Job Closed
This listing is no longer active.
Innovative Investing Experiences
Senior Site Reliability Engineer
Location
United States
Posted
69 days ago
Salary
$150K - $170K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer
DriveWealth
• Engineering & Automation: Design and develop internal tools and SRE platforms to eliminate repetitive tasks (toil) and improve developer velocity. • Infrastructure as Code: Architect and maintain modular, reusable IaC using Terraform and manage GitOps workflows via ArgoCD. • Observability & Reliability: Implement OpenTelemetry standards and the Grafana stack (Alloy, Loki, Tempo, Mimir) to provide deep insights into system health. Define and manage SLIs, SLOs, and Error Budgets. • Platform Governance: Review software architecture and Kubernetes metrics to ensure high availability, capacity planning, and cost-optimization across AWS regions. • Incident Engineering: Lead incident response, perform complex root-cause analysis (RCA), and champion a blameless post-mortem culture. • Collaboration: Partner with engineering teams to foster the adoption of new tools, security standards, and reliability best practices.
Job Requirements
- Linux & Networking Mastery: Proficient in Linux administration with a deep understanding of the TCP/IP stack, OSI model, DNS, and network troubleshooting.
- FinTech Background: Experience working in highly regulated financial environments or with FIX/API connectivity.
- Production Kubernetes: Hands-on experience managing production-grade clusters, including RBAC, autoscaling, Helm, and multi-cluster patterns.
- Cloud Native Expertise (AWS): Strong grasp of AWS core services, security, and high-availability patterns. Proficiency with boto3 and AWS CLI for automation.
- Modern CI/CD & GitOps: Experience building secure, automated delivery pipelines and operating GitOps workflows (ArgoCD).
- Code Proficiency: Strong scripting and development skills in Python or Golang, along with Bash and Ansible.
- Security Mindset: Experience with secrets management, vulnerability scanning, and securing the software supply chain.
- AI & Prompt Engineering: Familiarity with using LLMs, Public MCPs, or Bedrock Agent Core to enhance SRE workflows.
- Data & Middleware: Experience managing Kafka, MQ, SQS, or orchestration tools like Airflow and Rundeck.
Benefits
- Competitive compensation
- Equity
- 401(k) match
- Full insurance coverage
- Wellness reimbursement
- Company-provided phone
- Personal development allowance
- Generous PTO
- Observed holidays
- Extended leave
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Deployment and configuration of applications on Kubernetes • Installation and testing of infrastructure components • Automation of installation processes • Preparation of accompanying documentation
• Build, maintain, and scale a secure, cloud-native infrastructure across GCP, Azure, and AWS • Work closely with our engineering team to ensure seamless deployment pipelines, reliable environments, and high-performance systems • Design, deploy, and maintain cloud infrastructure across GCP, Azure, and AWS • Manage and optimize compute environments including Virtual Machines, Cloud Run, and container orchestration platforms • Architect scalable, multi-region solutions with high availability, redundancy, and strong security practices • Build, deploy, and manage containers using Docker • Deploy and manage services on Kubernetes (GKE/AKS/EKS) • Implement auto-scaling strategies, load balancing, and autoschedulers • Optimize resource utilization and cost efficiency across environments • Build and maintain CI/CD pipelines (GitHub Actions, GitLab, or similar) • Automate deployments, updates, rollbacks, and environment provisioning • Create infrastructure-as-code using tools like Terraform or Pulumi • Configure VPCs, VPNs, firewalls, DNS, subnets, and secure routing • Ensure secure API communication between our Next.js app and backend services • Collaborate with engineers to support API creation and deployments • Optimize performance of server environments running in Cloud Run or VMs
• Maintain and administer the Cloudera CDP platform (Public or Private Cloud), ensuring availability, performance, and scalability • Monitor, scale, and manage Hadoop and Spark clusters, services (HDFS, Hive, Impala, HBase, Kafka, etc.), and associated infrastructure • Configure and manage security policies and data governance tools (Ranger, Atlas, etc.) • Develop automation scripts and pipelines for deploying and managing services across the data platform • Investigate and resolve platform and data pipeline issues, engaging with internal stakeholders and Cloudera support as needed • Continuously monitor system health, optimize resource usage, and improve performance of workloads and queries • Maintain clear documentation on procedures, configurations, incident responses, and best practices • Work with Data Engineers, Architects, and DevOps teams to integrate CDP into broader data pipelines and solutions • Participate in on-call rotation for critical incidents affecting platform operations
• The Cloud Operations Engineer delivers full lifecycle management of Azure services across commercial and government ecosystems. • Deployment, configuration, and monitoring of security stacks and identity services. • Management of Microsoft Entra account management and Azure Role-Based Access Control (RBAC). • Manage Microsoft Entra Premium licenses and identity services including Entra ID (Azure AD). • Deliver full lifecycle management for Azure services, including Azure Policies, Storage Accounts, Firewalls, Microsoft Sentinel, and Directory Services. • Manage, maintain, and sustain the Cloud Native Security Stack across Azure Government and Commercial regions.




