The Mozilla Corporation was founded in 2005 as a taxable, wholly-owned subsidiary of the Mozilla Foundation, which launched in 2003. The corporation serves the
Senior Site Reliability Engineer
Location
California
Posted
4 days ago
Salary
$123K - $144K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Mozilla
• Operate and evolve our EKS-based Kubernetes platform, supporting service migrations, platform improvements, and reliability initiatives. • Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases, contributing to pipeline reliability and OIDC-based authentication across GitHub Actions workflows. • Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts. • Operate and evolve our observability stack (VictoriaMetrics, VictoriaLogs, Grafana, Vector) and partner with engineering teams to incorporate instrumentation and monitoring into service design. • Apply security-conscious infrastructure practices, including least-privilege IAM, secrets management via AWS Secrets Manager and External Secrets Operator, and network segmentation. • Diagnose and debug production incidents; drive root-cause analysis and post-incident improvements to prevent recurring problems. • Participate in on-call rotation and collaborate with SDEs and fellow SREs to ship, maintain, and monitor new builds and support service onboarding. • Contribute to runbooks, architecture documentation, and team processes.
Job Requirements
- 7+ years of experience in infrastructure, platform engineering, or site reliability roles, including hands-on production Kubernetes experience in workload operations, troubleshooting, and cluster management.
- Hands-on experience with infrastructure-as-code on AWS using Terraform, OpenTofu, or Pulumi.
- Security awareness in day-to-day infrastructure work: identity, least privilege, secrets hygiene, and network controls.
- Demonstrated ownership mindset with the ability to proactively identify issues, drive work to completion, and communicate risks early.
- Excellent async written communication skills; comfortable working with a geographically distributed team.
- Ability to collaborate effectively with software engineers and non-engineering stakeholders to improve platform reliability and operational efficiency.
- Ability to learn, evaluate, and responsibly use emerging technologies, including AI-enabled tools, to improve work processes.
Benefits
- Fully remote work & schedule flexibility
- Company-provided laptop
- Annual bonus program
- Monthly remote work stipend
- Annual professional development stipend
- Industry conferences
- Company all-hands and team gatherings
- 24 days PTO per year (prorated)
- Your birthday
- Year-end company shutdown
- 9 wellbeing days
- Public holidays
- Other paid leave
- Quarterly wellbeing stipend for personal / family activities
- 401(k) / RRSP contributions
- Health, dental, & vision insurance
- Disability insurance
- Life insurance
- Employee assistance program
- Paid parental leave
- Paid sick days
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer
MozillaThe Mozilla Corporation was founded in 2005 as a taxable, wholly-owned subsidiary of the Mozilla Foundation, which launched in 2003. The corporation serves the
• Operate and evolve our EKS-based Kubernetes platform • Design and develop CI/CD systems supporting websites, services, and Thunderbird desktop releases • Write and maintain infrastructure in Pulumi and/or Terraform/OpenTofu across multiple AWS accounts • Operate and evolve our observability stack and partner with engineering teams to incorporate instrumentation and monitoring into service design • Apply security-conscious infrastructure practices • Diagnose and debug production incidents and drive root-cause analysis • Participate in on-call rotation and collaborate with SDEs and fellow SREs • Contribute to runbooks, architecture documentation, and team processes
• Oversee the design, implementation, and maintenance of cloud infrastructure across AWS and Azure • Lead and manage a team of DevOps engineers — assigning tasks and ensuring best practices in deployment, monitoring, and security • Define and enforce CI/CD processes and infrastructure automation standards using Terraform • Own the implementation and governance of Landing Zones in both AWS and Azure • Ensure HIPAA compliance and security policies are followed across all environments • Drive adoption of observability tools like DataDog and establish logging standards • Coordinate incident response and root cause analysis for infrastructure issues • Build SLA frameworks for critical services and define strategies for auto-scaling, failure recovery, and disaster recovery • Collaborate with stakeholders to align DevOps strategies with business needs • Review and approve technical documentation produced by the team
Site Reliability Engineer, Fintech
Ant-TechAnt-tech is a reputable headhunter agency in France, specializing in providing high-quality recruitment services for companies across various industries. With a team of experienced professionals and an extensive network of partners, Ant-tech connects talented candidates with organizations in need, particularly in the technology, finance, and other sectors. Committed to delivering optimal recruitment solutions, Ant-tech focuses not only on finding the right talent but also ensuring long-term and sustainable growth for both candidates and partner companies.
• Build and enhance automated provisioning for servers and network infrastructure across physical environments and cloud platforms (AWS, GCP). • Improve and evolve CI/CD pipelines for infrastructure provisioning and software deployment. • Develop and maintain infrastructure automation using tools such as Ansible. • Support and manage server, network, and platform reliability across the organisation. • Work closely with hardware vendors, telecom providers, and third-party service providers. • Coordinate procurement, deployment, and lifecycle management of infrastructure hardware. • Contribute to an engineering culture focused on automation, reliability, and continuous improvement.
• Mitverantwortung für die Systemverfügbarkeit: Du trägst aktiv zur Verfügbarkeit, Zuverlässigkeit und Effizienz unserer komplexen Systemarchitektur bei, die aus etwa 70 Servern bei Hetzner besteht. • Wartung und Automatisierung: Du unterstützt die Wartung und Automatisierung unserer bestehenden Infrastruktur, die auf Technologien wie Ubuntu, Percona MySQL Cluster, MinIO, Elasticsearch, Redis, NGINX, HAProxy, TiDB, Clickhouse und Kubernetes basiert. Dabei bringst du deine Ideen zur Optimierung ein. • Monitoring und Analyse: Du verbesserst unsere Monitoring-Strategien und führst umfassende Fehleranalysen durch. • Hohe Verfügbarkeit: Du bist bereit, in Ausnahmefällen auch nachts aufstehen zu müssen, um sicherzustellen, dass unsere Systeme reibungslos laufen. • Software Entwicklung: Mehrjährige Erfahrung in einer oder mehreren Programmiersprachen (z. B. Rust, Java, Go, Typescript) ist notwendig.



