We're a global company building smart software that helps improve public services
Senior Site Reliability Engineer
Location
United Kingdom
Posted
27 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Civica US
• Designing and implementing for scale & resilience: Architect, implement and continuously improve our existing Data Center and Cloud environments on AWS, Azure, and VMware, ensuring they meet our SLAs and adapt dynamically to demand working alongside the Platform teams providing PaaS/IaaS. • Driving automation: Build and evolve infrastructure as code (Terraform, etc.) and CI/CD pipelines (GitHub Actions, etc.) to ship new features safely and at speed. • Defining and measuring reliability: Partner with teams to set up meaningful SLIs/SLOs, implement real-time observability (Datadog, Prometheus, Grafana, ...) and proactively identify risks before it impacts our users. • Leading incident response: Own the on-call rota, coach teams through blameless post-mortems, and embed a culture of continuous improvement so outages become learning opportunities. • Mentoring & evangelism: Share your deep expertise by pairing with engineers, running brown-bag sessions on reliability best practices, and helping raise the bar across our global engineering organisation. • Securing our stack: Collaborate with our Security team and include security controls into CI/CD, runtime environments and disaster-recovery plans; so, our customers and citizens are always protected.
Job Requirements
- Demonstrable experience in a production SRE, DevOps or infrastructure role, ideally within a SaaS or large-scale web environment
- Expert in at least one public cloud (AWS, Azure, or GCP) and comfortable designing hybrid migrations from on-prem to cloud
- Strong coding/scripting and troubleshooting skills (on either of Go, .NET, Java, Python, etc.) and a passion for building reusable tested libraries and tooling
- Proven track record with IaC tools (Terraform, CloudFormation, or similar) and container orchestration (Kubernetes, ECS, AKS, OpenShift)
- Proven track record with virtual machine orchestration / provisioning and resiliency strategies (Kubevirt, packer, ansible)
- Deep understanding of monitoring, logging, and tracing frameworks (Prometheus/Grafana, ELK/Opensearch, Jaeger, etc.)
- Excellent communicator who thrives in cross-functional teams, with passion for translating complex technical issues into clear, actionable plans.
Benefits
- 25 Days Annual Leave + bank holidays – plus the option to buy up to 10 extra days!
- Days of Difference – Up to 3 extra days off for volunteering.
- Pension Contributions – 5% employer match to support your future.
- Income Protection – Up to 75% salary cover for long-term illness.
- Life Assurance – 4x salary tax-free lump sum.
- Critical Illness Cover – £25,000 lump sum (extendable to dependents).
- Private Medical Insurance – Fast access to private healthcare.
- Health Cash Plan – Claim back physio, therapies & more.
- Dental Insurance – Cover for routine & emergency care.
- Electric Vehicle (EV) Scheme – A wide range of electric & hybrid vehicles.
- Affinity Groups – Join employee-led communities.
- Bounty Bonus – Refer a friend & get rewarded.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Design, implement, and maintain CI/CD pipelines that pull source code from GitHub, build Java applications, and package deployment artifacts • Automate deployments across multiple environments, including Development, QA, UAT, Pilot, and Production • Integrate automated smoke tests and validation checks into deployment pipelines • Establish artifact versioning, tagging, and promotion strategies to support controlled releases and rollbacks • Support application deployments across AWS and Azure virtual machine environments • Assist with the transition of applications toward containerized deployments using Kubernetes • Maintain environment consistency across platforms and deployment stages • Own and enforce source control branching, merging, tagging, and release conventions • Maintain build, deployment, and release documentation, including runbooks and standard operating procedures • Coordinate software releases with Development, QA, and Application Support teams • Assist with troubleshooting build, deployment, and environment-related issues • Participate in incident response related to deployment or configuration failures • Perform other duties as assigned
Team Lead, Site Reliability Engineering - Fleet Management
MongoDBMongoDB, originally called 10gen, is a software development company. Since 2007, MongoDB has created an open-source, document-oriented database to help clients
Role Description The Fleet Management team provides the core runtime environment that empowers our developers to build and ship products to delight our customers. We manage the end-to-end lifecycle of our Kubernetes fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper). As our infrastructure scales to support new use cases and products, we are spearheading a migration from Terraform-based Infrastructure as Code (IaC) to an Operator-driven lifecycle management model. This role can be based out of our Austin, Boston, Los Angeles, New York City, Raleigh, or San Francisco offices, or remotely in the United States region. - Manage a team of 6-8 engineers, fostering a positive culture, handling career growth and performance conversations, and proactively removing blockers. - Help develop a clear technical vision and comprehensive roadmap for our runtime environment, balancing long-term strategic infrastructure goals with immediate engineering needs. - Contribute through light hands-on technical work, such as leading architectural design reviews, reviewing PRs, and stepping in to guide the team through complex operational challenges. - Act as the primary liaison for the Fleet Management team, collaborating closely with other engineering leaders to ensure platform alignment and manage stakeholder expectations. Qualifications - 10+ years of experience working on software and operating distributed systems, with 2+ years managing engineering teams. - Possess a customer-focused mindset, treating internal developers as your primary users. - Value efficiency in processes and operations, and have a track record of optimizing team workflows. - Prefer automation over manual processes ("allergic to ops work"), fostering a culture of building software solutions to eliminate toil. - Deep technical familiarity with Kubernetes ecosystems, containerization technologies, and modern IaC tooling (e.g., Terraform, Crossplane, or Operators). - Excel at translating complex business and engineering requirements into actionable, phased technical roadmaps. - High level of empathy, responsibility, ownership, and accountability. - Excellent verbal and written technical communication skills. Requirements - Leading major architectural shifts, such as migrating teams from traditional IaC to Operator-driven lifecycle management. - Managing and scaling infrastructure across multi-cloud environments (AWS, GCP, or Azure). - Designing secure, multi-tenant runtime environments at scale. Benefits - Equity participation. - Participation in the employee stock purchase program. - Flexible paid time off. - 20 weeks fully-paid gender-neutral parental leave. - Fertility and adoption assistance. - 401(k) plan. - Mental health counseling. - Access to transgender-inclusive health insurance coverage. - Health benefits offerings.
• Works with development teams to improve efficiency in the software deployment process. • Responsible for updating, maintaining, and selecting development tools used for tracking, building and deploying software. • Defines processes used to perform development work supporting the Agile framework. • Develops continuous integration and continuous deployment concepts and tools. • Improves deployment frequency while minimizing business impact. • Improves, updates, and maintains the branching strategy. • Improves processes to lower failure rates of new releases and improve tools and processes to create a continuous Integration environment. • Assists in containerization of software environments. • Provides documentation and training for supported tools. • Other duties as assigned
DevOps Engineer
Open FunctionOpenFn helps scale public health & humanitarian interventions via data integration, automation, and interoperability.
• Build World-Class Deployment, Monitoring, and Instance-Maintenance Tooling • Develop and maintain devops tooling (ansible, terraform, custom CLI programs), deployment runbooks, configuration templates, and documentation that allow the teams and people responsible for deploying, monitoring, and maintaining instances of OpenFn to succeed. • Deliver On-Premise and Local Deployments • Lead and execute OpenFn deployments on government and ministry-managed infrastructure, including air-gapped, low-connectivity, and sovereign-hosting environments. • Configure and maintain containerized deployments using Docker, Docker Compose, and Docker Swarm, and support Kubernetes-based setups where applicable. • Work directly with government IT teams to navigate local infrastructure constraints, security requirements, and network configurations. • Troubleshoot infrastructure and runtime issues in the field, often with limited access to external resources. • Cloud Infrastructure • Maintain and optimize OpenFn deployments on GCP, AWS, and occasionally Azure, including compute, networking, storage, and managed services configuration. • Implement and maintain CI/CD pipelines for services team deployments. • Monitor system performance, set up alerting, and respond to infrastructure incidents across cloud-hosted client environments. • Advise implementation teams on cloud architecture decisions and cost optimization. • Internal Standards and Enablement • Build and maintain internal DevOps standards, deployment guides, and infrastructure-as-code templates that the wider services team can use and build on. • Contribute to pre-sales and scoping conversations by advising on infrastructure feasibility, hosting options, and deployment effort for prospective clients. • Work closely with the Principal Solutions Architect and the CTO to ensure deployment strategy is aligned with solution architecture from the start of each engagement.




