Make experiences flow.
Site Reliability Engineer
Location
United Kingdom
Posted
33 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer
NICE
• Act as a primary or escalation responder in a 24x7 on-call rotation • Lead or support Major Incident (MI) response, including triage, mitigation, and resolution • Coordinate across Engineering, Infrastructure, Security, and Product teams • Execute and improve runbooks, playbooks, and escalation paths • Drive blameless post-incident reviews (PIRs) and track corrective actions • Own service health monitoring across infrastructure, applications, and dependencies • Design and maintain alerting strategies that align with SLIs/SLOs • Reduce alert fatigue through signal-to-noise improvements • Build dashboards using tools such as Grafana, Prometheus, Datadog, Splunk, CloudWatch • Automate repetitive operational tasks to reduce manual toil • Improve mean time to detect (MTTD) and mean time to resolve (MTTR) • Develop scripts and tools (Python, Bash, Go, etc.) to support NOC/SRE workflows • Implement self-healing and auto-remediation where possible • Partner with engineering teams to improve system design for reliability • Support and troubleshoot Linux-based systems, cloud platforms, Kubernetes/containerized environments • Assist with capacity planning and availability reviews • Ensure operational readiness for production releases
Job Requirements
- Strong Linux systems administration
- Experience with incident management and production support
- Familiarity with cloud infrastructure (AWS preferred)
- Containers & orchestration (Docker, Kubernetes)
- Monitoring/alerting platforms
- Scripting or programming experience in Python, Bash, Go, or similar
- Understanding of networking fundamentals (DNS, TCP/IP, load balancing)
- Experience working in 24x7 NOC or production operations environments
- Ability to handle high-pressure incidents calmly and effectively
- Strong written and verbal communication for incident coordination
- Comfort working from runbooks—but improving them when they fall short
- Experience defining or operating to SLOs / SLIs
- Prior migration from traditional NOC → SRE model
- Infrastructure as Code experience (Terraform, Ansible, etc.)
- Exposure to security, compliance, or regulated environments
Benefits
- Professional development opportunities
- Flexible working hours
- Work from home
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Mainframe SRE
ZensarAt Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.
Technical Expertise Required Basic knowledge of 4 technical expertise areas, with a strong interest in 1 area - Strong knowledge of Linux/Unix systems and command line tools. - Proficiency in scripting languages such as Python, Shell, or Perl. - Experience with configuration management tools like Ansible, Puppet, or Chef. - Familiarity with cloud platforms like AWS, Azure, or Google Cloud. - Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.). - Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools. - Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk. (Optional - But Good to Know) - Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues. - Excellent communication and collaboration skills to work effectively with cross-functional teams. - Strong attention to detail and ability to work in a fast-paced, dynamic environment. - Terraform basic syntax and GitLab CI/CD configuration, pipelines, jobs - Cloud resources provisioning and configuration through CLI/API - Understanding of how to do basic queries in logs tools for general questions - Operating system (Linux) configuration, package management, startup and troubleshooting - Block and object storage configuration - Networking VPCs, proxies and CDNs Responsibilities - Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application. - Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems. - Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues. - Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance. - Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents. - Automate repetitive tasks and processes to improve efficiency and reduce manual intervention. - Create and maintain documentation for system architecture, configuration, and troubleshooting procedures. - Perform capacity planning and resource allocation to ensure optimal system performance and scalability. - Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards. - Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering. Execution Follow established processes and runbooks, and submit updates to improve them for others. Proposes ideas and solutions within the Infrastructure Department to reduce the workload through automation. Plan and execute configuration change operations both at the application and the infrastructure levels. Actively looks for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation Objectives of this role - Run the production environment by monitoring availability and taking a holistic view of system health - Build software and systems to manage platform infrastructure and applications - Improve reliability, quality, and time-to-market of our suite of software solutions - Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement - Provide primary operational support and engineering for multiple large-scale distributed software applications Required skills and qualifications - Bachelor's degree in computer science, engineering, or a related field. - Proven experience as a Site Reliability Engineer or a similar role. - Solid understanding of software development methodologies and DevOps principles. - Experience with agile and iterative development processes. - Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator). - Familiarity with continuous integration/continuous deployment (CI/CD) pipelines. - Experience with source control systems such as Git or SVN. - Knowledge of security best practices and experience implementing security measures in a production environment. - Ability to work independently and handle multiple projects and priorities simultaneously. - Strong analytical and problem-solving skills, with a focus on continuous improvement and automation. Preferred skills and qualifications - Previous success in technical engineering - Coding experience beyond simple scripts At Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. Explore Life at Zensar and join us to Grow. Own. Achieve. Learn. to be the best version of yourself. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.
• Work cross-functionally with the InfoSec, SRE, and Engineering teams • Keep up to date with current vulnerabilities in the DevOps space, patch, mitigate, or procure acceptance of the vulnerability by InfoSec standards • Check code and repositories for insecure coding practices and work with Engineering teams to remediate • Work closely with InfoSec to create and maintain Secure SDLC training • Conduct security based quality assurance on pre-deployment packages, and seek approval or denial of those deployments based upon security findings • Conduct security based quality assurance such as dynamic and static code testing • Work closely with Compliance and Engineering teams to conduct pre-project risk assessments • Implement security checks and practices within CI/CD pipelines to ensure secure code deployment and infrastructure • Develop automation scripts and tools to streamline security processes, including vulnerability scanning, patch management, and incident response • Conduct security training and awareness programs for engineering teams to promote a security-first culture
DevOps Intern
AssetWorks IncWe provide innovative and practical solutions to help our customers, and the people they serve, thrive.
• Assist in building and enhancing CI/CD pipelines (e.g., Azure DevOps) for application deployment and upgrades • Support AI-first initiatives by exploring automation opportunities using scripting and AI tools • Contribute to reporting and analysis of system performance, deployment metrics, and operational KPIs • Conduct research and experimentation on AI/ML applications in DevOps (AIOps, predictive monitoring, automation) • Help maintain and improve observability tools (logging, monitoring, alerting systems) • Collaborate with cross-functional teams (Hosting, IT, DBA) on assigned projects • Prepare documentation, reports, and presentations related to DevOps processes and improvements • Support day-to-day operational tasks, including troubleshooting, deployments, and system validation • Assist in maintaining organized records of pipelines, environments, and automation scripts
DevSecOps/DevOps Engineer
Caspar HealthEffective, recognized digital rehabilitation combined with personal therapeutic care - independent of time & location!
• Take primary responsibility for security alerts and vulnerabilities. • Integrate automated security tests and compliance checks into our CI/CD pipelines. • Use Terraform and Terragrunt to advance our AWS infrastructure. • Work within a team to translate regulatory requirements into automated controls. • Manage and harden our data layers (PostgreSQL, Redis) and orchestrate our Kubernetes environment. • Collaborate with development teams to identify and remediate vulnerabilities early in the software lifecycle.



