NICE

Make experiences flow.

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 5,001-10,000Since 1991H1B SponsorCompany Site LinkedIn

Location

United Kingdom

Posted

33 days ago

Salary

Seniority

Senior

EnglishAnsible AWS Cloud DNS Docker Grafana Kubernetes Linux Prometheus Python Splunk TCP/IP Terraform Go

Job Description

• Act as a primary or escalation responder in a 24x7 on-call rotation • Lead or support Major Incident (MI) response, including triage, mitigation, and resolution • Coordinate across Engineering, Infrastructure, Security, and Product teams • Execute and improve runbooks, playbooks, and escalation paths • Drive blameless post-incident reviews (PIRs) and track corrective actions • Own service health monitoring across infrastructure, applications, and dependencies • Design and maintain alerting strategies that align with SLIs/SLOs • Reduce alert fatigue through signal-to-noise improvements • Build dashboards using tools such as Grafana, Prometheus, Datadog, Splunk, CloudWatch • Automate repetitive operational tasks to reduce manual toil • Improve mean time to detect (MTTD) and mean time to resolve (MTTR) • Develop scripts and tools (Python, Bash, Go, etc.) to support NOC/SRE workflows • Implement self-healing and auto-remediation where possible • Partner with engineering teams to improve system design for reliability • Support and troubleshoot Linux-based systems, cloud platforms, Kubernetes/containerized environments • Assist with capacity planning and availability reviews • Ensure operational readiness for production releases

Job Requirements

Strong Linux systems administration
Experience with incident management and production support
Familiarity with cloud infrastructure (AWS preferred)
Containers & orchestration (Docker, Kubernetes)
Monitoring/alerting platforms
Scripting or programming experience in Python, Bash, Go, or similar
Understanding of networking fundamentals (DNS, TCP/IP, load balancing)
Experience working in 24x7 NOC or production operations environments
Ability to handle high-pressure incidents calmly and effectively
Strong written and verbal communication for incident coordination
Comfort working from runbooks—but improving them when they fall short
Experience defining or operating to SLOs / SLIs
Prior migration from traditional NOC → SRE model
Infrastructure as Code experience (Terraform, Ansible, etc.)
Exposure to security, compliance, or regulated environments

Benefits

Professional development opportunities
Flexible working hours
Work from home

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Mainframe SRE

Zensar

At Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.

DevOps Engineer33 days ago

Full Time RemoteTeam 10,001

Technical Expertise Required Basic knowledge of 4 technical expertise areas, with a strong interest in 1 area - Strong knowledge of Linux/Unix systems and command line tools. - Proficiency in scripting languages such as Python, Shell, or Perl. - Experience with configuration management tools like Ansible, Puppet, or Chef. - Familiarity with cloud platforms like AWS, Azure, or Google Cloud. - Understanding of networking principles and protocols (TCP/IP, HTTP, DNS, etc.). - Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools. - Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk. (Optional - But Good to Know) - Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues. - Excellent communication and collaboration skills to work effectively with cross-functional teams. - Strong attention to detail and ability to work in a fast-paced, dynamic environment. - Terraform basic syntax and GitLab CI/CD configuration, pipelines, jobs - Cloud resources provisioning and configuration through CLI/API - Understanding of how to do basic queries in logs tools for general questions - Operating system (Linux) configuration, package management, startup and troubleshooting - Block and object storage configuration - Networking VPCs, proxies and CDNs Responsibilities - Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application. - Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems. - Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues. - Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance. - Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents. - Automate repetitive tasks and processes to improve efficiency and reduce manual intervention. - Create and maintain documentation for system architecture, configuration, and troubleshooting procedures. - Perform capacity planning and resource allocation to ensure optimal system performance and scalability. - Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards. - Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering. Execution Follow established processes and runbooks, and submit updates to improve them for others. Proposes ideas and solutions within the Infrastructure Department to reduce the workload through automation. Plan and execute configuration change operations both at the application and the infrastructure levels. Actively looks for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation Objectives of this role - Run the production environment by monitoring availability and taking a holistic view of system health - Build software and systems to manage platform infrastructure and applications - Improve reliability, quality, and time-to-market of our suite of software solutions - Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement - Provide primary operational support and engineering for multiple large-scale distributed software applications Required skills and qualifications - Bachelor's degree in computer science, engineering, or a related field. - Proven experience as a Site Reliability Engineer or a similar role. - Solid understanding of software development methodologies and DevOps principles. - Experience with agile and iterative development processes. - Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator). - Familiarity with continuous integration/continuous deployment (CI/CD) pipelines. - Experience with source control systems such as Git or SVN. - Knowledge of security best practices and experience implementing security measures in a production environment. - Ability to work independently and handle multiple projects and priorities simultaneously. - Strong analytical and problem-solving skills, with a focus on continuous improvement and automation. Preferred skills and qualifications - Previous success in technical engineering - Coding experience beyond simple scripts At Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. Explore Life at Zensar and join us to Grow. Own. Achieve. Learn. to be the best version of yourself. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.

View details: Mainframe SRE

India

Apply

DevSecOps Engineer

Roadie

Across town or across the country, Roadie delivers.

DevOps Engineer33 days ago

Full Time RemoteTeam 51-200Since 2014H1B Sponsor

Company Site LinkedIn

• Work cross-functionally with the InfoSec, SRE, and Engineering teams • Keep up to date with current vulnerabilities in the DevOps space, patch, mitigate, or procure acceptance of the vulnerability by InfoSec standards • Check code and repositories for insecure coding practices and work with Engineering teams to remediate • Work closely with InfoSec to create and maintain Secure SDLC training • Conduct security based quality assurance on pre-deployment packages, and seek approval or denial of those deployments based upon security findings • Conduct security based quality assurance such as dynamic and static code testing • Work closely with Compliance and Engineering teams to conduct pre-project risk assessments • Implement security checks and practices within CI/CD pipelines to ensure secure code deployment and infrastructure • Develop automation scripts and tools to streamline security processes, including vulnerability scanning, patch management, and incident response • Conduct security training and awareness programs for engineering teams to promote a security-first culture

Azure SDLC Terraform

View details: DevSecOps Engineer

United States

Apply

DevOps Intern

AssetWorks Inc

We provide innovative and practical solutions to help our customers, and the people they serve, thrive.

DevOps Engineer33 days ago

Internship RemoteTeam 5,001-10,000H1B Sponsor

Company Site LinkedIn

• Assist in building and enhancing CI/CD pipelines (e.g., Azure DevOps) for application deployment and upgrades • Support AI-first initiatives by exploring automation opportunities using scripting and AI tools • Contribute to reporting and analysis of system performance, deployment metrics, and operational KPIs • Conduct research and experimentation on AI/ML applications in DevOps (AIOps, predictive monitoring, automation) • Help maintain and improve observability tools (logging, monitoring, alerting systems) • Collaborate with cross-functional teams (Hosting, IT, DBA) on assigned projects • Prepare documentation, reports, and presentations related to DevOps processes and improvements • Support day-to-day operational tasks, including troubleshooting, deployments, and system validation • Assist in maintaining organized records of pipelines, environments, and automation scripts

Azure Cloud Python

View details: DevOps Intern

Pennsylvania

Apply

Job Closed

DevSecOps/DevOps Engineer

Caspar Health

Effective, recognized digital rehabilitation combined with personal therapeutic care - independent of time & location!

DevOps Engineer33 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Take primary responsibility for security alerts and vulnerabilities. • Integrate automated security tests and compliance checks into our CI/CD pipelines. • Use Terraform and Terragrunt to advance our AWS infrastructure. • Work within a team to translate regulatory requirements into automated controls. • Manage and harden our data layers (PostgreSQL, Redis) and orchestrate our Kubernetes environment. • Collaborate with development teams to identify and remediate vulnerabilities early in the software lifecycle.

AWS Docker JavaScript Kubernetes Linux Node.js PostgreSQL Python Redis Terraform Go

View details: DevSecOps/DevOps Engineer

Germany

Apply

Job Closed

Site Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Mainframe SRE

DevSecOps Engineer

DevOps Intern

DevSecOps/DevOps Engineer