Jobtome

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 51-200Since 2014H1B No SponsorCompany Site LinkedIn

Location

Switzerland

Posted

137 days ago

Salary

Seniority

Senior

Bachelor DegreeEnglishDistributed Systems

Job Description

• Ensure the reliability, scalability, and performance of production systems • Design resilient architectures • Define reliability standards • Improve observability and incident response • Reduce operational toil through automation • Contribute to codebases • Collaborate on system design • Help evolve engineering culture toward SRE best practices

Job Requirements

Strong experience running production systems at scale
Solid understanding of distributed systems and failure modes
Proven experience with SLO-driven reliability
Strong coding skills
Cloud infrastructure automation experience
Ability to debug complex cross-system issues
Ownership mindset and strong communication skills
Pragmatic approach to reliability, speed, and cost trade-offs

Benefits

Flexible working hours
Remote-friendly setup

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior AWS DevOps Engineer

IBMC

Driving Business Success in Indonesia.

DevOps Engineer137 days ago

Full Time RemoteTeam 51-200Since 2022H1B No Sponsor

Company Site LinkedIn

• Infrastructure Excellence: Design, build, and maintain robust AWS infrastructure to support scalable, secure, and high-availability applications. • Automation & CI/CD: Manage and continuously improve CI/CD pipelines to streamline deployments and ensure maximum system reliability. • Security & Compliance: Implement AWS security best practices, including IAM roles/policies, secure networking (VPC/VPN), and data protection measures. • Performance & Cost Optimization: Proactively monitor and optimize system performance, scalability, and cost efficiency across all AWS environments. • Resilience & Troubleshooting: Troubleshoot complex infrastructure issues, develop disaster recovery strategies, and ensure overall operational resilience. • Technical Mentorship: Provide guidance and mentorship on AWS and DevOps best practices to the wider engineering team to foster a culture of excellence.

AWS DynamoDB Amazon EC2 Elasticsearch NoSQL Prometheus Python Terraform

View details: Senior AWS DevOps Engineer

Indonesia

Apply

Principal Site Reliability Engineer – AI-first SRE

The New York Times

DevOps Engineer137 days ago

Full Time RemoteTeam 1,001-5,000Since 1851H1B Sponsor

Company Site LinkedIn

• Architect and maintain self-healing systems with 99.9%+ availability targets. • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns. • Implement adaptive SLIs/SLOs that evolve automatically from real-time data. • Build AIOps-based observability and auto-remediation pipelines. • Apply predictive modeling to forecast failures before they impact users. • Lead chaos, performance, and resilience testing programs. • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance. • Mentor engineers and drive reliability standards across teams. • Partner with platform, data, and product teams to ensure stability aligns with business goals. • Support major incident response, incident review, and participate in on-call rotations.

AWS GCP Grafana Kubernetes Prometheus Python Terraform

View details: Principal Site Reliability Engineer – AI-first SRE

Argentina

Apply

Job Closed

Software Engineer – Site Reliability Engineer

Captions

Your AI-powered creative studio.

DevOps Engineer138 days ago

Full Time RemoteTeam 11-50H1B No Sponsor

Company Site LinkedIn

• You will be responsible for the availability and integrity of the infrastructure that underpins Alkira’s Cloud Networking platform • You hold the production systems together; troubleshoot issues that arise in production deployment • Provide 24x7 coverage as a part of scheduled shift and on-call rotation • Work with multiple tools like Prometheus, Grafana, Jira etc. to monitor, manage, triage and document infrastructure issues in real time • Automate infrastructure deployment using CI/CD • Build necessary tools to evolve how we maintain and monitor our solution • Develop and execute system and integration test plans

AWS Azure GCP Grafana Jenkins Kubernetes Prometheus Terraform

View details: Software Engineer – Site Reliability Engineer

India

Apply

Senior Software Reliability Engineer – AI

MixMode

Automated threat detection, unparalleled network visibility, & deep guided investigation powered by Self-Supervised AI.

DevOps Engineer139 days ago

Other RemoteTeam 11-50H1B No Sponsor

Company Site LinkedIn

• Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services. • Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience. • Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization. • Design and build monitoring, alerting, and debugging tools for high-availability services. • Partner with researchers and ML engineers to productionize models at scale. • Establish best practices for testing, deployment, capacity planning, and incident response. • Serve as a technical leader during on-call rotations, driving incident response, postmortems, and continuous system improvements.

Distributed Systems Java Apache Kafka Kotlin Kubernetes MySQL PostgreSQL Python Scala Apache Spark

View details: Senior Software Reliability Engineer – AI

California

Apply

Job Closed

Site Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior AWS DevOps Engineer

Principal Site Reliability Engineer – AI-first SRE

Software Engineer – Site Reliability Engineer

Senior Software Reliability Engineer – AI