Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200Since 2014H1B No SponsorCompany SiteLinkedIn

Location

Switzerland

Posted

137 days ago

Salary

0

Seniority

Senior

Bachelor DegreeEnglishDistributed Systems

Job Description

Site Reliability Engineer

Jobtome

• Ensure the reliability, scalability, and performance of production systems • Design resilient architectures • Define reliability standards • Improve observability and incident response • Reduce operational toil through automation • Contribute to codebases • Collaborate on system design • Help evolve engineering culture toward SRE best practices

Job Requirements

  • Strong experience running production systems at scale
  • Solid understanding of distributed systems and failure modes
  • Proven experience with SLO-driven reliability
  • Strong coding skills
  • Cloud infrastructure automation experience
  • Ability to debug complex cross-system issues
  • Ownership mindset and strong communication skills
  • Pragmatic approach to reliability, speed, and cost trade-offs

Benefits

  • Flexible working hours
  • Remote-friendly setup

Related Categories

Related Job Pages

More DevOps Engineer Jobs

IBMC logo

Senior AWS DevOps Engineer

IBMC

Driving Business Success in Indonesia.

DevOps Engineer137 days ago
Full TimeRemoteTeam 51-200Since 2022H1B No Sponsor

• Infrastructure Excellence: Design, build, and maintain robust AWS infrastructure to support scalable, secure, and high-availability applications. • Automation & CI/CD: Manage and continuously improve CI/CD pipelines to streamline deployments and ensure maximum system reliability. • Security & Compliance: Implement AWS security best practices, including IAM roles/policies, secure networking (VPC/VPN), and data protection measures. • Performance & Cost Optimization: Proactively monitor and optimize system performance, scalability, and cost efficiency across all AWS environments. • Resilience & Troubleshooting: Troubleshoot complex infrastructure issues, develop disaster recovery strategies, and ensure overall operational resilience. • Technical Mentorship: Provide guidance and mentorship on AWS and DevOps best practices to the wider engineering team to foster a culture of excellence.

Indonesia
Full TimeRemoteTeam 1,001-5,000Since 1851H1B Sponsor

• Architect and maintain self-healing systems with 99.9%+ availability targets. • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns. • Implement adaptive SLIs/SLOs that evolve automatically from real-time data. • Build AIOps-based observability and auto-remediation pipelines. • Apply predictive modeling to forecast failures before they impact users. • Lead chaos, performance, and resilience testing programs. • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance. • Mentor engineers and drive reliability standards across teams. • Partner with platform, data, and product teams to ensure stability aligns with business goals. • Support major incident response, incident review, and participate in on-call rotations.

Argentina
Job Closed
Full TimeRemoteTeam 11-50H1B No Sponsor

• You will be responsible for the availability and integrity of the infrastructure that underpins Alkira’s Cloud Networking platform • You hold the production systems together; troubleshoot issues that arise in production deployment • Provide 24x7 coverage as a part of scheduled shift and on-call rotation • Work with multiple tools like Prometheus, Grafana, Jira etc. to monitor, manage, triage and document infrastructure issues in real time • Automate infrastructure deployment using CI/CD • Build necessary tools to evolve how we maintain and monitor our solution • Develop and execute system and integration test plans

India
MixMode logo

Senior Software Reliability Engineer – AI

MixMode

Automated threat detection, unparalleled network visibility, & deep guided investigation powered by Self-Supervised AI.

DevOps Engineer139 days ago
OtherRemoteTeam 11-50H1B No Sponsor

• Own the reliability, performance, and operational health of production AI systems, focusing on improving complex, existing services. • Lead efforts to refactor and harden the AI codebase to improve observability, maintainability, and resilience. • Diagnose and resolve issues across distributed systems, including latency, throughput, data pipelines, and resource utilization. • Design and build monitoring, alerting, and debugging tools for high-availability services. • Partner with researchers and ML engineers to productionize models at scale. • Establish best practices for testing, deployment, capacity planning, and incident response. • Serve as a technical leader during on-call rotations, driving incident response, postmortems, and continuous system improvements.

California
Job Closed