Job Closed

This listing is no longer active.

Okta

The World's Identity Company

Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote Mid LevelTeam 5,001-10,000Since 2010H1B SponsorCompany Site LinkedIn

Location

Worldwide

Posted

87 days ago

Salary

Seniority

Mid Level

No structured requirement data.

Job Description

Role Description As a mid-level Site Reliability Engineer, you'll join our SRE team based in Europe to ensure our production systems are not only operational but also resilient, scalable, and ready for exponential growth. This isn't just about keeping the lights on; it's about directly contributing to the platform's core resiliency and robustness. You'll be a hands-on builder, crafting solutions that make our system more reliable by design. - Design and build custom software in Go to enhance the platform's reliability, resiliency, and redundancy. - Partner with engineering teams to embed reliability principles, improving the availability, performance, and observability of our services. - Use your deep understanding of infrastructure and observability principles to identify opportunities for improvement within the product and implement solutions. - Contribute to our on-call rotation, providing rapid, effective response to critical incidents and using your expertise to troubleshoot, mitigate or accurately escalate production issues. - Develop and refine our SRE tooling and processes, focusing on automation and operational efficiency. - Define, document, and champion reliability best practices across the organisation. Qualifications - A proactive and systematic approach to problem-solving, with a high degree of ownership. - Proven experience in a production environment supporting large-scale, mission-critical applications with a high degree of autonomy. - Proficiency in at least one programming language, with a strong preference for Go. You should be comfortable writing custom applications, not just scripts. - Experience with infrastructure as code (Terraform), container orchestration (Kubernetes, Docker) and GitOps (ArgoCD). - Demonstrable expertise in a major cloud provider (Azure, AWS, or GCP). - A strong grasp of microservices architecture, databases (SQL, NoSQL), and networking fundamentals, so you can understand how custom code can solve platform-level issues. - An understanding of core SRE principles, including SLIs, SLOs, and error budgets. - Experience in an on-call rotation for a 24/7 cloud-based environment. - Exceptional communication and collaboration skills, with a proven ability to work effectively in a remote, distributed team, where tasks may be self-driven. Benefits - Supporting Your Well-Being - Driving Social Impact - Developing Talent and Fostering Connection + Community Company Description Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one. Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

DevOps Intern, Direct-to-Consumer Engineering

NBCUniversal

Here you can create the extraordinary. Join us.

DevOps Engineer87 days ago

Internship RemoteTeam 10,001+Since 2004H1B Sponsor

Company Site LinkedIn

• As an NBCUniversal Academic Year intern, you’ll work on real projects and be part of our collaborative culture. • Contribute to meaningful work while building skills that matter. • Work closely with development and operations partners to ensure reliable, scalable, and efficient systems that power TVE platforms. • Responsibilities include building, testing, and maintaining infrastructure and technology stacks, implementing and optimizing CI/CD pipelines, monitoring system health and automating maintenance tasks, and contributing to process improvement initiatives aimed at enhancing quality while reducing time and costs.

AWS CI/CD CircleCI DevOps Git Github Actions Grafana Java Jenkins Node.js Python Terraform

View details: DevOps Intern, Direct-to-Consumer Engineering

New York

$30 / hour

Apply

Job Closed

Principal Engineer, Python DevOps

Nagarro

Nagarro (Frankfurt: NA9) is a leader in digital product engineering and drives technology-led business breakthroughs.

DevOps Engineer87 days ago

Full Time RemoteTeam 10,001+Since 1996H1B Sponsor

Company Site LinkedIn

• Design and develop scalable web applications using Python and modern frontend frameworks. • Build and maintain backend services and APIs for integrations. • Develop responsive frontend applications using JavaScript frameworks. • Implement microservices and integrate with databases and cloud platforms. • Ensure application security, performance, and scalability. • Contribute to CI/CD pipelines and DevOps processes. • Participate in code reviews, technical discussions, and mentoring. • Implement monitoring, logging, and system reliability improvements. • Collaborate with cross-functional teams to deliver end-to-end solutions.

Angular AWS Cloud Django Docker Flask GraphQL JavaScript Microservices MongoDB MySQL NoSQL PostgreSQL Python React Redis TypeScript Vue.js

View details: Principal Engineer, Python DevOps

India

Apply

Job Closed

DevSecOps Lead

Corning

Headquartered in Corning, New York, Corning is a leading global manufacturer of specialty glass and ceramics. This company has a long history of innovation and

DevOps Engineer87 days ago

Full Time Remote

Company Site

Lead the security and compliance program while managing security tools and cloud infrastructure. Collaborate across teams to implement automated solutions and enhance security processes, ensuring readiness for audits and compliance standards.

View details: DevSecOps Lead

Canada

Apply

Site Reliability Engineer

Mistral AI

Developing the best generative AI models

DevOps Engineer87 days ago

Full Time RemoteTeam 11-50H1B No Sponsor

Company Site LinkedIn

• Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems. • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads. • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters. • Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.). • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime. • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs. • Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences. • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform. • Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments. • Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure. • Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.). • Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements. • Document processes and procedures to ensure consistency and knowledge sharing across the team. • Contribute to open-source projects, research publications, blog articles and conferences.

Cloud Distributed Systems Docker Flux Grafana Kubernetes Prometheus Python Terraform Go

View details: Site Reliability Engineer

New York

Apply

Job Closed