Job Closed
This listing is no longer active.
The World's Identity Company
Site Reliability Engineer
Location
Worldwide
Posted
38 days ago
Salary
0
Seniority
Mid Level
No structured requirement data.
Job Description
Site Reliability Engineer
Okta
Role Description As a mid-level Site Reliability Engineer, you'll join our SRE team based in Europe to ensure our production systems are not only operational but also resilient, scalable, and ready for exponential growth. This isn't just about keeping the lights on; it's about directly contributing to the platform's core resiliency and robustness. You'll be a hands-on builder, crafting solutions that make our system more reliable by design. - Design and build custom software in Go to enhance the platform's reliability, resiliency, and redundancy. - Partner with engineering teams to embed reliability principles, improving the availability, performance, and observability of our services. - Use your deep understanding of infrastructure and observability principles to identify opportunities for improvement within the product and implement solutions. - Contribute to our on-call rotation, providing rapid, effective response to critical incidents and using your expertise to troubleshoot, mitigate or accurately escalate production issues. - Develop and refine our SRE tooling and processes, focusing on automation and operational efficiency. - Define, document, and champion reliability best practices across the organisation. Qualifications - A proactive and systematic approach to problem-solving, with a high degree of ownership. - Proven experience in a production environment supporting large-scale, mission-critical applications with a high degree of autonomy. - Proficiency in at least one programming language, with a strong preference for Go. You should be comfortable writing custom applications, not just scripts. - Experience with infrastructure as code (Terraform), container orchestration (Kubernetes, Docker) and GitOps (ArgoCD). - Demonstrable expertise in a major cloud provider (Azure, AWS, or GCP). - A strong grasp of microservices architecture, databases (SQL, NoSQL), and networking fundamentals, so you can understand how custom code can solve platform-level issues. - An understanding of core SRE principles, including SLIs, SLOs, and error budgets. - Experience in an on-call rotation for a 24/7 cloud-based environment. - Exceptional communication and collaboration skills, with a proven ability to work effectively in a remote, distributed team, where tasks may be self-driven. Benefits - Supporting Your Well-Being - Driving Social Impact - Developing Talent and Fostering Connection + Community Company Description Okta secures AI by building the trusted, neutral infrastructure that enables organizations to safely embrace this new era. This work requires a relentless drive to solve complex challenges with real-world stakes. We are looking for builders and owners who operate with speed and urgency and execute with excellence. We are intentional about connection. Our global community, spanning over 20 offices worldwide, is united by a drive to innovate. Your journey begins with an immersive, in-person onboarding experience designed to accelerate your impact and connect you to our mission and team from day one. Okta is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, ancestry, marital status, age, physical or mental disability, or status as a protected veteran.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Intern, Direct-to-Consumer Engineering
NBCUniversalHere you can create the extraordinary. Join us.
• As an NBCUniversal Academic Year intern, you’ll work on real projects and be part of our collaborative culture. • Contribute to meaningful work while building skills that matter. • Work closely with development and operations partners to ensure reliable, scalable, and efficient systems that power TVE platforms. • Responsibilities include building, testing, and maintaining infrastructure and technology stacks, implementing and optimizing CI/CD pipelines, monitoring system health and automating maintenance tasks, and contributing to process improvement initiatives aimed at enhancing quality while reducing time and costs.
Principal Engineer, Python DevOps
NagarroNagarro (Frankfurt: NA9) is a leader in digital product engineering and drives technology-led business breakthroughs.
• Design and develop scalable web applications using Python and modern frontend frameworks. • Build and maintain backend services and APIs for integrations. • Develop responsive frontend applications using JavaScript frameworks. • Implement microservices and integrate with databases and cloud platforms. • Ensure application security, performance, and scalability. • Contribute to CI/CD pipelines and DevOps processes. • Participate in code reviews, technical discussions, and mentoring. • Implement monitoring, logging, and system reliability improvements. • Collaborate with cross-functional teams to deliver end-to-end solutions.
DevSecOps Lead
CorningHeadquartered in Corning, New York, Corning is a leading global manufacturer of specialty glass and ceramics. This company has a long history of innovation and
Lead the security and compliance program while managing security tools and cloud infrastructure. Collaborate across teams to implement automated solutions and enhance security processes, ensuring readiness for audits and compliance standards.
Site Reliability Engineer
Mistral AIMistral AI is dedicated to democratizing frontier AI, making it accessible to everyone by promoting open-source, efficient, and innovative AI models, products,
• Balance the day-to-day operations on production systems with long-term software engineering improvements to reduce operational toil and foster the reliability, availability, and performance of these systems. • Design, build, and maintain scalable, highly available and fault-tolerant infrastructures to support our web services and ML workloads. • Make sure our platform, inference and model training environments are always highly available and enable seamless replication of work environments across several HPC clusters. • Operate systems and troubleshoot issues in production environments (interrupts, on-call responses, users admin, data extraction, infrastructure scaling, etc.). • Implement and improve monitoring, alerting, and incident response systems to ensure optimal system performance and minimize downtime. • Implement and maintain workflows and tools (CI/CD, containerization, orchestration, monitoring, logging and alerting systems) for both our client-facing APIs and large training runs. • Participate occasionally in on-call rotations to respond to incidents and perform root cause analysis to prevent future occurrences. • Drive continuous improvement in infrastructure automation, deployment, and orchestration using tools like Kubernetes, Flux, Terraform. • Collaborate with AI/ML researchers to develop and implement solutions that enable safe and reproducible model-training experiments. • Build a cloud-agnostic platform offering an abstraction layer between science and infrastructure. • Design and develop new workflows and tooling to improve to the reliability, availability and performance of our systems (automation scripts, refactoring, new API-based features, web apps, dashboards, etc.). • Collaborate with the security team to ensure infrastructure adheres to best security practices and compliance requirements. • Document processes and procedures to ensure consistency and knowledge sharing across the team. • Contribute to open-source projects, research publications, blog articles and conferences.



