The Enterprise MLOps platform powering over 20% of the Fortune 100
Staff Site Reliability Engineer
Location
California
Posted
4 days ago
Salary
$200K - $230K / year
Seniority
Lead
Job Description
Staff Site Reliability Engineer
Domino Data Lab
• Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil • Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle • Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur • Guide the development of customer and user-facing observability tools within our products • Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards • Scale cloud operations practices for Domino’s single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades • Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture
Job Requirements
- Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
- Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
- A strong ability to perceive and close reliability gaps in technical products, tools and processes
- Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
- Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
- A history of improving reliability through engineering and automation, not just putting out fires manually
- Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
- Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
- Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams
Benefits
- equity
- company bonus or sales commissions/bonuses
- 401(k) plan
- medical, dental, and vision benefits
- wellness stipends
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Network Deployment Engineer
CodiLimeA strategic partner for technology-driven companies | Network engineering | Software engineering
• Leading design, architecture, and optimization for networking infrastructure & devices in Large scale DCs and/or Offices • Overseeing entire site deployment cycles from initial requirements to operational handover • Collaborating with cross-functional IT, security, and facility teams plus external vendors • Enforcing and maintaining robust technical documentation and architectural standards • Developing Python and/or Ansible scripts to drive network automation initiatives
Senior Network Deployment Engineer
CodiLimeA strategic partner for technology-driven companies | Network engineering | Software engineering
• Leading design, architecture, and optimization for networking infrastructure&devices in Large scale DCs and/or Offices • Overseeing entire site deployment cycles from initial requirements to operational handover • Collaborating with cross-functional IT, security, and facility teams plus external vendors • Enforcing and maintaining robust technical documentation and architectural standards • Developing Python and/or Ansible scripts to drive network automation initiatives
Senior DevSecOps Engineer
KaseyaKaseya® is the leading provider of IT and security management solutions for managed service providers (MSPs) and SMBs.
• Design and implement security controls across CI/CD pipelines, cloud infrastructure, and software development workflows • Integrate security testing tools, including SAST, DAST, dependency scanning, and vulnerability management solutions • Conduct threat modeling and risk assessments for applications, infrastructure, and platform services • Implement and maintain security controls for cloud environments, infrastructure-as-code, and containerized workloads • Develop automated security and compliance checks supporting regulatory and internal security requirements • Partner with Engineering, Infrastructure, and Security teams to implement secure development practices • Evaluate, implement, and optimize security tooling supporting application and infrastructure security • Mentor engineers on secure development practices and DevSecOps methodologies
• Definição da Infraestrutura dos produtos seguindo as definições da arquitetura; • Resiliência do ambiente; • Alinhamento e controle dos SLIs, SLAs e SLOs; • Troubleshooting da infraestrutura da aplicação (conhece, participa, propõe soluções); • Auxilia no Troubleshooting da aplicação, sob convite dos desenvolvedores; • Direciona soluções de monitoração, logs e automação; • Documentação da Infra dos produtos; • Participa e conhece a capacidade e custo da Infraestrutura; • Análise de tendências das aplicações; • Direciona novas soluções no produto; • Participa de POCs e testes de novas soluções; • IAC: Infraestrutura como código; • Implantar/Criar infraestrutura em nuvem (Azure, OCI, AWS e GCP); • Solicitar e acompanhar requisições aos times de infraestrutura onpremises.



