Domino Data Lab logo
Domino Data Lab

The Enterprise MLOps platform powering over 20% of the Fortune 100

Staff Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteLeadTeam 201-500Since 2013H1B SponsorCompany SiteLinkedIn

Location

California

Posted

4 days ago

Salary

$200K - $230K / year

Seniority

Lead

Postgraduate DegreeEnglishCloudKubernetesLinuxPythonGo

Job Description

Staff Site Reliability Engineer

Domino Data Lab

• Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil • Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle • Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur • Guide the development of customer and user-facing observability tools within our products • Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards • Scale cloud operations practices for Domino’s single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades • Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture

Job Requirements

  • Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
  • Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
  • A strong ability to perceive and close reliability gaps in technical products, tools and processes
  • Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
  • Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
  • A history of improving reliability through engineering and automation, not just putting out fires manually
  • Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
  • Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
  • Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams

Benefits

  • equity
  • company bonus or sales commissions/bonuses
  • 401(k) plan
  • medical, dental, and vision benefits
  • wellness stipends

Related Categories

Related Job Pages

More DevOps Engineer Jobs

CodiLime logo

Senior Network Deployment Engineer

CodiLime

A strategic partner for technology-driven companies | Network engineering | Software engineering

DevOps Engineer4 days ago
ContractRemoteTeam 201-500Since 2011H1B No Sponsor

• Leading design, architecture, and optimization for networking infrastructure & devices in Large scale DCs and/or Offices • Overseeing entire site deployment cycles from initial requirements to operational handover • Collaborating with cross-functional IT, security, and facility teams plus external vendors • Enforcing and maintaining robust technical documentation and architectural standards • Developing Python and/or Ansible scripts to drive network automation initiatives

Poland
zł22K - zł26K / month
CodiLime logo

Senior Network Deployment Engineer

CodiLime

A strategic partner for technology-driven companies | Network engineering | Software engineering

DevOps Engineer4 days ago
ContractRemoteTeam 201-500Since 2011H1B No Sponsor

• Leading design, architecture, and optimization for networking infrastructure&devices in Large scale DCs and/or Offices • Overseeing entire site deployment cycles from initial requirements to operational handover • Collaborating with cross-functional IT, security, and facility teams plus external vendors • Enforcing and maintaining robust technical documentation and architectural standards • Developing Python and/or Ansible scripts to drive network automation initiatives

Brazil
Kaseya logo

Senior DevSecOps Engineer

Kaseya

Kaseya® is the leading provider of IT and security management solutions for managed service providers (MSPs) and SMBs.

DevOps Engineer4 days ago
Full TimeRemoteTeam 1,001-5,000H1B Sponsor

• Design and implement security controls across CI/CD pipelines, cloud infrastructure, and software development workflows • Integrate security testing tools, including SAST, DAST, dependency scanning, and vulnerability management solutions • Conduct threat modeling and risk assessments for applications, infrastructure, and platform services • Implement and maintain security controls for cloud environments, infrastructure-as-code, and containerized workloads • Develop automated security and compliance checks supporting regulatory and internal security requirements • Partner with Engineering, Infrastructure, and Security teams to implement secure development practices • Evaluate, implement, and optimize security tooling supporting application and infrastructure security • Mentor engineers on secure development practices and DevSecOps methodologies

United Kingdom
credsystem logo

Especialista de SRE

credsystem

Tornando novas conquistas possíveis.

DevOps Engineer4 days ago
Full TimeRemoteTeam 201-500Since 1996H1B No Sponsor

• Definição da Infraestrutura dos produtos seguindo as definições da arquitetura; • Resiliência do ambiente; • Alinhamento e controle dos SLIs, SLAs e SLOs; • Troubleshooting da infraestrutura da aplicação (conhece, participa, propõe soluções); • Auxilia no Troubleshooting da aplicação, sob convite dos desenvolvedores; • Direciona soluções de monitoração, logs e automação; • Documentação da Infra dos produtos; • Participa e conhece a capacidade e custo da Infraestrutura; • Análise de tendências das aplicações; • Direciona novas soluções no produto; • Participa de POCs e testes de novas soluções; • IAC: Infraestrutura como código; • Implantar/Criar infraestrutura em nuvem (Azure, OCI, AWS e GCP); • Solicitar e acompanhar requisições aos times de infraestrutura onpremises.

Brazil