Job Closed

This listing is no longer active.

Staff Software Reliability Engineer

DevOps EngineerDevOps EngineerOtherRemoteLeadTeam 11-50

Location

United States + 1 moreAll locations: United States | Canada

Posted

128 days ago

Salary

$195K - $245K / year

Seniority

Lead

No structured requirement data.

Job Description

Staff Software Reliability Engineer

Stairwell

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We're looking for a Staff SRE who can own the reliability, scalability, and operational excellence of our platform. You'll work at the intersection of infrastructure and software engineering - building the systems, tooling, and practices that let our team ship confidently and operate at scale. - Set technical direction for infrastructure and reliability - evaluate approaches, make architectural decisions, and establish standards. - Own and evolve our Kubernetes-based infrastructure on GCP. - Build and maintain CI/CD pipelines, deployment tooling, and release processes. - Maintain and simplify our build system (Bazel) for faster, more reliable builds across the org. - Define and instrument SLIs/SLOs; build dashboards and alerting that surface real problems. - Drive incident response, post-mortems, and reliability improvements. - Partner with product engineers to design systems that are reliable and operable from day one. - Contribute to our engineering culture around AI-augmented development - sharing patterns, workflows, and lessons learned. Qualifications - Significant experience in SRE, platform engineering, or infrastructure roles at scale. - Demonstrated technical leadership: you've driven significant infrastructure or reliability initiatives, not just executed on them. - Deep hands-on expertise with Kubernetes (GKE preferred) and GCP services. - Strong programming skills - Go preferred. - Experience with build systems (Bazel strongly preferred) and CI/CD tooling. - Practical experience with AI coding assistants as part of your regular workflow - not just experimentation, but daily use. - Ability to critically evaluate AI-generated code and infrastructure configs: you know when to trust it, when to revise it, and when to write it yourself. - Track record of improving reliability through automation, observability, and good engineering practices. - Comfort with ambiguity and ownership; we're a small team where engineers drive decisions. Nice to Have - Background in security, malware analysis, or threat detection. - Experience with large-scale data systems (BigTable, Spanner, BigQuery). - Deep proficiency in Go. Benefits - Hard technical problems with real security impact. - Small team, huge impact, high autonomy, low process overhead. - Opportunity to collaborate with world-class experts in cybersecurity. - Work remotely in the USA or Canada, or use our co-working space in Santa Clara to collaborate with teammates in-person.

Job Requirements

  • Significant experience in SRE, platform engineering, or infrastructure roles at scale.
  • Demonstrated technical leadership: you've driven significant infrastructure or reliability initiatives, not just executed on them.
  • Deep hands-on expertise with Kubernetes (GKE preferred) and GCP services.
  • Strong programming skills - Go preferred.
  • Experience with build systems (Bazel strongly preferred) and CI/CD tooling.
  • Practical experience with AI coding assistants as part of your regular workflow - not just experimentation, but daily use.
  • Ability to critically evaluate AI-generated code and infrastructure configs: you know when to trust it, when to revise it, and when to write it yourself.
  • Track record of improving reliability through automation, observability, and good engineering practices.
  • Comfort with ambiguity and ownership; we're a small team where engineers drive decisions.
  • Nice to Have
  • Background in security, malware analysis, or threat detection.
  • Experience with large-scale data systems (BigTable, Spanner, BigQuery).
  • Deep proficiency in Go.

Benefits

  • Hard technical problems with real security impact.
  • Small team, huge impact, high autonomy, low process overhead.
  • Opportunity to collaborate with world-class experts in cybersecurity.
  • Work remotely in the USA or Canada, or use our co-working space in Santa Clara to collaborate with teammates in-person.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Red Hat logo

Senior Site Reliability Engineer, Azure Red Hat OpenShift

Red Hat

The leading provider of enterprise open source solutions.

DevOps Engineer128 days ago
OtherRemoteTeam 10,001+Since 1993H1B Sponsor

• Contribute code to increase the scalability and reliability of the service • Contribute software tests and participate in peer review to increase the quality of our codebase • Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration • Participate in a regular on-call schedule, including occasional paid weekends and holidays • Practice sustainable incident response and blameless postmortems • Resolve customer issues escalated from the Red Hat Global Support team • Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve • Explore and experiment with emerging AI technologies relevant to software development, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.

California + 1 moreAll locations: California | Oregon
$139.6K - $230.2K / year
Job Closed
Famedly GmbH logo

Senior Site Reliability Engineer – m/f/d

Famedly GmbH

Famedly is a complete medical collaboration platform delivered as a single decentralized application.

DevOps Engineer128 days ago
Full TimeRemoteTeam 11-50Since 2019H1B No Sponsor

• Responsibility for the reliability, observability, and performance of backend systems • Design and implement SRE practices • Maintain infrastructure as code • Work closely with development teams • Automate incident detection and remediation • Contribute to architecture and roadmap

Germany
€60K - €70K / year
Full TimeRemoteTeam 1,001-5,000Since 2014H1B Sponsor

• Analyze systemic failure patterns and design improvements that prevent incident recurrence • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments • Build tooling and automation to reduce incident response toil and scale team impact • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack • Analyze reliability data to identify systemic improvements; build dashboards that drive action • Explore AI-assisted approaches to documentation quality and incident analysis • Design scalable reliability standards that reduce reactive workload over time. • Own standards, practices, and continuous improvement of incident response • Define incident commander eligibility criteria and manage the rotation • Available as escalation IC when incidents exceed a team's management chain • Develop and deliver training programs for engineering teams at all levels • Coach teams through post-mortems and on developing actionable corrective actions • Edit and review customer-facing incident documents to ensure quality and clarity • Drive turnaround SLAs while maintaining technical accuracy • Ensure clear explanation of what happened, why, and how we'll prevent recurrence • Partner with engineering leaders to elevate reliability practices • Be the expert who teams proactively engage for guidance

India
Veradigm® logo

DevOps Engineer

Veradigm®

Driving value through its unique combination of platforms, data, expertise, connectivity, and scale.

DevOps Engineer128 days ago
Full TimeRemoteTeam 1,001-5,000H1B No Sponsor

• Veradigm is expanding its DevOps Engineering team and is seeking a highly skilled and enthusiastic DevOps Engineer to support and evolve our platforms and systems. • This role is critical to the success of our VEHR/VPM/VIE products and will be responsible for building and deploying solutions and services in On-premises and Hosted environment. • Simultaneously, it will also support Azure environments used by the Dev/QA teams. • Knowledge of secure DevOps practices (secrets management, compliance, scanning tools). • Exposure and understanding of container technologies like Docker and/or Kubernetes. • Experience with Configuration Management tools (e.g., Ansible, Chef, etc.) is a plus. • Able to work with developers supporting both modern and legacy applications. • Comfortable with CI/CD, including debugging build failures and deployment issues. • Self-driven and motivated, with the ability to work independently and prioritize tasks effectively. • Strong communication and interpersonal skills, with the ability to collaborate and communicate effectively with cross-functional teams. • Excellent troubleshooting and problem-solving skills, with keen attention to detail. • Excellent documentation skills.

India
Job Closed