Job Closed
This listing is no longer active.
Staff Software Reliability Engineer
Location
United States + 1 moreAll locations: United States | Canada
Posted
128 days ago
Salary
$195K - $245K / year
Seniority
Lead
No structured requirement data.
Job Description
Staff Software Reliability Engineer
Stairwell
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We're looking for a Staff SRE who can own the reliability, scalability, and operational excellence of our platform. You'll work at the intersection of infrastructure and software engineering - building the systems, tooling, and practices that let our team ship confidently and operate at scale. - Set technical direction for infrastructure and reliability - evaluate approaches, make architectural decisions, and establish standards. - Own and evolve our Kubernetes-based infrastructure on GCP. - Build and maintain CI/CD pipelines, deployment tooling, and release processes. - Maintain and simplify our build system (Bazel) for faster, more reliable builds across the org. - Define and instrument SLIs/SLOs; build dashboards and alerting that surface real problems. - Drive incident response, post-mortems, and reliability improvements. - Partner with product engineers to design systems that are reliable and operable from day one. - Contribute to our engineering culture around AI-augmented development - sharing patterns, workflows, and lessons learned. Qualifications - Significant experience in SRE, platform engineering, or infrastructure roles at scale. - Demonstrated technical leadership: you've driven significant infrastructure or reliability initiatives, not just executed on them. - Deep hands-on expertise with Kubernetes (GKE preferred) and GCP services. - Strong programming skills - Go preferred. - Experience with build systems (Bazel strongly preferred) and CI/CD tooling. - Practical experience with AI coding assistants as part of your regular workflow - not just experimentation, but daily use. - Ability to critically evaluate AI-generated code and infrastructure configs: you know when to trust it, when to revise it, and when to write it yourself. - Track record of improving reliability through automation, observability, and good engineering practices. - Comfort with ambiguity and ownership; we're a small team where engineers drive decisions. Nice to Have - Background in security, malware analysis, or threat detection. - Experience with large-scale data systems (BigTable, Spanner, BigQuery). - Deep proficiency in Go. Benefits - Hard technical problems with real security impact. - Small team, huge impact, high autonomy, low process overhead. - Opportunity to collaborate with world-class experts in cybersecurity. - Work remotely in the USA or Canada, or use our co-working space in Santa Clara to collaborate with teammates in-person.
Job Requirements
- Significant experience in SRE, platform engineering, or infrastructure roles at scale.
- Demonstrated technical leadership: you've driven significant infrastructure or reliability initiatives, not just executed on them.
- Deep hands-on expertise with Kubernetes (GKE preferred) and GCP services.
- Strong programming skills - Go preferred.
- Experience with build systems (Bazel strongly preferred) and CI/CD tooling.
- Practical experience with AI coding assistants as part of your regular workflow - not just experimentation, but daily use.
- Ability to critically evaluate AI-generated code and infrastructure configs: you know when to trust it, when to revise it, and when to write it yourself.
- Track record of improving reliability through automation, observability, and good engineering practices.
- Comfort with ambiguity and ownership; we're a small team where engineers drive decisions.
- Nice to Have
- Background in security, malware analysis, or threat detection.
- Experience with large-scale data systems (BigTable, Spanner, BigQuery).
- Deep proficiency in Go.
Benefits
- Hard technical problems with real security impact.
- Small team, huge impact, high autonomy, low process overhead.
- Opportunity to collaborate with world-class experts in cybersecurity.
- Work remotely in the USA or Canada, or use our co-working space in Santa Clara to collaborate with teammates in-person.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer, Azure Red Hat OpenShift
Red HatThe leading provider of enterprise open source solutions.
• Contribute code to increase the scalability and reliability of the service • Contribute software tests and participate in peer review to increase the quality of our codebase • Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration • Participate in a regular on-call schedule, including occasional paid weekends and holidays • Practice sustainable incident response and blameless postmortems • Resolve customer issues escalated from the Red Hat Global Support team • Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve • Explore and experiment with emerging AI technologies relevant to software development, proactively identifying opportunities to incorporate new AI capabilities into existing workflows and tooling.
Senior Site Reliability Engineer – m/f/d
Famedly GmbHFamedly is a complete medical collaboration platform delivered as a single decentralized application.
• Responsibility for the reliability, observability, and performance of backend systems • Design and implement SRE practices • Maintain infrastructure as code • Work closely with development teams • Automate incident detection and remediation • Contribute to architecture and roadmap
• Analyze systemic failure patterns and design improvements that prevent incident recurrence • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments • Build tooling and automation to reduce incident response toil and scale team impact • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack • Analyze reliability data to identify systemic improvements; build dashboards that drive action • Explore AI-assisted approaches to documentation quality and incident analysis • Design scalable reliability standards that reduce reactive workload over time. • Own standards, practices, and continuous improvement of incident response • Define incident commander eligibility criteria and manage the rotation • Available as escalation IC when incidents exceed a team's management chain • Develop and deliver training programs for engineering teams at all levels • Coach teams through post-mortems and on developing actionable corrective actions • Edit and review customer-facing incident documents to ensure quality and clarity • Drive turnaround SLAs while maintaining technical accuracy • Ensure clear explanation of what happened, why, and how we'll prevent recurrence • Partner with engineering leaders to elevate reliability practices • Be the expert who teams proactively engage for guidance
DevOps Engineer
Veradigm®Driving value through its unique combination of platforms, data, expertise, connectivity, and scale.
• Veradigm is expanding its DevOps Engineering team and is seeking a highly skilled and enthusiastic DevOps Engineer to support and evolve our platforms and systems. • This role is critical to the success of our VEHR/VPM/VIE products and will be responsible for building and deploying solutions and services in On-premises and Hosted environment. • Simultaneously, it will also support Azure environments used by the Dev/QA teams. • Knowledge of secure DevOps practices (secrets management, compliance, scanning tools). • Exposure and understanding of container technologies like Docker and/or Kubernetes. • Experience with Configuration Management tools (e.g., Ansible, Chef, etc.) is a plus. • Able to work with developers supporting both modern and legacy applications. • Comfortable with CI/CD, including debugging build failures and deployment issues. • Self-driven and motivated, with the ability to work independently and prioritize tasks effectively. • Strong communication and interpersonal skills, with the ability to collaborate and communicate effectively with cross-functional teams. • Excellent troubleshooting and problem-solving skills, with keen attention to detail. • Excellent documentation skills.



