Job Closed

This listing is no longer active.

Confluent logo
Confluent

Set data in motion.

Staff Site Reliability Engineer – Incident Management & Reliability

DevOps EngineerDevOps EngineerFull TimeRemoteLeadTeam 1,001-5,000Since 2014H1B SponsorCompany SiteLinkedIn

Location

Canada

Posted

128 days ago

Salary

CA$225.1K - CA$264.5K / year

Seniority

Lead

Job Description

Staff Site Reliability Engineer – Incident Management & Reliability

Confluent

• Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments • Own standards, practices, and continuous improvement of incident response across engineering • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity • Develop and deliver training programs; coach teams through post-mortems • Partner with engineering leaders to elevate reliability practices org-wide

Job Requirements

  • 10+ years of relevant experience in SRE, incident management, or reliability engineering
  • Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
  • Experience navigating reliability/incident programs at 500+ engineer organizations
  • Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
  • Strong understanding of distributed systems and failure modes at scale
  • Deep experience with observability: metrics, logging, tracing
  • Kubernetes and container orchestration experience
  • Understanding of CI/CD pipelines and release processes
  • Strong written communication (design docs, runbooks, post-mortems)
  • Experience driving org-wide process and cultural changes
  • Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems

Benefits

  • Offers Equity

Related Categories

Related Job Pages

More DevOps Engineer Jobs

HeadSpin logo

Build and Release Engineer

HeadSpin

Optimize Digital Experiences‍ with Data Science Capabilities

DevOps Engineer128 days ago
Full TimeRemoteTeam 201-500H1B Sponsor

• Responsible for maintaining and improving deployment pipeline • Ensuring that software can be deployed smoothly to cloud infrastructure and on-premise environments • Collaborating with engineering team to track dependencies and apply upgrades • Creating and maintaining documentation for the release process • Coordinating with stakeholders to ensure smooth release process • Setting up and configuring CI/CD pipelines to automate build and deployment process

India
Job Closed
Dropbox logo

Site Reliability Engineer

Dropbox

Dropbox is the one place to keep life organized and keep work moving.

DevOps Engineer128 days ago
Full TimeRemoteTeam 1,001-5,000Since 2007H1B Sponsor

• Ensure the reliability, scalability, and performance of Dropbox's infrastructure and services • Collaborate with cross-functional teams to develop and maintain best practices for monitoring, logging, and incident response • Build, Implement and maintain automations & infrastructure-as-code tooling, specifically Terraform, Ansible, and Github Actions as well as custom code platforms • Utilize container orchestration platforms, such as Kubernetes, Amazon ECS and Red Hat Openshift, to manage containers at scale • Manage and optimize monitoring and logging pipelines using tools like Datadog and Cribl LogStream • Drive improvement projects related to service health and visibility for our stakeholders, ranging from developers to business service owners to C-level • Develop and maintain custom tooling and automation scripts in Bash, Python and other scripting languages

Mexico
Digitanity logo

Senior Data DevOps Engineer

Digitanity

Bridging your culture with international talent

DevOps Engineer128 days ago
ContractRemoteTeam 51-200H1B No Sponsor

• Deploy, maintain, and optimize cloud-based data infrastructure on AWS • Own CI/CD pipelines, infrastructure automation, and monitoring • Ensure platform stability, observability, and scalability • Support the transition from single-client to multi-client architecture • Work closely with the founder and data engineers to move fast and safely

Bulgaria
System Automation Corporation logo

DevSecOps Site Reliability Engineer

System Automation Corporation

Bringing innovative solutions to our regulatory communities. FOLLOW us to be connected to the Evoke Network.

DevOps Engineer128 days ago
OtherRemoteTeam 51-200Since 1973H1B No Sponsor

• Design and evolve Azure platform infrastructure with a focus on scalability, reliability, and growth readiness. • Participate in capacity planning to support growth, peak demand, and seasonal usage patterns. • Integrate with development resources to implement infrastructure-as-code (e.g., Bicep). • Troubleshoot production infrastructure issues and lead incident response efforts, including coordination, escalation, and real-time remediation across teams. • Conduct post-incident reviews (postmortems) focused on root cause analysis, corrective actions, and long-term reliability improvements rather than blame. • Monitor and operate production systems using Azure Monitor, Application Insights, Sentinel, and related observability tooling. • Improve system reliability and performance through alerting, error monitoring, SLOs/SLAs, and analysis of performance and capacity trends. • Collaborate with security analyst to define and implement security controls across Azure resources and pipelines. • Manage secrets, certificates, and identity integrations. • Automate security posture checks in CI/CD pipelines. • Maintain policy-as-code using Azure Blueprints or Defender for Cloud Compliance & Audit Support. • Support SOC 2 Type II compliance through tooling, automation, and audit readiness. • Respond to evidence requests and generate reports from observability and security systems. • Contribute to the documentation of platform controls and best practices. • Support, maintain, and own CI/CD pipelines (GitHub Actions, Azure DevOps, or equivalent). • Optimize build, test, and release flows, partnering with engineers to diagnose failures and improve deployment reliability. • Define and maintain consistent environment standards across development, staging, and production to ensure deployment safety, reliability, and compliance. • Partner with engineering teams to improve deployment promotion strategies, rollback mechanisms, and release safety practices.

United States
$130K - $150K / year
Job Closed