Job Closed
This listing is no longer active.
Set data in motion.
Staff Site Reliability Engineer – Incident Management & Reliability
Location
Canada
Posted
128 days ago
Salary
CA$225.1K - CA$264.5K / year
Seniority
Lead
Job Description
Staff Site Reliability Engineer – Incident Management & Reliability
Confluent
• Analyze systemic failure patterns and design reliability improvements that prevent incident recurrence • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments • Own standards, practices, and continuous improvement of incident response across engineering • Edit and review customer-facing incident documents (CRCAs) to ensure quality and clarity • Develop and deliver training programs; coach teams through post-mortems • Partner with engineering leaders to elevate reliability practices org-wide
Job Requirements
- 10+ years of relevant experience in SRE, incident management, or reliability engineering
- Cloud experience with at least one of AWS, GCP, or Azure (we run all three)
- Experience navigating reliability/incident programs at 500+ engineer organizations
- Deep expertise with incident management tooling (Rootly, PagerDuty, or similar)
- Strong understanding of distributed systems and failure modes at scale
- Deep experience with observability: metrics, logging, tracing
- Kubernetes and container orchestration experience
- Understanding of CI/CD pipelines and release processes
- Strong written communication (design docs, runbooks, post-mortems)
- Experience driving org-wide process and cultural changes
- Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
Benefits
- Offers Equity
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Responsible for maintaining and improving deployment pipeline • Ensuring that software can be deployed smoothly to cloud infrastructure and on-premise environments • Collaborating with engineering team to track dependencies and apply upgrades • Creating and maintaining documentation for the release process • Coordinating with stakeholders to ensure smooth release process • Setting up and configuring CI/CD pipelines to automate build and deployment process
Site Reliability Engineer
DropboxDropbox is the one place to keep life organized and keep work moving.
• Ensure the reliability, scalability, and performance of Dropbox's infrastructure and services • Collaborate with cross-functional teams to develop and maintain best practices for monitoring, logging, and incident response • Build, Implement and maintain automations & infrastructure-as-code tooling, specifically Terraform, Ansible, and Github Actions as well as custom code platforms • Utilize container orchestration platforms, such as Kubernetes, Amazon ECS and Red Hat Openshift, to manage containers at scale • Manage and optimize monitoring and logging pipelines using tools like Datadog and Cribl LogStream • Drive improvement projects related to service health and visibility for our stakeholders, ranging from developers to business service owners to C-level • Develop and maintain custom tooling and automation scripts in Bash, Python and other scripting languages
• Deploy, maintain, and optimize cloud-based data infrastructure on AWS • Own CI/CD pipelines, infrastructure automation, and monitoring • Ensure platform stability, observability, and scalability • Support the transition from single-client to multi-client architecture • Work closely with the founder and data engineers to move fast and safely
DevSecOps Site Reliability Engineer
System Automation CorporationBringing innovative solutions to our regulatory communities. FOLLOW us to be connected to the Evoke Network.
• Design and evolve Azure platform infrastructure with a focus on scalability, reliability, and growth readiness. • Participate in capacity planning to support growth, peak demand, and seasonal usage patterns. • Integrate with development resources to implement infrastructure-as-code (e.g., Bicep). • Troubleshoot production infrastructure issues and lead incident response efforts, including coordination, escalation, and real-time remediation across teams. • Conduct post-incident reviews (postmortems) focused on root cause analysis, corrective actions, and long-term reliability improvements rather than blame. • Monitor and operate production systems using Azure Monitor, Application Insights, Sentinel, and related observability tooling. • Improve system reliability and performance through alerting, error monitoring, SLOs/SLAs, and analysis of performance and capacity trends. • Collaborate with security analyst to define and implement security controls across Azure resources and pipelines. • Manage secrets, certificates, and identity integrations. • Automate security posture checks in CI/CD pipelines. • Maintain policy-as-code using Azure Blueprints or Defender for Cloud Compliance & Audit Support. • Support SOC 2 Type II compliance through tooling, automation, and audit readiness. • Respond to evidence requests and generate reports from observability and security systems. • Contribute to the documentation of platform controls and best practices. • Support, maintain, and own CI/CD pipelines (GitHub Actions, Azure DevOps, or equivalent). • Optimize build, test, and release flows, partnering with engineers to diagnose failures and improve deployment reliability. • Define and maintain consistent environment standards across development, staging, and production to ensure deployment safety, reliability, and compliance. • Partner with engineering teams to improve deployment promotion strategies, rollback mechanisms, and release safety practices.




