Job Closed

This listing is no longer active.

Akuity

Remove complexity, add velocity.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerOther Remote SeniorTeam 11-50Since 2021H1B No SponsorCompany Site LinkedIn

Location

United States

Posted

70 days ago

Salary

Seniority

Senior

Bachelor Degree5 yrs expEnglishAWS Amazon EC2 Grafana Kubernetes Prometheus Python

Job Description

• Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them • Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure • Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes • Partner with engineering teams to build reliability into new features before they ship to production • Participate in an on-call rotation and act as incident commander for high-severity production events • Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low • Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil • Lead post-incident reviews with clear timelines, root cause analysis, and follow-through on action items

Job Requirements

5+ years of SRE, platform engineering, or production operations experience in a SaaS environment
Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything
Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM
Experience defining and operating against SLOs in production; you've written error budgets, not just read about them
Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent)
Solid scripting and automation skills; Go, Python, Bash, or similar; you automate what you touch
Strong written communication: clear runbooks, sharp incident reports, thoughtful post-mortems
Live within US time zones (Pacific through Eastern), including Canada and other regions

Benefits

Health insurance, dental, and vision coverage
Equity participation in a well-funded, growing company
Home office stipend and equipment budget
Flexible time off and a culture that respects it
Work directly with the engineers who built Argo CD and Kargo; you'll learn a lot here

Related Categories

DevOps Engineer

Related Job Pages

Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Lead Site Reliability Engineer

Gifthealth

DevOps Engineer70 days ago

Other Remote

• Designs, builds, and maintains reliable, scalable software systems supporting Ruby on Rails applications • Embeds reliability, performance, and operational best practices into application code and development workflows • Owns DevOps practices including CI/CD reliability, deployment strategies, and release safety • Leads incident response, debugging, and root cause analysis across application and platform layers • Implements and evolves observability (logging, metrics, tracing) within application and service code • Partners with engineering teams on architecture, capacity planning, and technical standards

AWS Azure Docker GCP Prometheus Ruby Ruby on Rails Terraform

View details: Lead Site Reliability Engineer

United States

$123K - $154K / year

Apply

Job Closed

Site Reliability Engineer – SaaS

Infiterra

Infiterra helps IT Distributors and MSPs transform and grow. Our platform automates each step from quote to bill.

DevOps Engineer70 days ago

Full Time RemoteTeam 51-200Since 2012H1B No Sponsor

Company Site LinkedIn

• Maintain and continuously improve production uptime, supporting our ≥99.9% target for 2026. • Monitor systems proactively and respond effectively to production incidents. • Drive improvements in MTTR (Mean Time to Resolution). • Perform structured root cause analysis and contribute to long-term preventive actions. • Participate in an evolving on-call model as we mature toward structured production support. • Manage and optimize Azure infrastructure across compute, networking, and identity components. • Work hands-on with AKS clusters as part of our growing Kubernetes adoption. • Maintain networking components including load balancers and private endpoints. • Contribute to improving platform resilience and scalability as demand grows. • Design and improve observability practices, including metrics, logs, and alerting standards across production systems. • Contribute to and improve Infrastructure as Code practices (Terraform or similar), ensuring consistent and repeatable deployments. • Reduce manual operational effort through scripting and automation. • Work closely with DevOps to ensure smooth CI/CD integration and reliable production deployments. • Support Security initiatives related to infrastructure hardening. • Partner with DevOps on deployment reliability and configuration changes impacting production.

Azure Kubernetes Linux Terraform

View details: Site Reliability Engineer – SaaS

Greece

Apply

Job Closed

Site Reliability Engineer

Illumination Systems Arizona

Arizona's Lighting & Controls Agency.

DevOps Engineer70 days ago

Full Time RemoteTeam 51-200Since 1937H1B No Sponsor

Company Site LinkedIn

• Enhance, optimize, validate and automate core MinIO software for performance, scalability, and security. • Help building and delivering high-performance distributed storage solutions with a focus on cloud-native architectures. • Validate the MinIO Software according to customer environment and requirements, ensuring no surprises are observed at customer deployments. • Improve existing features, fix critical issues, and contribute to open-source repositories. • Collaborate with other engineers to refine architecture, APIs, and integrations. • Write efficient, well-documented, and maintainable code. • Conduct performance benchmarking and debugging of complex storage environments. • Work closely with customers to address issues, and manage expectations.

Cloud Distributed Systems Kubernetes Microservices Rust Go

View details: Site Reliability Engineer

South Korea

Apply

Senior Site Reliability Engineer

Akuity

Remove complexity, add velocity.

DevOps Engineer70 days ago

Other RemoteTeam 11-50Since 2021H1B No Sponsor

Company Site LinkedIn

About Akuity With the move to the cloud, Kubernetes has become widely adopted by DevOps and Platform Engineering teams, but it has also added complexity. While scaling Kubernetes at Intuit, the Akuity founders started building Argo CD in order to streamline the adoption of Kubernetes. Argo CD helps developers own, understand and deploy their K8s deployments via GitOps. Today, Argo CD is the third most popular project in the CNCF (Cloud Native Computing Foundation) and is used by 70% of companies who are using Kubernetes in production. The list of Argo CD users includes companies like Intuit, BlackRock, Tesla, Major League Baseball, Peloton, and many more. The team founded Akuity in 2021 to enable enterprises to ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane. Trusted by top companies around the globe, the Akuity Platform provides the only end-to-end GitOps platform for the enterprises. Our mission is to simplify the software delivery process so that DevOps and Platform Engineering teams can move fast, and deploy code effortlessly without the fear of breaking things. The Role We are looking for a Senior SRE to help us keep the Akuity platform running at the level our enterprise customers expect. This is a high-ownership role; you won't just respond to incidents, you'll shape how we define and defend reliability across the entire platform. You'll work closely with engineering, infrastructure, and product to build the systems and culture that let us scale with confidence. What You'll Own Platform Reliability & SLAs - Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them - Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure - Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes - Partner with engineering teams to build reliability into new features before they ship to production On-Call & Incident Response - Participate in an on-call rotation and act as incident commander for high-severity production events - Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low - Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil - Lead post-incident reviews with clear timelines, root cause analysis, and follow-through on action items What We're Looking For Required - 5+ years of SRE, platform engineering, or production operations experience in a SaaS environment - Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything - Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM - Experience defining and operating against SLOs in production; you've written error budgets, not just read about them - Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent) - Solid scripting and automation skills; Go, Python, Bash, or similar; you automate what you touch - Strong written communication: clear runbooks, sharp incident reports, thoughtful post-mortems - Live within US time zones (Pacific through Eastern), including Canada and other regions Strong Advantage - Experience with Argo CD, Kargo, or GitOps-based delivery workflows - Familiarity with multi-region, multi-cluster Kubernetes deployments - Experience with compliance-adjacent infrastructure (SOC 2, ISO 27001, HIPAA, or PCI DSS) - Background operating infrastructure for other platform or developer tooling companies Our Stack - Kubernetes (EKS): multi-region, enterprise-grade clusters serving Argo CD and Kargo workloads - AWS: primary cloud provider across all production and DR environments - Argo CD & Kargo: GitOps delivery tools we build and run ourselves - Prometheus, Grafana, and OpenTelemetry for observability - Terraform and GitOps-driven infrastructure management What We Offer - Competitive compensation, commensurate with experience - Equity participation in a well-funded, growing company - Fully remote: work from anywhere within US time zones (Pacific through Eastern), including Canada and other regions - Home office stipend and equipment budget - Flexible time off and a culture that respects it - Work directly with the engineers who built Argo CD and Kargo; you'll learn a lot here US-based employees receive full benefits, including comprehensive health, dental, and vision coverage. Candidates based outside the US will be engaged as contractors.

Kubernetes AWS Prometheus Grafana Docker Terraform Shell Python OpenTelemetry Amazon EKS Amazon S3 Amazon RDS Amazon IAM Amazon EC2 Observability / Monitoring

View details: Senior Site Reliability Engineer

United States

Apply

Senior Site Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Lead Site Reliability Engineer

Site Reliability Engineer – SaaS

Site Reliability Engineer

Senior Site Reliability Engineer