Akuity logo
Akuity

Remove complexity, add velocity.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerOtherRemoteSeniorTeam 11-50Since 2021H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

70 days ago

Salary

0

Seniority

Senior

Job Description

Senior Site Reliability Engineer

Akuity

About Akuity With the move to the cloud, Kubernetes has become widely adopted by DevOps and Platform Engineering teams, but it has also added complexity. While scaling Kubernetes at Intuit, the Akuity founders started building Argo CD in order to streamline the adoption of Kubernetes. Argo CD helps developers own, understand and deploy their K8s deployments via GitOps. Today, Argo CD is the third most popular project in the CNCF (Cloud Native Computing Foundation) and is used by 70% of companies who are using Kubernetes in production. The list of Argo CD users includes companies like Intuit, BlackRock, Tesla, Major League Baseball, Peloton, and many more. The team founded Akuity in 2021 to enable enterprises to ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage the development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane. Trusted by top companies around the globe, the Akuity Platform provides the only end-to-end GitOps platform for the enterprises. Our mission is to simplify the software delivery process so that DevOps and Platform Engineering teams can move fast, and deploy code effortlessly without the fear of breaking things. The Role We are looking for a Senior SRE to help us keep the Akuity platform running at the level our enterprise customers expect. This is a high-ownership role; you won't just respond to incidents, you'll shape how we define and defend reliability across the entire platform. You'll work closely with engineering, infrastructure, and product to build the systems and culture that let us scale with confidence. What You'll Own Platform Reliability & SLAs - Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them - Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure - Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes - Partner with engineering teams to build reliability into new features before they ship to production On-Call & Incident Response - Participate in an on-call rotation and act as incident commander for high-severity production events - Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low - Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil - Lead post-incident reviews with clear timelines, root cause analysis, and follow-through on action items What We're Looking For Required - 5+ years of SRE, platform engineering, or production operations experience in a SaaS environment - Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything - Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM - Experience defining and operating against SLOs in production; you've written error budgets, not just read about them - Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent) - Solid scripting and automation skills; Go, Python, Bash, or similar; you automate what you touch - Strong written communication: clear runbooks, sharp incident reports, thoughtful post-mortems - Live within US time zones (Pacific through Eastern), including Canada and other regions Strong Advantage - Experience with Argo CD, Kargo, or GitOps-based delivery workflows - Familiarity with multi-region, multi-cluster Kubernetes deployments - Experience with compliance-adjacent infrastructure (SOC 2, ISO 27001, HIPAA, or PCI DSS) - Background operating infrastructure for other platform or developer tooling companies Our Stack - Kubernetes (EKS): multi-region, enterprise-grade clusters serving Argo CD and Kargo workloads - AWS: primary cloud provider across all production and DR environments - Argo CD & Kargo: GitOps delivery tools we build and run ourselves - Prometheus, Grafana, and OpenTelemetry for observability - Terraform and GitOps-driven infrastructure management What We Offer - Competitive compensation, commensurate with experience - Equity participation in a well-funded, growing company - Fully remote: work from anywhere within US time zones (Pacific through Eastern), including Canada and other regions - Home office stipend and equipment budget - Flexible time off and a culture that respects it - Work directly with the engineers who built Argo CD and Kargo; you'll learn a lot here US-based employees receive full benefits, including comprehensive health, dental, and vision coverage. Candidates based outside the US will be engaged as contractors.

Job Requirements

  • 5+ years of SRE, platform engineering, or production operations experience in a SaaS environment.
  • Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything.
  • Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM.
  • Experience defining and operating against SLOs in production; you've written error budgets, not just read about them.
  • Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent).
  • Solid scripting and automation skills; Go, Python, Bash, or similar; you automate what you touch.
  • Strong written communication: clear runbooks, sharp incident reports, thoughtful post-mortems.
  • Live within US time zones (Pacific through Eastern), including Canada and other regions.
  • Experience with Argo CD, Kargo, or GitOps-based delivery workflows.
  • Familiarity with multi-region, multi-cluster Kubernetes deployments.
  • Experience with compliance-adjacent infrastructure (SOC 2, ISO 27001, HIPAA, or PCI DSS).
  • Background operating infrastructure for other platform or developer tooling companies.

Benefits

  • Competitive compensation, commensurate with experience.
  • Equity participation in a well-funded, growing company.
  • Fully remote: work from anywhere within US time zones (Pacific through Eastern), including Canada and other regions.
  • Home office stipend and equipment budget.
  • Flexible time off and a culture that respects it.
  • Work directly with the engineers who built Argo CD and Kargo; you'll learn a lot here.
  • US-based employees receive full benefits, including comprehensive health, dental, and vision coverage. Candidates based outside the US will be engaged as contractors.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

minteo logo

Senior DevOps Engineer

minteo

Superpowering Latam's financial system

DevOps Engineer70 days ago
Full TimeRemoteTeam 11-50Since 2022H1B No Sponsor

• Design and implement infrastructure in a multi-account AWS environment. • Evaluate new technology and product changes to improve SLAs, visibility and quality of the operation, while collaborating closely with all other engineering teams. • Lead security best practices and regular audits of our environment to meet compliance requirements. • Collaborate with a compliance office or the director of operations to ensure regular compliance requirements are met. • Responsible for highly technical parts such as pen-testing, vulnerabilities, alerting, etc. • Implement scripts and automation tasks for infrastructure and configuration as code, monitoring, metrics collection, etc. • Lead automation efforts for data collection and processing as well as backups and SRE efforts. • Work directly with the technology and data teams to design, test and secure cloud infrastructure. • Design and implement build, deployment, and configuration management. • Build and test automation tools for infrastructure provisioning. • Provide technical guidance and educate team members and coworkers on development and operations. • Monitor metrics and develop ways to improve infrastructure monitoring. • Follow all best practices and procedures as established by the company. • Regularly create and update documentation relating to process, automation and architecture. • Evaluate and assist in the design of new features that may impact infrastructure and operations performance, provisioning, security, etc.

Colombia
CAKE.com logo

Site Reliability Engineer, Remote – UK & EU

CAKE.com

Deliciously simple way to run a business and empower your team 💫

DevOps Engineer70 days ago
Full TimeRemoteTeam 201-500Since 2009H1B No Sponsor

• Scale and secure infrastructure to handle increasing user demand. • Define and deploy monitoring, alerting, and logging systems. • Respond to and resolve production incidents; conduct post-mortems. • Monitor server logs and design tools to automate operational processes.

United Kingdom
CAKE.com logo

Site Reliability Engineer, SRE

CAKE.com

Deliciously simple way to run a business and empower your team 💫

DevOps Engineer70 days ago
OtherRemoteTeam 201-500Since 2009H1B No Sponsor

• Scale and secure our rapidly growing infrastructure • Automate critical processes • Ensure a seamless experience for new users • Make sure the infrastructure keeps up with the growth • Ensure system scalability and high traffic handling • Define and deploy monitoring, alerting, and logging systems • Respond to and resolve production incidents • Conduct thorough post-mortems • Monitor server logs for abnormalities • Design, manage and maintain automation tools for operational processes

United States
Tillster logo

Site Reliability Engineer

Tillster

We’re a unified commerce platform that enables QSR restaurants to deliver personalized brand experiences & drive sales.

DevOps Engineer70 days ago
Full TimeRemoteTeam 201-500Since 2002H1B Sponsor

• Analyzing and troubleshooting large-scale distributed systems in the public cloud • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity • Improve and maintain monitoring and logging solutions that measure availability, latency and overall system health of production systems • Provision and manage cloud Infrastructure through automation and infrastructure as code • Restore healthy operation of applications and services through sustainable incident response and blameless postmortems • Follow and monitor security and compliance best practices • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks

Portugal
Job Closed