Fabric logo
Fabric

The national pay range for this role is $165,000.00 - $210,000.00 per year. Actual compensation will be determined by factors such as the candidate's geographic market, experience, skills, and qualifications. Certain roles may also be eligible for additional compensation. If your compensation requirement is greater than our posted range, please still consider applying; a determination can be made based on unique qualifications. Expected compensation ranges for this role may change over time.

Site Reliability Engineer

Location

United States

Posted

22 hours ago

Salary

$135K - $160K / year

Seniority

Mid Level

No structured requirement data.

Job Description

Site Reliability Engineer

Fabric

Role Description As a Site Reliability Engineer, you will own and evolve the infrastructure powering healthcare experiences for millions of patients. This role bridges the gap between traditional infrastructure excellence and the future of AI-driven operations. You will act as a primary architect for our AWS and Kubernetes (EKS) environment, ensuring the platform is resilient, scalable, and compliant while exploring how agentic workflows can modernize SRE practices. What You'll Do - Infrastructure & Kubernetes Orchestration - Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users. - Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform. - Optimizing the AWS footprint—specifically EC2, RDS, and S3—to balance high performance with cost-efficiency and reliability. - AI-Assisted Operations & Automation - Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks. - Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe. - Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems. - Observability & Incident Management - Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs. - Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR). - Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards. - Compliance & Collaboration - Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements. - Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews. Qualifications - 5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale. - Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management. - Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems. - Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go. - Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency. - A "rigor-first" mindset with a dedication to HIPAA-compliant, high-availability architecture. Requirements - You are a deeply proficient engineer who excels at the intersection of cloud infrastructure, automation, and system design. - You possess a meticulous approach to observability and a passion for finding the "root cause" rather than just applying a patch. - You enjoy exploring the "next frontier" of SRE, including how AI and agentic tools can make operations more efficient. - You thrive in fast-paced environments where technical rigor is balanced with pragmatism and clinical-grade safety. This Might Not Be The Right Fit If... - You prefer working on static infrastructure rather than evolving systems through code and automation. - You are uncomfortable with the "agile" pace of tech-driven platform development or integrating AI tools into your daily workflow. - You prefer a siloed role that does not involve active participation in incident response or collaborative postmortems. Benefits - The national pay range for this role is $135,000.00 – $160,000.00 per year. - Actual compensation will be determined by factors such as the candidate's geographic market, experience, skills, and qualifications. - Certain roles may also be eligible for additional compensation, including a comprehensive benefits package such as medical, dental, vision, unlimited PTO, and a 401(k) plan, stock options and bonuses. - If your compensation requirement is greater than our posted range, please still consider applying; a determination can be made based on unique qualifications. - Expected compensation ranges for this role may change over time.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Referral Board logo

Site Reliability Engineer

Referral Board

Remote's Total Rewards philosophy is to ensure fair, unbiased compensation and fair equity pay along with competitive benefits in all locations in which we operate. We do not agree to or encourage cheap-labor practices and therefore we ensure to pay above in-location rates. At Remote, we foster internal mobility as a key element of our culture of employee growth and development, supported by a compensation philosophy that guarantees pay equity and fairness.

DevOps Engineer23 hours ago
Full TimeRemoteTeam 1,001-5,000

Role Description We are Cloud Infrastructure SREs that integrate, scale, and evolve multi-cloud infrastructure across 4 Cloud Service Providers, 70+ globally distributed regions, and tens of thousands of hosts to power Elastic Cloud. We tackle hard problems at scale through automation, Infrastructure as Code (IaC), configuration management, and purpose-built software that eliminates toil and improves reliability. If that scale of challenge genuinely excites you, we'd love to hear from you. What you will be doing - Engineering software to automate large-scale systems — building internal tools and services, not just running scripts. - Optimizing the reliability and lifecycle of hosts across multiple cloud providers. - Strengthening our observability posture — crafting alerting and monitoring systems that drive incident prevention over incident response. - Scaling global infrastructure and evolving the infrastructure management processes to meet growing demand. - Contributing to code reviews, sharing your work, planning what we need to do next, and mentoring teammates. - Being part of a balanced SRE on-call rotation: responding to incidents, improving runbooks, leading postmortems, and championing reliability improvements. Qualifications - Experience building software with Golang. You are also comfortable reviewing others' code and have opinions about what good code looks like. - Production experience operating large-scale cloud compute (hundreds of hosts or more) via automated workflows. - Deep experience with Linux systems — you are at home in the terminal debugging at the OS level. - Proficiency working with containerized workloads in production. - A customer-first, systems-thinking approach to operational problems — you care about root causes, not just symptoms. - Comfortable working across time zones in both real-time and asynchronous contexts. - You write clear and maintainable documentation such as software designs, runbooks, architecture diagrams/decisions, postmortems, etc. - You communicate project status regularly and clearly, flag blockers early, and follow through on action items. - A sensible approach to AI integration — identifying where AI tools genuinely reduce operational burden and embedding them into workflows without adding complexity. Bonus Points - Production experience with any of: Terraform, Puppet, Ansible, Argo CD, Argo Workflows, CUE, Docker, Kubernetes, Ubuntu, or Ubuntu Live Patch. - Experience being on-call during incidents and using observability tools (e.g. Elastic Stack, Graphite, Prometheus, Influx) to diagnose issues, quantify impact, and confirm mitigations. - Hands-on experience engineering solutions with the Elastic Stack. Compensation Compensation for this role is in the form of base salary. This role does not have a variable compensation component. The typical starting salary range for new hires in this role is: - $143,100 — $175,000 USD The typical starting salary range for this role in select locations (including Seattle WA, Los Angeles CA, the San Francisco Bay Area CA, and the New York City Metro Area) is: - $143,100 — $175,000 USD Benefits - Competitive pay based on the work you do here and not your previous salary. - Health coverage for you and your family in many locations. - Ability to craft your calendar with flexible locations and schedules for many roles. - Generous number of vacation days each year. - Increase your impact - We match up to $2000 (or local currency equivalent) for financial donations and service. - Up to 40 hours each year to use toward volunteer projects you love. - Embracing parenthood with a minimum of 16 weeks of parental leave. Additional Information As a distributed company, diversity drives our identity. Whether you’re looking to launch a new career or grow an existing one, Elastic is the type of company where you can balance great work with great life. We strive to have parity of benefits across regions and while regulations differ from place to place, we believe taking care of our people is the right thing to do. Different people approach problems differently. We need that. Elastic is an equal opportunity/affirmative action employer committed to diversity, equity, and inclusion. We welcome individuals with disabilities and strive to create an accessible and inclusive experience for all individuals.

United States
$143.1K - $175K / year
Interlead GmbH logo

DevSecOps Engineer

Interlead GmbH

Die globale Plattform rund ums Haus

Full TimeRemoteTeam 51-200Since 2013H1B No Sponsor

• You monitor and maintain our servers. • You manage internal monitoring. • You are responsible for the development and maintenance of the IT infrastructure. • You manage our infrastructure as code (e.g., with Terraform, Ansible). • You integrate security checks directly into our CI/CD pipelines (shift-left approach) and continuously improve them. • You perform regular vulnerability scans and penetration tests. • You handle secrets management and the secure handling of credentials. • You ensure compliance with security standards and regulatory requirements (e.g., ISO 27001, GDPR). • You conduct threat modeling and assess security risks early. • You document processes and infrastructure. • You monitor current trends and introduce new tools and best practices. • You advise colleagues on security-related issues.

Germany

Role Description Für einen Kunden aus dem öffentlichen Sektor suchen wir derzeit einen Senior DevOps Engineer (m/w/d). Für diese Position können wir ca. 85,00 € / Stunde für remote und vor Ort anbieten. - Leistungszeitraum: asap bis 30.07.2028 - Leistungsumfang: ca. 3.200 Stunden, optional +50 % - Leistungsort: ca. 95 % remote (Deutschland), ca. 5% Düsseldorf oder Bonn Qualifications - Berufserfahrung: mindestens 5 Jahre - Sprache: Deutsch C2 (Wort und Schrift) Requirements - Agile Arbeitsweise (Scrum/Kanban, Definition of Ready/Done, Akzeptanzkriterien) – 3 Jahre - Benutzung von Git (Branching Modelle, Merge/Rebase, Reset, Fetch/Pull) – 3 Jahre - Build Automatisierung Allgemein (CI Pipelines, Build Jobs, Staging vs. Production Builds, GitLab CI, Bash/Python) – 3 Jahre - Softwarequalität (Code Review, Clean Code, SOLID, statische Code Analyse, Linter) – 3 Jahre - Kenntnisse von Software Architektur (3-Tier Architecture, Hexagonale/Clean Architecture, Design Patterns) – 3 Jahre - Nutzung von Webstandards und Protokollen (HTTP REST, Authentifizierung TLS, OIDC) – 3 Jahre - Implementierung von IT-Sicherheit von Apps / Privacy-by-Design (CORS, OWASP MASVS & ASVS, CSP, Dependency Handling, CVE Handling, BSI IT Grundschutz) – 1 Jahr - Erfahrung mit Container Technologien (Imagedefinitionen, Container Registries, Buildautomatisierung, Entwicklung mit Containern) – 3 Jahre - Erfahrung mit GitLab (GitLab CI, Container Images im Buildprozess, Deployment von Backendsystemen über die CI) – 3 Jahre - Erfahrung in der Linux System Administration (Account Management, vi Editor, Netzwerkverkehr analysieren, Backup/Restore, Hardware Setup, Troubleshooting, Monitoring) – 5 Jahre - Erfahrung mit Containertechnologien – Advanced (Betrieb von Containern, Build von Container-Images, Kubernetes, Observability mit Prometheus/OpenTelemetry) – 5 Jahre - Erstellung von Buildscripten und -tools (Bash, Python) – 5 Jahre - Erfahrung im Aufbau von Backendsystemen (SQL-Datenbanken, Reverse Proxy, Proxy, Webserver) – 5 Jahre - GitOps mit ArgoCD/Crossplane (Konfigurieren & Administrieren, App of Apps Pattern/ApplicationSets, Variablen und Templates, Fehlerbehandlung) – 5 Jahre - Erfahrung in der Administration/Nutzung von MacOS (Grundsätzliche Bedienung, Brew) – 1 Jahr Benefits - Flexible Arbeitszeiten - Remote-Arbeit - Wettbewerbsfähige Vergütung

Germany
€85 / hour
Lantana Consulting Group logo

DevOps Manager

Lantana Consulting Group

Transforming healthcare through health information.

Full TimeRemoteTeam 51-200H1B No Sponsor

• Manages and integrates DevSecOps practices across DevOps Engineering, Release Management, and Security operations • Acts as a key manager and technical liaison, bridging internal teams and IT stakeholders • Ensures that software-delivery pipelines are aligned with technical standards and federal security frameworks • Plays an active role in implementation and problem-solving • Guides integration of infrastructure and drives project delivery • Communicates clearly with leadership and monitors delivery and release metrics.

United States
$140K - $170K / year