Sword Health logo
Sword Health

Sword Health is the world’s fastest growing virtual MSK care provider, on a mission to free two billion people from pain

Senior DevOps Engineer

DevOps EngineerDevOps EngineerOtherRemoteSeniorTeam 201-500Since 2015H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

85 days ago

Salary

$140K - $220K / year

Seniority

Senior

Job Description

Senior DevOps Engineer

Sword Health

• Design, implement, and maintain scalable, resilient infrastructure to support Sword Health’s high-demand applications and services. • Automate and streamline deployment processes, CI/CD pipelines, and routine maintenance tasks to enhance efficiency and reduce downtime. • Monitor and optimize system performance, proactively identifying and resolving issues to ensure high availability and reliability. • Collaborate closely with development, data, and security teams to ensure seamless integration of infrastructure and code changes. • Drive security best practices by implementing and managing access control, network security, and compliance-related policies across the infrastructure. • Lead incident response and troubleshooting for infrastructure-related issues, ensuring rapid and effective resolution to maintain service continuity. • Mentor and guide junior team members, sharing DevOps best practices and fostering a culture of continuous learning and improvement within the team. • Stay up-to-date with industry trends and emerging technologies, bringing innovative solutions to Sword Health’s DevOps processes and toolchains.

Job Requirements

  • Experience with Linux environments.
  • Experience with DevOps and GitOps methodologies.
  • Experience with Kubernetes and Containerized applications (Docker).
  • Experience with Infrastructure as Code (Terraform).
  • Experience with Monitoring Tools (Google Cloud Monitoring/StackDriver, Grafana, Prometheus/AlertManager, NewRelic).
  • Experience with Jenkins.
  • Experience with CI/CD.
  • Team player, Solution-oriented, Proactive attitude with “Get Things Done” mindset.
  • Enthusiast and interested in technologies and innovation.
  • Fluent in English (written and oral).
  • Extra: Experience/Knowledge with Kafka, Prometheus/AlertManager, Grafana, Elasticsearch/ Logstash/ Kibana, Vault, Redis, MySQL, DNS.
  • Extra: Experience with PHP, Javascript, GoLang.
  • Extra: Experience provisioning servers and services using AWS, Azure, or GCP.
  • Extra: Experience/Knowledge with Istio.
  • Extra: Good know-how about Cloud Networking including VPC Management, Routing, NAT, and overall troubleshooting using TCPdump analysis.

Benefits

  • Comprehensive health, dental and vision insurance*
  • Life and AD&D Insurance*
  • Financial advisory services*
  • Supplemental Insurance Benefits (Accident, Hospital and Critical Illness)*
  • Health Savings Account*
  • Equity shares*
  • Discretionary PTO plan*
  • Parental leave*
  • 401(k)
  • Flexible working hours
  • Remote-first company
  • Paid company holidays
  • Free digital therapist for you and your family

Related Categories

Related Job Pages

More DevOps Engineer Jobs

ClickHouse logo

Senior Site Reliability Engineer

ClickHouse

ClickHouse is an open-source, column-oriented OLAP database management system.

DevOps Engineer85 days ago
Full TimeRemoteTeam 51-200Since 2016H1B Sponsor

• Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse. • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud. • Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane, ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents. • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers. • Continuously improve the reliability and performance of our ClickHouse services. • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities. • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

Australia
ClickHouse logo

Senior Site Reliability Engineer

ClickHouse

ClickHouse is an open-source, column-oriented OLAP database management system.

DevOps Engineer85 days ago
Full TimeRemoteTeam 51-200Since 2016H1B Sponsor

• Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse. • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud. • Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents. • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers. • Continuously improve the reliability and performance of our ClickHouse services. • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities. • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime.

Singapore
Full TimeRemoteTeam 501-1,000Since 1965H1B No Sponsor

• Perform system updates (deployments) using the project-specific technologies; • Develop complex scripts to automate project deployments or any other tasks the project deems necessary to automate; • Monitor systems and applications using project-defined technologies, analyze complex data obtained, and perform advanced configurations in the monitoring tools; • Define architecture in terms of documentation and technology; • Use cloud computing platforms and provision basic cloud resources and services, both manually and through automation; • Develop medium-complexity web tools for internal project use; • Document processes and technical solutions according to project needs; • Analyze and resolve critical software and infrastructure incidents in the system; • Automate the build process; • Analyze and isolate defects identified during solution testing, identify root causes, and propose solutions to meet software quality processes; • Create supplementary documentation according to project needs; • Implement new software development procedures as needed, describing methods and their operation in standardized sequences according to the quality assurance system, aiming for continuous improvement; • Act as a knowledge multiplier by providing technical support to the team.

Brazil
Job Closed
First Advantage logo

SRE Lead (US Remote)

First Advantage

Founded in 2003, First Advantage provides comprehensive background-check insights and solutions, enabling employers and housing providers to make confident choices, diminish risks,

DevOps Engineer85 days ago

At First Advantage (Nasdaq: FA), people are at the heart of everything we do. From our customers and partners to our greatest advantage — our team members. Operating with empathy and compassion, First Advantage fosters a global inclusive workforce devoted to the diverse voices that make up our talent and products. Our team members empower each other to be their authentic selves and treat all with respect, integrity, and fairness. Say hello to a rewarding career, and come join a leading provider of mission-critical background screening solutions to some of the most recognized Fortune 100 and Global 500 brands. First Advantage is a global leader in background screening, identity, and verification solutions. As we continue to scale our digital platforms and modern cloud-native infrastructure, we are seeking a highly skilled and forward-thinking Lead Site Reliability Engineer (SRE) to drive reliability, resilience, and operational excellence across our systems. The Lead SRE will be responsible for guiding reliability strategy, overseeing complex incident response, improving observability, strengthening automation and CI/CD practices, and partnering closely with engineering teams to embed SRE principles throughout the organization. This role requires a deep understanding of modern cloud architecture—including both Azure and AWS—as well as expertise in Linux systems, monitoring technologies, and root‑cause analysis. This is a senior hands-on engineering role, ideal for someone who enjoys solving difficult problems at scale and mentoring others while driving meaningful improvements to uptime, performance, and customer experience. What You'll Do: - Site Reliability & Platform Stability - Lead reliability initiatives across multiple high-availability, large-scale SaaS systems, ensuring platform uptime, performance, and resilience. - Build and maintain distributed systems, infrastructure components, and automation tooling to ensure consistent, reliable delivery of production services. - Champion proactive reliability engineering, holistic system monitoring, and continuous operational improvements. - Partner with architecture, engineering, and operations teams to define SLAs, SLOs, and SLIs. - Cloud Engineering (Azure & AWS) - Architect, build, and maintain cloud infrastructure using best practices. - Guide cloud migrations, cost optimization, and resilience engineering across multi-cloud environments. - Implement and enforce cloud security, compliance, and governance standards. - DevOps, CI/CD, and Automation - Create and maintain CI/CD pipelines using GitHub Actions, Azure DevOps, Jenkins, or equivalent. - Automate deployments using IaC tools (Terraform, Bicep, CloudFormation). - Reduce manual operational burden through automation and self-service tooling. - Monitoring, Observability & Performance - Implement observability stacks covering metrics, logs, traces, and synthetic checks. - Standardize monitoring practices using industry tooling. - Perform performance analysis, load testing, and optimization. - Incident Response & Management - Serve as Incident Commander for major production incidents. - Define and improve incident management processes. - Ensure clear communication during outages and lead technical bridges. - Deliver high‑quality RCAs with actionable follow‑ups. - Root‑Cause Analysis (RCA) & Continuous Improvement - Drive deep, data‑driven RCAs and long-term reliability improvements. - Identify and eliminate systemic issues and operational toil. - Leadership, Collaboration & Mentorship - Provide technical leadership across teams. - Mentor engineers and promote SRE best practices. - Foster strong cross‑functional partnerships. What You'll Need to be Successful: - 7+ years in SRE, DevOps, Platform Engineering, or Cloud Engineering. - Strong expertise in Azure and AWS. - Proficiency in CI/CD, automation, and release engineering. - Deep monitoring, logging, and observability experience. - Incident response leadership experience. - Proven RCA experience. - Strong Linux skills. - Scripting skills (Python, Bash, PowerShell, Go). - IaC experience. - Strong systems and networking fundamentals. - Additional Preferred Qualifications - Experience with large-scale distributed systems. - Message queues or event streaming knowledge. - Familiarity with incident management frameworks. - Multi-cloud enterprise experience. - Kubernetes, ECS, AKS, or EKS exposure Why First Advantage is Your Next Big Career Move First Advantage is going through a technology transformation! We are looking for experts who are excited to work with advanced technologies and provide best-in-class user experiences, drive the development and deployment of scalable solutions, and smoothly guide our agile teams and clients through meaningful changes as we continue to expand our impact. What Are You Waiting For? Apply Today! You have learned a little about us today – we want to learn about you! If you think this position and our company are a great fit for your areas of interest and expertise, tell us about you by applying now! The salary range for this position is approximately $120,000 - $150,000 base annually. This range reflects our good faith estimate to pay fairly as to what our ideal candidates are likely to expect, and we tailor our offers within the range based on the selected candidate’s experience, industry knowledge, technical and communication skills, and other factors that may prove relevant during the interview process. United States Equal Opportunity Employment: First Advantage is proud to be a global leader in removing barriers and supporting our community members to ensure the changing demographics of the workforce are reflected in our hiring and employment practices. We value all of our candidates, employees, and clients, and place great emphasis on hiring and supporting qualified individuals in each role. We are an equal opportunity employer. We do not discriminate on the basis of race, color, ethnicity, ancestry, religion, sex, national origin, sexual orientation, age, citizenship status, marital status, disability, gender identity, gender expression, veteran status, genetic information, or any other area protected by applicable law.

United States
$120K - $150K / year