Job Closed

This listing is no longer active.

HavocAI logo
HavocAI

Autonomous Solutions for Maritime Operations

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerOtherRemoteSeniorTeam 11-50Since 2024H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

81 days ago

Salary

$150K - $185K / year

Seniority

Senior

Bachelor Degree7 yrs expEnglishDistributed SystemsKubernetesLinuxPython

Job Description

Senior Site Reliability Engineer

HavocAI

• Design and evolve reliability architecture for distributed and cloud-hosted systems. • Define and implement SRE best practices, including SLIs, SLOs, error budgets, and capacity planning. • Partner with platform and application teams to design systems for reliability, scalability, and operability. • Identify and mitigate systemic reliability risks across infrastructure and services. • Lead incident response processes including on-call rotations, escalation, and post-incident reviews. • Conduct root cause analysis for complex production incidents and drive long-term improvements. • Improve operational readiness through runbooks, automation, and resilience testing. • Reduce operational toil through tooling, automation, and process improvements. • Design and maintain observability systems for metrics, logging, tracing, and alerting. • Ensure services and data pipelines are observable, debuggable, and performant in production. • Drive performance analysis and tuning across infrastructure and service layers. • Build automation to improve system reliability, deployment safety, and recovery processes. • Partner with DevOps and Cloud Platform teams on CI/CD reliability, rollout strategies, and safe deployment patterns. • Support and improve Kubernetes-based environments and containerized workloads. • Collaborate with security teams to ensure secure and resilient system design. • Participate in disaster recovery planning and testing. • Maintain strong operational practices around access control, secrets management, and change management.

Job Requirements

  • 7+ years of experience in SRE, infrastructure, or systems engineering roles
  • Strong experience operating large-scale distributed production systems
  • Deep understanding of Linux systems, networking, and distributed systems fundamentals
  • Hands-on experience with Kubernetes and container orchestration
  • Programming or scripting experience in Go, Python, or similar languages
  • Experience designing and operating observability systems for production environments
  • Proven ability to lead incident response and reliability improvements
  • Strong communication skills and ability to collaborate across engineering teams
  • Must be a US Citizen.
  • Must be Eligible to obtain a Government Clearance - if required.

Benefits

  • 100% Employer paid Health, Dental and Vision Insurance for you and your families
  • Life Insurance (Employer Paid)
  • Ability to participate in the companies 401k program (Matching)
  • Unlimited PTO policy with an enforced 2 week minimum
  • Equity Package
  • Work / Home Office Stipend
  • Global Entry
  • 16 Week Paid Parental Leave
  • Monthly Health and Wellness Stipend

Related Categories

Related Job Pages

More DevOps Engineer Jobs

OtherRemoteTeam 51-200

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description Collate is the creator of the fast-growing open-source OpenMetadata project, and we’re passionate about transforming the way data teams work together. Our mission is to help every company realize the fullest potential of data through AI Agents via open-source, and unified metadata, by solving problems around data discovery, observability, and governance. What you'll be doing: - Design and build OpenMetadata as a SaaS platform that is iterating and evolving rapidly. - Put automation at the forefront of our engineering culture. Automate CI/CD and our integration tests. - Define the release methodologies and implement CI/CD for our SaaS and On-Prem releases. - Build the observability platform for our growing SaaS customers. - Provide leadership for the devops team. - Set priorities/roadmap and lead initiative on SaaS and OpenSource infrastructure. - Be a part of the team that'll change the way data products are being built in OSS. - Implement corporate Governance on Cloud service usage and Security measures. - Lead the effort to develop cloud mechanisms to improve scalability and reduce costs. - Work with our current technical stack including Java, JSON, Rest, Python, Docker, Reach/Node/TypeScript. Qualifications - 5+ years of experience in building scalable SaaS. - Experience with CI/CD setup for Cloud (ECS) and On-premises environments, as well as with Orchestration tools: Dockers, Kubernetes is beneficial. - Knowledge of DevSecOps concepts and infrastructure-as-a-code, as well as hands-on experience in security related to cloud-based infrastructure. - Provide guidance on operational concerns to development and business teams. - Experience managing and maintaining load balancers, proxies, webservers, Queues, Caches etc. - A desire to learn and grow as an engineer. - A passion to uncover and solve real-user problems. - Excitement about being an early engineer. You'll be defining our engineering culture, choosing our shared tools, and helping us build a world-class team. - Previous open-source contributions are a plus. Benefits - Competitive salary and Seed stage startup equity. - Fully remote. - Work with smart, motivated, like-minded peers, whom you can teach and learn from to grow together. - You'll be joining a small team with no bureaucracy or politics and get to work directly with the founders.

United States
Dave logo

Staff Site Reliability Engineer

Dave

We started Dave for one reason: banks weren’t built for people like us, and we knew we deserved better.

DevOps Engineer81 days ago
OtherRemoteTeam 201-500H1B Sponsor

• Lead architecture and automation across our GCP environment, ensuring reliability, scalability, security, and thoughtful cost management. • Define and improve SLIs, SLOs, and error budgets using Cloud Monitoring and Datadog — connecting reliability goals to real business outcomes. • Shape our multi-region, disaster recovery, and capacity planning strategies so the platform holds up as we grow. • Design and optimize cloud networking, including VPC architecture, ingress/egress, Cloud Armor, VPN, and DNS to support internal systems, partner integrations, and member-facing services. • Drive infrastructure-as-code and GitOps practices using Terraform, Kubernetes, Helm, and ArgoCD to make deployments predictable and repeatable. • Mentor SREs and infrastructure engineers through design reviews, incident retros, and hands-on collaboration — strengthening technical depth across the team. • Explore practical LLM-driven automation where it meaningfully reduces operational toil and shortens incident resolution time.

United States
$208K - $330K / year
Aledade logo

Senior Software Engineer II - CI Pipeline Engineer

Aledade

Self-described as "a new company with an old-fashioned goal," Aledade aims to put healthcare control back into the hands of doctors. Headquartered in Bethesda, Maryland, the compan

DevOps Engineer81 days ago

As a Senior II Engineer on the CI Pipeline team, you will serve as a primary architect of our CI/CD vision, helping to ensure that as Aledade scales, our delivery speed and compliance posture accelerate together. You will initially lead the evolution of a "Universal Pipeline" – the initiative to make the "Right Way" the "Easy Way" by building automation and guardrails to ensure every deployment is HIPAA-compliant by default. Beyond the initial pipeline framework, you will be involved in the long-term strategy for our internal developer experience, moving into the test tooling infrastructure (interwoven into the CI pipeline), self-service tooling, and ephemeral environments to leverage those technologies. Your goal is to foster a high-velocity engineering culture where security, compliance, and audit evidence are seamless side-effects of a delivery lifecycle, not manual tasks. Primary Duties: - Develop and implement scalable and performant solutions. - Partner, as a peer, with Engineering Managers, Product Managers, and stakeholders throughout Aledade to develop and execute technical roadmaps using Agile processes. - Mentor and coach more junior engineers including thorough pull request reviews for other developers and be receptive to critical feedback on your own work. Minimum Qualifications: - BS/BTech (or higher) in Computer Science, Engineering or a related field. - 6+ years experience as an engineer building and managing highly automated CI/CD infrastructure and developer tooling as part of a cross-functional team. - 3+ years of experience working with infrastructure-as-code and automation scripting (e.g., Python, Bash, or Go) to manage complex delivery pipelines. - 3+ years of experience acting as a trusted technical decision-maker in a team setting, solving for short-term and long-term business value. - 3+ years of experience coaching other engineers on testing strategies and pipeline integration. Preferred KSA’s: - Engineering & Custom Tooling - Systems Programming: Proficiency in a high-level language (Python, Go, etc) to build custom CLI tools, internal providers, or API integrations that extend the capabilities of off-the-shelf CI/CD products. - Developer Experience (DX) Tooling: Experience building internal abstractions or "Golden Path" templates that simplify complex cloud interactions for product engineers. - Infrastructure as Code (IaC): Expert-level Terraform or Pulumi skills used to treat the entire delivery platform as a version-controlled, testable software product. Test Infrastructure & Orchestration - Ephemeral Test Environments: Expertise in architecting "on-demand" testing environments (using Kubernetes/Namespaces or Docker) that allow developers to run full-stack integration tests within the pipeline. - Test Tooling Integration: Experience building or integrating frameworks for Contract Testing (e.g., Pact), Synthetic Testing, and Automated Regression at scale. - Mocking & Service Virtualization: Ability to provide engineers with the infrastructure needed to mock healthcare-specific dependencies (e.g., EHR simulators) within the CI flow. Compliance & Security as Code - Automated Governance: Experience building "Compliance as Code" into pipelines, ensuring that SOC2, SOX, and HIPAA audit evidence (the "Triple-Lock" of Author, Approver, and Scan results) is captured automatically. - Secure Supply Chain: Proficiency in integrating security gates—including SAST, DAST, Secret Detection, and automated SBOM generation—into the automated delivery flow. - Identity & Secrets Management: Deep understanding of managing sensitive credentials and least-privilege access for CI/CD runners in a cloud environment (AWS preferred). Pipeline Architecture & Reliability - Universal Pipeline Design: Expertise in building modular, reusable CI/CD templates (e.g., GitHub Actions) that standardize deployment patterns across diverse stacks (ECS, EKS, Databricks). - Build Optimization: Proven ability to optimize monorepo build performance through intelligent caching, change-detection, and parallelization. - Observability & DORA Metrics: Ability to instrument the delivery platform to track and improve core metrics like Deployment Frequency and Lead Time for Changes. Physical Requirements: - Sitting for prolonged periods of time. Extensive use of computers and keyboard. Occasional walking and lifting may be required.

United States
Dev.Pro logo

Senior DevOps Engineer

Dev.Pro

Software Development Partner. Result-driven. Quality-obsessed.

DevOps Engineer81 days ago
Full TimeRemoteTeam 501-1,000Since 2011H1B No Sponsor

• Manage, scale, and optimize cloud environments used for data science workloads (primarily AWS, Databricks, dbt). • Provision, maintain, and optimize compute clusters for ML workloads (e.g., Kubernetes, ECS/EKS, Databricks, SageMaker). • Implement and maintain high-availability solutions for mission-critical analytics platforms. • Develop CI/CD pipelines for model deployment, infrastructure-as-code (IaC), and automated testing using industry standard toolchains. • Build monitoring, alerting, and logging systems for cloud and ML infrastructure (e.g., Datadog, CloudWatch, Prometheus, Grafana, ELK). • Automate provisioning, configuration, and deployments using tools such as Terraform and CloudFormation, GitHub actions, etc. • Collaborate with Data Engineering to maintain integrations between data pipelines and cloud systems. • Share responsibility for provisioning and operating application networking capabilities that support data platforms, including API gateways, CDNs, application load balancers, TLS, and WAFs. • Conduct periodic risk assessments, best practice reviews, and remediation efforts to strengthen security and resiliency.

Brazil