Job Closed

This listing is no longer active.

HHAeXchange logo
HHAeXchange

Better Homecare, Better Health

Site Reliability Architect

DevOps EngineerDevOps EngineerOtherRemoteLeadTeam 501-1,000Since 2008H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

120 days ago

Salary

$170K - $185K / year

Seniority

Lead

Bachelor Degree10 yrs expEnglishAWSDNSGCPJavaKubernetesPythonTCP/IPTerraform

Job Description

Site Reliability Architect

HHAeXchange

• Architect with a resiliency-by-design intent, for self-healing, fault-tolerant systems, focusing on proactive readiness rather than reactive correction. • Operate within a secure high-volume, high-volatility application environment, utilizing advanced networking and compute structures, in cloud hosted environments (AWS/GCP). • Move the organization from "firefighting" to a proactive culture through habits and systems supporting feature flagging, production readiness reviews, architectural decision records, and chaos engineering. • Support the incident management practice, mentoring SREs and Software engineers alike in utilizing our monitoring and observability toolsets for effective troubleshooting. • Define SLIs, SLOs, and error budgets that balance feature velocity with platform stability, supporting a shift to service ownership. • Underscore an automation-first perspective using Terraform, CDK, and other cloud-formation infrastructure as code toolsets to ensure repeatable, audit-ready environments.

Job Requirements

  • Bachelor's or Master's degree in Computer Science, Information Systems, or related field and applicable experience.
  • 10 + years in SRE/DevOps with 4 of that in an enterprise SaaS environment.
  • 4+ years in software development contributing to a SaaS-based, cloud-hosted product line.
  • Proven track record in a distributed SaaS environment managing multi-cloud or multi-region workloads.
  • Proficiency in modern cloud networking, including DNS, TCP/IP, Load Balancing, and Zero Trust security models.
  • Strong coding skills in Go, Python, Java, C#, or others, to build internal reliability tools and automate complex operational workflows.
  • Expert-level knowledge of Kubernetes (EKS/GKE) architecture, including multi-cluster management and stateful workloads.
  • Ability to optimize cloud spend while maintaining high performance and reliability.
  • Experience operating in a DevSecOps context with compliance guardrails (e.g., GDPR, HIPAA, HITRUST) across varied infrastructures
  • Willingness to explore and adopt AI tools responsibly to enhance productivity and innovation in your role

Benefits

  • competitive health plans
  • paid time-off
  • company paid holidays
  • 401K retirement program with a Company elected match
  • other company sponsored programs

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Moonlite AI logo

Sr. Site Reliability Engineer (SRE)

Moonlite AI

Moonlite is building a cloud-native experience on-prem. Our software provides the control and customization enterprises need for AI. Build Faster with Moonlite Instantly download and deploy NIMS from NVIDIA or build your own applications with Hugging Face. Customize and deploy AI agents in one click or integrate your own with ease. Total Control Over Your AI Obtain the highest level of security by design for your private environments. Moonlite provides total visibility into all your resources, applications, and users. Find Value with Your Use Case Allocate resources in real-time as needed in your environment. Use the models that best align with your use cases. When a new model is released, test it out and power your applications with it.

DevOps Engineer120 days ago
OtherRemoteTeam 10Since 2024

Moonlite delivers high-performance AI infrastructure for organizations running intensive computational research, large-scale model training, and demanding data processing workloads.We provide infrastructure deployed in our facilities or co-located in yours, delivering flexible on-demand or reserved compute that feels like an extension of your existing data center. Our team of AI infrastructure specialists combines bare-metal performance with cloud-native operational simplicity, enabling research teams and enterprises to deploy demanding AI workloads with enterprise-grade reliability and compliance. Your Role: You will be instrumental in building and operating production-grade AI infrastructure with deep Kubernetes expertise at its core. Working closely with our systems engineers, network engineers, and platform engineering team, you’ll architect and operate the Kubernetes infrastructure that powers our control plane and orchestrates compute, storage, and networking at scale. This role requires deep understanding of Kubernetes internals, custom resource definitions (CRDs), storage and network integrations, and building production-grade clusters from the ground up (not just deploying in managed environments). You'll ensure enterprise-grade reliability while establishing the automation, observability, and operational practices. Job Responsibilities Kubernetes Infrastructure Engineering: Design, build, and operate production Kubernetes clusters on bare-metal infrastructure – including cluster bootstrapping, control plane architecture, etcd management, and scaling strategies for high-performance compute workloads. Kubernetes Networking & CNIs: Implement and operate custom Kubernetes networking solutions with SR-IOV for high-performance GPU interconnects, multi-tenancy isolation and advanced networking policies. Configure CNI plugins and network segmentation for research workloads. Custom Operators & Controllers: Develop and maintain custom Kubernetes operators and controllers for bare-metal provisioning, infrastructure lifecycle management, and resource orchestration across compute, storage, and networking domains. GPU Infrastructure Integration: Deploy and optimize NVIDIA GPU operators, device plugins, and other custom scheduling logic for GPU workload placement and utilization optimization. Platform Integration & Storage: Build deep integrations between Kubernetes and underlying infrastructure including CSI drivers for storage, custom admission controllers for policy enforcement, and scheduling extensions for specialized hardware placement. Design and implement automation using Terraform, Ansible, Helm, and custom operators to orchestrate infrastructure workflows and enable deployments across multiple regions. Production Operations & Reliability: Manage production bare-metal infrastructure across multiple regions. Build systems ensuring high availability, fault tolerance, and graceful degradation – establishing SLIs, SLOs, and monitoring to meet enterprise reliability commitments. Observability & Incident Response: Build comprehensive monitoring, logging, and alerting using Prometheus, Grafana, and ELK stack. Lead incident response, conduct postmortems, and implement preventative measures to improve reliability and reduce MTTR. Performance & Capacity Planning: Identify and resolve performance bottlenecks across infrastructure domains. Monitor utilization trends, forecast capacity needs, and optimize resource allocation for various workloads. Requirements Preferred Qualifications Experience building custom Kubernetes operators or controllers for infrastructure orchestration Deep familiarity with Kubernetes networking (Calico, Cilium, Multus), service mesh technologies, and network policy management Experience with GPU workload orchestration including NVIDIA GPU Operator, MIG, time-slicing, and device plugins Background with advanced Kubernetes features including custom schedulers, admission controllers, and API server extensions Experience with Kubernetes cluster federation or multi-cluster management Knowledge of high-performance networking technologies (InfiniBand, RDMA, RoCE) and their integration with Kubernetes Experience with enterprise storage systems (VAST, Lightbits, Ceph, or similar) Familiarity with configuration management at scale and GitOps practices Understanding of security best practices for Kubernetes and bare-metal infrastructure Experience operating infrastructure in regulated industries or co-located data center environments Background supporting research institutions, technical computing environments, or enterprise AI infrastructure Key Technologies Kubernetes, Linux, Terraform, Ansible, Prometheus, Grafana, ELK Stack, Go, Python, Bash, NVIDIA GPU Technologies, High-Performance Networking, Enterprise Storage Systems Why Moonlite Build Critical Research Infrastructure: Your work will directly enable quantitative research teams and AI practitioners to push the boundaries of what's possible in financial modeling and AI research. Enterprise Impact: Build and operate infrastructure that supports mission-critical research and AI workloads for leading financial institutions and research organizations. Technical Excellence: Join an infrastructure team focused on delivering enterprise-grade reliability while pushing the boundaries of high-performance computing capabilities. Hands-On Ownership: As part of our growing infrastructure team, you'll have significant ownership over critical systems and the autonomy to influence our operational practices and technology choices. Industry Leadership: Work alongside experienced infrastructure professionals who have built and operated systems for the most demanding computing environments. We offer a competitive total compensation package combining a competitive base salary, startup equity, and industry-leading benefits. The total compensation range for this role is $165,000 – $225,000, which includes both base salary and equity. Actual compensation will be determined based on experience, skills, and market alignment. We provide generous benefits, including a 6% 401(k) match, fully covered health insurance premiums, and other comprehensive offerings to support your well-being and success as we grow together. #li-remote

Indiana + 1 moreAll locations: Indiana | Illinois
$165K - $225K / year
Awetomaton logo

Senior DevOps Engineer

Awetomaton

Awetomaton is a team of curious, tenacious, and seasoned analysts and engineers. Let us make your cloud and data adventure a joy

DevOps Engineer120 days ago
OtherRemoteTeam 26Since 2020

We are seeking a Senior DevOps Engineer to support our application and algorithm development teams. The ideal candidate will have 5-10 years experience in CI/CD, containerization, and cloud-based platform deployments, particularly in C2S and GovCloud environments. This role is responsible for automating deployments, meeting security requirements, and deploying scalable solutions that balance performance and cost. This position will be performed from WPAFB. Active Top Secret security clearance required. Tenacity when tackling system security and compliance requirements Strong background in Linux, networking and system administration fundamentals Req uired Experience and Skills B.S. in Information Technology or related field 10+ years of experience in DevOps engineering, system administration and infrastructure management 2+ years of experience with scripting languages like Python and Bash. 2+ years of experience managing and deploying applications on Kubernetes environments (Helm / Kustomize / etc.) 2+ years of experience with deployment and administration of database (Postgres/MySQL/etc.) / messaging (RabbitMQ/ZeroMQ/etc.) solutions. 2+ years of experience with Infrastructure as Code tooling (Terraform, CloudFormation, Ansible, etc.) Familiarity with security compliance tools (e.g., Grype, Fortify, etc.) Hands-on experience with CI/CD tools such as Gitlab CI and ArgoCD. Desired Education and Experience Experience with Argo Workflows and batch & stream based processing paradigms. Leadership and mentorship experience, capable of leading a small team or project group. Expertise navigating Authority to Operate and Risk Management Framework per NIST 800-53 Familiarity with enterprise processes for Firewall Change Requests (FCRs), Security Impact Determinations (SIDRs) and Change Requests in ServiceNow workflow systems About Awetomaton We build software that enables humans to make informed decisions from raw intel data, with a specialization in satellite collection. We are a Dayton-based defense startup that believes there is a critical skill & culture gap - one we are filling with high-output, mission focused engineering talent. Results matter to our customers. We've never lost a customer because we deliver what they need. Awetomaton is where you do work that matters with a passionate team in an environment that inspires your creativity. Benefits We provide industry leading benefits that balance your health & wellbeing today, as well as lay a solid financial future into retirement. Highlights of our benefit plan include: Flexible time off totaling 36 days / year. No approvals required. 100% 401k company match up to IRS annual limits. This is a benefit of up to $24,500 for 2026. Health plan through Blue Cross Blue Shield with Health Savings Account (HSA). We pay 95% of employee premiums including family plans. Employee favorite: Tech and Wellness reimbursement of $3000 / year. Treat yourself with the newest gadgets or athletic gear. No approvals required. Annual profit sharing as 401k contribution. No set percent and dependent on annual company performance. Benefit is above and beyond 401k match. MacBook laptop and well-equipped office spaces for optimal productivity in office and on-the-go.

Ohio
Metova, Inc. logo

Ingeniero DevOps

Metova, Inc.

Helping companies transform their business through technology to meet the growing expectations of their customers.

DevOps Engineer120 days ago
Full TimeRemoteTeam 201-500Since 2006H1B No Sponsor

• Identificar problemas de los componentes de desarrollo de software. • Crear pipelines de construcción y liberación. • Interpretar, documentar y mejorar las líneas base. • Analizar resultados de métricas en liberaciones para encontrar oportunidades de mejora. • Participar en liberaciones en temas de configuración de componentes. • Generar permisos y/o accesos de los involucrados hacia las herramientas de DevOps e infraestructura

Mexico
Job Closed
MED-REVIEW logo

DevOps Engineer

MED-REVIEW

O seu parceiro na jornada rumo à residência médica dos sonhos!!

DevOps Engineer121 days ago
Full TimeRemoteTeam 51-200Since 2020H1B No Sponsor

• Pipeline Automation: Design, implement, and maintain CI/CD pipelines (Continuous Integration / Continuous Deployment). • Infrastructure as Code (IaC): Provision and manage cloud resources using tools such as Terraform or CloudFormation. • Observability: Configure monitoring, logging, and alerts to proactively detect issues in production environments. • Container Management: Orchestrate applications using Docker and Kubernetes. • Security (DevSecOps): Implement security best practices across the development lifecycle, from vulnerability analysis to secrets management. • Collaboration: Act as a facilitator between developers and the infrastructure team to optimize application performance.

Brazil
Job Closed