Job Closed

This listing is no longer active.

Deepgram logo
Deepgram

Building foundational AI for speech transcription and understanding.

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, Terraform

DevOps EngineerDevOps EngineerOtherRemoteSeniorTeam 51-200Since 2015H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

115 days ago

Salary

$160K - $220K / year

Seniority

Senior

Job Description

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, Terraform

Deepgram

• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments

Job Requirements

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven, hands-on experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
  • Strong scripting and automation skills (e.g., Python, Go, Bash)

Benefits

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO
  • Generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions

Related Categories

Related Job Pages

More DevOps Engineer Jobs

eClinical Solutions logo

Senior DevOps Software Engineer

eClinical Solutions

We bring people and data together to support tomorrow’s breakthroughs

DevOps Engineer115 days ago
OtherRemoteTeam 201-500Since 2012H1B Sponsor

• Design, develop, test, and deploy scalable, secure, and highly interactive web applications • Own and evolve core platform modules • Influence application and system architecture • Lead by example through clean, well-tested code • Collaborate closely with Product Management, QA, and other engineers • Provide technical mentorship and guidance to other engineers • Diagnose and resolve complex production issues • Ensure solutions meet eClinical Solutions quality standards

Massachusetts
$132K - $165K / year
Zscaler logo

Senior Site Reliability Engineer

Zscaler

Zscaler helps leading organizations in 180+ countries securely transform their networks and applications for a mobile and cloud-first world. Founded in 2008, th

DevOps Engineer115 days ago

• Expertly navigate networking principles, firewalls, and load balancing solutions to ensure robust infrastructure performance • Partner with Software Engineering and Infrastructure teams to design, implement, and deploy comprehensive end-to-end monitoring solutions • Execute seamless patches and upgrades, ensuring all administrative tools and utilities remain current and high-performing • Proactively monitor applications and services, participating in an on-call rotation to resolve issues and implement strategic prevention measures • Troubleshoot complex technical challenges and provide clear, candid communication regarding issues and their resolutions.

Illinois
$101K - $145K / year
Job Closed
Full TimeRemoteTeam 1,001-5,000H1B No Sponsor

• Participate in on-call rotations as the primary technical lead. Act as the Incident Commander during major severity incidents: initiating war rooms, coordinating cross-functional teams, and providing clear status updates. • Instrument code to expose high-cardinality metrics and distributed traces. Collaboratively define, measure, and defend Service Level Objectives (SLOs) and Error Budgets with product owners. • Write high-quality, production-ready code (in Java, Go, or Python) to build internal tooling, automation platforms, and self-healing mechanisms that eliminate manual operator intervention. • Partner with Product Engineering teams during the design phase to ensure new services are built with reliability, scalability, and observability patterns (circuit breakers, rate limiting, backpressure, fallback strategies) from day one. • Analyze system performance and traffic patterns to model future capacity needs. Conduct load testing and chaos engineering experiments to verify system resilience under failure conditions.

Nigeria
Full TimeRemoteTeam 1,001-5,000H1B No Sponsor

• Participate in on-call rotations to detect and triage service and reliability issues across all environments. Act as the Incident Commander during major incidents: initiating war room or bridge calls, coordinating cross-functional teams, providing timely and clear status updates to all stakeholders. • Create and maintain meaningful dashboards and alerts. Work with development teams to instrument their code to ensure visibility. • Develop automation to eliminate manual and repetitive operational tasks (toil) related to reliability across both applications and infrastructure. • Implement and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) defined by the engineering leadership. • Investigate and resolve customer complaints escalated beyond L1 and L2 support, especially those involving performance, reliability, or complex system behavior.

Nigeria