Job Closed
This listing is no longer active.
Building foundational AI for speech transcription and understanding.
Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, Terraform
Location
United States
Posted
115 days ago
Salary
$160K - $220K / year
Seniority
Senior
Job Description
Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, Terraform
Deepgram
• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments
Job Requirements
- 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
- Proven, hands-on experience building and managing production infrastructure with Terraform
- Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
- Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
- Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
- Strong scripting and automation skills (e.g., Python, Go, Bash)
Benefits
- Medical, dental, vision benefits
- Annual wellness stipend
- Mental health support
- Life, STD, LTD Income Insurance Plans
- Unlimited PTO
- Generous paid parental leave
- Flexible schedule
- 12 Paid US company holidays
- Quarterly personal productivity stipend
- One-time stipend for home office upgrades
- 401(k) plan with company match
- Tax Savings Programs
- Learning / Education stipend
- Participation in talks and conferences
- Employee Resource Groups
- AI enablement workshops / sessions
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Software Engineer
eClinical SolutionsWe bring people and data together to support tomorrow’s breakthroughs
• Design, develop, test, and deploy scalable, secure, and highly interactive web applications • Own and evolve core platform modules • Influence application and system architecture • Lead by example through clean, well-tested code • Collaborate closely with Product Management, QA, and other engineers • Provide technical mentorship and guidance to other engineers • Diagnose and resolve complex production issues • Ensure solutions meet eClinical Solutions quality standards
Senior Site Reliability Engineer
ZscalerZscaler helps leading organizations in 180+ countries securely transform their networks and applications for a mobile and cloud-first world. Founded in 2008, th
• Expertly navigate networking principles, firewalls, and load balancing solutions to ensure robust infrastructure performance • Partner with Software Engineering and Infrastructure teams to design, implement, and deploy comprehensive end-to-end monitoring solutions • Execute seamless patches and upgrades, ensuring all administrative tools and utilities remain current and high-performing • Proactively monitor applications and services, participating in an on-call rotation to resolve issues and implement strategic prevention measures • Troubleshoot complex technical challenges and provide clear, candid communication regarding issues and their resolutions.
• Participate in on-call rotations as the primary technical lead. Act as the Incident Commander during major severity incidents: initiating war rooms, coordinating cross-functional teams, and providing clear status updates. • Instrument code to expose high-cardinality metrics and distributed traces. Collaboratively define, measure, and defend Service Level Objectives (SLOs) and Error Budgets with product owners. • Write high-quality, production-ready code (in Java, Go, or Python) to build internal tooling, automation platforms, and self-healing mechanisms that eliminate manual operator intervention. • Partner with Product Engineering teams during the design phase to ensure new services are built with reliability, scalability, and observability patterns (circuit breakers, rate limiting, backpressure, fallback strategies) from day one. • Analyze system performance and traffic patterns to model future capacity needs. Conduct load testing and chaos engineering experiments to verify system resilience under failure conditions.
• Participate in on-call rotations to detect and triage service and reliability issues across all environments. Act as the Incident Commander during major incidents: initiating war room or bridge calls, coordinating cross-functional teams, providing timely and clear status updates to all stakeholders. • Create and maintain meaningful dashboards and alerts. Work with development teams to instrument their code to ensure visibility. • Develop automation to eliminate manual and repetitive operational tasks (toil) related to reliability across both applications and infrastructure. • Implement and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) defined by the engineering leadership. • Investigate and resolve customer complaints escalated beyond L1 and L2 support, especially those involving performance, reliability, or complex system behavior.



