Job Closed

This listing is no longer active.

Deepgram

Building foundational AI for speech transcription and understanding.

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, Terraform

DevOps EngineerDevOps EngineerOther Remote SeniorTeam 51-200Since 2015H1B SponsorCompany Site LinkedIn

Location

United States

Posted

115 days ago

Salary

$160K - $220K / year

Seniority

Senior

5 yrs expEnglishAWS Kubernetes Python Terraform

Job Description

• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments

Job Requirements

5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
Proven, hands-on experience building and managing production infrastructure with Terraform
Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
Strong scripting and automation skills (e.g., Python, Go, Bash)

Benefits

Medical, dental, vision benefits
Annual wellness stipend
Mental health support
Life, STD, LTD Income Insurance Plans
Unlimited PTO
Generous paid parental leave
Flexible schedule
12 Paid US company holidays
Quarterly personal productivity stipend
One-time stipend for home office upgrades
401(k) plan with company match
Tax Savings Programs
Learning / Education stipend
Participation in talks and conferences
Employee Resource Groups
AI enablement workshops / sessions

Related Categories

DevOps Engineer

Related Job Pages

Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior DevOps Software Engineer

eClinical Solutions

We bring people and data together to support tomorrow’s breakthroughs

DevOps Engineer115 days ago

Other RemoteTeam 201-500Since 2012H1B Sponsor

Company Site LinkedIn

• Design, develop, test, and deploy scalable, secure, and highly interactive web applications • Own and evolve core platform modules • Influence application and system architecture • Lead by example through clean, well-tested code • Collaborate closely with Product Management, QA, and other engineers • Provide technical mentorship and guidance to other engineers • Diagnose and resolve complex production issues • Ensure solutions meet eClinical Solutions quality standards

AWS Kubernetes Python Terraform

View details: Senior DevOps Software Engineer

Massachusetts

$132K - $165K / year

Apply

Senior Site Reliability Engineer

Zscaler

Zscaler helps leading organizations in 180+ countries securely transform their networks and applications for a mobile and cloud-first world. Founded in 2008, th

DevOps Engineer115 days ago

Other Remote

Company Site

• Expertly navigate networking principles, firewalls, and load balancing solutions to ensure robust infrastructure performance • Partner with Software Engineering and Infrastructure teams to design, implement, and deploy comprehensive end-to-end monitoring solutions • Execute seamless patches and upgrades, ensuring all administrative tools and utilities remain current and high-performing • Proactively monitor applications and services, participating in an on-call rotation to resolve issues and implement strategic prevention measures • Troubleshoot complex technical challenges and provide clear, candid communication regarding issues and their resolutions.

DNS Firewalls Python TCP/IP

View details: Senior Site Reliability Engineer

Illinois

$101K - $145K / year

Apply

Job Closed

Senior Site Reliability Engineer

Moniepoint Inc. (Formerly TeamApt Inc.)

DevOps Engineer115 days ago

Full Time RemoteTeam 1,001-5,000H1B No Sponsor

Company Site LinkedIn

• Participate in on-call rotations as the primary technical lead. Act as the Incident Commander during major severity incidents: initiating war rooms, coordinating cross-functional teams, and providing clear status updates. • Instrument code to expose high-cardinality metrics and distributed traces. Collaboratively define, measure, and defend Service Level Objectives (SLOs) and Error Budgets with product owners. • Write high-quality, production-ready code (in Java, Go, or Python) to build internal tooling, automation platforms, and self-healing mechanisms that eliminate manual operator intervention. • Partner with Product Engineering teams during the design phase to ensure new services are built with reliability, scalability, and observability patterns (circuit breakers, rate limiting, backpressure, fallback strategies) from day one. • Analyze system performance and traffic patterns to model future capacity needs. Conduct load testing and chaos engineering experiments to verify system resilience under failure conditions.

AWS Azure Distributed Systems GCP Java Apache Kafka Kubernetes Microservices MySQL PostgreSQL Prometheus Python RabbitMQ Rust

View details: Senior Site Reliability Engineer

Nigeria

Apply

Site Reliability Engineer

Moniepoint Inc. (Formerly TeamApt Inc.)

DevOps Engineer115 days ago

Full Time RemoteTeam 1,001-5,000H1B No Sponsor

Company Site LinkedIn

• Participate in on-call rotations to detect and triage service and reliability issues across all environments. Act as the Incident Commander during major incidents: initiating war room or bridge calls, coordinating cross-functional teams, providing timely and clear status updates to all stakeholders. • Create and maintain meaningful dashboards and alerts. Work with development teams to instrument their code to ensure visibility. • Develop automation to eliminate manual and repetitive operational tasks (toil) related to reliability across both applications and infrastructure. • Implement and track Service Level Indicators (SLIs) and Service Level Objectives (SLOs) defined by the engineering leadership. • Investigate and resolve customer complaints escalated beyond L1 and L2 support, especially those involving performance, reliability, or complex system behavior.

AWS Azure Distributed Systems GCP Grafana Java Kubernetes Microservices MySQL PostgreSQL Python SQL

View details: Site Reliability Engineer

Nigeria

Apply

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, Terraform

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevOps Software Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Site Reliability Engineer