Deepgram logo
Deepgram

Building foundational AI for speech transcription and understanding.

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

DevOps EngineerDevOps EngineerOtherRemoteSeniorTeam 51-200Since 2015H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

97 days ago

Salary

$150K - $220K / year

Seniority

Senior

Bachelor Degree5 yrs expEnglishAWSKubernetesPythonTerraform

Job Description

Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, AWS, Terraform

Deepgram

• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments

Job Requirements

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proven, hands-on experience building and managing production infrastructure with Terraform
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads
  • Experience managing bare metal infrastructure, including server provisioning (e.g., PXE boot, MAAS), configuration, and lifecycle management
  • Strong scripting and automation skills (e.g., Python, Go, Bash)

Benefits

  • Medical, dental, vision benefits
  • Annual wellness stipend
  • Mental health support
  • Life, STD, LTD Income Insurance Plans
  • Unlimited PTO
  • Generous paid parental leave
  • Flexible schedule
  • 12 Paid US company holidays
  • Quarterly personal productivity stipend
  • One-time stipend for home office upgrades
  • 401(k) plan with company match
  • Tax Savings Programs
  • Learning / Education stipend
  • Participation in talks and conferences
  • Employee Resource Groups
  • AI enablement workshops / sessions

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Continuum logo

Release Engineer

Continuum

Accelerating Digital at the Speed of Government

DevOps Engineer97 days ago
OtherRemoteTeam 11-50Since 2023H1B Sponsor

• Lead CI/CD practices for Dynamics 365, Power Platform, and related Azure components, driving consistent and reliable releases. • Manage solution builds, packaging, and deployments across Dev, Test, UAT, and Production using Azure DevOps and proven ALM methods. • Review and validate environment configurations, layering (managed/unmanaged), patch strategies, and dependencies for D365 apps, Dataverse, plugins, and integrations. • Ensure version consistency, connection references, environment variables, and API dependencies are aligned for each release. • Troubleshoot complex release issues such as solution import errors, plugin exceptions, or schema conflicts, and establish long-term prevention tactics. • Support change control by ensuring all Change Requests include accurate release notes, dependency documentation, and the required approvals aligned with governance standards. • Use Power Platform tools (Solution Checker, Plugin Trace Logs, Dataverse telemetry, Azure Application Insights) to diagnose and resolve deployment issues. • Champion automation by building and maintaining Azure DevOps Pipelines or GitHub Actions for exports, validation, versioning, and deployment. • Collaborate with Dynamics developers, architects, integration teams, and functional consultants to coordinate cross-system releases involving Azure APIs, Power Automate, and Power Pages. • Continuously improve release governance, environment management, and ALM practices to support a sustainable D365 delivery model.

Virginia
Job Closed
EverOps logo

Senior DevOps Engineer

EverOps

The Embedded Service Provider

DevOps Engineer97 days ago
OtherRemoteTeam 51-200H1B No Sponsor

• Develop and use automation tools effectively to operate, manage, and scale production and development environments in Azure quickly • Design, build, and maintain CI/CD pipelines using Azure DevOps Pipelines, including multi-stage YAML pipelines for infrastructure and application deployments • Author and maintain Azure infrastructure using Bicep templates and Terraform modules, following IaC best practices • Participate in regular customer and internal EverOps scrums • Monitor Azure environments using native tooling and third-party platforms while focusing on constant improvement • Implement new Azure services and technologies as customer requirements evolve • Design and execute new solutions while working to improve existing ones • Provide operational support and project deployments for our customer environments

United States
Job Closed
Jobgether logo

Senior Staff Site Reliability Engineer

Jobgether

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1 We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

DevOps Engineer97 days ago
OtherRemoteH1B No Sponsor

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description This role is pivotal in ensuring the reliability, scalability, and performance of cloud-based enterprise software. As a Senior Staff Site Reliability Engineer, you will: - Design, deploy, and maintain robust infrastructure for mission-critical services - Collaborate closely with development teams to optimize CI/CD pipelines and automate operational workflows - Provide guidance on distributed systems, cloud architecture, and containerized environments - Influence both technical strategy and day-to-day operations - Combine hands-on engineering with leadership in best practices for deployment automation, observability, and cost optimization - Mentor peers and participate in technology evaluations - Ensure resilient, customer-focused infrastructure Qualifications - Strong experience in scalable, distributed systems architecture and cloud platforms - Proficiency in programming with Go (Golang) and containerization technologies such as Docker - Hands-on experience with Kubernetes and orchestration technologies - Expertise in CI/CD processes, deployment automation, and configuration management - Solid understanding of Git workflows in a collaborative team environment - Bachelor’s degree in Computer Science or equivalent experience - Strong analytical, problem-solving, and communication skills - Experience with networking fundamentals, identity and access management, and monitoring/observability tools is a plus - Ability to work independently and collaboratively in a fast-paced, fully remote environment Requirements - Participate in on-call rotation to maintain operational excellence for production systems - Evaluate emerging technologies and recommend solutions to enhance system reliability and security Benefits - Competitive salary range of $170,000–$230,000 - Flexible, fully remote U.S. work environment - Generous paid time off and holiday schedule - Parental leave and progressive healthcare options - Retirement savings programs - Education reimbursement opportunities - Team bonding events and global volunteering initiatives - Inclusive, collaborative culture emphasizing growth and development

United States
Job Closed

DevOps Engineer III

Modivcare

Modivcare is on a mission to transform access to healthcare so people across America can sustain healthier and happier lives. More specifically, the company wan

DevOps Engineer97 days ago

• Designs, builds, and maintains scalable and robust infrastructure using cloud platforms (e.g., AWS, Azure) and containerization technologies (e.g., Docker, Kubernetes, ECS). • Collaborates with InfoSec to ensure the team is building a secure, scalable cloud infrastructure. • Develops and maintains enterprise-grade CI/CD pipelines and components to automate the build, test, and deployment processes for applications. • Implements and manages version control systems (e.g., Git) and artifact repositories to ensure efficient code collaboration and artifact management. • Monitors and improves the performance and reliability of CI/CD pipelines, addressing bottlenecks and implementing proactive measures. • Implements monitoring and logging solutions (e.g., Datadog, Prometheus, ELK stack) to track system health, identify performance issues, and troubleshoot incidents. • Collaborates with development and operations teams to diagnose and resolve production issues, ensuring quick resolution and minimal disruption to services. • Continuously monitors system capacity, performance, and security, implementing proactive measures to optimize resource utilization and enhance system stability. • Develops automation scripts (e.g., TypeScript, Bash, Python) to streamline routine operational tasks, improve efficiency, and reduce manual intervention. • Automates the deployment and configuration of applications, services, and infrastructure components using Infrastructure as Code (IaC) tools such as Pulumi, Terraform, CDK, or CloudFormation. • Works closely with cross-functional teams, including developers, testers, and operations, to foster a collaborative DevOps culture and drive continuous improvement. • Creates and maintains detailed technical documentation, including system diagrams, architectural designs, and standard operating procedures (SOPs).

United States
$97.2K - $133.7K / year
Job Closed