Job Closed

This listing is no longer active.

Somewhere

Logistical & physical assistance for non-emergency medical transport

Site Reliability Engineer, AI Infrastructure

DevOps EngineerDevOps EngineerFull Time Remote Mid Level

Location

United Kingdom + 8 more

Posted

99 days ago

Salary

Seniority

Mid Level

No structured requirement data.

Job Description

Role Description We are hiring a Senior Site Reliability Engineer (SRE) to own the reliability of our GPU training and inference clusters from the US West Coast. You will serve as the on-call anchor for Asian hours, drive incident response on multi-thousand GPU fabrics, and push our platform toward higher availability, faster recovery, and cleaner operations. This is a hands-on role with significant production impact from week one. Key Responsibilities - Cluster Operations & Hardening - Production SLURM Management: Operate and harden production SLURM clusters running large-scale distributed training and inference jobs. - Hardware Health: Own the health of NVIDIA HGX and DGX nodes, including GPU, NVLink, NVSwitch, and BMC diagnostics. - Fabric Tuning: Debug and tune NVIDIA Quantum InfiniBand fabrics (NDR and HDR), including Subnet Manager, topology, adaptive routing, SHARP, and congestion issues. - Root Cause Analysis: Drive deep-dive RCA on GPU failures, XID errors, ECC events, thermal throttling, and link flaps. - Automation & Observability - Systems Automation: Write robust automation in Python, Go, or Bash to replace manual tasks, improve MTTR, and scale operations efficiently. - Observability Stack: Build and maintain observability for GPU fleets using Prometheus, Grafana, DCGM, node exporter, and custom exporters. - Capacity & Rollouts: Contribute to capacity planning, firmware rollout strategy, and cluster bring-up for new sites. - Collaboration & Incident Response - Workload Optimization: Partner with customer workload teams on NCCL tuning, job scheduling policy, QoS, and fairshare. - Operational Excellence: Lead post-mortems, write comprehensive runbooks, and improve change management processes across global regions. - On-Call Leadership: Participate in the on-call rotation for US hours and handle escalations from international sites when necessary. Qualifications - 5+ years in SRE, systems engineering, or HPC operations. - Extensive production experience with SLURM at scale (accounting/slurmdbd, prolog/epilog scripts, cgroups, GRES, topology awareness). - Hands-on experience with NVIDIA datacenter GPUs, driver stacks, CUDA runtime, Fabric Manager, nvidia-smi, DCGM, and GPU Direct RDMA. - Operational experience with InfiniBand fabrics at 100G or higher (OpenSM/UFM, ibdiagnet, perfquery, and fabric troubleshooting). - Expert-level Linux admin skills (Ubuntu/RHEL family), including kernel tuning, systemd, networking, and PXE provisioning. - Solid scripting skills in Python and Bash, plus working knowledge of Ansible or Terraform. Nice to Have - Experience with NCCL internals, PyTorch distributed, or Megatron-style training stacks. - Familiarity with BCM (Base Command Manager), Run:ai, or similar managers. - Experience running Kubernetes on bare metal with GPU, Network, and MPI Operators. - Exposure to high-performance storage like Lustre, WEKA, VAST, or BeeGFS. - Prior work in an AI cloud, neocloud, HPC center, or hyperscaler environment. Benefits - You will touch clusters that train world-class models, working with the most advanced hardware available. - We maintain a flat structure with direct access to leadership and a culture built around technical craftsmanship and ownership. - Full remote flexibility with occasional travel for team summits and datacenter site visits. - Comprehensive US benefits including performance bonuses, equity participation, and 401(k) eligibility.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

DevOps Specialist

inventCloud

Experiência em ambientes de missão crítica (bancos, meios de pagamento, fintechs) Cultura de automação, confiabilidade e melhoria contínua Capacidade de tomada de decisão baseada em dados

DevOps Engineer99 days ago

Full Time Remote

Role Description Buscamos um(a) Especialista DevOps com atuação hands-on em ambientes críticos, de alta volumetria e baixa latência. Este profissional será responsável por projetar, implementar e sustentar plataformas resilientes, seguras e escaláveis, com foco em automação, confiabilidade (SRE) e governança. Responsabilidades técnicas: - Projetar e evoluir pipelines de CI/CD com foco em versionamento, rastreabilidade (traceability) e rollback seguro - Implementar estratégias de blue/green deployment, canary releases e feature flags - Gerenciar e otimizar clusters Kubernetes em produção (HA, autoscaling, network policies, RBAC) - Desenvolver e manter infraestrutura como código (Terraform) com módulos reutilizáveis e governança - Implementar observabilidade completa (logs, métricas, tracing distribuído) com correlação de eventos - Definir e monitorar SLIs/SLOs/SLAs, atuando com práticas de SRE - Atuar em troubleshooting avançado (análise de dump, profiling, gargalos de I/O, CPU e rede) - Gerenciar pipelines seguros com práticas de DevSecOps (scan de vulnerabilidades, secrets, compliance) - Trabalhar com arquiteturas distribuídas e orientadas a eventos - Otimizar custos e performance em ambientes cloud de larga escala - Automatizar rotinas operacionais e reduzir toil Qualifications - Experiência sólida com CI/CD (Jenkins, GitHub Actions, GitLab CI) em ambientes corporativos - Experiência avançada com Docker e Kubernetes (gestão de clusters produtivos) - Conhecimento profundo em Linux (tuning de kernel, análise de processos, troubleshooting avançado) - Experiência em cloud (AWS, Azure ou GCP) com foco em arquitetura e segurança - Experiência com Terraform (state management, módulos, workspaces) - Conhecimento avançado de redes (TCP/IP, TLS, DNS, proxies, load balancers) - Experiência com observabilidade (Prometheus, Grafana, ELK/EFK, OpenTelemetry) - Experiência com controle de versão e estratégia de branching (GitFlow ou trunk-based) - Scripting avançado (Shell, Python ou Go) Requirements - Experiência com ambientes regulados (LGPD, compliance, auditoria) - Conhecimento em segurança ofensiva/defensiva (hardening, IAM, Zero Trust) - Experiência com service mesh (Istio, Linkerd) - Vivência com mensageria e streaming (Kafka, RabbitMQ) - Experiência com alta volumetria e sistemas de baixa latência - Certificações cloud (AWS, Azure, GCP) - Experiência com chaos engineering Benefits - Mentalidade de dono (ownership) em ambientes críticos - Forte capacidade analítica e de resolução de incidentes complexos - Atuação sob pressão com foco em estabilidade e continuidade do negócio - Comunicação clara com times técnicos e stakeholders Company Description - Experiência em ambientes de missão crítica (bancos, meios de pagamento, fintechs) - Cultura de automação, confiabilidade e melhoria contínua - Capacidade de tomada de decisão baseada em dados

View details: DevOps Specialist

Brazil

Apply

Job Closed

Solution Architecture- DevOps

NTT DATA Services

NTT DATA is a $30 billion business and technology services leader, serving 75% of the Fortune Global 100. We are committed to accelerating client success and positively impacting society through responsible innovation. We are one of the world's leading AI and digital infrastructure providers, with unmatched capabilities in enterprise-scale AI, cloud, security, connectivity, data centers, and application services. Our consulting and Industry solutions help organizations and society move confidently and sustainably into the digital future. As a Global Top Employer, we have experts in more than 50 countries. We also offer clients access to a robust ecosystem of innovation centers as well as established and start-up partners. NTT DATA is a part of NTT Group, which invests over $3 billion each year in R&D.

DevOps Engineer100 days ago

Full Time Remote

Role Description We are currently seeking a Solution Architecture- DevOps to join our team in Austin, Texas (US-TX), United States (US). A DevOps Architect is the visionary behind an organization’s automation and cloud infrastructure strategy and a part of NTT DATA’s CCoE. He/ She will design the tools, and systems that power efficient software delivery. The DevOps Architect will take a higher-level approach—designing frameworks that guide teams and systems toward automation excellence. The DevOps Architect acts as both strategist and engineer, designing systems that ensure automation, scalability, and operational excellence. Key Responsibilities: - Designing the framework for scalable and secure CI/CD pipelines. - Overseeing the architecture of infrastructure-as-code (IaC) templates for consistent environment provisioning. - Recommend Integration of DevOps tools across multi cloud environments. - Establish governance, monitoring, and compliance frameworks. - Define automation standards and best practices in consultation with engineering teams. - Define and guide code-driven automation. - Operate across multiple cloud platforms such as AWS, Azure, OCI and Google Cloud. - Design frameworks and reference architectures that embrace DevSecOps principles, embedding security checks into every stage of the CI/CD pipeline. - Align DevOps strategy with business outcomes. Qualifications - Minimum 10 years in cloud technology with experience in at least two of the following platforms: AWS, Microsoft Azure, and Google Cloud Platform (GCP). - A minimum of 5 years experience in automation tools with a combination of any of the following: Jenkins, GitLab CI/CD, CircleCI, and Bamboo. - A minimum of 5 years experience with Containerization & Orchestration: Docker or Kubernetes for scalable microservices management. - A minimum of 5 years experience with Infrastructure as Code: Terraform, Ansible, Puppet, or Chef. - A minimum of 5 years experience with Monitoring Tools: Prometheus, Grafana or similar technology. Requirements - Degree in Computer science or equivalent practical experience. - One or more certifications, including: AWS Certified DevOps Engineer – Professional, Microsoft Certified: DevOps Engineer Expert, Google Professional Cloud DevOps Engineer, Kubernetes Certified Administrator (CKA). Benefits - Medical, dental, and vision insurance with an employer contribution. - Flexible spending or health savings account. - Life and AD&D insurance. - Short and long term disability coverage. - Paid time off. - Employee assistance program. - Participation in a 401k program with company match. - Additional voluntary or legally-required benefits.

View details: Solution Architecture- DevOps

United States

$128.9K - $190.1K / year

Apply

Job Closed

Cloud Applications DevOps Engineer

One New Zealand

Need help? Visit: https://one.nz/contact For current outages, visit: https://one.nz/help/network-status/

DevOps Engineer100 days ago

Full Time RemoteTeam 1,001-5,000Since 2023H1B No Sponsor

Company Site LinkedIn

• Validate and verify resolved defects, enhancements, and newly delivered functionality to ensure quality, stability, and alignment with business requirements before release. • Collaborate with stakeholders to review data extracts, dashboards, and reports, ensuring accuracy, consistency, and meaningful insights for decision-making. • Participate in a 24x7 on-call roster as required, providing timely support and ensuring continuity of critical platform operations. • Manage and resolve incidents within agreed operational SLA targets, maintaining clear communication and ownership through to resolution. • Provide technical support to the Security Operations Centre (SOC), assisting with investigations, system-health checks, and incident triage. • Perform repeat fault analysis and root cause investigations, identifying underlying issues and implementing preventative measures to reduce recurrence. • Conduct quality assurance (QA) of reporting data, pipelines, and outputs to maintain high data integrity and reliability across platforms. • Monitor cloud applications and services to proactively identify risks, performance issues, and optimisation opportunities. • Support continuous improvement initiatives by identifying automation opportunities and streamlining operational processes. • Work closely with DevOps, engineering, and business teams to ensure smooth deployment, integration, and ongoing support of cloud-based applications. • Maintain documentation for processes, incidents, and system changes to support knowledge sharing and operational transparency.

Cloud JavaScript Python SDLC SQL Go

View details: Cloud Applications DevOps Engineer

New Zealand

Apply

Job Closed

DE&A - GCP Data Engineer

Zensar

At Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.

DevOps Engineer100 days ago

Full Time RemoteTeam 10,001

Role Description Creating Impact faster: Delivering more Impact In less time by quickly deploying solutions or augmenting existing ones to enable teams to accelerate business results and Increase time to value. - Breaking through barriers: Helping create better customer experience by taking the data first approach and deliver Insights. - Adapting to anything: Agility to reacting and responding to new business priorities and market conditions and customer opportunities with rapidly deployable solutions. - Innovating anywhere: Solving problems with powerful solutions enabling Inter operable solutions across multiple lines of business. Company Description At Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. - At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. - Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. - We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. - We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace.

View details: DE&A - GCP Data Engineer

India

Apply

Job Closed

Site Reliability Engineer, AI Infrastructure

Job Description

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

DevOps Specialist

Solution Architecture- DevOps

Cloud Applications DevOps Engineer

DE&A - GCP Data Engineer