Hydra Host

A distributed marketplace for compute

AI Infrastructure Engineer

Infrastructure EngineerInfrastructure EngineerOther Remote SeniorTeam 11-50H1B No SponsorCompany Site LinkedIn

Location

United States

Posted

112 days ago

Salary

$150K - $225K / year

Seniority

Senior

Bachelor DegreeEnglishAnsible Kubernetes Linux PyTorch TCP/IP Terraform

Job Description

• Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware. • Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use. • Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. • Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types. • Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken. • Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding. • Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.

Job Requirements

Bare metal Linux depth — you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s.
NVIDIA GPU stack expertise — drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance.
Kubernetes and orchestration — production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
AI Networking fundamentals — TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads.
Customer-facing communication — you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1.
Nice to Have HPC or large-scale distributed training environments.
AI workload experience (vLLM, PyTorch, inference frameworks).
Storage systems (NVMe, distributed filesystems, CEPH, WEKA).
IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS).

Benefits

Competitive salary
Equity ownership
Healthcare — medical, dental, vision for you and your family
Remote-first — with hubs in Phoenix, Boulder, and Miami
Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem

Related Categories

Infrastructure Engineer

Related Job Pages

More Remote Jobs

More Infrastructure Engineer Jobs

Software Engineer, Privacy Infrastructure Engineering

Netflix

Described as the world's top internet television network, Netflix is a publicly-traded entertainment company offering video-on-demand and streaming media. As an

Infrastructure Engineer112 days ago

Other Remote

Company Site

• Build privacy engineering solutions spanning over the entire Netflix infrastructure • Partner closely with other software engineers, data scientists and TPMs to design, prototype, develop, run and improve data detection, classification and metadata systems at scale • Innovate, solve challenging problems and impact the business in a meaningful way • Work across diverse environments including distributed systems, data streaming and data warehouse

Distributed Systems Java Apache Spark SQL

View details: Software Engineer, Privacy Infrastructure Engineering

United States

$260K - $459K / year

Apply

Job Closed

MS Cloud Infrastructure Architect

GTT

Greater Technology Together

Infrastructure Engineer113 days ago

Full Time RemoteTeam 1,001-5,000H1B Sponsor

Company Site LinkedIn

• Design and develop GTT’s Managed Services infrastructure platforms • Continuous improvement of existing infrastructure platforms • Act as last escalation point for operational support of infrastructure platforms

Ansible AWS Azure GCP Kubernetes Linux Python

View details: MS Cloud Infrastructure Architect

Czechia

Apply

Senior Infrastructure Engineer

Pure IT CUSO

We’re a growing Managed Services Provider (MSP) that specializes in supporting credit unions with their IT needs. Our mission? To keep their technology running smoothly, securely, and efficiently—so they can focus on serving their members. We’re all about teamwork, innovation, and having fun while doing what we love.

Infrastructure Engineer113 days ago

Other Remote

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are seeking a Senior Engineer who can deliver high quality project work today while helping us move toward a more automated, consistent, and scalable operating model over time. This is a hands-on role. You will execute deployments, perform configurations, troubleshoot issues, and work through complex escalations. At the same time, we expect you to identify patterns, reduce manual effort where possible, and contribute to building the foundation for a more repeatable project and managed services framework. We are not fully modernized yet and this role is part of that transformation. This is a senior and highly compensated position for someone who can operate with good judgment, move quickly when needed, and still create improvements as the environment matures. What You Will Do - Project Delivery and Hands On Engineering - Deliver networking, virtualization, public cloud, and Microsoft 365 projects from planning through completion. - Perform hands-on configuration, migrations, cloud resource setup, and implementation tasks for customer environments. - Make sound decisions under real world constraints while balancing ideal patterns with project timelines. - Tier 3 Escalation and Deep Troubleshooting - Serve as a senior escalation resource for complex issues that require advanced analysis. - Identify root causes and ensure that lessons learned are integrated into standards, procedures, or tooling. - Infrastructure Standards and Emerging Automation - Help define and document patterns and procedures where they do not yet exist. - Convert successful project patterns into reusable templates, scripts, or infrastructure as code modules when possible. - Support the gradual shift from manual work toward predictable and automated builds and operations. - Documentation and Knowledge Transfer - Create clear and concise documentation that supports delivery teams and operational handoff. - Communicate what was done and why in order to improve transparency and consistency. - Continuous Improvement - Identify operational friction, repeated work, and inefficient processes and propose practical improvements. - Work with leadership to prioritize what to automate now and what to automate later. - Travel - Occasional scheduled travel for onsite project work. Qualifications - Strong knowledge of routing, switching, VPN, segmentation, and diagnostic workflows. - Experience with virtualization platforms such as VMware or Hyper V. - Microsoft 365 administration skills including identity and security controls. - Familiarity with enterprise backup and replication technologies. - Practical experience with basic public cloud services such as virtual machines, networking, storage, identity, and security controls in Azure or AWS. - Exposure to automation tools or infrastructure as code concepts with a desire to expand them over time. Mindset - Comfortable switching between hands-on implementation, deep troubleshooting, and incremental improvement work. - Understands how to balance getting the job done today with making the job easier in the future. - Strong analytical skills and the ability to break complex issues into clear steps. - Operates independently with sound judgment and communicates clearly. Nice to Have - Experience in the credit union or financial services industry. Work Environment Location: Remote or Hybrid in Tomball, Texas. You will be joining a team that is modernizing how we deliver projects and managed services. This role is ideal for someone who thrives in real world engineering constraints while contributing to a shift toward cleaner, more scalable, and more automated operations.

VMware Azure AWS PowerShell Infrastructure as Code

View details: Senior Infrastructure Engineer

United States

Apply

Job Closed

Data Infrastructure Engineer

Funga

Harnessing forest fungal networks to address the biodiversity and climate crises.

Infrastructure Engineer113 days ago

Other RemoteTeam 11-50Since 2021H1B No Sponsor

Company Site LinkedIn

• Own the Stack: Architect and maintain our central storage (PostgreSQL/PostGIS, SQLite) and cloud environment (AWS/GCP), leveraging ECS, Lambda, and S3. • Modern DevOps: Standardize environments using Docker, CI/CD pipelines, and Infrastructure as Code to automate the testing and deployment of data services. • Scale for Performance: Optimize data models and database performance management for extensibility as our genomic and geospatial inputs scale. • Build Lean Ingest: Design and automate scalable ELT/ETL pipelines for genomic, geospatial, and tabular data from sources like Survey123, ArcGIS, and Asana. • QA/QC at Scale: Build automated validation pipelines to ensure data integrity and version control from the moment it hits our system. • Internal Enablement: Support scientists and operational teams by designing the data models that power internal modeling workflows, dashboards, and reporting. • System Connectivity: Develop lightweight APIs and connectors to sync data between our core infrastructure and downstream applications (e.g. Asana, ArcGIS, and internal dashboards).

AWS Docker ETL GCP PostGIS PostgreSQL Python SQL SQLite

View details: Data Infrastructure Engineer

United States

$120K - $150K / year

Apply

Job Closed

AI Infrastructure Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Software Engineer, Privacy Infrastructure Engineering

MS Cloud Infrastructure Architect

Senior Infrastructure Engineer

Data Infrastructure Engineer