A distributed marketplace for compute
AI Infrastructure Engineer
Location
United States
Posted
112 days ago
Salary
$150K - $225K / year
Seniority
Senior
Job Description
AI Infrastructure Engineer
Hydra Host
• Get AI Platform customers production-ready on Hydra — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware. • Own the bare metal ←→ platform layer — bridging GPU infrastructure (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use. • Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. • Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types. • Identify gaps before customers do — pressure-testing Hydra's infrastructure, APIs, and workflows to find what's missing or broken. • Turn customer learnings into product — working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding. • Advise customers on chip selection and tokenomics — helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.
Job Requirements
- Bare metal Linux depth — you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. Not just managed K8s.
- NVIDIA GPU stack expertise — drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. You understand how stack compatibility affects performance.
- Kubernetes and orchestration — production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
- AI Networking fundamentals — TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads.
- Customer-facing communication — you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
- Bias toward scalable solutions — you'd rather build a feature that helps 10 customers than a custom deployment that helps 1.
- Nice to Have HPC or large-scale distributed training environments.
- AI workload experience (vLLM, PyTorch, inference frameworks).
- Storage systems (NVMe, distributed filesystems, CEPH, WEKA).
- IaC and provisioning tools (Terraform, Ansible, Cloud-init, MaaS).
Benefits
- Competitive salary
- Equity ownership
- Healthcare — medical, dental, vision for you and your family
- Remote-first — with hubs in Phoenix, Boulder, and Miami
- Direct impact — your work shapes how GPU infrastructure gets deployed across the AI ecosystem
Related Guides
Related Categories
Related Job Pages
More Infrastructure Engineer Jobs
Software Engineer, Privacy Infrastructure Engineering
NetflixDescribed as the world's top internet television network, Netflix is a publicly-traded entertainment company offering video-on-demand and streaming media. As an
• Build privacy engineering solutions spanning over the entire Netflix infrastructure • Partner closely with other software engineers, data scientists and TPMs to design, prototype, develop, run and improve data detection, classification and metadata systems at scale • Innovate, solve challenging problems and impact the business in a meaningful way • Work across diverse environments including distributed systems, data streaming and data warehouse
• Design and develop GTT’s Managed Services infrastructure platforms • Continuous improvement of existing infrastructure platforms • Act as last escalation point for operational support of infrastructure platforms
Senior Infrastructure Engineer
Pure IT CUSOWe’re a growing Managed Services Provider (MSP) that specializes in supporting credit unions with their IT needs. Our mission? To keep their technology running smoothly, securely, and efficiently—so they can focus on serving their members. We’re all about teamwork, innovation, and having fun while doing what we love.
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are seeking a Senior Engineer who can deliver high quality project work today while helping us move toward a more automated, consistent, and scalable operating model over time. This is a hands-on role. You will execute deployments, perform configurations, troubleshoot issues, and work through complex escalations. At the same time, we expect you to identify patterns, reduce manual effort where possible, and contribute to building the foundation for a more repeatable project and managed services framework. We are not fully modernized yet and this role is part of that transformation. This is a senior and highly compensated position for someone who can operate with good judgment, move quickly when needed, and still create improvements as the environment matures. What You Will Do - Project Delivery and Hands On Engineering - Deliver networking, virtualization, public cloud, and Microsoft 365 projects from planning through completion. - Perform hands-on configuration, migrations, cloud resource setup, and implementation tasks for customer environments. - Make sound decisions under real world constraints while balancing ideal patterns with project timelines. - Tier 3 Escalation and Deep Troubleshooting - Serve as a senior escalation resource for complex issues that require advanced analysis. - Identify root causes and ensure that lessons learned are integrated into standards, procedures, or tooling. - Infrastructure Standards and Emerging Automation - Help define and document patterns and procedures where they do not yet exist. - Convert successful project patterns into reusable templates, scripts, or infrastructure as code modules when possible. - Support the gradual shift from manual work toward predictable and automated builds and operations. - Documentation and Knowledge Transfer - Create clear and concise documentation that supports delivery teams and operational handoff. - Communicate what was done and why in order to improve transparency and consistency. - Continuous Improvement - Identify operational friction, repeated work, and inefficient processes and propose practical improvements. - Work with leadership to prioritize what to automate now and what to automate later. - Travel - Occasional scheduled travel for onsite project work. Qualifications - Strong knowledge of routing, switching, VPN, segmentation, and diagnostic workflows. - Experience with virtualization platforms such as VMware or Hyper V. - Microsoft 365 administration skills including identity and security controls. - Familiarity with enterprise backup and replication technologies. - Practical experience with basic public cloud services such as virtual machines, networking, storage, identity, and security controls in Azure or AWS. - Exposure to automation tools or infrastructure as code concepts with a desire to expand them over time. Mindset - Comfortable switching between hands-on implementation, deep troubleshooting, and incremental improvement work. - Understands how to balance getting the job done today with making the job easier in the future. - Strong analytical skills and the ability to break complex issues into clear steps. - Operates independently with sound judgment and communicates clearly. Nice to Have - Experience in the credit union or financial services industry. Work Environment Location: Remote or Hybrid in Tomball, Texas. You will be joining a team that is modernizing how we deliver projects and managed services. This role is ideal for someone who thrives in real world engineering constraints while contributing to a shift toward cleaner, more scalable, and more automated operations.
Data Infrastructure Engineer
FungaHarnessing forest fungal networks to address the biodiversity and climate crises.
• Own the Stack: Architect and maintain our central storage (PostgreSQL/PostGIS, SQLite) and cloud environment (AWS/GCP), leveraging ECS, Lambda, and S3. • Modern DevOps: Standardize environments using Docker, CI/CD pipelines, and Infrastructure as Code to automate the testing and deployment of data services. • Scale for Performance: Optimize data models and database performance management for extensibility as our genomic and geospatial inputs scale. • Build Lean Ingest: Design and automate scalable ELT/ETL pipelines for genomic, geospatial, and tabular data from sources like Survey123, ArcGIS, and Asana. • QA/QC at Scale: Build automated validation pipelines to ensure data integrity and version control from the moment it hits our system. • Internal Enablement: Support scientists and operational teams by designing the data models that power internal modeling workflows, dashboards, and reporting. • System Connectivity: Develop lightweight APIs and connectors to sync data between our core infrastructure and downstream applications (e.g. Asana, ArcGIS, and internal dashboards).



