AI Infrastructure Engineer

Infrastructure EngineerInfrastructure EngineerFull Time Remote Mid Level

Location

United States

Posted

2 days ago

Salary

$100K - $150K / year

Seniority

Mid Level

AI AI/ML Python C++Kubernetes Ray Linux CI/CD PyTorch JAX Observability/Monitoring Mode

Job Description

Role Description We are seeking an AI Infrastructure Engineer to design, build, and operate the platform layer that powers large-scale AI training and inference workloads. The role focuses on: - GPU clusters - Distributed training frameworks - Scheduling - Storage performance - Developer experience for ML engineers and researchers The ideal candidate has built or operated production AI infrastructure at scale, understands the interaction between hardware, kernel, scheduler, and ML framework, and brings strong software engineering discipline to platform work. Qualifications - Bachelor’s or Master’s degree in Computer Science or a related field. - Six or more years of experience in infrastructure, platform, or HPC engineering. - Hands-on experience operating GPU clusters or large-scale ML training infrastructure. - Strong proficiency in Python and at least one systems language such as Go or C++. - Deep understanding of distributed training, accelerator architectures, and collective communication. - Experience with Kubernetes, Slurm, Ray, or similar scheduling systems for ML workloads. - Strong understanding of Linux internals, networking, and high-performance storage. - Experience with at least one major cloud provider’s ML infrastructure offerings. - Strong software engineering practices including testing, CI/CD, and code review. - Excellent communication and cross-functional collaboration skills. Requirements - Design and operate GPU and accelerator infrastructure for training and inference, spanning on-prem clusters, cloud-managed services, and hybrid configurations. - Build scheduling, queueing, and resource-sharing systems that maximize accelerator utilization across many teams. - Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train into a unified platform offering. - Operate high-performance storage systems and data pipelines that keep accelerators fed with training data at near-line-rate. - Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication. - Build observability for AI workloads including utilization, throughput, training stability, and failure-mode analytics. - Implement checkpointing, restart, and fault-tolerance patterns for long-running training jobs at scale. - Drive cost optimization across compute, storage, and networking through scheduling, spot capacity, and right-sizing. - Develop developer tooling and paved-road workflows that let researchers launch experiments safely and efficiently. - Partner with research and applied ML teams to plan capacity for upcoming training runs. - Implement security controls, isolation, and access management for multi-tenant AI infrastructure. - Drive automation across cluster provisioning, lifecycle management, and configuration enforcement. - Maintain runbooks, capacity dashboards, and operational documentation for the AI platform. - Stay current with AI infrastructure research, accelerator hardware, and emerging open-source AI tooling. Benefits - Competitive base salary commensurate with experience, plus benefits.

Related Categories

Infrastructure Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More Infrastructure Engineer Jobs

Cloud Platform Infrastructure Architect

Guidehouse

Solving big problems, building trust in society, and empowering our clients to shape the future.

Infrastructure Engineer2 days ago

Full Time RemoteTeam 10,001+Since 2018H1B Sponsor

Company Site LinkedIn

• Design end-to-end AWS cloud infrastructure solutions to support enterprise applications, data platforms, and business critical services. • Create high-availability, disaster recovery, and multi-region strategies. • Develop migration strategies for transitioning on-premises workloads to AWS. • Contribute to cloud strategy and technology roadmaps. • Oversee deployment of Infrastructure as code (IaC) using tools like Terraform, AWS CloudFormation or CDK. • Establish and enhance continuous integration and continuous delivery pipelines to streamline software deployment and infrastructure updates. • Ensure observability, monitoring and logging using AWS CloudWatch, X-Ray or third-party tools. • Design architecture that meets security, governance, and regulatory compliance requirements (HIPAA, FedRAMP, SOC 2 etc.) as applicable. • Implement IAM best practices, encryption strategies, and secure networking. • Partner with developers, DevOps engineers, and business stakeholders to ensure solutions meet mission-critical needs. • Diagnose and resolve complex technical issues related to cloud infrastructure, ensuring high performance and reliability. • Provide technical leadership, mentoring, and guidance to engineering teams. • Stay current with emerging cloud technologies and trends, evaluating and recommending new solutions to enhance our capabilities.

Ansible AWS Cloud Docker EC2 Kubernetes Puppet Python Ray SQL Terraform

View details: Cloud Platform Infrastructure Architect

Virginia + 1 more

$102K - $170K / year

Apply

Director, Infrastructure Engineering – Program Management

NVIDIA

Infrastructure Engineer2 days ago

Full Time RemoteTeam 10,001+Since 1993H1B Sponsor

Company Site LinkedIn

• Lead how NVIDIA responds to and adapts to growth and rapid change. • Drive the organization’s program design and execution. • Collect requirements, define priorities, coordinate scheduling, and address challenges throughout the implementation lifecycle. • Optimize workflow using objective measures to improve engineering efficiency. • Establish a Program Management charter based on accountability, outcomes, leadership, and delivery. • Set and clearly define high standards and help the team achieve them. • Hire, retain, and grow outstanding people. • Drive high performance, clarity, positive culture, and collaboration.

View details: Director, Infrastructure Engineering – Program Management

Colorado + 3 more

$272K - $425.5K / year

Apply

Senior Backend & Infrastructure Engineer

Very Good Ventures

The Flutter Development Experts

Infrastructure Engineer2 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

Role Description We are seeking a hands-on Backend & Infrastructure Engineer (Infrastructure & Scalability) to design, build, and maintain scalable cloud infrastructure and backend services. This role requires deep expertise in backend services development, DevOps, programmatic infrastructure / Infrastructure As Code (IaC), and Site Reliability Engineering to ensure the stability and performance of our internal and client product initiatives. You will be instrumental in standing up, scaling, and supporting complete backends as well as individual services, including deployed AI agents and agentic infrastructure. As a consultancy, the needs and technology stacks of our clients can range widely across projects, as such we are looking for candidates with breadth of expertise and a flexible mindset. Our product development focus leans heavily towards cloud infrastructure and more recently towards deployment of AI agents and agentic infrastructure. Responsibilities - Platform & Infrastructure - Design, implement, and manage cloud infrastructure using programmatic tools such as Terraform, OpenTofu, or similar. - Build and manage staged development environments and corresponding CI/CD pipelines to ensure rapid, reliable, and automated deployments. - Oversee observability, monitoring, logging, and alerting systems for performance, usage metrics, and security. - Establish best practices for security, compliance, and cost optimization within cloud deployments. - Build, deploy, and manage individual services, including autonomous agents, ensuring they are scalable and performant. - Backend Development & Integration - Develop, deploy, and scale robust backend services and microservices, with a focus on high availability and resilience. - Develop and maintain robust backend APIs and integration services. Strong working knowledge of API design and expertise with GraphQL deployment. - Working knowledge of backend architectural patterns and their application. - Collaboration with front-end teams for services integrations and performance. Qualifications - 6+ years of experience as a Backend or Full Stack Engineer with a strong emphasis on DevOps, programmatic infrastructure and/or Site Reliability Engineering (SRE). - Expertise in programmatic infrastructure and cloud resource management using infrastructure-as-code tooling. - Strong experience with one or more of the major cloud platforms (e.g., GCP, AWS, or Azure). - Deep proficiency in at least one modern backend language (e.g., Python, Typescript, Go, Java/Kotlin, C#, or Rust). - Solid experience designing and scaling production backends, building and deploying custom services, and utilizing cloud managed services. - Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes). - Familiarity with feature-flagging and staged product rollouts. - Experience implementing and managing observability, logging, and alerting systems. - Comfortable working in fast-paced environments with evolving requirements. - All resumes must be in English. - Must be fluent in English (level 3). Nice to haves - Certification from one or more cloud platform (GCP, AWS, or Azure). - Experience building and scaling AI-powered services or agents. - Anthropic or other AI certification (e.g., Claude Certified Architect Foundations). Benefits - Passion and enthusiasm for what we create. - REMOTE first and global company. - Subsidized health insurance, dental, and vision coverage. - Generous parental leave. - Flexible PTO and company holidays. - End of year company shut down (December 25 - January 1), in addition to observing company holidays. - 12 -16 weeks universal fully paid family leave. - Other benefits available based on location.

Infrastructure as Code AI Agents Terraform CI/CD Observability/Monitoring Microservices API GraphQL GCP AWS Azure Python TypeScript Java Kotlin C#Rust Docker/Containers Docker Kubernetes

View details: Senior Backend & Infrastructure Engineer

Northern America + 1 more

$175K - $205K / year

Apply

Lead Infrastructure Engineer

Capital Bank, N.A.

Capital Bank, N.A. is an Affirmative Action, E-Verify, and Equal Opportunity Employer.

Infrastructure Engineer2 days ago

Full Time Remote

Role Description Technical Leadership - Serve as a technical lead for infrastructure initiatives, providing guidance on system design, cloud design, networking, and security. - Mentor and develop junior and senior engineers; provide training, task direction, coaching, and knowledge transfer. - Establish and enforce infrastructure engineering best practices, standards, and documentation. Infrastructure Engineering - Design, deploy, and maintain cloud and on-premises infrastructure with a focus on Azure IaaS, PaaS, identity, and security services. - Oversee enterprise-wide patching, configuration management, and vulnerability remediation processes. - Lead the lifecycle management of virtualized environments within cloud and on-premise infrastructure. - Ensure reliability, redundancy, and performance across all systems, networks, and cloud services. - Manage advanced Active Directory, Microsoft 365, and Azure AD/Entra ID configurations. - Evaluate and integrate third-party technologies, SaaS platforms, and emerging cloud solutions. Operations & Support - Act as the highest-level escalation point for complex service desk and infrastructure issues affecting systems, applications, and cloud environments. - Oversee environment health monitoring, including cloud resource optimization, performance dashboards, and availability metrics. - Ensure comprehensive documentation of systems, procedures, and architecture. - Participate in, and sometimes lead, after-hours maintenance windows and on-call support rotations. Project & Process Management - Lead and contribute to infrastructure-focused projects, upgrades, migrations, and security initiatives. - Drive automation and infrastructure-as-code adoption to optimize delivery and reduce manual effort. - Collaborate with Information Security to maintain compliance and strengthen cloud and on-prem security posture. - Assist with vendor management, procurement, licensing, and budgeting for infrastructure-related technologies. - Ensure Infrastructure Engineering tasks and deliverables are updated for leadership. - Required to be on-call as needed for emergency situations. - Other responsibilities and duties, as assigned. Qualifications - Bachelor’s degree in computer science, Information Technology, Business Administration, or equivalent combination of education and experience. - Minimum 10 total years of IT infrastructure experience. - Minimum 5 years of hands-on Azure cloud experience in an enterprise environment. - Required certifications: Azure Administrator Associate (AZ-104) and at least one advanced Azure certification such as AZ-305, AZ-700, or AZ-500. - Experience designing and deploying Azure IaaS/PaaS resources, networking, identity, and security controls. - Minimum 7 years of experience with Microsoft Windows platforms, Active Directory, and Microsoft 365. - Strong experience with LAN/WAN technologies. - Expertise with VMware vSphere hypervisors. - ITIL experience or certification preferred. Requirements - Demonstrated leadership ability, including mentoring and technical oversight of engineering teams. - Strong project leadership, prioritization, and operational planning skills. - Ability to independently manage workload, drive initiatives, and operate with minimal supervision. - Strong analytical thinking, troubleshooting, and problem-solving abilities. - Ability to collaborate effectively across IT, business stakeholders, vendors, and auditors. Benefits - Base Salary Range: $100,000 - $130,000 annually. - This role will include a yearly annual target bonus based on individual performance. - Remote Opportunity. - Comprehensive benefits package including Medical, Dental, Vision, Company Paid Life Insurance, Disability Insurance, and more. - Company Contributions to your 401k - Regardless of your contribution. - Employee Perks: Paid Parental Leave, Employee Recognition Program, Leadership Program, Tuition Reimbursement Program, Employee Bank Checking Account, and much more! - Generous Paid Time Off and Paid Holidays - Including Paid Charity Hours to support volunteer opportunities.

Azure Active Directory Observability/Monitoring Microsoft Windows VMware

View details: Lead Infrastructure Engineer

United States

$100K - $130K / year

Apply