AI Infrastructure Engineer

Location

United States

Posted

2 days ago

Salary

$100K - $150K / year

Seniority

Mid Level

Job Description

AI Infrastructure Engineer

Bright Vision Technologies

Role Description We are seeking an AI Infrastructure Engineer to design, build, and operate the platform layer that powers large-scale AI training and inference workloads. The role focuses on: - GPU clusters - Distributed training frameworks - Scheduling - Storage performance - Developer experience for ML engineers and researchers The ideal candidate has built or operated production AI infrastructure at scale, understands the interaction between hardware, kernel, scheduler, and ML framework, and brings strong software engineering discipline to platform work. Qualifications - Bachelor’s or Master’s degree in Computer Science or a related field. - Six or more years of experience in infrastructure, platform, or HPC engineering. - Hands-on experience operating GPU clusters or large-scale ML training infrastructure. - Strong proficiency in Python and at least one systems language such as Go or C++. - Deep understanding of distributed training, accelerator architectures, and collective communication. - Experience with Kubernetes, Slurm, Ray, or similar scheduling systems for ML workloads. - Strong understanding of Linux internals, networking, and high-performance storage. - Experience with at least one major cloud provider’s ML infrastructure offerings. - Strong software engineering practices including testing, CI/CD, and code review. - Excellent communication and cross-functional collaboration skills. Requirements - Design and operate GPU and accelerator infrastructure for training and inference, spanning on-prem clusters, cloud-managed services, and hybrid configurations. - Build scheduling, queueing, and resource-sharing systems that maximize accelerator utilization across many teams. - Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train into a unified platform offering. - Operate high-performance storage systems and data pipelines that keep accelerators fed with training data at near-line-rate. - Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication. - Build observability for AI workloads including utilization, throughput, training stability, and failure-mode analytics. - Implement checkpointing, restart, and fault-tolerance patterns for long-running training jobs at scale. - Drive cost optimization across compute, storage, and networking through scheduling, spot capacity, and right-sizing. - Develop developer tooling and paved-road workflows that let researchers launch experiments safely and efficiently. - Partner with research and applied ML teams to plan capacity for upcoming training runs. - Implement security controls, isolation, and access management for multi-tenant AI infrastructure. - Drive automation across cluster provisioning, lifecycle management, and configuration enforcement. - Maintain runbooks, capacity dashboards, and operational documentation for the AI platform. - Stay current with AI infrastructure research, accelerator hardware, and emerging open-source AI tooling. Benefits - Competitive base salary commensurate with experience, plus benefits.

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Guidehouse logo

Cloud Platform Infrastructure Architect

Guidehouse

Solving big problems, building trust in society, and empowering our clients to shape the future.

Full TimeRemoteTeam 10,001+Since 2018H1B Sponsor

• Design end-to-end AWS cloud infrastructure solutions to support enterprise applications, data platforms, and business critical services. • Create high-availability, disaster recovery, and multi-region strategies. • Develop migration strategies for transitioning on-premises workloads to AWS. • Contribute to cloud strategy and technology roadmaps. • Oversee deployment of Infrastructure as code (IaC) using tools like Terraform, AWS CloudFormation or CDK. • Establish and enhance continuous integration and continuous delivery pipelines to streamline software deployment and infrastructure updates. • Ensure observability, monitoring and logging using AWS CloudWatch, X-Ray or third-party tools. • Design architecture that meets security, governance, and regulatory compliance requirements (HIPAA, FedRAMP, SOC 2 etc.) as applicable. • Implement IAM best practices, encryption strategies, and secure networking. • Partner with developers, DevOps engineers, and business stakeholders to ensure solutions meet mission-critical needs. • Diagnose and resolve complex technical issues related to cloud infrastructure, ensuring high performance and reliability. • Provide technical leadership, mentoring, and guidance to engineering teams. • Stay current with emerging cloud technologies and trends, evaluating and recommending new solutions to enhance our capabilities.

Virginia + 1 moreAll locations: Virginia | Washington
$102K - $170K / year
Full TimeRemoteTeam 10,001+Since 1993H1B Sponsor

• Lead how NVIDIA responds to and adapts to growth and rapid change. • Drive the organization’s program design and execution. • Collect requirements, define priorities, coordinate scheduling, and address challenges throughout the implementation lifecycle. • Optimize workflow using objective measures to improve engineering efficiency. • Establish a Program Management charter based on accountability, outcomes, leadership, and delivery. • Set and clearly define high standards and help the team achieve them. • Hire, retain, and grow outstanding people. • Drive high performance, clarity, positive culture, and collaboration.

Colorado + 3 moreAll locations: Colorado | New York | Oregon | Texas
$272K - $425.5K / year
Full TimeRemoteTeam 51-200H1B No Sponsor

Role Description We are seeking a hands-on Backend & Infrastructure Engineer (Infrastructure & Scalability) to design, build, and maintain scalable cloud infrastructure and backend services. This role requires deep expertise in backend services development, DevOps, programmatic infrastructure / Infrastructure As Code (IaC), and Site Reliability Engineering to ensure the stability and performance of our internal and client product initiatives. You will be instrumental in standing up, scaling, and supporting complete backends as well as individual services, including deployed AI agents and agentic infrastructure. As a consultancy, the needs and technology stacks of our clients can range widely across projects, as such we are looking for candidates with breadth of expertise and a flexible mindset. Our product development focus leans heavily towards cloud infrastructure and more recently towards deployment of AI agents and agentic infrastructure. Responsibilities - Platform & Infrastructure - Design, implement, and manage cloud infrastructure using programmatic tools such as Terraform, OpenTofu, or similar. - Build and manage staged development environments and corresponding CI/CD pipelines to ensure rapid, reliable, and automated deployments. - Oversee observability, monitoring, logging, and alerting systems for performance, usage metrics, and security. - Establish best practices for security, compliance, and cost optimization within cloud deployments. - Build, deploy, and manage individual services, including autonomous agents, ensuring they are scalable and performant. - Backend Development & Integration - Develop, deploy, and scale robust backend services and microservices, with a focus on high availability and resilience. - Develop and maintain robust backend APIs and integration services. Strong working knowledge of API design and expertise with GraphQL deployment. - Working knowledge of backend architectural patterns and their application. - Collaboration with front-end teams for services integrations and performance. Qualifications - 6+ years of experience as a Backend or Full Stack Engineer with a strong emphasis on DevOps, programmatic infrastructure and/or Site Reliability Engineering (SRE). - Expertise in programmatic infrastructure and cloud resource management using infrastructure-as-code tooling. - Strong experience with one or more of the major cloud platforms (e.g., GCP, AWS, or Azure). - Deep proficiency in at least one modern backend language (e.g., Python, Typescript, Go, Java/Kotlin, C#, or Rust). - Solid experience designing and scaling production backends, building and deploying custom services, and utilizing cloud managed services. - Experience with containerization and orchestration technologies (e.g., Docker, Kubernetes). - Familiarity with feature-flagging and staged product rollouts. - Experience implementing and managing observability, logging, and alerting systems. - Comfortable working in fast-paced environments with evolving requirements. - All resumes must be in English. - Must be fluent in English (level 3). Nice to haves - Certification from one or more cloud platform (GCP, AWS, or Azure). - Experience building and scaling AI-powered services or agents. - Anthropic or other AI certification (e.g., Claude Certified Architect Foundations). Benefits - Passion and enthusiasm for what we create. - REMOTE first and global company. - Subsidized health insurance, dental, and vision coverage. - Generous parental leave. - Flexible PTO and company holidays. - End of year company shut down (December 25 - January 1), in addition to observing company holidays. - 12 -16 weeks universal fully paid family leave. - Other benefits available based on location.

Northern America + 1 moreAll locations: Northern America | Latin America (LATAM)
$175K - $205K / year
Capital Bank, N.A. logo

Lead Infrastructure Engineer

Capital Bank, N.A.

Capital Bank, N.A. is an Affirmative Action, E-Verify, and Equal Opportunity Employer.

Role Description Technical Leadership - Serve as a technical lead for infrastructure initiatives, providing guidance on system design, cloud design, networking, and security. - Mentor and develop junior and senior engineers; provide training, task direction, coaching, and knowledge transfer. - Establish and enforce infrastructure engineering best practices, standards, and documentation. Infrastructure Engineering - Design, deploy, and maintain cloud and on-premises infrastructure with a focus on Azure IaaS, PaaS, identity, and security services. - Oversee enterprise-wide patching, configuration management, and vulnerability remediation processes. - Lead the lifecycle management of virtualized environments within cloud and on-premise infrastructure. - Ensure reliability, redundancy, and performance across all systems, networks, and cloud services. - Manage advanced Active Directory, Microsoft 365, and Azure AD/Entra ID configurations. - Evaluate and integrate third-party technologies, SaaS platforms, and emerging cloud solutions. Operations & Support - Act as the highest-level escalation point for complex service desk and infrastructure issues affecting systems, applications, and cloud environments. - Oversee environment health monitoring, including cloud resource optimization, performance dashboards, and availability metrics. - Ensure comprehensive documentation of systems, procedures, and architecture. - Participate in, and sometimes lead, after-hours maintenance windows and on-call support rotations. Project & Process Management - Lead and contribute to infrastructure-focused projects, upgrades, migrations, and security initiatives. - Drive automation and infrastructure-as-code adoption to optimize delivery and reduce manual effort. - Collaborate with Information Security to maintain compliance and strengthen cloud and on-prem security posture. - Assist with vendor management, procurement, licensing, and budgeting for infrastructure-related technologies. - Ensure Infrastructure Engineering tasks and deliverables are updated for leadership. - Required to be on-call as needed for emergency situations. - Other responsibilities and duties, as assigned. Qualifications - Bachelor’s degree in computer science, Information Technology, Business Administration, or equivalent combination of education and experience. - Minimum 10 total years of IT infrastructure experience. - Minimum 5 years of hands-on Azure cloud experience in an enterprise environment. - Required certifications: Azure Administrator Associate (AZ-104) and at least one advanced Azure certification such as AZ-305, AZ-700, or AZ-500. - Experience designing and deploying Azure IaaS/PaaS resources, networking, identity, and security controls. - Minimum 7 years of experience with Microsoft Windows platforms, Active Directory, and Microsoft 365. - Strong experience with LAN/WAN technologies. - Expertise with VMware vSphere hypervisors. - ITIL experience or certification preferred. Requirements - Demonstrated leadership ability, including mentoring and technical oversight of engineering teams. - Strong project leadership, prioritization, and operational planning skills. - Ability to independently manage workload, drive initiatives, and operate with minimal supervision. - Strong analytical thinking, troubleshooting, and problem-solving abilities. - Ability to collaborate effectively across IT, business stakeholders, vendors, and auditors. Benefits - Base Salary Range: $100,000 - $130,000 annually. - This role will include a yearly annual target bonus based on individual performance. - Remote Opportunity. - Comprehensive benefits package including Medical, Dental, Vision, Company Paid Life Insurance, Disability Insurance, and more. - Company Contributions to your 401k - Regardless of your contribution. - Employee Perks: Paid Parental Leave, Employee Recognition Program, Leadership Program, Tuition Reimbursement Program, Employee Bank Checking Account, and much more! - Generous Paid Time Off and Paid Holidays - Including Paid Charity Hours to support volunteer opportunities.

United States
$100K - $130K / year