Job Closed

This listing is no longer active.

IT Search Corp logo
IT Search Corp

This is a remote position.

Certified NVIDIA AI Infrastructure & Kubernetes Platform Engineer

AI EngineerMachine Learning EngineerFull TimeRemoteMid LevelTeam 2-10

Location

United States

Posted

48 days ago

Salary

$100 - $130 / hour

Seniority

Mid Level

No structured requirement data.

Job Description

Certified NVIDIA AI Infrastructure & Kubernetes Platform Engineer

IT Search Corp

Role Description We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations. This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices. Core Responsibilities - AI Infrastructure Operations - Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads. - Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning. - Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools. - Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes. - Kubernetes Platform Engineering - Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator. - Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes. - Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows. - High-Performance Networking & DPUs - Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM). - Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput. - Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance. - Security & Compliance - Apply best practices from the CKS certification to secure containerized AI environments. - Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments. - Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts. - Monitoring, Telemetry & Optimization - Monitor GPU, CPU, and I/O performance using NVIDIA DCGM, Prometheus, Grafana, and Base Command APIs. - Tune system performance and model training pipelines for cost-efficiency and throughput. - Build and maintain operational runbooks, incident response playbooks, and SLA reporting dashboards covering GPU utilization, thermal thresholds, and fabric health. Qualifications - Certified Kubernetes Administrator (CKA) - Certified Kubernetes Application Developer (CKAD) - Certified Kubernetes Security Specialist (CKS) - NVIDIA Certified Associate: AI Infrastructure & Operations (NCA-AIIO) - NVIDIA Certified Professional: AI Infrastructure (NCP-AII) - NVIDIA Certified Professional: AI Operations (NCP-AIO) - NVIDIA Certified Professional: AI Networking (NCP-AIN) Requirements - Expertise With: - DGX System, BasePOD, and SuperPOD Administration - BlueField DPU Configuration & Operations - InfiniBand Fabric and UFM Management - Base Command Manager for workload orchestration - Technical Skills: - Kubernetes, Helm, GPU Operator, Kubeflow - DevOps tools: Ansible, Terraform, GitOps, CI/CD pipelines - Storage: NFS, BeeGFS, Lustre - Networking: RoCE, InfiniBand, DPU offload, gRPC, RDMA - Programming/scripting: Python, YAML, Bash

Related Job Pages

More AI Engineer Jobs

General Motors logo

Senior AI/ML Capacity and Performance Engineer

General Motors

General Motors (GM), founded in 1908 by William "Billy" Durant in Flint, Michigan, began with the Buick Motor Company and later acquired brands like Oldsmobile and Cadillac, evolvi

AI Engineer48 days ago
Full TimeRemoteTeam 165,000Since 1908

Description About Us : The AV Infrastructure org provides developer environments, cloud infrastructure, and ML/AI GPU platforms for AV research and development teams to build, test, and run faster in GM. The Role : GM is looking for a Senior Performance Engineer to join the AV Capacity and Performance Engineering team in the AV Infrastructure org to support our critical efforts in developing autonomous vehicles. The mission of the AVCPE team is to provide input into large scale ML infrastructure strategy, advise on key decisions affecting our cloud budget, identify and execute optimization projects, and provide capacity planning and engineering expertise to support GM's efforts in developing autonomous vehicles (AV). What you'll be doing (Responsibilities) - Strategic Infrastructure Development: Adopt and run AV models to support GM's long-term GPU system strategy and "evergreen" infrastructure roadmap. - Performance Optimization: Conduct deep-dive analyses of production workloads to identify bottlenecks and propose high-impact optimization strategies. - Cross-Functional Collaboration: Partner with AI/ML Research, Infrastructure Engineering, and Cloud Vendors to spearhead projects that enhance engineering velocity and cost-efficiency. - Proactive System Scaling: Identify opportunities for architectural improvements to ensure the scalability and reliability of large-scale ML training and inference environments. Your skills & abilities (Required Qualifications) - Experience: 5+ years of professional experience in high-scale infrastructure or ML systems. - Education: Bachelor's Degree in Computer Science, a related technical field, or equivalent practical experience. - Software Proficiency: Expert-level coding skills in Python and the ability to architect/debug within the PyTorch ecosystem. - Systems Engineering: Proven track record of resolving performance issues within large-scale distributed production environments. - Architectural Knowledge: Deep understanding of distributed systems, specifically modern ML system design and high-performance computing (HPC). - Containerization: Hands-on experience with Kubernetes for orchestrating complex workloads. - GPU Monitoring: Technical proficiency with Nvidia DCGM , nvidia-smi , and Grafana for real-time telemetry and observability. - Cloud Platforms: Extensive experience working within major cloud ecosystems ( AWS, GCP, or Azure ). What will give you a competitive edge (Preferred Qualifications) - Advanced Experience: 8+ years of relevant industry experience. - Hardware Expertise: Working knowledge of Enterprise-grade Nvidia GPU architectures, including H100, B200, and GB200 . - Model Deployment: Experience deploying and scaling open-source models via the Hugging Face ecosystem. - Data Analytics: Proficiency in BigQuery for large-scale data analysis and reporting. - Profiling Tools: Practical experience utilizing Nvidia Nsight and Nsight Compute for kernel-level performance tuning. - Soft Skills: Strong technical communication skills with the ability to translate complex infrastructure needs into actionable business insights. Hybrid/Remote: This role is categorized as hybrid/Remote. This means the successful candidate is expected to report to Sunnyvale Technical Center at minimum three days per week or at the hiring manager's discretion. Ability to sit remote in Seattle, WA. Compensation: The compensation information is a good faith estimate only. It is based on what a successful applicant might be paid in accordance with applicable state laws. The compensation may not be representative for positions located outside of New York, Colorado, California, or Washington. - The salary range for this role: is $144,700 to $261,300. The actual base salary a successful candidate will be offered within this range will vary based on factors relevant to the position. - Bonus Potential: An incentive pay program offers payouts based on company performance, job level, and individual performance. - Benefits: GM offers a variety of health and wellbeing benefit programs. Benefit options include medical, dental, vision, Health Savings Account, Flexible Spending Accounts, retirement savings plan, sickness and accident benefits, life insurance, paid vacation & holidays, tuition assistance programs, employee assistance program, GM vehicle discounts and more. About GM Our vision is a world with Zero Crashes, Zero Emissions and Zero Congestion and we embrace the responsibility to lead the change that will make our world better, safer and more equitable for all. Why Join Us We believe we all must make a choice every day - individually and collectively - to drive meaningful change through our words, our deeds and our culture. Every day, we want every employee to feel they belong to one General Motors team. Total Rewards | Benefits Overview From day one, we're looking out for your well-being-at work and at home-so you can focus on realizing your ambitions. Learn how GM supports a rewarding career that rewards you personally by visiting Total Rewards resources. Non-Discrimination and Equal Employment Opportunities (U.S.) General Motors is committed to being a workplace that is not only free of unlawful discrimination, but one that genuinely fosters inclusion and belonging. We strongly believe that providing an inclusive workplace creates an environment in which our employees can thrive and develop better products for our customers. All employment decisions are made on a non-discriminatory basis without regard to sex, race, color, national origin, citizenship status, religion, age, disability, pregnancy or maternity status, sexual orientation, gender identity, status as a veteran or protected veteran, or any other similarly protected status in accordance with federal, state and local laws. We encourage interested candidates to review the key responsibilities and qualifications for each role and apply for any positions that match their skills and capabilities. Applicants in the recruitment process may be required, where applicable, to successfully complete a role-related assessment(s) and/or a pre-employment screening prior to beginning employment. To learn more, visit How we Hire. Accommodations General Motors offers opportunities to all job seekers including individuals with disabilities. If you need a reasonable accommodation to assist with your job search or application for employment, email us [email protected] or call us at 1-800-865-7580. In your email, please include a description of the specific accommodation you are requesting as well as the job title and requisition number of the position for which you are applying.

California
$144.7K - $261.3K / year
Hudson IT and Manpower logo

Senior Gen AI Engineer

Hudson IT and Manpower

Information Technology and Manpower Services

AI Engineer48 days ago
ContractRemoteTeam 11-50Since 2019H1B No Sponsor

• Build PoCs, MVPs, and production-grade applications using Generative AI technologies • Work with Large Language Models (LLMs) including fine-tuning, deployment, and evaluation • Design and implement AI-powered solutions using OpenAI APIs, LangChain, and Hugging Face • Develop and optimize agentic AI workflows using frameworks like LangGraph, CrewAI, etc. • Build multi-modal AI solutions integrating text, image, and other data types • Collaborate with cross-functional teams to translate business requirements into AI solutions • Ensure performance, scalability, and reliability of AI systems

Alaska + 2 moreAll locations: Alaska | New York | Texas
$40 - $50 / hour
Job Closed
Budgetly logo

AI Engineer

Budgetly

Employee cards with spending rules.

AI Engineer48 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor

• Translate ambiguous business problems into clear, usable systems and tools • Work directly with revenue owners to understand friction and remove it • Build tooling and automations that improve execution quality • Identify where workflows fail and continuously improve them • Act as a teacher and multiplier, not just a builder

Australia
$110K - $130K / year
Job Closed
Nerdio logo

GTM AI Architect

Nerdio

Empowering MSPs and IT professionals to deploy, manage, and optimize virtual desktops in Microsoft Azure

AI Engineer48 days ago
Full TimeRemoteTeam 51-200H1B No Sponsor

• Use Case Identification & Prioritization: Work closely with team members across marketing to identify the highest-value opportunities to improve efficacy and efficiency through AI, data, and automation. Evaluate and prioritize use cases based on potential impact, feasibility, and data readiness, and maintain a clear roadmap of initiatives communicated to stakeholders. Stay current on the evolving AI tool landscape and bring forward-looking recommendations on where to invest next. • Data Strategy & Sourcing: For each priority use case, determine what data is needed to drive the best outcome, whether from the CRM, intent platforms, engagement tools, third-party sources, internal assets, etc. Work cross-functionally with RevOps, Sales, and IT to pull relevant data together in a structured and accessible way. • Automation & Agentic Workflow Building: Partner with marketing stakeholders to design and build automation and agentic workflows that improve the quality and speed of their work. Use Claude, Claude Code, Claude Workspace, and related tools to create working solutions — from prompt-driven workflows to multi-step agentic processes that run with minimal manual intervention. Iterate on deployed workflows based on real usage and outcomes, continuously improving impact. Act as a thought partner and co-builder for team members who want to apply AI to their specific function, making complex capabilities accessible and practical. • AI Capability Repository & Enablement: Architect and maintain a central repository of AI capabilities, skills, prompts, and reference assets that the marketing team can access, reuse, and build on. Establish standards and lightweight governance for how the marketing org uses AI tools — ensuring alignment, consistency, and scalability as usage grows. Enable team members to adopt new workflows and build their own AI fluency through documentation, training, and hands-on support. Ensure that knowledge and solutions developed for one function are systematically available to the rest of the team, preventing duplicate effort and compounding gains over time.

United States
$110K - $130K / year