This is a remote position.
NVIDIA AI Infrastructure & Kubernetes Platform Engineer
Location
United States
Posted
47 days ago
Salary
$110 - $135 / hour
Seniority
Mid Level
Job Description
NVIDIA AI Infrastructure & Kubernetes Platform Engineer
IT Search Corp
Role Description We are seeking a highly skilled AI Infrastructure & Kubernetes Platform Engineer with a proven track record in deploying and managing NVIDIA DGX-based AI clusters, orchestrating containerized AI workloads using Kubernetes, and ensuring secure, high-throughput operations across InfiniBand-powered networks. The ideal candidate will hold a combination of Kubernetes certifications (CKA, CKAD, CKS) and NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN), coupled with hands-on training in DGX, BlueField, and high-speed network operations. This position plays a key role in supporting AI/ML infrastructure at scale, enabling efficient training and inference for complex models, and integrating NVIDIA's cutting-edge compute, storage, and fabric solutions with modern DevOps practices. - AI Infrastructure Operations - Deploy and manage NVIDIA DGX BasePODs and SuperPODs for high-performance AI workloads. - Oversee DGX system lifecycle operations including provisioning, monitoring, firmware upgrades, and capacity planning. - Operate Base Command Manager to manage GPU clusters, schedule workloads, and integrate with MLOps tools. - Perform DGX node health validation, NCCL interconnect testing, and NVLink topology verification following new deployments or hardware changes. - Kubernetes Platform Engineering - Architect secure and scalable Kubernetes clusters optimized for GPU-accelerated workloads using NVIDIA GPU Operator. - Leverage expertise from CKA/CKAD/CKS to develop, deploy, and secure AI applications on Kubernetes. - Implement CI/CD pipelines and GitOps methodologies for deploying and managing ML workflows. - High-Performance Networking & DPUs - Administer InfiniBand networks and BlueField DPUs using Unified Fabric Manager (UFM). - Enable NVLink/NVSwitch performance across GPU nodes and tune fabric configurations for minimal latency and maximum throughput. - Use BlueField for offloading storage, firewalling, and telemetry, enhancing AI workload security and performance. - Security & Compliance - Apply best practices from the CKS certification to secure containerized AI environments. - Configure runtime security, secrets management, network segmentation, and auditing using DPU-enhanced Kubernetes deployments. - Support zero-trust architecture initiatives by enforcing workload identity, RBAC policies, and supply chain integrity across AI container images and model artifacts. - Monitor GPU, CPU, and I/O performance using NVIDIA DCGM, Prometheus, Grafana, and Base Command APIs. - Tune system performance and model training pipelines for cost-efficiency and throughput. - Build and maintain operational runbooks, incident response playbooks, and SLA reporting dashboards covering GPU utilization, thermal thresholds, and fabric health. Qualifications - Kubernetes certifications (CKA, CKAD, CKS) - NVIDIA certifications (NCA-AIIO, NCP-AIO, NCP-AII, NCP-AIN) - Hands-on training in DGX, BlueField, and high-speed network operations Requirements - Expertise with DGX System, BasePOD, and SuperPOD Administration - BlueField DPU Configuration & Operations - InfiniBand Fabric and UFM Management - Base Command Manager for workload orchestration Benefits - Remote position
Related Guides
Related Categories
Related Job Pages
More Platform Engineer Jobs
Power Platform Developer
TrueTandemYour Trusted IT Solutions Provider & Microsoft Gold Certified Partner - Careers Info: https://jobs.lever.co/truetandem
• TrueTandem is seeking a Power Platform Developer to create and maintain robust, responsive, and scalable applications within a Federal environment. • This role involves hands-on Power Platform development of canvas apps that streamline and automate organizational business processes. • The candidate will develop application Proof-of-Concepts (PoCs) and Minimal Viable Products (MVPs) to demonstrate and iteratively build on their business value and functionality. • The developer should be able to create Canvas Apps from scratch or using data from an existing data source. • This includes understanding the user interface, controls, and screen navigation for forms and views. • This position requires experience and passion for coding, and a strong desire to solve our customers’ unique technology challenges.
• Define the technical vision and multi-year roadmap for the platform, including how we evolve from today's fleet to the next order-of-magnitude in scale, sensor count, and autonomy maturity. • Own the future architecture of our Ubuntu/Yocto/Linux distributions tailored for real-time, safety-critical autonomous vehicle workloads, and make the build-vs-buy decisions that follow. • Set the strategy for ROS 2 IPC middleware (Cyclone DDS, Fast DDS, Zenoh, etc.) across the fleet — including profile selection, QoS standards, and determinism budgets for multi-sensor data flows. • Lead development of user-space drivers for LiDARs, cameras, radars, GNSS/INS, CAN, and other vehicle interfaces, and set the standards other engineers follow when adding new hardware. • Own the platform's functional safety and security strategy end-to-end — secure boot, OTA update pipelines, CVE response, and alignment with ISO 26262 / SOTIF workflows as we mature toward production. • Define the observability contract for the platform: what "healthy" looks like in the lab and in the field, and the SLIs/SLOs the autonomy and perception teams can build against. • Collaborating with autonomy, perception, and controls leads to set cross-stack performance budgets (CPU, GPU, memory, bus bandwidth, end-to-end latency) and drive the cross-team work to hit them. • Set standards for how the platform is built, tested, and released — CI/CD for OS images and driver packages, hardware-in-the-loop testing, release gates, and rollback strategy. • Contribute to the platform team's technical hiring and calibration — own the interview rubric, grow senior engineers into tech leads, and raise the bar on code review and design review across the team. • Represent AeroVect technically in relationships with silicon, sensor, and middleware vendors, and influence their roadmaps where it matters to us. • Provide on-call escalation support for platform components during field trials and customer pilots, and use what you learn in the field to drive systemic fixes. • Identify strategic technical debt and drive it down — not just within the platform, but stack-wide.
Principal Platform Engineer
Clarity InnovationsAn education technology company based in Portland, Oregon, Clarity Innovations offers consulting, marketing strategy, and other services designed to match "prom
• Developing, testing and maintaining software automation for Air Force-based GIS products • Developing, testing and maintaining cloud-based deployments of supporting COTS systems • Optimizes software designs and architectures to deliver desired performance targets and devises tooling and methodologies to profile execution and capture performance metrics • Guiding technical decisions in optimize the performance and cost benefit of government cloud resources • Mentoring junior engineers in DevSecOps best practices • Maintains & guides the development of common best practices/tools used by multiple teams • Implements and practices DevOps Developer Enablement and helps more junior/less experienced engineers to do same
Senior Software Engineer, Platform Engineering
CommandLink#1 Global Platform To Simplify & Scale Your Telco, ISP, Network, Phone, & Security Stack.
About your new role: We’re looking for a Senior Software Engineer to join our growing platform team. In this role, you’ll be a core contributor to the systems that power CommandLink’s SaaS platform — from back-end services and API integrations to the architectural decisions that keep us scalable as we grow. You’ll work closely with product managers, designers, and other engineers in a collaborative, high-trust environment where your judgment matters and your impact is visible. This is not a ticket-shuffling role. We want engineers who think deeply about the problems they’re solving, advocate for quality, and take pride in what they ship. What You'll Do: - Design, develop, and maintain scalable back-end systems and APIs that power our SaaS platform across a global customer base. - Lead technical design and architecture discussions for new features and platform improvements, balancing speed with long-term maintainability. - Integrate with third-party APIs, telemetry systems, and network infrastructure to extend platform capabilities. - Collaborate cross-functionally with product, design, and engineering teams to define requirements, break down complexity, and ship high-quality features. - Write clean, testable, well-documented code and hold your team to the same standard through thoughtful code reviews. - Identify and resolve performance bottlenecks, reliability issues, and scalability gaps before they become customer problems. - Mentor junior and mid-level engineers, sharing best practices in software design, testing, and system thinking. - Participate in on-call rotations and contribute to a culture of operational excellence. - Takes on additional responsibilities and projects as needed to support the success of the team and organization. What you'll need for success: - 5+ years of professional software engineering experience, with meaningful time spent in a SaaS product environment. - Proven experience at a high-growth or fast-paced technology company where priorities shift quickly and engineers own outcomes, not just tasks. - Strong proficiency in PHP (Laravel, Symfony, or similar framework), with working knowledge of Golang and Python. - Hands-on experience with Kubernetes for container orchestration and OpenSearch for search/analytics workloads. - Solid understanding of relational and non-relational databases, query optimization, and data modeling. - Experience designing and building RESTful or event-driven APIs that integrate with external platforms and services. - Comfort working across major cloud environments (AWS, Azure, or GCP), including managed services, IAM, networking fundamentals, and cost-aware architecture. - Strong written and verbal communication skills in English — you’ll be collaborating with distributed teams across time zones. - An ownership mindset: you don’t just close tickets. You understand the product, ask the right questions, push back on unnecessary complexity, and leave the codebase better than you found it. Bonus Points: - Exposure to network engineering technologies such as SD-WAN, VoIP, BGP, or MPLS — or genuine curiosity about how networks work. - Experience with workflow orchestration engines like Temporal or Camunda. - Familiarity with event streaming platforms (Kafka) or stream processing (Flink, Spark). - Prior experience working in a fully remote, globally distributed engineering team. Why you'll love life at Command|Link Join us at CommandLink, where you'll have the opportunity to shape the future of business communication. We value the innovative spirit and seek individuals ready to bring their unique vision and expertise to a team that values bold ideas and strategic thinking. Are you ready to make an impact? Apply now and be the architect of your career as well as our clients' success. - Room to grow at a high-growth company - An environment that celebrates ideas and innovation - Your work will have a tangible impact - Flexible time off - Fun events at cool locations - Employee referral bonuses to encourage the addition of great new people to the team At CommandLink, we’re committed to creating a fair, consistent, and efficient hiring experience. As part of our process, we use AI-assisted tools to help review and analyze applications. These tools support our recruiting team by identifying qualifications and experience that align with the requirements of each role. AI tools are used only to assist in the evaluation process — they do not make final hiring decisions. Every application is reviewed by a member of our recruiting or hiring team before any decisions are made.


