Senior Software Engineer, DGX Cloud AI Infrastructure
Location
California + 3 moreAll locations: California | Oregon | Texas | Washington
Posted
4 days ago
Salary
$184K - $356.5K / year
Seniority
Senior
Job Description
Senior Software Engineer, DGX Cloud AI Infrastructure
NVIDIA
• Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale. • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.
Job Requirements
- Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).
- 8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership.
- Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware.
- Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale.
- Proven track record of architecting, debugging, and scaling large-scale distributed systems.
- Expert-level Python and C/C++ programming skills.
- Experience operating workloads in scheduled, containerized cluster environments.
- Excellent analytical, debugging, and communication skills, with the ability to influence across teams.
Benefits
- equity
- benefits
Related Guides
Related Job Pages
More LLM Engineer Jobs
Generative AI Engineer – LATAM Candidates Only
Talentus GlobalWe facilitate talent & software solutions across the globe. Near-shore, managed services, ERP's, CRM's, EdTech/HigherEd.
• Design, develop, and deploy Generative AI solutions leveraging Large Language Models (LLMs) and multimodal AI technologies. • Build and maintain scalable AI applications using cloud platforms such as Azure, AWS, or GCP. • Develop and optimize Retrieval-Augmented Generation (RAG) architectures, vector databases, and knowledge retrieval systems. • Fine-tune, evaluate, and monitor foundation models to improve performance, accuracy, and reliability. • Implement prompt engineering strategies and AI orchestration frameworks to support business use cases. • Collaborate with software engineering, data science, DevOps, and security teams to integrate AI solutions into production environments. • Develop APIs, microservices, and AI-powered applications following software engineering best practices. • Ensure compliance with AI governance, security, privacy, and responsible AI standards. • Monitor AI workloads, model performance, and operational costs, recommending continuous improvements. • Stay current with emerging Generative AI technologies, frameworks, and industry trends.
Senior AI Infrastructure Engineer
PyynePyyne is a modern technology consultancy engineering the next generation of digital products and services. At Pyyne, we believe in using technology to unlock business potential, create sustainable growth, and drive forward digital excellence. Our solutions range from advanced Software Engineering, Cloud, and Data & AI solutions.
• Design and deploy AI platforms that integrate with infrastructure tools • Develop AI-powered workflows to automate operational tasks • Build AI-driven automation for incident response and operational workflows • Implement AI-powered monitoring and anomaly detection capabilities • Create intelligent operational dashboards with actionable insights • Ensure AI platforms operate reliably in production environments • Develop AI solutions for cost optimization and predictive capacity planning
• Add AI functionality into existing Python, Node.js, and Ruby codebases • Build LLM-powered features: chat, summaries, classification, smart search, document Q&A • Design lightweight RAG pipelines using embeddings and vector search • Work with vector DBs (pgvector, Pinecone, Qdrant) • Implement safe, reliable LLM endpoints (OpenAI, Anthropic, Azure) • Work directly with clients to shape AI features and reduce manual effort • Advise clients when NOT to use AI and navigate trade-offs around latency, accuracy, and cost
Senior MLOps, Generative AI Engineer
Sentara HealthSentara Health Plans provides health plan coverage to close to one million members in Virginia. We offer a full suite of commercial products including employee-owned and employer-sponsored plans, as well as Individual & Family Health Plans, Employee Assistance Programs, and plans serving Medicare and Medicaid enrollees. Our quality provider network features a robust provider network, including specialists, primary care physicians, and hospitals. We offer programs to support members with chronic illnesses, customized wellness programs, and integrated clinical and behavioral health services—all to help our members improve their health. Our success is supported by a family-friendly culture that encourages community involvement and creates unlimited opportunities for development and growth. Be a part of an excellent healthcare organization that cares about our People, Quality, Patient Safety, Service, and Integrity. Join a team that has a mission to improve health every day and a vision to be the healthcare choice of the communities that we serve!
• Design, build, and maintain scalable ML infrastructure and pipelines supporting model training, deployment, monitoring, governance, and lifecycle management. • Develop and optimize CI/CD pipelines for machine learning and AI workloads across development, staging, and production environments. • Build reusable ML platform capabilities including feature stores, model registries, experimentation frameworks, artifact management, and deployment automation. • Implement scalable orchestration and workflow solutions for batch and real-time ML inference workloads. • Create robust monitoring systems to measure model performance, detect model drift, monitor data quality, and ensure production reliability. • Develop automation tools and self-service capabilities to improve the efficiency, scalability, and reliability of MLOps processes. • Collaborate with Data Scientists and Software Engineers to streamline the ML lifecycle from experimentation through enterprise production deployment. • Apply software engineering best practices to AI/ML systems including testing, observability, resiliency, security, versioning, and infrastructure-as-code. • Identify gaps and improvement opportunities within the organization’s ML platform ecosystem and architect scalable solutions to address them. • Support enterprise AI governance, compliance, auditability, and model risk management requirements. • Ensure platform scalability, reliability, security, and operational excellence across AI/ML systems. • Lead the architecture, design, and deployment of enterprise Generative AI solutions leveraging LLMs, foundation models, and agentic AI systems. • Design and implement Retrieval-Augmented Generation (RAG) pipelines using vector databases, embeddings, semantic search, reranking, and retrieval optimization strategies. • Build scalable LLM orchestration frameworks using technologies such as LangChain, LlamaIndex, Semantic Kernel, or equivalent frameworks. • Develop advanced prompt engineering strategies, prompt chaining, context management, and agent workflows to improve LLM accuracy and reliability. • Evaluate and implement fine-tuning, parameter-efficient tuning, and prompt-based optimization approaches for domain-specific use cases. • Build AI evaluation and benchmarking frameworks to measure hallucination rates, response quality, grounding accuracy, toxicity, bias, latency, and business performance metrics. • Implement AI safety guardrails, governance controls, content filtering, and responsible AI practices for enterprise healthcare environments. • Design scalable GenAI APIs and microservices supporting high-throughput enterprise AI applications. • Optimize GenAI systems for cost, latency, throughput, and inference performance across cloud and hybrid environments. • Integrate enterprise data sources, healthcare systems, and knowledge repositories into secure GenAI workflows. • Research and evaluate emerging GenAI technologies, open-source frameworks, and foundation models to drive innovation and continuous improvement. • Develop architecture diagrams, technical roadmaps, implementation strategies, and executive-level documentation for enterprise AI initiatives. • Collaborate with cybersecurity, compliance, and infrastructure teams to ensure secure and compliant deployment of GenAI solutions involving PHI and sensitive healthcare data. • Contribute to the development of AI platform standards, reusable GenAI accelerators, templates, and engineering best practices.




