NVIDIA

Senior Software Engineer, DGX Cloud AI Infrastructure

LLM EngineerMachine Learning EngineerFull Time Remote SeniorTeam 10,001+Since 1993H1B SponsorCompany Site LinkedIn

Location

California + 3 more

Posted

4 days ago

Salary

$184K - $356.5K / year

Seniority

Senior

Bachelor Degree8 yrs expEnglishDistributed Systems Node.js Python PyTorch

Job Description

• Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale. • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.

Job Requirements

Bachelor’s or Master’s in Computer Science or a related technical field (or equivalent experience).
8+ years of experience developing software infrastructure for large-scale AI or HPC systems, including a track record of technical leadership.
Expertise debugging and triaging AI applications across the full stack — from the application layer down to the hardware.
Deep hands-on experience with NCCL, CUDA-aware distributed execution, and debugging multi-GPU and multi-node workloads at scale.
Proven track record of architecting, debugging, and scaling large-scale distributed systems.
Expert-level Python and C/C++ programming skills.
Experience operating workloads in scheduled, containerized cluster environments.
Excellent analytical, debugging, and communication skills, with the ability to influence across teams.

Benefits

equity
benefits

Related Categories

LLM Engineer AI Engineer Machine Learning Engineer AI Research Scientist Computer Vision Engineer NLP Engineer

Related Job Pages

LLM Engineer Jobs in California Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More LLM Engineer Jobs

Generative AI Engineer – LATAM Candidates Only

Talentus Global

We facilitate talent & software solutions across the globe. Near-shore, managed services, ERP's, CRM's, EdTech/HigherEd.

LLM Engineer5 days ago

Full Time RemoteTeam 201-500Since 2020H1B No Sponsor

Company Site LinkedIn

• Design, develop, and deploy Generative AI solutions leveraging Large Language Models (LLMs) and multimodal AI technologies. • Build and maintain scalable AI applications using cloud platforms such as Azure, AWS, or GCP. • Develop and optimize Retrieval-Augmented Generation (RAG) architectures, vector databases, and knowledge retrieval systems. • Fine-tune, evaluate, and monitor foundation models to improve performance, accuracy, and reliability. • Implement prompt engineering strategies and AI orchestration frameworks to support business use cases. • Collaborate with software engineering, data science, DevOps, and security teams to integrate AI solutions into production environments. • Develop APIs, microservices, and AI-powered applications following software engineering best practices. • Ensure compliance with AI governance, security, privacy, and responsible AI standards. • Monitor AI workloads, model performance, and operational costs, recommending continuous improvements. • Stay current with emerging Generative AI technologies, frameworks, and industry trends.

AWS Azure Cloud Docker Google Cloud Platform Kubernetes Microservices Python

View details: Generative AI Engineer – LATAM Candidates Only

Colombia

Apply

Senior AI Infrastructure Engineer

Pyyne

Pyyne is a modern technology consultancy engineering the next generation of digital products and services. At Pyyne, we believe in using technology to unlock business potential, create sustainable growth, and drive forward digital excellence. Our solutions range from advanced Software Engineering, Cloud, and Data & AI solutions.

LLM Engineer6 days ago

Full Time RemoteTeam 51-200Since 2020H1B No Sponsor

Company Site LinkedIn

• Design and deploy AI platforms that integrate with infrastructure tools • Develop AI-powered workflows to automate operational tasks • Build AI-driven automation for incident response and operational workflows • Implement AI-powered monitoring and anomaly detection capabilities • Create intelligent operational dashboards with actionable insights • Ensure AI platforms operate reliably in production environments • Develop AI solutions for cost optimization and predictive capacity planning

Ansible AWS Azure Cloud Google Cloud Platform ITSM Python Terraform

View details: Senior AI Infrastructure Engineer

Brazil

Apply

LLM Engineer, Freelancer

Monterail

Delivering Innovative Software

LLM Engineer6 days ago

Part Time RemoteTeam 51-200Since 2011H1B No Sponsor

Company Site LinkedIn

• Add AI functionality into existing Python, Node.js, and Ruby codebases • Build LLM-powered features: chat, summaries, classification, smart search, document Q&A • Design lightweight RAG pipelines using embeddings and vector search • Work with vector DBs (pgvector, Pinecone, Qdrant) • Implement safe, reliable LLM endpoints (OpenAI, Anthropic, Azure) • Work directly with clients to shape AI features and reduce manual effort • Advise clients when NOT to use AI and navigate trade-offs around latency, accuracy, and cost

Azure JavaScript Node.js Python Ruby

View details: LLM Engineer, Freelancer

Poland

Apply

Senior MLOps, Generative AI Engineer

Sentara Health

Sentara Health Plans provides health plan coverage to close to one million members in Virginia. We offer a full suite of commercial products including employee-owned and employer-sponsored plans, as well as Individual & Family Health Plans, Employee Assistance Programs, and plans serving Medicare and Medicaid enrollees. Our quality provider network features a robust provider network, including specialists, primary care physicians, and hospitals. We offer programs to support members with chronic illnesses, customized wellness programs, and integrated clinical and behavioral health services—all to help our members improve their health. Our success is supported by a family-friendly culture that encourages community involvement and creates unlimited opportunities for development and growth. Be a part of an excellent healthcare organization that cares about our People, Quality, Patient Safety, Service, and Integrity. Join a team that has a mission to improve health every day and a vision to be the healthcare choice of the communities that we serve!

LLM Engineer13 days ago

Full Time RemoteTeam 10,001+Since 1890H1B No Sponsor

Company Site LinkedIn

• Design, build, and maintain scalable ML infrastructure and pipelines supporting model training, deployment, monitoring, governance, and lifecycle management. • Develop and optimize CI/CD pipelines for machine learning and AI workloads across development, staging, and production environments. • Build reusable ML platform capabilities including feature stores, model registries, experimentation frameworks, artifact management, and deployment automation. • Implement scalable orchestration and workflow solutions for batch and real-time ML inference workloads. • Create robust monitoring systems to measure model performance, detect model drift, monitor data quality, and ensure production reliability. • Develop automation tools and self-service capabilities to improve the efficiency, scalability, and reliability of MLOps processes. • Collaborate with Data Scientists and Software Engineers to streamline the ML lifecycle from experimentation through enterprise production deployment. • Apply software engineering best practices to AI/ML systems including testing, observability, resiliency, security, versioning, and infrastructure-as-code. • Identify gaps and improvement opportunities within the organization’s ML platform ecosystem and architect scalable solutions to address them. • Support enterprise AI governance, compliance, auditability, and model risk management requirements. • Ensure platform scalability, reliability, security, and operational excellence across AI/ML systems. • Lead the architecture, design, and deployment of enterprise Generative AI solutions leveraging LLMs, foundation models, and agentic AI systems. • Design and implement Retrieval-Augmented Generation (RAG) pipelines using vector databases, embeddings, semantic search, reranking, and retrieval optimization strategies. • Build scalable LLM orchestration frameworks using technologies such as LangChain, LlamaIndex, Semantic Kernel, or equivalent frameworks. • Develop advanced prompt engineering strategies, prompt chaining, context management, and agent workflows to improve LLM accuracy and reliability. • Evaluate and implement fine-tuning, parameter-efficient tuning, and prompt-based optimization approaches for domain-specific use cases. • Build AI evaluation and benchmarking frameworks to measure hallucination rates, response quality, grounding accuracy, toxicity, bias, latency, and business performance metrics. • Implement AI safety guardrails, governance controls, content filtering, and responsible AI practices for enterprise healthcare environments. • Design scalable GenAI APIs and microservices supporting high-throughput enterprise AI applications. • Optimize GenAI systems for cost, latency, throughput, and inference performance across cloud and hybrid environments. • Integrate enterprise data sources, healthcare systems, and knowledge repositories into secure GenAI workflows. • Research and evaluate emerging GenAI technologies, open-source frameworks, and foundation models to drive innovation and continuous improvement. • Develop architecture diagrams, technical roadmaps, implementation strategies, and executive-level documentation for enterprise AI initiatives. • Collaborate with cybersecurity, compliance, and infrastructure teams to ensure secure and compliant deployment of GenAI solutions involving PHI and sensitive healthcare data. • Contribute to the development of AI platform standards, reusable GenAI accelerators, templates, and engineering best practices.

AWS Azure Cloud Cyber Security Distributed Systems Google Cloud Platform Kubernetes Microservices Python PyTorch Tensorflow

View details: Senior MLOps, Generative AI Engineer

Alabama + 24 more

$91.4K - $152.4K / year

Apply

Job Closed

Senior Software Engineer, DGX Cloud AI Infrastructure

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More LLM Engineer Jobs

Generative AI Engineer – LATAM Candidates Only

Senior AI Infrastructure Engineer

LLM Engineer, Freelancer

Senior MLOps, Generative AI Engineer