Senior DGX Cloud AI Infrastructure Software Engineer
Location
California + 3 moreAll locations: California | Oregon | Texas | Washington
Posted
135 days ago
Salary
$184K - $287.5K / year
Seniority
Senior
Job Description
Senior DGX Cloud AI Infrastructure Software Engineer
NVIDIA
• Develop infrastructure software and tools for large-scale pre-training, post-training, and inference. • Develop and optimize tools and libraries to improve infrastructure efficiency and resiliency. • Co-design and implement APIs for integration with NVIDIA's resiliency stacks. • Enhance infrastructure and products underpinning NVIDIA's AI platforms. • Define meaningful and actionable reliability metrics to track and improve system and service reliability. • Skilled in problem-solving, root cause analysis, and optimization. • Root cause and analyze and triage failures from the application level to the hardware level.
Job Requirements
- Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems.
- Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
- Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level.
- Experience with observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki).
- Proven track record in building and scaling large-scale distributed systems.
- Experience with AI training and inferencing infrastructure services.
- Proficiency in programming languages such as Python, C/C++, script languages.
- Experience in quality software engineering practices, including test development, defensive programming, version control, and CI.
- Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential.
Benefits
- equity
- benefits
Related Guides
Related Job Pages
More LLM Engineer Jobs
Conversational AI Engineer
ZillowReimagining real estate to make it easier than ever to move from one home to the next.
• Design, build, and deploy intelligent chat agents and automated workflows to resolve common customer and frontline issues. • Integrate core systems (such as Salesforce) with AI tools to create a unified, compliant user experience. • Develop and optimize prompts to ensure the AI delivers accurate, relevant answers and help content. • Evaluate, onboard, and manage AI/ML tools and emerging technologies to enhance system performance. • Implement safeguards and monitoring to maintain accuracy, prevent misinformation, and build user trust. • Collaborate with Product, Engineering, QA, Content, and Analytics teams to embed conversational AI into business strategy and track performance. • Apply machine learning and large language models to improve natural language understanding and generation in our chat agents.
Technical Partner Manager, AI Infrastructure
MirantisStrategic open source infrastructure for containers and virtual machines.
• Develop and manage strategic technical partnerships across the AI infrastructure ecosystem. • Support Business Development leadership as the primary technical liaison between Mirantis and strategic technology partners. • Collaborate with product management, engineering, and sales to drive joint solution development, technical validation, and technical go-to-market alignment. • Build and/or support enablement sessions, solution demos, and technical workshops for partners and, where needed, customers. • Represent Mirantis at industry events, technical summits, and partner briefings. • Gather partner and customer feedback to influence product roadmap and partnership strategy.


