NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you have initiative and creativity, we want to hear from you! Applications for this job will be accepted at least until June 1, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Principal Software Engineer, DGX Cloud Production Engineering
Location
California
Posted
8 days ago
Salary
$272K - $431.3K / year
Seniority
Lead
Job Description
Principal Software Engineer, DGX Cloud Production Engineering
NVIDIA
• Define and execute the technical strategy for DGX Cloud cluster operations, building the automation, GitOps, and Day 2 reliability needed to operate large-scale GPU clusters across NVIDIA Cloud Partners (NCPs) and on-prem environments • Lead design and implementation of systems for cluster lifecycle, validation, repair, upgrades, observability, and readiness • Establish patterns for Kubernetes-based GPU cluster operations across partner and on-prem environments • Identify and eliminate operational toil through software, APIs, automation, and agent-assisted workflows • Set technical standards for production readiness, SLOs, incident response, handoff gates, and operational acceptance • Mentor engineers and influence platform, infrastructure, storage, networking, security, and workload teams
Job Requirements
- 15+ years of experience building and operating large-scale distributed systems or cloud infrastructure
- Deep experience with Kubernetes, Linux, infrastructure automation, and production operations
- Strong programming experience in Go, Python, or similar
- Proven ability to lead complex cross-org technical initiatives
- Experience designing reliable systems with clear SLOs, observability, incident response, and automation
- BS/MS in Computer Science or equivalent experience.
Benefits
- equity
- benefits
Related Guides
Related Categories
Related Job Pages
More Production Engineer Jobs
Document Production Workflow Coordinator
RR Donnelley - RRDRR Donnelley - RRD is a “global integrated communications” firm with a history dating back to 1864. Headquartered in Chicago, Illinois, RR Donnelley employs
Title: Document Production Workflow Coordinator - Wheeling, WV Location: Wheeling, WV, Columbus, OH, United States Job Description: - Wheeling, WV, USA - Employees work in a hybrid mode - Full-time - Department: Legal & Document Processing Company Description RRD provides marketing, packaging, print, and business services to the world’s most respected brands. The company’s proprietary technology, advanced data analytics, and expertise fuel organizational decision-making from strategy through execution, delivering sustainable solutions with the lowest possible environmental impact. Global organizations and regulated industries trust RRD to reduce complexity and drive audience connections across the entire customer journey. Job Description Do you have a keen eye to even the smallest of details? This position could be for you! In this role, your primary function will be document review and preparation. This includes creating spreadsheets, charts, graphs, mail merges, tables, presentations, and other documents to support the client’s brand and track the progress of all work. Location: Hybrid in Columbus, OH. Also open to candidates in the Wheeling, WV location. After training, this will be mostly remote. However, seeking candidates local to either Columbus or Wheeling. Shift: Saturday and Sunday; 9am to 10pm AND Friday and Monday;12pm to 9pm This position qualifies for an additional $1.75/hour shift differential. Job duties: - Create and edit legal documents to client specifications using applicable software. - Transcribe tapes, scan, and clean documents, and convert documents to/from different file formats. - Recover/restore corrupted document files when needed. - Handle sensitive and/or confidential documents and information. - Communicate with managers and supervisors on job or deadline issues. Qualifications Job Requirements: - High school diploma or equivalent - Advanced knowledge of MS Office (Word, Excel, and PPT), including formatting documents with Styles, and generating a table of contents and table of authorities; strong keyboarding and typing skills - Ability to work in a fast-paced, team environment and as an independent operator. - Attention to detail with emphasis on accuracy and quality. - Able to apply intermediate requisite knowledge of appropriate grammar, spelling, composition to work requests Additional Information The salary for this role at the noted RRD location is $23 to $24 / hour. Starting pay decisions are determined based on multiple factors including but not limited to relevant education, qualifications, skills, experience, certifications, proficiency, performance, shift, location, and other business needs. Typically, roles follow step progressions to a target rate or set increments over time. Depending on the role, in addition to the hourly rate of pay, the total compensation package may also include overtime, shift differential, call-in, and/or stand-by pay. RRD’s benefit offerings include medical, dental, and vision coverage, paid time off, disability insurance, 401(k) with company match, life insurance and other voluntary supplemental insurance coverages, plus parental leave, adoption assistance, tuition assistance and employer/partner discounts. #WLWV All employment offers are contingent upon the successful completion of both a pre-employment background and drug screen.
Document Production Workflow Coordinator
RR Donnelley - RRDRR Donnelley - RRD is a “global integrated communications” firm with a history dating back to 1864. Headquartered in Chicago, Illinois, RR Donnelley employs
Title: Document Production Workflow Coordinator Location: Columbus United States Job Description: Company Description RRD provides marketing, packaging, print, and business services to the world's most respected brands. The company's proprietary technology, advanced data analytics, and expertise fuel organizational decision-making from strategy through execution, delivering sustainable solutions with the lowest possible environmental impact. Global organizations and regulated industries trust RRD to reduce complexity and drive audience connections across the entire customer journey. Job Description Do you have a keen eye to even the smallest of details? This position could be for you! In this role, your primary function will be document review and preparation. This includes creating spreadsheets, charts, graphs, mail merges, tables, presentations, and other documents to support the client's brand and track the progress of all work. Location: Hybrid in Columbus, OH. Also open to candidates in the Wheeling, WV location. After training, this will be mostly remote. However, seeking candidates local to either Columbus or Wheeling. Shift: Saturday and Sunday; 9am to 10pm AND Friday and Monday;12pm to 9pm This position qualifies for an additional $1.75/hour shift differential. Job duties: - Create and edit legal documents to client specifications using applicable software. - Transcribe tapes, scan, and clean documents, and convert documents to/from different file formats. - Recover/restore corrupted document files when needed. - Handle sensitive and/or confidential documents and information. - Communicate with managers and supervisors on job or deadline issues. Qualifications Job Requirements: - High school diploma or equivalent - Advanced knowledge of MS Office (Word, Excel, and PPT), including formatting documents with Styles, and generating a table of contents and table of authorities; strong keyboarding and typing skills - Ability to work in a fast-paced, team environment and as an independent operator. - Attention to detail with emphasis on accuracy and quality. - Able to apply intermediate requisite knowledge of appropriate grammar, spelling, composition to work requests Additional Information The salary for this role at the noted RRD location is $23 to $24 / hour. Starting pay decisions are determined based on multiple factors including but not limited to relevant education, qualifications, skills, experience, certifications, proficiency, performance, shift, location, and other business needs. Typically, roles follow step progressions to a target rate or set increments over time. Depending on the role, in addition to the hourly rate of pay, the total compensation package may also include overtime, shift differential, call-in, and/or stand-by pay. RRD's benefit offerings include medical, dental, and vision coverage, paid time off, disability insurance, 401(k) with company match, life insurance and other voluntary supplemental insurance coverages, plus parental leave, adoption assistance, tuition assistance and employer/partner discounts. #WLOH All employment offers are contingent upon the successful completion of both a pre-employment background and drug screen. RRD is an Equal Opportunity Employer, including disability/veterans
Engineering Manager, DGX Cloud Production Engineering
NVIDIANVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you have initiative and creativity, we want to hear from you! Applications for this job will be accepted at least until June 1, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
• Lead a team of software and production engineers building and operating DGX Cloud infrastructure across NVIDIA Cloud Partner (NCP) and on-prem environments • Drive execution across cluster operations, Kubernetes operability, automation, GitOps, observability, and incident response • Help define team priorities, roadmap, staffing, and operational ownership • Partner with platform, workload, storage, networking, security, and TPM teams to improve production readiness • Build a healthy on-call and incident review culture focused on learning, ownership, and durable fixes • Coach engineers, grow technical leaders, and create clear ownership across ambiguous problem spaces
Senior Production Engineer – DGX Cloud
NVIDIANVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you have initiative and creativity, we want to hear from you! Applications for this job will be accepted at least until June 1, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
• You will be part of an DGX Cloud team responsible for production systems that enable large scalable GPU clusters to be used for a variety of AI workloads. • This includes working on custom software related to GPU asset provisioning, configuration, and lifecycle management across cloud providers. • Implementing monitoring and health management capabilities that enable industry leading reliability, availability, and scalability of GPU assets. • You will be harnessing multiple data streams, ranging from GPU hardware diagnostics to cluster and network telemetry. • Working with teams across NVIDIA to ensure production AI clusters run reliability and consistently with maximum performance. • Evaluating system failures and improving services based on a well-defined incident management process.

