Job Closed
This listing is no longer active.
NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you! Applications for this job will be accepted at least until June 15, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
Senior Systems Software Engineer, AI Infrastructure
Location
California
Posted
71 days ago
Salary
$152K - $241.5K / year
Seniority
Senior
Job Description
Senior Systems Software Engineer, AI Infrastructure
NVIDIA
• Develop and maintain large-scale systems supporting critical use-cases including frontier model training for AI Infrastructure • Collaborate on tooling for HPC, GPU Training, and AI Model training workflows • Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution • Establish frameworks for operational maturity, lead sustainable incident response protocols, and conduct blameless postmortems • Implement SRE fundamentals, including incident management, monitoring, and performance optimization • Work with engineering teams to deliver innovative solutions
Job Requirements
- Degree in Computer Science or related field, or equivalent experience
- 5+ years in Software Development, SRE, or Production Engineering
- Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby)
- Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, Azure, GCP, or OCI)
- Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK)
- Hands-on experience with observability platforms (e.g., ELK, Prometheus, Loki) and CI/CD systems (e.g., GitLab)
Benefits
- Comprehensive benefits package
- Health insurance
- Equity options
- Professional development opportunities
Related Guides
Related Job Pages
More LLM Engineer Jobs
• Bridge cutting-edge research and real-world applications by designing and implementing modern LLM fine-tuning and optimization approaches. • Focus on applied research projects including retrieval-augmented systems (RAG), multimodal modeling, and large-scale learning for e-commerce. • Work on algorithms from the idea stage to deployment, building robust experimentation pipelines and evaluation frameworks for production-ready AI. • Interact directly with our partners, collaborating closely with engineering and product teams to bring research prototypes into the eBay ecosystem. • Contribute to Core AI strategy by taking ownership of research components and delivering measurable impact on product-facing systems. • Collaborate with world-class researchers to contribute to technical discussions, documentation, and internal thought leadership.
• Lead technical architecture and decision‑making for complex GenAI engagements • Deliver production‑ready AI applications—including RAG, conversational AI, and agentic systems—under tight timelines • Implement advanced GenAI patterns using vector databases, orchestration frameworks, and managed AI services with strong observability, security, and optimization • Integrate LLM APIs and AI services into existing workflows across cloud and on‑prem environments; design responsible AI guardrails and evaluation frameworks • Conduct hands‑on training, documentation, and mentorship to ensure sustainable knowledge transfer • Maintain and expand reusable AI solution templates and technical standards • Validate program team readiness for independent AI operations • Experiment with emerging AI tools and share findings through demos, documentation, and discussions
• As a Senior Generative AI Engineer, you will lead the design and deployment of advanced AI solutions in cloud environments like AWS, Azure, and GCP. • Responsibilities include building robust data pipelines, optimizing performance, and deploying large language models (LLMs) for various business needs. • Emphasizing AI ethics, you'll ensure responsible practices are integrated throughout development. • Additionally, you'll collaborate with stakeholders to translate business needs into technical requirements and communicate complex AI concepts clearly.
• Design and implement end-to-end AI solutions for document understanding and automated report generation. • Build and deploy LLM-based systems, including RAG pipelines, to retrieve and combine context from multiple data sources. • Work with unstructured and semi-structured data such as PDFs, documents, images, and historical records, transforming it into usable inputs for AI systems. • Rapidly prototype, test, and iterate on AI solutions to deliver measurable improvements. • Integrate AI outputs into existing systems via APIs, collaborating with backend teams working in C# and SQL Server environments. • Ensure solutions are scalable, reliable, and aligned with real-world business workflows. • Address performance, accuracy, and usability challenges in AI-generated outputs. • Operate within data privacy and security constraints, including handling sensitive information and exploring local model deployment where required. • Collaborate with internal teams to understand requirements and refine use cases. • Communicate technical concepts, trade-offs, and solution approaches clearly to non-technical stakeholders. • Contribute to shaping the organisation’s approach to AI adoption and best practices.




