Job Closed
This listing is no longer active.
NVIDIA is widely considered to be one of the technology industry's most desirable employers. We have some of the most brilliant and hardworking people in the world working with us and our product lines are growing fast in some of the hottest state of the art fields such as Virtual Reality, Artificial Intelligence, Deep Learning, and Autonomous Vehicles. Applications for this job will be accepted at least until June 8, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes.
Systems Software Engineer, AI Infrastructure
Location
India
Posted
106 days ago
Salary
0
Seniority
Senior
Job Description
Systems Software Engineer, AI Infrastructure
NVIDIA
• Develop and maintain large-scale systems supporting critical use-cases including frontier model training for AI Infrastructure, driving reliability, operability, and scalability across global public and private clouds. • Collaborate on tooling for HPC, GPU Training, and AI Model training workflows. • Build tools and frameworks to improve observability, define actionable reliability metrics, and enable fast issue resolution, driving continuous improvement in system performance. • Establish frameworks for operational maturity, lead sustainable incident response protocols, and conduct blameless postmortems to improve team efficiency and system resilience. • Implement SRE fundamentals, including incident management, monitoring, and performance optimization, while designing automation tools to reduce manual processes and operational overhead. • Work with engineering teams to deliver innovative solutions, uphold high standards for code and infrastructure, and contribute to hiring for a diverse, high-performing team.
Job Requirements
- Degree in Computer Science or related field, or equivalent experience with 5+ years in Software Development, SRE, or Production Engineering.
- Proficiency in Python and at least one other language (C/C++, Go, Perl, Ruby).
- Expertise in systems engineering within Linux or Windows environments and cloud platforms (AWS, Azure, GCP, or OCI).
- Strong understanding of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK).
- Hands-on experience with observability platforms (e.g., ELK, Prometheus, Loki) and CI/CD systems (e.g., GitLab).
- Strong communication skills with the ability to convey technical concepts effectively to diverse audiences.
- Commitment to fostering a culture of diversity, curiosity, and continuous improvement.
Benefits
- highly competitive salaries
- comprehensive benefits package
Related Guides
Related Job Pages
More LLM Engineer Jobs
Senior AI/ML Engineer (LLM)
iBusiness FundingHelping to provide capital in an efficient and transparent manner to every small and medium-sized business in America.
About iBusiness Funding iBusiness Funding is a software and lender service provider specializing in small business lending. Our technology, team, and process enable us to support loans from $10,000 to $25 million for our lending partners. Our technology solutions have been proven to quickly scale our clients’ portfolios without the need for additional overhead. Our flagship product, LenderAI, features end-to-end lending functionality from sales all the way through servicing To date, we’ve processed over $11 billion in SBA and non-SBA volume and handle more than 1,000 business loan applications daily. Our team is driven by our core values of innovation, integrity, enjoyment, and family. Join us and be part of a team that’s transforming the finance industry and empowering businesses to thrive! Position Description We are seeking an experienced Senior LLM Engineer to join our team. You will play a key role in designing and implementing workflows that leverage large language models (LLMs, LAMs, LMMs, LVLMs, etc.) to automate the process and drive innovation in our products. The ideal candidate will have a deep understanding of NLP, experience with foundational models, and a flexible, problem-solving mindset. You will collaborate closely with cross-functional teams, contributing to the development of scalable AI-driven solutions. Major Areas of Responsibility Design, Implement, and optimize workflows that incorporate large language models to automate and enhance product features. Leverage existing foundational models and adapt them to fit into various product requirements, ensuring alignment with business goals.
• Own the Technical Blueprint: Personally architect the infrastructure solutions for our most strategic M&E partnerships, studio-scale content production pipelines, agency data consolidation plays, generative AI model deployments. • The Physics to P&L Narrative: Fluently demonstrate to executive stakeholders how infrastructure decisions, data lake locality, storage tiering, inference optimization, directly impact their business model and operability. • Deconstruct the Bottleneck: Go beyond the stated problem to find the technical truth. Translate vague business goals (e.g., “We need lower rendering costs”) into precise engineering requirements (e.g., “We need to optimize the inference batch size on L40s to reduce cost-per-token by 30%”). • Map the Transition: Identify exactly where a customer sits on the curve from legacy service bureau to AI-native tech platform and prescribe the specific infrastructure intervention needed to move them forward. • Build and Validate the Integration Layer: Identify, engage, and technically validate relationships with the most critical ISVs in the media and entertainment landscape, from rendering and VFX toolchains to generative AI platforms. • Define the Standard: It is not enough to support these tools. You will define the reference architectures for how they run best on Nebius infrastructure, and work directly with ISV engineering teams to build and publish those standards. • Decide What’s Worth Doing: In partnership with the GM, evaluate ISV and partner opportunities on their technical merit and strategic leverage, and be equally rigorous about what not to pursue. • Shape the M&E Roadmap: Use forensic evidence from the field to prioritize and justify the M&E vertical roadmap. You will work directly with Nebius’s global Head of Product and Head of Engineering to translate partner and customer needs into product direction. • Lead the M&E Product Summit: Chair a quarterly summit with Core Engineering leadership, using field evidence to drive roadmap decisions and maintain vertical momentum.
• Develop performant, scalable, and high quality APIs and backend processes for Ostro's SaaS platform, with a strong emphasis on LLM integration. • Collaborate with cross-functional teams to implement new features and refine existing ones, particularly those involving AI/LLM capabilities. • Provide feedback on roadmap and features for your team, contributing to the strategic direction of Ostro’s AI/LLM initiatives. • Ensure code quality and compliance through thorough reviews, unit testing, and adherence to best practices for LLM-powered applications and Ostro engineering. • Optimize application performance and scalability to meet user demands, especially for LLM inference and data processing. • Stay informed about emerging AI/LLM technologies, prompt engineering techniques, and industry trends. • Troubleshoot and resolve production issues, ensuring performance, reliability, and scalability of LLM-driven features.
• Build and improve agentic workflows (tool/function calling, planning, self-checks) for analytics, summaries, visualizations, and task automation. • Implement adapters and tools to connect LLMs with internal and external services. • Contribute to our FastAPI backend with clean interfaces, Pydantic validation, and tests. • Develop evaluation metrics to measure accuracy, latency, and cost. • Optimize prompts, retrieval/contexting, and execution strategies for privacy, reliability, and performance. • Ship services in containers (Docker) and collaborate on deployments (Kubernetes), CI, and observability. • Document technical decisions and share learnings with the team.




