Job Closed
This listing is no longer active.
NVIDIA is widely considered one of the world's most desirable employers in technology. We have some of the world's most forward-thinking and passionate people working for us. If you're creative and autonomous, we want to hear from you!
Senior Software Engineer – Inference Platform Infrastructure
Location
California + 2 moreAll locations: California | New York | Texas
Posted
104 days ago
Salary
$152K - $241.5K / year
Seniority
Senior
Job Description
Senior Software Engineer – Inference Platform Infrastructure
NVIDIA
• Build automation that makes inference at scale easy to operate: provisioning, configuration, upgrades, rollbacks, and routine maintenance—optimized for repeatability and safety. • Create and evolve deployment patterns for inference workloads on Kubernetes: rollouts, autoscaling, multi‑cluster patterns, GPU scheduling/isolation, and safe upgrade strategies. • Own platform reliability outcomes through software: define and improve SLIs/SLOs, error budgets, alert quality, and automated remediation for common failure modes. • Owning and operating a large fleet of NVIDIA GPU and Datacenter hardware from pre-release to production.
Job Requirements
- Strong software engineering skills; ability to build platforms and systems that our teams rely on.
- 5+ years building and operating production distributed systems with strong ownership and a track record of improving reliability and eliminating toil.
- Proven expertise in cloud-native platforms: Kubernetes, containers, service networking, configuration management, and modern CI/CD.
- Deep experience with infrastructure‑as‑code and automation-first operations (e.g., GitOps workflows, policy enforcement, fleet management patterns).
- Excellent communication and collaboration skills; ability to lead cross‑functional efforts and drive improvements to completion.
- BS/MS in Computer Science, Computer Engineering, or related field or equivalent experience.
Benefits
- eligible for equity and benefits
Related Guides
Related Categories
Related Job Pages
More Platform Engineer Jobs
Senior Data Platform Engineer (SnowFlake Specialist) - Mortgage Lending - LATAM
Truelogic SoftwarePremium boutique software development company that helps brands with big ideas to make a difference in people’s lives.
About Truelogic At Truelogic we are a leading provider of nearshore staff augmentation services headquartered in New York. For over two decades, we’ve been delivering top-tier technology solutions to companies of all sizes, from innovative startups to industry leaders, helping them achieve their digital transformation goals. Our team of 600+ highly skilled tech professionals, based in Latin America, drives digital disruption by partnering with U.S. companies on their most impactful projects. Whether collaborating with Fortune 500 giants or scaling startups, we deliver results that make a difference. By applying for this position, you’re taking the first step in joining a dynamic team that values your expertise and aspirations. We aim to align your skills with opportunities that foster exceptional career growth and success while contributing to transformative projects that shape the future. Our Client operates in the financial services sector, providing residential mortgage solutions. The company originates, services, and invests in home loans, offering purchase, refinance, and servicing platforms. It leverages data, technology, and capital markets expertise to manage risk, optimize pricing, and support homeowners throughout the loan lifecycle. Job Summary Senior Data Platform Engineer specializing in enterprise-scale data warehousing and advanced Snowflake architectures. Leads the design, optimization, and governance of high-performance data platforms supporting mission-critical analytics and financial systems. Combines deep Snowflake expertise with strong software engineering foundations in Python, cloud infrastructure, and DataOps. Proven at performance tuning, cost optimization, secure data modeling, and CI/CD implementation. Adept at translating complex data architectures into actionable solutions through cross-functional, agile collaboration. Responsibilities Lead the design, architecture, and optimization of enterprise-scale Snowflake data warehouse environments. Manage advanced Snowflake features including multi-cluster warehouses, data sharing, streams, tasks, time travel, and zero-copy cloning. Drive performance tuning, query optimization, and cost management at an enterprise level. Design and implement robust data models and dimensional schemas using modern warehouse patterns. Develop scalable data pipelines using Python 3, applying OOP principles and design patterns. Orchestrate data workflows using tools such as Apache Airflow, Pandas, and SQLAlchemy. Design and expose RESTful APIs and event-driven integrations to enable cross-platform data flow. Implement cloud infrastructure using Infrastructure as Code on AWS with Terraform, CDK, or CloudFormation. Enforce data security through RBAC, data masking, and row-level security policies. Establish DataOps best practices, including CI/CD pipelines, automated testing, and monitoring. Qualifications and Job Requirements
Staff Software Engineer, AI Platform
GoodLeap🔆 GoodLeap is America's leading fintech for sustainable home solutions.
Role Description The Staff Software Engineer will be responsible for building, maintaining, monitoring, and scaling AI products and components in our portfolio. This role will work directly with Product Managers and other stakeholders to help define the product roadmap and provide input towards the overall technical strategy. - Collaborate with product, architecture, and design leads to deliver highly-available, fault-tolerant products and services. - Work on significant and unique technical challenges, evaluate and recommend solutions, and guide decision making by considering technical tradeoffs. - Grasp both the technical and business perspective to help drive innovation. - Work autonomously and be self-disciplined, requiring no supervision or guidance. - Mentor and coach team members to grow both their technical skills and soft skills. Qualifications - Bachelor's degree - 7+ years experience in backend software, ML, or AI engineering - Highly proficient in ML and AI engineering tools, python/FastAPI, A2A protocol, agent design and evaluation, prompt and context engineering - Adept with agile software development lifecycle and DevOps principles - Passion for software development, emerging technologies and culture of innovation - Excellent communication and interpersonal skills Requirements - Experience with AWS cloud infrastructure, Terraform - Full-stack experience building UIs (React/Flutter) - Statistical & analysis skills, hypothesis testing, a/b testing, translating technical findings for business stakeholders Benefits - $173,000 - $200,000 a year - This role may be eligible for a bonus and equity.
Senior Data Platform Engineer
Crystal IntelligenceBlockchain intelligence and compliance solutions for financial institutions, governments & regulators
• Active participation in development and maintenance of our data pipelines and backend services • Integration of blockchains, Automated Market Maker (AMM) protocols, and bridges within Crystal's platform • Integrate new technologies into our processes and tools • End-to-end feature designing and implementation • Code, debug, test and deliver features and improvements in a continuous manner • Provide code review, assistance, and feedback for other team members.
• Build Release Confidence Systems: Design and implement automated smoke tests and workflow validations for our most important payment flows, and integrate them into CI and release gates • Improve Environment Fidelity: Make testing and staging behave more like production by improving stability, data hygiene, environment health checks, and repeatable test scenarios • Develop Test Infrastructure: Build frameworks and harnesses for API and event-driven workflow testing, including partner simulations and contract checks where appropriate • Enable Developer Productivity: Build internal self-service tooling and automation that reduces manual steps, improves debugging workflows, and increases engineering velocity • Improve Release Observability and Documentation: Strengthen release dashboards, visibility, and documentation practices to improve triage, bug tracking, and regression prevention over time • Strengthen CI and Developer Workflows: Improve CI speed and signal, reduce flaky tests, and make failures easy to debug with useful artifacts and visibility • Add Operational Guardrails: Partner with engineering to define monitors, canaries, rollout strategies, and rollback patterns for higher-risk changes • Close the Loop on Regressions: Turn production regressions into durable prevention by adding automated checks, checklist steps, improved documentation, or better safety rails




