Job Closed
This listing is no longer active.
Senior AI-HPC Cluster Engineer – MLOps
Location
California + 1 moreAll locations: California | Texas
Posted
106 days ago
Salary
$184K - $356.5K / year
Seniority
Senior
Job Description
Senior AI-HPC Cluster Engineer – MLOps
NVIDIA
• Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions • Build and nurture customer and cross-team relationships to consistently support the clusters and address changing user needs • Support our researchers to run their workloads including performance analysis and optimizations • Conduct root cause analysis and suggest corrective action • Proactively find and fix issues before they occur • Build innovative tooling to accelerate researchers' velocity, troubleshooting, and software performance at scale
Job Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum of 8+ years of experience crafting and operating large scale compute infrastructure
- Experience with AI/HPC job schedulers and orchestrators, such as Slurm, K8s or LSF
- Applied experience with AI/HPC workflows that use MPI and NCCL
- Proficient in using Linux including Centos/RHEL and/or Ubuntu Linux distributions
- A solid understanding of container technologies like Enroot, Docker and Podman
- Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++...)
- Experience analyzing and tuning performance for a variety of AI/HPC workloads
- Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions
- Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
- Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Benefits
- equity
- benefits
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
Senior Software Engineer, Machine Learning
AssemblyAIOffering speech-to-text APIs for modern developers, AssemblyAI is ultimately on a mission to use the latest deep learning technology to build practical products that make futuristi
• Design and implement tooling that enables researchers to quickly deploy and evaluate new models in production • Design, build, and maintain high-performance, cost-efficient inference pipelines, making architectural decisions about scaling, reliability, and cost trade-offs • Proactively identify and resolve infrastructure bottlenecks, proposing and scoping improvements to iteration speed and production reliability • Develop and maintain user-facing APIs that interact with our ML systems • Implement comprehensive observability solutions to monitor model performance and system health • Troubleshoot and lead resolution of complex production issues across distributed systems, driving root-cause analysis and implementing preventive measures • Set the direction for and continuously improve our MLOps practices, identifying the highest-impact opportunities to reduce friction between research and production. • Collaborate closely with research and engineering teams to align on technical direction, and help onboard and mentor engineers on ML infrastructure best practices.
• Own SentiLink’s real-time ML model monitoring domain, leading the design, implementation, and ongoing improvement of monitoring systems and workflows. • Own our ML experimentation, model tracking, and versioning infrastructure, ensuring strong reproducibility and visibility across the model lifecycle. • Drive improvements to the model development process, reducing inefficiencies, improving code quality, resolving DS tooling gaps, and enabling faster iteration. • Serve as the primary technical owner of key touchpoints and interfaces between Data Science and Engineering/Infrastructure, defining standards and workflows. • Support efforts to optimize model behavior in production, including latency, reliability, maintainability, and operational best practices. • Investigate and diagnose model performance issues on an ad-hoc basis, including partner escalations and analysis of model behavior in real-world scenarios. • Evaluate, prototype, and recommend new ML infrastructure, tools, and data capabilities, partnering with DS to validate impact and support adoption.
• Collaborate with LangChain engineers to develop educational content that teaches agentic evaluation, monitoring and refinement using LangSmith, LangChain and LangGraph. • Design curriculum and structured learning paths for our community of over 1 million developers and agent builders. • Create and deliver content across multiple formats: • Online courses for LangChain Academy, video tutorials, and webinars • Live presentations at workshops, hackathons, meetups, and conferences • Build and maintain example projects, code demos, and visuals to support educational content. • Translate experimental applied AI code and internal agent evaluation techniques into crisp, developer-friendly learning materials.
Machine Learning Engineer
DropboxDropbox is the one place to keep life organized and keep work moving.
• Work with large scale data systems, and infrastructure • Help productionize multimodal and semantic retrieval systems at scale, powering Dash’s multimedia and creative search experiences. • Partner with product, design, and infrastructure teams to improve retrieval, ranking, and conversational experiences across image, video, and text content. • Build and iterate on quick prototypes and experimental features, driving innovation in multimodal interaction and creative workflows. • Run quality and performance benchmarks across individual components and end-to-end systems to identify optimization opportunities. • Contribute to open source projects and leverage OSS tools for efficient inference and scaling. • On-call work may be necessary occasionally to help address bugs, outages, or other operational issues.




