Job Closed

This listing is no longer active.

Senior AI-HPC Cluster Engineer – MLOps

Machine Learning EngineerMachine Learning EngineerOtherRemoteSeniorTeam 10,001+Since 1993H1B SponsorCompany SiteLinkedIn

Location

California + 1 moreAll locations: California | Texas

Posted

106 days ago

Salary

$184K - $356.5K / year

Seniority

Senior

Bachelor Degree8 yrs expEnglishDockerKubernetesLinuxPythonRust

Job Description

Senior AI-HPC Cluster Engineer – MLOps

NVIDIA

• Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions • Build and nurture customer and cross-team relationships to consistently support the clusters and address changing user needs • Support our researchers to run their workloads including performance analysis and optimizations • Conduct root cause analysis and suggest corrective action • Proactively find and fix issues before they occur • Build innovative tooling to accelerate researchers' velocity, troubleshooting, and software performance at scale

Job Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum of 8+ years of experience crafting and operating large scale compute infrastructure
  • Experience with AI/HPC job schedulers and orchestrators, such as Slurm, K8s or LSF
  • Applied experience with AI/HPC workflows that use MPI and NCCL
  • Proficient in using Linux including Centos/RHEL and/or Ubuntu Linux distributions
  • A solid understanding of container technologies like Enroot, Docker and Podman
  • Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++...)
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads
  • Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions
  • Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
  • Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC and AI/ML infrastructure fields.

Benefits

  • equity
  • benefits

Related Job Pages

More Machine Learning Engineer Jobs

AssemblyAI logo

Senior Software Engineer, Machine Learning

AssemblyAI

Offering speech-to-text APIs for modern developers, AssemblyAI is ultimately on a mission to use the latest deep learning technology to build practical products that make futuristi

• Design and implement tooling that enables researchers to quickly deploy and evaluate new models in production • Design, build, and maintain high-performance, cost-efficient inference pipelines, making architectural decisions about scaling, reliability, and cost trade-offs • Proactively identify and resolve infrastructure bottlenecks, proposing and scoping improvements to iteration speed and production reliability • Develop and maintain user-facing APIs that interact with our ML systems • Implement comprehensive observability solutions to monitor model performance and system health • Troubleshoot and lead resolution of complex production issues across distributed systems, driving root-cause analysis and implementing preventive measures • Set the direction for and continuously improve our MLOps practices, identifying the highest-impact opportunities to reduce friction between research and production. • Collaborate closely with research and engineering teams to align on technical direction, and help onboard and mentor engineers on ML infrastructure best practices.

Europe
$195K - $225K / year
Job Closed
OtherRemoteTeam 51-200Since 2017H1B Sponsor

• Own SentiLink’s real-time ML model monitoring domain, leading the design, implementation, and ongoing improvement of monitoring systems and workflows. • Own our ML experimentation, model tracking, and versioning infrastructure, ensuring strong reproducibility and visibility across the model lifecycle. • Drive improvements to the model development process, reducing inefficiencies, improving code quality, resolving DS tooling gaps, and enabling faster iteration. • Serve as the primary technical owner of key touchpoints and interfaces between Data Science and Engineering/Infrastructure, defining standards and workflows. • Support efforts to optimize model behavior in production, including latency, reliability, maintainability, and operational best practices. • Investigate and diagnose model performance issues on an ad-hoc basis, including partner escalations and analysis of model behavior in real-world scenarios. • Evaluate, prototype, and recommend new ML infrastructure, tools, and data capabilities, partnering with DS to validate impact and support adoption.

United States
$170K - $240K / year
Job Closed
OtherRemoteTeam 11-50H1B No Sponsor

• Collaborate with LangChain engineers to develop educational content that teaches agentic evaluation, monitoring and refinement using LangSmith, LangChain and LangGraph. • Design curriculum and structured learning paths for our community of over 1 million developers and agent builders. • Create and deliver content across multiple formats: • Online courses for LangChain Academy, video tutorials, and webinars • Live presentations at workshops, hackathons, meetups, and conferences • Build and maintain example projects, code demos, and visuals to support educational content. • Translate experimental applied AI code and internal agent evaluation techniques into crisp, developer-friendly learning materials.

Utah
$175K - $195K / year
Job Closed
Dropbox logo

Machine Learning Engineer

Dropbox

Dropbox is the one place to keep life organized and keep work moving.

Full TimeRemoteTeam 1,001-5,000Since 2007H1B Sponsor

• Work with large scale data systems, and infrastructure • Help productionize multimodal and semantic retrieval systems at scale, powering Dash’s multimedia and creative search experiences. • Partner with product, design, and infrastructure teams to improve retrieval, ranking, and conversational experiences across image, video, and text content. • Build and iterate on quick prototypes and experimental features, driving innovation in multimodal interaction and creative workflows. • Run quality and performance benchmarks across individual components and end-to-end systems to identify optimization opportunities. • Contribute to open source projects and leverage OSS tools for efficient inference and scaling. • On-call work may be necessary occasionally to help address bugs, outages, or other operational issues.

Germany
€125K - €169.1K / year
Job Closed