Job Closed

This listing is no longer active.

Senior AI-HPC Cluster Engineer – MLOps

Machine Learning EngineerMachine Learning EngineerOther Remote SeniorTeam 10,001+Since 1993H1B SponsorCompany Site LinkedIn

Location

California + 1 more

Posted

106 days ago

Salary

$184K - $356.5K / year

Seniority

Senior

Bachelor Degree8 yrs expEnglishDocker Kubernetes Linux Python Rust

Job Description

• Provide leadership and strategic mentorship on the management of large-scale HPC systems including the deployment of compute, networking, and storage • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions • Build and nurture customer and cross-team relationships to consistently support the clusters and address changing user needs • Support our researchers to run their workloads including performance analysis and optimizations • Conduct root cause analysis and suggest corrective action • Proactively find and fix issues before they occur • Build innovative tooling to accelerate researchers' velocity, troubleshooting, and software performance at scale

Job Requirements

Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
Minimum of 8+ years of experience crafting and operating large scale compute infrastructure
Experience with AI/HPC job schedulers and orchestrators, such as Slurm, K8s or LSF
Applied experience with AI/HPC workflows that use MPI and NCCL
Proficient in using Linux including Centos/RHEL and/or Ubuntu Linux distributions
A solid understanding of container technologies like Enroot, Docker and Podman
Proficiency in one scripting language (Python, Bash) and at least one compiled language (Golang, Rust, C, C++...)
Experience analyzing and tuning performance for a variety of AI/HPC workloads
Excellent problem-solving to analyze complex systems, identify bottlenecks, and implement scalable solutions
Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals
Passion for continual learning and staying ahead of new technologies and effective approaches in the HPC and AI/ML infrastructure fields.

Benefits

equity
benefits

Related Categories

Machine Learning Engineer AI Engineer AI Research Scientist LLM Engineer Computer Vision Engineer NLP Engineer

Related Job Pages

Machine Learning Engineer Jobs in California Remote Python Jobs (US)More Remote Jobs

More Machine Learning Engineer Jobs

Senior Software Engineer, Machine Learning

AssemblyAI

Offering speech-to-text APIs for modern developers, AssemblyAI is ultimately on a mission to use the latest deep learning technology to build practical products that make futuristi

Machine Learning Engineer106 days ago

Full Time Remote

• Design and implement tooling that enables researchers to quickly deploy and evaluate new models in production • Design, build, and maintain high-performance, cost-efficient inference pipelines, making architectural decisions about scaling, reliability, and cost trade-offs • Proactively identify and resolve infrastructure bottlenecks, proposing and scoping improvements to iteration speed and production reliability • Develop and maintain user-facing APIs that interact with our ML systems • Implement comprehensive observability solutions to monitor model performance and system health • Troubleshoot and lead resolution of complex production issues across distributed systems, driving root-cause analysis and implementing preventive measures • Set the direction for and continuously improve our MLOps practices, identifying the highest-impact opportunities to reduce friction between research and production. • Collaborate closely with research and engineering teams to align on technical direction, and help onboard and mentor engineers on ML infrastructure best practices.

AWS Distributed Systems Kubernetes Python PyTorch

View details: Senior Software Engineer, Machine Learning

Europe

$195K - $225K / year

Apply

Job Closed

Founding Senior Machine Learning Engineer

SentiLink

SentiLink Stops Identity Fraud

Machine Learning Engineer106 days ago

Other RemoteTeam 51-200Since 2017H1B Sponsor

Company Site LinkedIn

• Own SentiLink’s real-time ML model monitoring domain, leading the design, implementation, and ongoing improvement of monitoring systems and workflows. • Own our ML experimentation, model tracking, and versioning infrastructure, ensuring strong reproducibility and visibility across the model lifecycle. • Drive improvements to the model development process, reducing inefficiencies, improving code quality, resolving DS tooling gaps, and enabling faster iteration. • Serve as the primary technical owner of key touchpoints and interfaces between Data Science and Engineering/Infrastructure, defining standards and workflows. • Support efforts to optimize model behavior in production, including latency, reliability, maintainability, and operational best practices. • Investigate and diagnose model performance issues on an ad-hoc basis, including partner escalations and analysis of model behavior in real-world scenarios. • Evaluate, prototype, and recommend new ML infrastructure, tools, and data capabilities, partnering with DS to validate impact and support adoption.

Docker Kubernetes Python SQL

View details: Founding Senior Machine Learning Engineer

United States

$170K - $240K / year

Apply

Job Closed

Education Engineer, Machine Learning

LangChain

Machine Learning Engineer106 days ago

Other RemoteTeam 11-50H1B No Sponsor

Company Site LinkedIn

• Collaborate with LangChain engineers to develop educational content that teaches agentic evaluation, monitoring and refinement using LangSmith, LangChain and LangGraph. • Design curriculum and structured learning paths for our community of over 1 million developers and agent builders. • Create and deliver content across multiple formats: • Online courses for LangChain Academy, video tutorials, and webinars • Live presentations at workshops, hackathons, meetups, and conferences • Build and maintain example projects, code demos, and visuals to support educational content. • Translate experimental applied AI code and internal agent evaluation techniques into crisp, developer-friendly learning materials.

View details: Education Engineer, Machine Learning

Utah

$175K - $195K / year

Apply

Job Closed

Machine Learning Engineer

Dropbox

Dropbox is the one place to keep life organized and keep work moving.

Machine Learning Engineer106 days ago

Full Time RemoteTeam 1,001-5,000Since 2007H1B Sponsor

Company Site LinkedIn

• Work with large scale data systems, and infrastructure • Help productionize multimodal and semantic retrieval systems at scale, powering Dash’s multimedia and creative search experiences. • Partner with product, design, and infrastructure teams to improve retrieval, ranking, and conversational experiences across image, video, and text content. • Build and iterate on quick prototypes and experimental features, driving innovation in multimodal interaction and creative workflows. • Run quality and performance benchmarks across individual components and end-to-end systems to identify optimization opportunities. • Contribute to open source projects and leverage OSS tools for efficient inference and scaling. • On-call work may be necessary occasionally to help address bugs, outages, or other operational issues.

Keras Python PyTorch scikit-learn TensorFlow

View details: Machine Learning Engineer

Germany

€125K - €169.1K / year

Apply

Job Closed

Senior AI-HPC Cluster Engineer – MLOps

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Machine Learning Engineer Jobs

Senior Software Engineer, Machine Learning

Founding Senior Machine Learning Engineer

Education Engineer, Machine Learning

Machine Learning Engineer