Serverless AI Inference - run any model, at any scale, without managing GPUs

Machine Learning Engineer – Training Optimization

Machine Learning EngineerMachine Learning EngineerFull Time Remote SeniorTeam 1-10Since 2023H1B No SponsorCompany Site LinkedIn

Location

Worldwide

Posted

142 days ago

Salary

Seniority

Senior

EnglishDistributed Systems Node.js PyTorch

Job Description

• Optimize large-scale model training pipelines (throughput, convergence, stability, and cost) • Improve distributed training strategies (data, model, and pipeline parallelism) • Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8) • Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements • Collaborate with researchers on architecture-aware training strategies • Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility) • Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels) • Own training performance metrics and continuously push them forward

Job Requirements

Strong experience training large neural networks (LLMs or similarly large models)
Hands-on experience with training optimization (not just model usage)
Solid understanding of:
Backpropagation, optimization algorithms, and training dynamics
Distributed systems for ML training
Experience with PyTorch (required)
Comfort working close to hardware (GPUs, memory, networking constraints)
Ability to move fluidly between research ideas and production-ready code
Nice to Have
Experience with large-scale distributed training (multi-node, multi-GPU)
Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks
Experience optimizing training on AMD or NVIDIA GPUs
Contributions to open-source ML infrastructure or research codebases
Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)

Benefits

Competitive compensation + meaningful equity

Related Categories

Machine Learning Engineer AI Engineer AI Research Scientist LLM Engineer Computer Vision Engineer NLP Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More Machine Learning Engineer Jobs

Machine Learning Engineer – Platform

Artera.net

Artera is a Swiss ISP that produces premium hosting and cloud services.

Machine Learning Engineer142 days ago

Full Time RemoteTeam 11-50Since 2002H1B No Sponsor

Company Site LinkedIn

• Work on the AI Platform team focusing on scalable and efficient pipelines for model training, evaluation, and data processing • Build and evolve core libraries used by AI scientists to develop, launch, and monitor AI products • Optimize GPU and CPU efficiency and data throughput of large-scale foundation models • Ensure Artera’s observability infrastructure provides a clear picture of model performance optimization

AWS Docker Kubernetes Node.js Python PyTorch Ray Tensorflow Terraform

View details: Machine Learning Engineer – Platform

California

$140K - $180K / year

Apply

Senior Machine Learning Engineer

A3Data

Machine Learning Engineer142 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Define and implement scalable, reproducible, monitorable, production-ready Machine Learning architectures. • Develop, evolve, and maintain production Machine Learning pipelines and services, ensuring reliability and performance. • Deploy highly available models and pipelines with a focus on MLOps, CI/CD, and automation. • Collaborate with data scientists, data engineers, developers, and business stakeholders. • Diagnose and resolve complex issues related to models and pipelines in production. • Lead technical discussions and workshops, and support architectural decisions with teams and clients. • Contribute to raising the client's and A3 Data's technical maturity by promoting best practices.

AWS Azure Docker GCP Kubernetes Python PyTorch scikit-learn TensorFlow

View details: Senior Machine Learning Engineer

Brazil

Apply

Machine Learning Engineer – Mid-level

A3Data

Machine Learning Engineer142 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Develop, train, and improve Machine Learning models, ensuring reproducibility, scalability, and production monitoring; • Implement and manage the model lifecycle, with versioning for code, data, metrics, and artifacts, following MLOps best practices; • Package models as scalable, highly available services integrated into automated pipelines; • Support and continuously improve ML solutions in production, identifying and fixing issues; • Collaborate with Data Engineering, Data Science, and business teams in a multidisciplinary environment; • Perform code reviews and support the technical development of more junior engineers; • Participate in technical discussions with clients, explaining solutions, architectural decisions, and trade-offs.

Airflow PySpark Python PyTorch scikit-learn TensorFlow

View details: Machine Learning Engineer – Mid-level

Brazil

Apply

Senior Machine Learning Scientist

Matterworks

AI-Powered Tools to Engineer Biology

Machine Learning Engineer143 days ago

Other RemoteTeam 11-50H1B No Sponsor

Company Site LinkedIn

• Design, adapt, and optimize deep learning architectures for scientific domains and data modalities. • Own and deliver on complex ML projects, including experiment design, implementation, evaluation, and iteration based on results. • Write clean, well-tested code in PyTorch and NumPy enabling a high experimentation rate. • Stay current with deep learning research and its applications in chemistry and biology. • Propose and prototype new ideas to enhance our modeling capabilities. • Work closely with scientists and engineers across the team to integrate models into our product and infrastructure.

NumPy Python PyTorch

View details: Senior Machine Learning Scientist

United States

Apply

Job Closed

Machine Learning Engineer – Training Optimization

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Machine Learning Engineer Jobs

Machine Learning Engineer – Platform

Senior Machine Learning Engineer

Machine Learning Engineer – Mid-level

Senior Machine Learning Scientist