Featherless AI logo
Featherless AI

Serverless AI Inference - run any model, at any scale, without managing GPUs

Machine Learning Engineer – Training Optimization

Machine Learning EngineerMachine Learning EngineerFull TimeRemoteSeniorTeam 1-10Since 2023H1B No SponsorCompany SiteLinkedIn

Location

Worldwide

Posted

142 days ago

Salary

0

Seniority

Senior

Job Description

Machine Learning Engineer – Training Optimization

Featherless AI

• Optimize large-scale model training pipelines (throughput, convergence, stability, and cost) • Improve distributed training strategies (data, model, and pipeline parallelism) • Tune optimizers, schedulers, batch sizing, and precision (bf16 / fp16 / fp8) • Reduce training time and compute cost via profiling, bottleneck analysis, and systems-level improvements • Collaborate with researchers on architecture-aware training strategies • Build and maintain robust training infrastructure (checkpointing, fault tolerance, reproducibility) • Evaluate and integrate new training techniques (e.g. gradient checkpointing, ZeRO, FSDP, custom kernels) • Own training performance metrics and continuously push them forward

Job Requirements

  • Strong experience training large neural networks (LLMs or similarly large models)
  • Hands-on experience with training optimization (not just model usage)
  • Solid understanding of:
  • Backpropagation, optimization algorithms, and training dynamics
  • Distributed systems for ML training
  • Experience with PyTorch (required)
  • Comfort working close to hardware (GPUs, memory, networking constraints)
  • Ability to move fluidly between research ideas and production-ready code
  • Nice to Have
  • Experience with large-scale distributed training (multi-node, multi-GPU)
  • Familiarity with DeepSpeed, FSDP, Megatron, or custom training stacks
  • Experience optimizing training on AMD or NVIDIA GPUs
  • Contributions to open-source ML infrastructure or research codebases
  • Exposure to non-Transformer architectures (RNNs, hybrid models, etc.)

Benefits

  • Competitive compensation + meaningful equity

Related Job Pages

More Machine Learning Engineer Jobs

Artera.net logo

Machine Learning Engineer – Platform

Artera.net

Artera is a Swiss ISP that produces premium hosting and cloud services.

Full TimeRemoteTeam 11-50Since 2002H1B No Sponsor

• Work on the AI Platform team focusing on scalable and efficient pipelines for model training, evaluation, and data processing • Build and evolve core libraries used by AI scientists to develop, launch, and monitor AI products • Optimize GPU and CPU efficiency and data throughput of large-scale foundation models • Ensure Artera’s observability infrastructure provides a clear picture of model performance optimization

California
$140K - $180K / year
Full TimeRemoteTeam 51-200H1B No Sponsor

• Define and implement scalable, reproducible, monitorable, production-ready Machine Learning architectures. • Develop, evolve, and maintain production Machine Learning pipelines and services, ensuring reliability and performance. • Deploy highly available models and pipelines with a focus on MLOps, CI/CD, and automation. • Collaborate with data scientists, data engineers, developers, and business stakeholders. • Diagnose and resolve complex issues related to models and pipelines in production. • Lead technical discussions and workshops, and support architectural decisions with teams and clients. • Contribute to raising the client's and A3 Data's technical maturity by promoting best practices.

Brazil
Full TimeRemoteTeam 51-200H1B No Sponsor

• Develop, train, and improve Machine Learning models, ensuring reproducibility, scalability, and production monitoring; • Implement and manage the model lifecycle, with versioning for code, data, metrics, and artifacts, following MLOps best practices; • Package models as scalable, highly available services integrated into automated pipelines; • Support and continuously improve ML solutions in production, identifying and fixing issues; • Collaborate with Data Engineering, Data Science, and business teams in a multidisciplinary environment; • Perform code reviews and support the technical development of more junior engineers; • Participate in technical discussions with clients, explaining solutions, architectural decisions, and trade-offs.

Brazil
Matterworks logo

Senior Machine Learning Scientist

Matterworks

AI-Powered Tools to Engineer Biology

OtherRemoteTeam 11-50H1B No Sponsor

• Design, adapt, and optimize deep learning architectures for scientific domains and data modalities. • Own and deliver on complex ML projects, including experiment design, implementation, evaluation, and iteration based on results. • Write clean, well-tested code in PyTorch and NumPy enabling a high experimentation rate. • Stay current with deep learning research and its applications in chemistry and biology. • Propose and prototype new ideas to enhance our modeling capabilities. • Work closely with scientists and engineers across the team to integrate models into our product and infrastructure.

United States
Job Closed