Pluralis Research logo
Pluralis Research

Protocol Learning: Multi-participant, low-bandwidth model parallel.

Machine Learning Engineer – Distributed ML Systems

Machine Learning EngineerMachine Learning EngineerFull TimeRemoteSeniorTeam 1-10H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

64 days ago

Salary

0

Seniority

Senior

Bachelor Degree5 yrs expEnglishDistributed SystemsgRPCPython

Job Description

Machine Learning Engineer – Distributed ML Systems

Pluralis Research

• Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions. • Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead. • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes. • Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs. • Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks. • Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave. • Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes. • Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management. • Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.

Job Requirements

  • Strong experience building and operating distributed systems in production
  • Hands-on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar)
  • Deep understanding of model parallelism (data, tensor, pipeline parallelism)
  • Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture)
  • Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination
  • Experience optimizing GPU workloads, memory management, and large-scale compute efficiency

Benefits

  • Equity-heavy compensation with meaningful ownership in a mission-driven company
  • Competitive base salary for senior engineering roles in Australia
  • Visa sponsorship available for exceptional candidates
  • Remote-first with optional access to our Melbourne hub
  • World-class team — team mates were previously at at Google, Amazon, Microsoft, and leading startups

Related Job Pages

More Machine Learning Engineer Jobs

Full TimeRemoteTeam 51-200H1B No Sponsor

• Empacotar e versionar modelos de machine learning (MLflow, SageMaker Model Registry) • Definir e implementar serviços AWS adequados (SageMaker, Lambda, ECS/EKS, API Gateway, entre outros) • Construir e manter esteiras CI/CD, garantindo automação de testes, build e deploy • Automatizar deploys em múltiplos ambientes (dev/staging/prod) com segurança e rollback • Expor modelos para consumo por outros serviços (via endpoints ou Lambdas) • Configurar e acompanhar monitoramentos em produção (CloudWatch, logs, métricas) • Colaborar com times multidisciplinares para garantir soluções eficientes, seguras e escaláveis.

Brazil
Full TimeRemoteTeam 51-200H1B No Sponsor

• Develop, program, and test machine learning systems • Design architectures for reproducible, scalable, and monitored Machine Learning solutions • Assist the data science team in designing and building model deployment pipelines • Build data engineering processes/pipelines within the data environment • Define architecture standards, procedures, and tooling • Define data layers and their intended usage • Implement, fine-tune, and monitor LLM models (e.g., Bedrock, OpenSearch, HuggingFace) • Build inference pipelines for RAG-based responses (Retrieval-Augmented Generation) • Ensure versioning and reuse of trained models and embeddings • Work on continuous improvement of response relevance and source ranking • Work on reducing token costs using optimization techniques • Support integrations with AI APIs, vector stores, and document repositories • Communicate information clearly with technical and business team members • Help identify potential errors and report inconsistencies

Brazil
ZipRecruiter logo

Staff Software Engineer, Machine Learning

ZipRecruiter

ZipRecruiter is a leading online employment marketplace, actively connecting people to their next great opportunity.

Full TimeRemoteTeam 1,001-5,000Since 2010H1B Sponsor

• Design, develop, and maintain advanced machine learning models and algorithms to solve complex business problems • Lead complex machine learning projects that solve business challenges and add value to the broader ZipRecruiter Business • Design the overall architecture and infrastructure for machine learning systems, ensuring scalability, efficiency, and robustness • Push the boundaries of what's possible in machine learning in your organization, finding new and innovative ways to use AI to drive business value • Provide leadership and set tone for Machine Learning Engineers at ZipRecruiter leveraging industry best practices and innovation • Stay up-to-date with the latest developments in machine learning and AI, and drive adoption of new techniques and technologies as appropriate

United States
$180K - $225K / year
ZipRecruiter logo

Senior Software Engineer, Machine Learning

ZipRecruiter

ZipRecruiter is a leading online employment marketplace, actively connecting people to their next great opportunity.

Full TimeRemoteTeam 1,001-5,000Since 2010H1B Sponsor

• Design, develop, and maintain machine learning models and algorithms to solve complex business problems • Identify patterns, trends, and anomalies in the data, and visualize insights using appropriate tools • Assess the performance of machine learning models using appropriate metrics, validation techniques, and testing datasets • Discover opportunities to optimize models by fine-tuning hyperparameters, feature selection, or employing regularization techniques to improve accuracy, performance, and scalability

United States
$140K - $225K / year