Protocol Learning: Multi-participant, low-bandwidth model parallel.
Machine Learning Engineer – Distributed ML Systems
Location
United States
Posted
64 days ago
Salary
0
Seniority
Senior
Job Description
Machine Learning Engineer – Distributed ML Systems
Pluralis Research
• Design and implement large-scale distributed training systems optimized for heterogeneous hardware operating under low-bandwidth, high-latency conditions. • Develop and optimize model-parallel training strategies (data, tensor, pipeline parallelism) with custom sharding techniques that minimize communication overhead. • Optimize GPU utilization, memory efficiency, and compute performance across distributed nodes. • Implement robust checkpointing, state synchronization, and recovery mechanisms for long-running, fault-prone training jobs. • Build monitoring and metrics systems to track training progress, model quality, and system bottlenecks. • Architect resilient training systems where nodes can fail, networks can partition, and participants can dynamically join or leave. • Design and optimize peer-to-peer topologies for decentralized coordination across non-co-located nodes. • Implement NAT traversal, peer discovery, dynamic routing, and connection lifecycle management. • Profile and optimize communication patterns to reduce latency and bandwidth overhead in multi-participant environments.
Job Requirements
- Strong experience building and operating distributed systems in production
- Hands-on expertise with distributed training frameworks (FSDP, DeepSpeed, Megatron, or similar)
- Deep understanding of model parallelism (data, tensor, pipeline parallelism)
- Expert-level Python with production experience (concurrency, error handling, retry logic, clean architecture)
- Strong networking fundamentals: P2P systems, gRPC, routing, NAT traversal, distributed coordination
- Experience optimizing GPU workloads, memory management, and large-scale compute efficiency
Benefits
- Equity-heavy compensation with meaningful ownership in a mission-driven company
- Competitive base salary for senior engineering roles in Australia
- Visa sponsorship available for exceptional candidates
- Remote-first with optional access to our Melbourne hub
- World-class team — team mates were previously at at Google, Amazon, Microsoft, and leading startups
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
• Empacotar e versionar modelos de machine learning (MLflow, SageMaker Model Registry) • Definir e implementar serviços AWS adequados (SageMaker, Lambda, ECS/EKS, API Gateway, entre outros) • Construir e manter esteiras CI/CD, garantindo automação de testes, build e deploy • Automatizar deploys em múltiplos ambientes (dev/staging/prod) com segurança e rollback • Expor modelos para consumo por outros serviços (via endpoints ou Lambdas) • Configurar e acompanhar monitoramentos em produção (CloudWatch, logs, métricas) • Colaborar com times multidisciplinares para garantir soluções eficientes, seguras e escaláveis.
• Develop, program, and test machine learning systems • Design architectures for reproducible, scalable, and monitored Machine Learning solutions • Assist the data science team in designing and building model deployment pipelines • Build data engineering processes/pipelines within the data environment • Define architecture standards, procedures, and tooling • Define data layers and their intended usage • Implement, fine-tune, and monitor LLM models (e.g., Bedrock, OpenSearch, HuggingFace) • Build inference pipelines for RAG-based responses (Retrieval-Augmented Generation) • Ensure versioning and reuse of trained models and embeddings • Work on continuous improvement of response relevance and source ranking • Work on reducing token costs using optimization techniques • Support integrations with AI APIs, vector stores, and document repositories • Communicate information clearly with technical and business team members • Help identify potential errors and report inconsistencies
Staff Software Engineer, Machine Learning
ZipRecruiterZipRecruiter is a leading online employment marketplace, actively connecting people to their next great opportunity.
• Design, develop, and maintain advanced machine learning models and algorithms to solve complex business problems • Lead complex machine learning projects that solve business challenges and add value to the broader ZipRecruiter Business • Design the overall architecture and infrastructure for machine learning systems, ensuring scalability, efficiency, and robustness • Push the boundaries of what's possible in machine learning in your organization, finding new and innovative ways to use AI to drive business value • Provide leadership and set tone for Machine Learning Engineers at ZipRecruiter leveraging industry best practices and innovation • Stay up-to-date with the latest developments in machine learning and AI, and drive adoption of new techniques and technologies as appropriate
Senior Software Engineer, Machine Learning
ZipRecruiterZipRecruiter is a leading online employment marketplace, actively connecting people to their next great opportunity.
• Design, develop, and maintain machine learning models and algorithms to solve complex business problems • Identify patterns, trends, and anomalies in the data, and visualize insights using appropriate tools • Assess the performance of machine learning models using appropriate metrics, validation techniques, and testing datasets • Discover opportunities to optimize models by fine-tuning hyperparameters, feature selection, or employing regularization techniques to improve accuracy, performance, and scalability


