Job Closed
This listing is no longer active.
Reddit is an online platform utilized by thousands of communities to connect and converse about a wide variety of topics, including TV and movie fan theories, s
Senior Machine Learning Engineer, ML Training Platform
Location
United States
Posted
71 days ago
Salary
$216.7K - $303.4K / year
Seniority
Senior
Job Description
Senior Machine Learning Engineer, ML Training Platform
• Lead the building, testing, and maintenance of ML training infrastructure at Reddit. • Play a pivotal role in designing, building, and optimizing the infrastructure and tooling required to support large-scale machine learning workflows. • Evolve the MLE experience, from provisioning interactive GPU environments through large-scale training, supporting on-demand and self-service workflows. • Kubernetes Automation: Write custom Kubernetes Controllers and Operators to manage the lifecycle of interactive Jupyter workspaces and long-running ML training jobs, handle auto-idling, and ensure fault tolerance. • GPU Orchestration: Work with the underlying compute team to ensure MLEs have efficient access to training hardware resources and handle resource contention gracefully. • Developer Experience (DevX): Treat internal MLEs as your customers. Conduct user research, reduce friction in the "Idea-to-Prototype" loop, and standardize software environments (Docker images, Python dependency management).
Job Requirements
- 5+ years of software engineering experience, with a focus on Platform Engineering, ML Infrastructure, or Backend Systems.
- Deep Kubernetes Expertise: You know K8s beyond just "deploying pods." You understand CRDs, Controllers and the Operator pattern.
- Jupyter Ecosystem Knowledge: Experience customizing JupyterHub, JupyterLab extensions, or building similar interactive computing platforms.
- Strong Coding Skills: Proficiency in Python (for the ML ecosystem) and Go (for Kubernetes controllers/infrastructure tooling).
- GPU Experience: Hands-on practice with CUDA environments, GPU virtualization/containerization, and doing it all within Kubernetes.
- Cloud Provider Experience: Familiarity with both managed ML offerings (Vertex AI, Sagemaker, etc) and building custom ML components in AWS and/or GCP.
- Experience working with distributed training frameworks, including Ray and Kubernetes.
- Comfortable with distributed systems, big data (Petabyte scale) and data-intensive systems.
- Strong focus on scalability, reliability, performance, and ease of use. You are an undying advocate for platform users and have a deep intuition for the machine learning development lifecycle.
- Strong organizational & communication skills.
Benefits
- Comprehensive Healthcare Benefits and Income Replacement Programs
- 401k Match
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Flexible Vacation & Reddit Global Days off
- Generous paid Parental Leave
- Paid Volunteer time off
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
Principal Software Engineer, ML Infrastructure
MotionalWe're making driverless vehicles a safe, reliable, and accessible reality.
Mission Summary Our team builds the foundational infrastructure that empowers Machine Learning Engineers to develop the next generation of self-driving technology. We design and operate the high-performance, large-scale systems that process petabytes of vehicle data, run massive simulations, and train complex models. This is a software engineering role focused on building robust platforms, not an SRE position. As a Principal Engineer, you will be the technical cornerstone of the team, setting the architectural vision and leading the development of critical infrastructure used by hundreds of engineers. What you'll be doing: - Architect and lead the development of highly scalable, distributed infrastructure services on Kubernetes. - Set the technical vision and roadmap for the team, making high-impact decisions that accelerate AV development. - Mentor and guide other engineers on the team, elevating the group's technical abilities through design reviews, pair programming, and direct feedback. - Get hands-on to build and deploy the most complex components of the platform, from high-throughput data processing services to core compute orchestration. What we're looking for: - 7+ years of professional software engineering experience. - BS, MS, or PhD in Computer Science or a related technical field. - Hands-on experience with Kubernetes (k8s) is a must. You have designed, built, and operated critical services on Kubernetes at a significant scale. - Proven experience building large-scale infrastructure. You have a deep understanding of the challenges in developing high-throughput, reliable distributed systems. - Expertise in Python, Go, or a similar language. - Strong experience with a major cloud provider (AWS, GCP, Azure). - Demonstrated ability to lead complex, cross-functional projects from concept to production. The salary range for this role is an estimate based on a wide range of compensation factors including but not limited to specific skills, experience and expertise, role location, certifications, licenses, and business needs. The estimated compensation range listed in this job posting reflects base salary only. This role may include additional forms of compensation such as a bonus or company equity. The recruiter assigned to this role can share more information about the specific compensation and benefit details associated with this role during the hiring process. Candidates for certain positions are eligible to participate in Motional’s benefits program. Motional’s benefits include but are not limited to medical, dental, vision, 401k with a company match, health saving accounts, life insurance, pet insurance, and more. Salary Range $200,000—$275,000 USD Motional is a driverless technology company making autonomous vehicles a safe, reliable, and accessible reality. We’re driven by something more. Our journey is always people first. We aren't just developing driverless cars; we're creating safer roadways, more equitable transportation options, and making our communities better places to live, work, and connect. Our team is made up of engineers, researchers, innovators, dreamers and doers, who are creating a technology with the potential to transform the way we move. Higher purpose, greater impact. We’re creating first-of-its-kind technology that will transform transportation. To do so successfully, we must design for everyone in our cities and on our roads. We believe in building a great place to work through a progressive, global culture that is diverse, inclusive, and ensures people feel valued at every level of the organization. Diversity helps us to see the world differently; it’s not only good for our business, it’s the right thing to do. Scale up, not starting up. Our team is behind some of the industry's largest leaps forward, including the first fully-autonomous cross-country drive in the U.S, the launch of the world's first robotaxi pilot, and operation of the world's longest-standing public robotaxi fleet. We’re driven to scale; we’re moving towards commercialization of our technology, and we need team members who are ready to embrace change and challenges. Formed as a joint venture between Hyundai Motor Group and Aptiv, Motional is fundamentally changing how people move through their lives. Headquartered in Boston, Motional has operations in the U.S and Asia. For more information, visit www.Motional.com and follow us on Twitter, LinkedIn, Instagram and YouTube. Motional AD Inc. is an EOE. We celebrate diversity and are committed to creating an inclusive environment for all employees. To comply with Federal Law, we participate in E-Verify. All newly-hired employees are queried through this electronic system established by the DHS and the SSA to verify their identity and employment eligibility.
AI/ML Engineer
Orbofi AIThe Ultimate AI-generated content factory and AI engine, for web3, games, and every online community
• Designing, implementing, and maintaining cutting-edge AI models • Collaborating with cross-functional teams to define, develop, and deploy AI-generated content solutions • Conducting research on the latest advancements in AI/ML • Optimizing existing AI models and algorithms • Ensuring the quality, accuracy, and robustness of AI-generated content • Developing and maintaining clear documentation for AI models, algorithms, and systems
MLOps Engineer
Moody'sMoody's is a public company that provides credit ratings, analysis, research, and other economic tools designed to "contribute to transparent and integrated fin
• Work closely with the Data Science team and the Data Engineers and DevOps teams in order to deploy machine learning models • Execute continuous integration and continuous delivery (CI/CD) activities to release ML code and ML pipelines into a Production environment • Maintain the Machine Learning pipeline and make sure everything is running accurately and reliably • Liaise with senior stakeholders across the Data function and the wider business • Use industry best practices such as code reviews, pull requests, and peer testing to ensure high quality AI/ML deliverables • Build AI/ML model performance benchmarking, evaluation, monitoring capabilities and facilitates resolution of issues with the appropriate teams
Machine Learning Engineer
SunergiSunergi is a solar data company specializing in investment-grade market expansion reporting.
• Handle implementing data pipelines focused on Machine Learning applications. • Develop data sets for POCs to demonstrate new insights. • Partner with various cross functional teams to define, develop and implement data technology solutions. • Identify requirements of new features, propose design and drive the solution.




