Job Closed
This listing is no longer active.
The Power to Predict. See the future in your data.
MLOps Team Lead
Location
Europe
Posted
102 days ago
Salary
0
Seniority
Senior
Job Description
MLOps Team Lead
Fundamental
• Lead and mentor a team of MLOps engineers, fostering technical growth and a culture of operational excellence • Define and drive the MLOps roadmap, aligning infrastructure capabilities with Research, Engineering and product objectives • Establish best practices, standards, and processes for ML infrastructure, deployment, and operations • Own technical decision-making for ML infrastructure architecture and tooling choices • Architect and oversee scalable, automated machine learning pipelines, CI/CD workflows, and orchestration frameworks • Drive the design and implementation of robust model serving infrastructure using platforms like Triton, TorchServe, TensorFlow Serving, and KServe • Define inference architecture strategy optimized for ultra-low latency and high throughput • Design and maintain feature stores, robust data pipelines, and scalable storage solutions to efficiently handle large volumes of data • Collaborate with research teams to bridge the gap between experimentation and production • Define logging, alerting, and monitoring strategy to track model performance, drift, and system reliability
Job Requirements
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent practical experience)
- 7+ years of experience in MLOps, with 3+ years in a technical leadership role
- Strong software engineering skills in Python, with experience in Bash and/or Go
- Proven track record of building and leading high-performing MLOps or infrastructure teams
- Experience building and designing MLOps infrastructure from the ground up
- Deep experience with MLOps platforms (MLflow, WandB, etc.) and frameworks (PyTorch, TensorFlow, etc.)
- Deep experience with model serving frameworks (Triton, TorchServe, TensorFlow Serving, KServe) for high scalability and low latency inference
- Experience building and managing data pipelines to support both model training and inference
- Good experience with Kubernetes on a major cloud provider (AWS, GCP, or Azure) and with infrastructure as code (Terraform, Helm, GitOps)
- Proficient with observability and monitoring tools (Prometheus, Grafana, Datadog, OpenTelemetry)
- Excellent communication skills with ability to translate between research and production contexts
Benefits
- Competitive compensation with salary and equity
- Comprehensive health coverage, including medical, dental, vision, and 401K
- Fertility support, as well as paid parental leave for all new parents, inclusive of adoptive and surrogate journeys
- Relocation support for employees moving to join the team in one of our office locations
- A mission-driven, low-ego culture that values diversity of thought, ownership, and bias toward action
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
Senior Machine Learning Engineer
MenTMenT is een executive search kantoor dat bedrijven sinds 2001 helpt bij het oplossen van hun rekruteringsproblemen.
• A Senior ML Engineer is responsible for designing, implementing, and maintaining AI systems across various applications. • They contribute to the organization's AI strategy, work on complex solutions and optimize existing systems to enhance performance. • Responsibilities include mentoring junior ML engineers and collaborating with cross-functional teams. • Develop AI applications and solutions by understanding business needs, collaborating with stakeholders, analyzing data, and implementing AI algorithms. • Design, develop, and maintain robust AI systems, including machine learning models and deep learning networks. • Document and demonstrate solutions with clear technical documentation, diagrams, and code comments. • Contribute to the organization’s AI strategy by researching cutting-edge tools and techniques, participating in educational opportunities, and maintaining professional networks. • Identify and resolve performance and scalability issues in AI applications by improving software and addressing bottlenecks and bugs. • Lead and collaborate with cross-functional teams to define and implement innovative AI solutions, optimizing user interaction and experience. • Conduct code reviews and mentor team members to uphold high coding standards. • Translate business requirements into actionable technical requirements. • Work closely with data engineering and data science teams to implement automated and unit testing. • Improve operations by analyzing systems and recommending procedural changes. • Support engineering goals by delivering project outcomes as needed.
• Design, build, and operate production-grade AI/ML systems that power Asteri’s orchestration platform • Collaborate closely with backend, platform, and frontend engineers to integrate AI capabilities into scalable, reliable product workflows • Deploy and iterate on LLM-based applications in production, continuously evaluating quality, latency, and cost • Own retrieval and agentic systems (e.g., RAG pipelines, workflow agents, policy-driven logic) end-to-end • Define and run rigorous evaluation and testing for AI systems, including offline experiments and production monitoring • Improve model performance and system behavior over time through experimentation, tuning, and system-level optimizations • Implement strong engineering practices for AI development, including testing, CI/CD, versioning, and rollback strategies • Stay current with advances in applied AI and generative models and translate relevant techniques into practical product improvements • Partner with cross-functional teams to understand product requirements and translate them into robust AI solutions
Machine Learning Engineer – Content Safety Platform
CanvaFounded in 2012, Canva offers an online graphic design and publishing platform used by millions of people across the globe. As an employer, Canva offers flexibl
• Own end-to-end delivery of ML-based safety features, from technical design through production rollout and iteration • Build and maintain ML models that safeguard AI-generated content across multiple modalities (images, video, audio, text), detecting harmful content, IP violations, bias, and other safety concerns • Design and implement RAG (Retrieval-Augmented Generation) architectures and other advanced ML systems to enhance detection capabilities • Fine-tune and evaluate LLM-based models for content moderation and prompt filtering, making data-driven decisions about model selection and optimization • Collaborate with Legal, Product Policy, and AI product teams to define requirements, balance safety with user experience, and deliver compliant solutions • Create evaluation frameworks to measure model quality, safety coverage, false positive/negative rates, and policy alignment • Monitor production systems, respond to incidents, and maintain operational excellence through documentation and runbooks
• Propose and prototype innovative solutions to solve real-world problems, leveraging the latest state-of-the-art techniques in the field • Develop and maintain core ML pipelines • Train and deploy deep learning models for real-time applications • Collaborate cross-functionally with camera, systems and labeling teams • Curate datasets for evaluating performance and comparing performance trends over time • Provide technical mentorship to other junior ML engineers



