Lead Machine Learning Operations Engineer
Location
California + 1 moreAll locations: California | New York
Posted
1 day ago
Salary
$157K - $235K / year
Seniority
Senior
Job Description
Lead Machine Learning Operations Engineer
Paramount
• Own ML production reliability strategy • Define and lead the operational strategy for production ML systems, including monitoring, traceability, deployment safety, incident response, and post-deployment validation. • Set the standards ML teams use to assess model health, performance, and trustworthiness in production. • Own model traceability and governance • Ensure every production model has clear lineage (data, features, code, artifacts, validation, deployment history) and drive adoption of model registry and metadata tooling across ML teams. • Build end-to-end ML observability • Design and implement monitoring across the full ML signal path: data arrival, feature freshness, distribution stability, candidate generation, ranking behavior, model metrics, serving latency, and SLA performance. • Define production health metrics • Partner with ML, data, product, and business stakeholders to define post-deployment metrics covering model quality, system reliability, business guardrails, and degradation indicators. • Detect drift and degradation proactively • Detect data drift, feature drift, model behavior changes, and silent failures before they impact customers via thresholding, alerting, anomaly detection, and release-over-release monitoring. • Lead diagnostic tooling and root-cause analysis • Build dashboards, logs, and diagnostic workflows that progress quickly from “recommendations look off” to root cause, with context captured across candidates, features, scores, ranking decisions, and downstream outcomes. • Own ML deployment safety • Define and operate automated gates that prevent bad models or bad data from being promoted to production. • Partner with MLEs to establish validation checks, rollback criteria, canary strategies, shadow testing, and release health reviews. • Lead ML incident response • Own incident response practices for ML systems, including rollback playbooks, hotfix strategies, severity definitions, tradeoff frameworks, communications, and post-mortems. • Drive closure of systemic gaps after incidents rather than only resolving the immediate issue. • Partner across ML Platform, Data, and ML • Partner with DevOps/Platform on infrastructure and observability needs; with Data Engineering on data quality, drift, and freshness; and with ML Engineering to embed operational requirements into development and deployment workflows. • Set standards and mentor others • Act as the technical lead for ML operations: establish reusable patterns, playbooks, and standards, and mentor engineers on reliability, observability, and operational rigor.
Job Requirements
- 5+ years of experience in machine learning engineering, ML platform, applied ML, MLOps, data platform, reliability engineering, or a related technical role.
- Demonstrated experience operating production ML systems, including monitoring, deployment, incident response, model validation, data quality, or reliability ownership.
- Experience leading technical initiatives across multiple engineering teams, especially where success required influencing architecture, tooling, standards, or adoption.
- Hands-on experience with model registries, feature stores, ML metadata systems, production monitoring, model deployment pipelines, or ML observability platforms.
- Solid knowledge of end-to-end ML systems, including training data, features, model artifacts, offline validation, online serving, post-deployment metrics, and business outcome measurement.
- Ability to reason about ML operational failure modes: stale features, distribution shift, training-serving skew, delayed labels, and offline-online metric gaps.
- Solid SQL skills and comfort investigating data quality, feature distributions, model outputs, pipeline behavior, and production anomalies.
- Track record of cross-functional collaboration with Platform, Data, and ML Engineering to deliver production-grade operational capabilities.
- Solid written and verbal communication skills, including the ability to explain ML system health, risks, incidents, and tradeoffs to both technical and non-technical stakeholders.
Benefits
- medical
- dental
- vision
- 401(k) plan
- life insurance coverage
- disability benefits
- tuition assistance program
- PTO
Related Guides
Related Job Pages
More Machine Learning Engineer Jobs
Machine Learning Operations Lead
RemitlyRemitly is a global digital financial services company providing fast, affordable, and secure remittance services with the aim of making it easier for people to
Role Description Develop a comprehensive data and analytics cloud migration strategy as part of a broader technology modernization and migration to Microsoft Azure. Drive thought leadership and execution of cross-functional analytics teams to enable adoption of data processing and analytical infrastructure in the cloud. - Manage, optimize, and operationalize data lakes, data science virtual machines and other DevOps tools (Github, Jfrog) to enable faster go-to-market capabilities for data science teams. - Architect and build core components of machine learning and data engineering platform infrastructure. - Develop a comprehensive user, developer, manager education program to accelerate onboarding into a governed self-service data ecosystem. - Define, manage and report out operational SLAs and KPIs for data platforms and solutions. - Partner with data science and IT engineering teams to high performance, efficient feature pipelines from backend proprietary data. - Simplify the technology stack to sunset legacy applications while minimizing business disruptions. - Perform other duties as needed. Qualifications - Bachelor’s degree (or foreign equivalent) in Applied Computer Science, Computer Engineering, Information Systems, or a related field required. - 5 years of experience in job offered or related occupations required. Requirements - 5 years of experience developing architectural design documents to enable cloud transformation and migration of key on-prem services to cloud to enable faster go-to-market for engineering teams. - Experience with key technologies including Azure Fundamentals, Azure Databricks, Azure Data Lake storage and Azure Compute to design, develop the right set of tools for analytics research and development in Azure. - Developing orchestration pipelines for data ingest, data transfer and developing automated data pipelines to improve operational efficiency for batch file transfers across disparate systems. - 2 years of experience building custom applications for data and server compute to enable machine learning (ML) enabled computes to be readily available for model development and deployment for analytics teams. - Supporting a secure, robust and resilient data and cloud related services to protect systems from security vulnerabilities, enabling scaling of services based on project needs and develop backup systems to ensure redundancy of services for non-production and production workloads. - Employee reports to LexisNexis Risk Solutions, Inc. office in Alpharetta, GA, but may telecommute from any location within the U.S. - Experience can be concurrent. Benefits - Salary range: $194,400 to $252,800/year + standard company benefits. - National Base Pay Range: $136,100 - $252,800. Geographic differentials may apply in some locations to better reflect local market rates. - Country specific benefits available.
Role Description We’re looking for an E-Learning Developer to join a growing team, supporting the development and optimisation of a Moodle-based Virtual Learning Environment. This is a great opportunity for someone who enjoys both the technical side of platform management and the creative side of building engaging learning content. - Managing and maintaining a Moodle-based VLE, including back-end configuration - Creating engaging digital learning content using tools such as Articulate - Turning subject matter expert content into interactive, user-friendly courses - Supporting enrolments, course setup, grading, and reporting - Working with stakeholders to design effective learning solutions - Improving user experience through interactivity, accessibility, and platform enhancements Qualifications - Experience administering Moodle at back-end level - Experience with e-learning authoring tools (e.g. Articulate) - Strong technical understanding of VLEs/LMS platforms - Ability to communicate and collaborate with non-technical stakeholders - Strong organisation and attention to detail Requirements - Experience in education or professional qualifications (CIPD/CMI) - Knowledge of Moodle add-ons/plugins - Excel reporting experience - Exposure to tools like Asana Benefits - £35,000 basic salary
Applied Machine Learning Evaluation Consultant
24-MAGThis opportunity is available through a leading AI-driven work platform.
Role Description We are sharing a specialised part-time consulting opportunity for experienced Machine Learning Engineers and Applied ML Researchers with expertise in end-to-end modeling, dataset analysis, feature engineering, validation strategy, model evaluation, reference solution development, and technical quality review. This role supports current and upcoming remote consulting opportunities focused on complex machine learning challenge design, applied modeling workflows, reference solution development, technical evaluation, reproducible documentation, and high-quality project execution. Selected professionals will design, solve, and review challenging machine learning tasks that reflect real-world ML development across multiple domains and data modalities. Key Responsibilities - End-to-End Machine Learning Solution Development - Develop complete machine learning solutions for challenging prediction and modeling problems. - Analyze datasets and define appropriate modeling approaches, validation strategies, and evaluation metrics. - Perform exploratory data analysis, feature engineering, data preprocessing, model training, tuning, and evaluation. - Work across tabular, text, image, time-series, recommendation, ranking, or other applied ML problem types. - Reference Solutions & Technical Documentation - Develop strong reference solutions using industry-standard machine learning techniques and best practices. - Document methodologies, assumptions, modeling choices, validation approaches, and evaluation results clearly. - Ensure solutions are accurate, reproducible, and technically well-structured. - Identify opportunities to improve model performance through systematic experimentation and iteration. - ML Project Review & Evaluation - Review and validate the technical quality of machine learning projects and deliverables. - Evaluate modeling choices, data preparation decisions, performance metrics, and experimental design. - Identify weak assumptions, data leakage risks, flawed validation, underdeveloped features, or unsupported modeling conclusions. - Provide clear written technical feedback that improves correctness, rigor, and reproducibility. Qualifications - Master's degree, PhD, or equivalent advanced experience in Computer Science, Machine Learning, Statistics, Mathematics, Electrical Engineering, or a related field. - 2+ years of hands-on experience developing, training, evaluating, and optimizing machine learning models in a professional or research setting. - Strong proficiency in Python and modern machine learning frameworks such as scikit-learn, XGBoost, LightGBM, PyTorch, or TensorFlow. - Demonstrated experience building end-to-end machine learning solutions, including data preparation, model development, validation, and evaluation. - Strong understanding of model evaluation metrics, validation methodologies, and experimental design. - Ability to work independently on open-ended machine learning problems and deliver high-quality technical outputs. Requirements - Relevant experience may include: - Tabular machine learning. - Natural language processing. - Computer vision. - Recommendation systems. - Ranking systems. - Time-series forecasting. - Applied modeling across structured or unstructured datasets. Nice to Have - PhD from a leading research university. - Experience at leading technology companies, AI-focused teams, research institutions, or high-growth startups. - Participation in competitive machine learning or data science competitions. - Experience optimizing models against performance-based evaluation metrics. - Familiarity with advanced techniques such as ensembling, hyperparameter optimization, transfer learning, foundation model fine-tuning, or reinforcement learning. - Publications, patents, or significant open-source contributions in machine learning or AI. - Experience reviewing, mentoring, or evaluating the work of other machine learning practitioners. Why This Opportunity - Apply machine learning engineering and applied research expertise to structured remote consulting work. - Contribute to high-quality ML challenge design, reference solution development, and technical evaluation. - Work on flexible assignments aligned with your modeling, Python, experimentation, and ML framework experience. - Use your technical judgment to evaluate complex ML workflows and improve solution quality. - Remote structure with competitive hourly compensation. Contract Details - Independent contractor role. - Fully remote with flexible scheduling. - Eligible professionals may be based in approved project locations depending on project needs. - Project commitment may vary depending on availability and scope. - Competitive rates up to $100 per hour depending on expertise and project scope. - Weekly payments via Stripe or Wise. - Projects may be extended, shortened, or adjusted depending on scope and performance. - Work will not involve access to confidential or proprietary information from any employer, client, or institution. About the Platform This opportunity is available through 24-MAG LLC. We connect experienced professionals with remote consulting opportunities across technical, evaluation, and project-based workstreams. By submitting this application, you acknowledge that your information may be processed by 24-MAG LLC for recruitment and opportunity matching in accordance with our Privacy Policy: https://www.24-mag.com/privacy-policy .
Principal Machine Learning Systems Engineer
AtlassianAtlassian is a publicly-traded computer software business specializing in collaboration, development, and issue-tracking software for teams. As an employer, Atl
Role Description We’re looking for a Principal Machine Learning Systems Engineer to architect and scale the AI agents facing platform. The team is evolving from a vector/semantic search platform into a full hybrid search capability and, ultimately, into Atlassian’s agentic retrieval layer — a system that dynamically generates query plans on the fly using graph-based orchestration rather than static, linear pipelines. You’ll own the technical direction of this evolution, ensuring high-quality results, low latency, high reliability, and cost efficiency as it serves both human users and AI agents at enterprise scale. Qualifications - 7+ years of experience as a software developer. - Deep experience building and operating production search, ads, or recommendation systems at scale. - Hands-on experience building GenAI systems — shipping agents, LLM-powered applications, or AI systems at scale. - Expertise in hybrid search — building indexes that combine vector/semantic and full-text/lexical retrieval. Requirements - Expert-level with search platform, deep learning training/inference platform (preferred). - Experience with at least one major vector database (pinecone, milvus, qdrant, turbopuffer, etc) (preferred). - Experience with scaling and deploying Machine Learning models (preferred). Benefits - Health and wellbeing resources. - Paid volunteer days. - Wide range of perks and benefits designed to support you and your family. Compensation At Atlassian, we strive to design equitable, explainable, and competitive compensation programs. The baseline of our range is higher than that of the typical market range. Base pay within the range is determined by a candidate's skills, expertise, or experience. For this role, our current base pay ranges for new hires in each zone are: - Zone A: $232,200 - $303,150 - Zone B: $209,700 - $273,775 - Zone C: $193,500 - $252,625 This role may also be eligible for benefits, bonuses, commissions, and equity. Please visit go.atlassian.com/payzones for more information on which locations are included in each of our geographic pay zones. Company Description At Atlassian, we're motivated by a common goal: to unleash the potential of every team. Our software products help teams all over the planet and our solutions are designed for all types of work. Team collaboration through our tools makes what may be impossible alone, possible together. We believe that the unique contributions of all Atlassians create our success. To ensure that our products and culture continue to incorporate everyone's perspectives and experience, we never discriminate based on race, religion, national origin, gender identity or expression, sexual orientation, age, or marital, veteran, or disability status. To provide you the best experience, we can support with accommodations or adjustments at any stage of the recruitment process. Simply inform our Recruitment team during your conversation with them. To learn more about our culture and hiring process, visit go.atlassian.com/crh.

