Job Closed

This listing is no longer active.

Looking for someone who: Has built scalable backend systems from scratch Has configured and worked extensively with AWS services Has migrated or refactored outdated infrastructure Has resolved production job failures Has worked extensively with containerized solutions

Principal Machine Learning Engineer

AI EngineerMachine Learning EngineerOther Remote

Location

United States

Posted

101 days ago

Salary

No structured requirement data.

Job Description

This role reports to the Director of MLE and works closely with Engineering, Data Science, Product, and the Principal SRE. You will influence cross team platform standards and help elevate engineering rigor across ML and infrastructure. In addition to system design, you will mentor engineers on ML reliability, architecture decision making, and operational excellence. Own ML systems architecture Define ML lifecycle standards Push event driven ML integration Design model packaging and deployment strategy Introduce systemic improvements Reduce architectural and data debt Establish testing and QA standards across ML workflows Build a resilient, scalable ML platform that: Trains distributed models at scale Supports event driven feature computation Enables portable model deployment (internal + external) Standardizes ML lifecycle across products Aligns infrastructure to product usage patterns ML Platform Architecture Define and evolve training orchestration standards Batch vs. streaming inference strategy Feature store direction State store patterns and tooling CPU/GPU scaling strategy When to extend current tooling, and when to replace it Define and evolve training orchestration standards Batch vs. streaming inference strategy Feature store direction State store patterns and tooling CPU/GPU scaling strategy When to extend current tooling, and when to replace it Event Driven ML Integration Design feature pipelines as first-class ML system components Integrate queuing and event systems with ML workflows Build reactive retraining triggers Define model drift detection and automated response systems Ensure retraining pipelines are reproducible and fault tolerant Design feature pipelines as first-class ML system components Integrate queuing and event systems with ML workflows Build reactive retraining triggers Define model drift detection and automated response systems Ensure retraining pipelines are reproducible and fault tolerant Model Packaging & Distribution Define model artifact standardization Deterministic builds Dependency isolation Runtime configuration injection Security constraints Version compatibility contracts Define model artifact standardization Deterministic builds Dependency isolation Runtime configuration injection Security constraints Version compatibility contracts ML Observability, Testing & Reliability Standards Define model performance SLIs Drift detection frameworks Data freshness guarantees Latency SLOs Model failure modes Establish standards for: Automated testing of feature pipelines Training pipeline validation Model artifact verification CI/CD workflows for ML systems Safe promotion from experiment to production Work closely with the Principal SRE to integrate telemetry and operational standards across the full stack. Define model performance SLIs Drift detection frameworks Data freshness guarantees Latency SLOs Model failure modes Establish standards for: Automated testing of feature pipelines Training pipeline validation Model artifact verification CI/CD workflows for ML systems Safe promotion from experiment to production Work closely with the Principal SRE to integrate telemetry and operational standards across the full stack. Operational Excellence & On Call You will help define and operate a sustainable ML on call model in partnership with Engineering and SRE. This includes: Clear ownership boundaries between ML systems and infrastructure Incident classification and severity alignment Runbooks for model failures and data drift Postmortem processes focused on systemic improvement Reducing operational toil through automation You are comfortable being accountable for production ML systems, as well as designing systems that make firefighting rare. You will help define and operate a sustainable ML on call model in partnership with Engineering and SRE. This includes: Clear ownership boundaries between ML systems and infrastructure Incident classification and severity alignment Runbooks for model failures and data drift Postmortem processes focused on systemic improvement Reducing operational toil through automation You are comfortable being accountable for production ML systems, as well as designing systems that make firefighting rare. Reduce Data Architecture Debt Evaluate service landscape alignment to product usage Improve or redefine streaming feature architecture Reduce batch rigidity Recommend infrastructure simplifications Evaluate service landscape alignment to product usage Improve or redefine streaming feature architecture Reduce batch rigidity Recommend infrastructure simplifications

Job Requirements

10–15+ years of experience building and operating production systems
Bachelor’s degree in computer science, Engineering, Mathematics, or related field — or equivalent practical experience. Advanced degrees are welcome but not required.
Deep production experience with distributed ML systems
Strong PyTorch and large-scale data engineering expertise
Experience with Ray or comparable distributed frameworks
Experience operating ML systems in production at scale
Exposure to event driven architectures
Experience improving testing and CI/CD practices for ML workflows
Adtech experience preferred but not required
Strong architectural opinions backed by real production experience
Current Technology Environment
ML Frameworks: PyTorch, Ray (Train, Tune, Datasets), PySpark ML
Data Platform: Databricks (Delta Lake, Unity Catalog), Snowflake, AWS (S3, EC2)
MLOps: MLflow (experiment tracking, model registry), GitHub Actions
Observability: Prometheus, Grafana, Datadog
Languages: Python, SQL, JavaScript/TypeScript
External LLM integrations (AWS Bedrock and OpenAI)
What They’re Looking For
Has designed ML systems from zero
Has migrated or rebuilt broken ML infrastructure
Has owned production model failures
Understands cost implications of ML design
Challenges architectural assumptions constructively
Anticipated Interview Process
Conversational + Architecture Discussion: A live discussion focused on past systems, tradeoffs, and a collaborative diagramming / trouble shooting exercise.
Take Home GitHub Exercise: A practical ML systems exercise evaluating structure, testing, reproducibility, and clarity.
DS/MLE Deep Dive: Technical and strategic discussion around platform evolution and leadership approach.
CEO Conversation: Focused on long term platform direction and company alignment.

Related Categories

AI Engineer Machine Learning Engineer AI Research Scientist LLM Engineer Computer Vision Engineer NLP Engineer

Related Job Pages

More Remote Jobs

More AI Engineer Jobs

Director, Client Scientific Solutions

UnitedHealth Group

UnitedHealth Group is a healthcare and well-being company that’s dedicated to improving the health outcomes of millions around the world. We are comprised of two distinct and com

AI Engineer101 days ago

Other Remote

The Optum AI Director, Client Scientific Solutions manages an applied research team focused on transitioning AI models from research to production. This role emphasizes operational execution, team leadership, and implementation of best practices for model development and deployment, while contributing to strategic discussions within the scope of assigned client projects. Manage the development and deployment of production-ready models, ensuring adherence to quality and scalability standards. Guide the team in converting research outputs into operational solutions that meet client needs. Work closely with product managers and engineering leads to define technical requirements and prioritize tasks for successful delivery. Introduce and apply innovative AI techniques within client projects to improve model performance and efficiency. Apply ethical AI principles and ensure compliance with organizational standards in all development activities. Implement best practices for model development, testing, and deployment to maintain operational excellence. Coach and develop team members, fostering technical growth and collaboration.

View details: Director, Client Scientific Solutions

United States

Apply

Job Closed

Senior AI Engineer

ChargePoint

Driving A Better Way®

AI Engineer101 days ago

Full Time RemoteTeam 1,001-5,000Since 2007H1B Sponsor

Company Site LinkedIn

• Architect and build LLM-based applications such as copilots, chatbots, and AI agents • Design and optimize RAG and grounded AI systems using enterprise data • Lead development of agentic workflows with tool/function calling and multi-step reasoning • Own backend services and APIs for AI inference and orchestration • Drive LLMOps / GenAIOps practices (evaluation, monitoring, CI/CD, versioning) • Optimize AI systems for quality, cost, latency, and reliability • Apply and advocate for Responsible AI and security-by-design • Mentor engineers and influence AI engineering best practices • Partner with product, platform, and security teams to shape AI strategy

AWS Azure Distributed Systems Docker GCP Java Kubernetes Microservices Python TypeScript

View details: Senior AI Engineer

India

Apply

Job Closed

Senior Solution Consultant

Jobgether

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1 We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

AI Engineer101 days ago

Other RemoteH1B No Sponsor

Company Site LinkedIn

This role involves ensuring successful deployment and operationalization of cutting-edge cybersecurity solutions for our customers. Oversee overall customer experience and delivery of services. Build and maintain strong customer relationships. Deploy and integrate cybersecurity capabilities across broad enterprises. Write technical documentation and briefings. Lead technical exchange meetings. Collaborate with end-users and stakeholders. Develop engineering artifacts such as system design diagrams. Participate in development and pre-deployment testing. Lead deployment planning and execution. Analyze and integrate technical requirements into customer infrastructures. Monitor system health and functionality. Communicate technical problems and provide recommendations.

View details: Senior Solution Consultant

United States

Apply

Job Closed

Senior AI Solutions Specialist

Jobgether

AI Engineer101 days ago

Other RemoteH1B No Sponsor

Company Site LinkedIn

This is a high-impact, client-facing role at the intersection of artificial intelligence, enterprise performance management, and strategic partnerships. You will serve as a trusted AI subject matter expert, collaborating with partners to shape go-to-market strategies and support advanced sales initiatives. From leading tailored product demonstrations to contributing to proofs of concept and solution design, you’ll help translate complex AI capabilities into real business value. Working in a fast-evolving environment, you’ll stay ahead of emerging AI/ML trends while enabling partners with cutting-edge platform innovations. This role is ideal for a technically strong consultant who thrives in strategic conversations and enjoys bridging technology with business outcomes. Lead and deliver tailored demonstrations showcasing AI capabilities, platform functionality, and product roadmaps to partners and stakeholders. Act as a subject matter expert within strategic alliance initiatives, supporting partner engagement activities and events. Collaborate with partners to understand business objectives and co-develop effective go-to-market (GTM) strategies. Support AI-focused sales cycles by assisting with scoping discussions, proofs of concept (POCs), and solution positioning. Enable partners and internal sales teams with technical guidance, best practices, and solution knowledge. Develop high-quality presentations and technical demonstrations that clearly articulate advanced product features and value propositions. Build and maintain strong technical relationships with partners to foster long-term collaboration and reference development. Stay current with advancements in AI/ML technologies and evolving enterprise performance management solutions.

View details: Senior AI Solutions Specialist

United States

Apply

Job Closed

Principal Machine Learning Engineer

Job Description

Job Requirements

Related Guides

Related Categories

Related Job Pages

More AI Engineer Jobs

Director, Client Scientific Solutions

Senior AI Engineer

Senior Solution Consultant

Senior AI Solutions Specialist