Job Closed
This listing is no longer active.
The smartest solution for fresh
ML Platform Engineer
Location
Alabama + 20 moreAll locations: Alabama | California | Colorado | Florida | Illinois | Kentucky | Montana | Nevada | New Jersey | New York | North Carolina | Oregon | Massachusetts | Michigan | Missouri | Pennsylvania | Texas | Utah | Virginia | Washington | Wisconsin
Posted
130 days ago
Salary
$130K - $176K / year
Seniority
Senior
Job Description
ML Platform Engineer
Afresh
• You will be instrumental in elevating our core ML platform to its next level of performance, reliability, and scalability. • You'll work on the critical infrastructure that directly enables all of Afresh's Machine Learning and Applied Science teams to innovate faster and deliver impact. • Your contributions will empower our product suite, including our flagship Prediction Engine, to power replenishment decisions on more than 15% of all produce sold in the United States. • In your first 3 months, you might deliver a feature that helps generalize model configuration, enables no-code model deploys for our various ML solutions, or vastly improves integration testing across our ML systems. • By the end of your first 6 months, you will have owned the implementation of significant scalability improvements and additions to our ML platform.
Job Requirements
- BS in Computer Science or a relevant technical field.
- 3+ years of professional software development experience with a proven track record of shipping high-quality applications and services.
- Experience working collaboratively with machine learning engineers, data scientists, or applied scientists on large-scale software projects involving machine learning models.
- Deep expertise in library design, API design, data structures, and algorithms.
- Strong familiarity with Python.
Benefits
- Afresh provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability, genetics, sexual orientation, gender identity/expression, marital status, pregnancy or related condition, or any other basis protected by law.
Related Guides
Related Categories
Related Job Pages
More Platform Engineer Jobs
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description As a Senior Platform Engineer at Axiomatic, you will own the reliability, deployment, and operational excellence of our AI platform. This role focuses primarily on infrastructure, CI/CD, and operations, with additional responsibilities for automation and tooling development. - Lead deployment strategies and CI/CD pipelines across multiple environments - Architect and maintain multi-cloud infrastructure (Azure, AWS, GCP) and on-premise deployments - Own infrastructure as code using Terraform to automate provisioning and configuration - Build comprehensive observability systems: monitoring, metrics, logging, and alerting - Implement security controls, compliance frameworks, and data governance policies - Develop automation tools, APIs, and scripts (Python) to improve operational efficiency - Ensure system reliability, performance, and scalability - Drive incident response, postmortems, and continuous improvement - Troubleshoot infrastructure and application issues across multiple environments Qualifications - 7+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Infrastructure Engineering roles - Deployment expert: Deep experience with CI/CD pipelines, release strategies, and production deployments at scale - Multi-cloud expertise: Hands-on experience with Azure and AWS required (GCP is a plus) - On-premise deployment experience: Linux system administration, bare-metal provisioning, networking - Terraform expert: Deep experience writing and maintaining infrastructure as code - Observability systems: Proven track record building monitoring, alerting, and metrics platforms - Security mindset: Experience implementing security controls and best practices. Security certification preferred (CISSP, CEH, AWS/Azure Security Specialty, or similar) - Data governance: Understanding of data privacy, residency requirements, and governance frameworks - Backend/scripting skills: Python (preferred) or Go for automation, tooling, and operational scripts - Kubernetes and container orchestration in production - Strong Linux/Unix administration and scripting (Bash, Python) - CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, or similar - Version control and GitOps practices - Strong problem-solving and debugging skills - Fluent in English (Spanish is a plus) Requirements - Design and implement deployment pipelines for multi-environment releases (dev, staging, production) - Own the full deployment lifecycle: build, test, release, and rollback strategies - Implement blue-green deployments, canary releases, and progressive rollouts - Build automated deployment tooling and workflows - Ensure zero-downtime deployments and rollback capabilities - Optimize build and deployment performance - Manage artifact repositories and container registries - Design and operate multi-cloud infrastructure across Azure, AWS, and GCP - Architect and deploy on-premise solutions for enterprise customers (Linux-based) - Manage Kubernetes clusters, container orchestration, and networking - Implement disaster recovery, backup strategies, and business continuity - Optimize cloud costs and resource utilization - Define and track SLIs, SLOs, and error budgets for critical services - Write and maintain Terraform modules for infrastructure provisioning - Implement GitOps workflows for infrastructure changes - Automate infrastructure scaling, updates, and operations - Ensure reproducible and version-controlled infrastructure - Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar) - Build dashboards for system health, performance, and business metrics - Implement distributed tracing for microservices - Conduct capacity planning and performance analysis - Drive reliability improvements through data-driven insights - Implement security best practices: identity management, secrets management, network policies - Work towards or maintain security certifications (SOC 2, ISO 27001, or similar) - Conduct security audits and vulnerability remediation - Implement data governance policies for AI pipelines and user data - Ensure compliance with data privacy regulations (GDPR, CCPA) - Write automation scripts and tools in Python for operational tasks - Build internal tooling for deployments, monitoring, and incident response - Develop runbooks, automation, and self-healing systems - Create APIs for infrastructure operations when needed - Maintain high code quality and testing standards for tooling - Participate in on-call rotation and lead incident response - Conduct blameless postmortems and drive action items - Build and maintain incident response playbooks - Improve system resilience and failure modes - Partner with engineering teams on deployment strategies and architecture - Work with security team on compliance and governance - Mentor engineers on operational best practices - Document systems, procedures, and runbooks Benefits - Opportunity to work on technology that drives innovation in AI for scientific and engineering applications - Contribute to the development of new AI architectures that can reason coherently and produce interpretable and verifiable solutions - Collaborate with a global team of engineers and AI specialists - Flexible working arrangements, including hybrid or fully remote options Company Description Axiomatic AI is building a new class of AI systems designed to reason with the rigor of the scientific method. Our mission, 30×30, is to deliver a 30× improvement in the speed, accessibility, and cost of semiconductor and photonic hardware development by 2030.
• Build and maintain CI/CD pipelines and deployment automation using RWX, focusing on reliability, speed, and cost efficiency. • Manage and evolve AWS infrastructure (Aurora, ElastiCache, VPC, IAM, EC2, Secrets Manager) using Infrastructure as Code with Pulumi. • Operate, debug, and scale Kubernetes workloads in production environments. • Improve developer experience by reducing build times, enhancing tooling, and creating self-service capabilities for engineering teams. • Support and optimize the TypeScript monorepo build infrastructure and related tooling. • Collaborate closely with product engineers on debugging, system design, and performance optimization. • Participate in the on-call rotation (Tuesday-to-Tuesday) and support incident response without burnout-driven expectations.
Senior ML Platform Engineer – ML Platforms, MLOps
Software MindSoftware House focused on results since 1999
• Support and contribute hands-on to multiple ML platform POCs • Work closely with Applied Scientists, ML Engineers, and internal platform teams • Evaluate platform capabilities across: GPU training and experimentation, real-time and batch inference, orchestration, monitoring, and operability, multi-tenancy, isolation, and scalability • Assess integration points with existing in-house tooling • Perform performance and operability analysis • Contribute technical input to: Build vs buy vs extend decisions, target platform stack recommendations, OPEX and CAPEX justification for rollout
Senior Platform Engineer
PonduranceDelivering personalized, 24/7 MDR services that grow with your organization.
• Support and enhance Pondurance’s MDR Data Pipeline • Focus on enhancing the durability and reliability of HashiCorp-based, containerized Vector endpoints and the systems that deploy them • Act as a subject matter expert for GitHub Actions, Terraform, and Nomad–based CI/CD pipeline • Own and maintain self-service deployment mechanisms for platform frontend • Manage system and application configuration using Salt and HashiCorp tooling • Oversee and maintain Docker-based systems • Implement and maintain monitoring, alerting, and dataflow solutions using Datadog • Participate in on-call rotations and root cause analysis (RCA) • Collaborate with cross-functional teams on reliability, scalability, and continuous improvement initiatives • Document processes, procedures, and troubleshooting steps to support knowledge sharing and team efficiency




