Albert Invent logo
Albert Invent

Invent the future, faster.

Staff AI/ML Platform Engineer

Platform EngineerPlatform EngineerOtherRemoteLeadTeam 51-200Since 2022H1B No SponsorCompany SiteLinkedIn

Location

California

Posted

129 days ago

Salary

0

Seniority

Lead

Bachelor Degree7 yrs expEnglishAWSDockerDynamoDBKubernetesNoSQLPythonRayRedis

Job Description

Staff AI/ML Platform Engineer

Albert Invent

• You'll own the APIs, data pipelines, and workflow orchestration that power our AI products—from real-time model inference to long-running optimization pipelines. • This role sits at the intersection of backend engineering and data engineering: you'll build the services that serve up models, manage workflows, and connect AI capabilities to the structured data that makes them useful. • You'll work closely with our Active Learning and LLM/Agents team leads, translating their product vision into scalable, production-grade systems. • The infrastructure you build will power model playgrounds for chemists, inverse design pipelines that optimize experiments across high-dimensional spaces, and orchestrated agent workflows that reason through complex scientific problems. • Design and build high-performance Python APIs that serve models, manage workflows, and expose AI capabilities to the broader platform • Architect backend services for scalability, reliability, and low latency • Build integrations between AI/ML systems, graph databases, and external data sources • Build and maintain long-running workflow pipelines using Ray and Temporal. • Design orchestration patterns for multi-step agent pipelines, batch inference, and numerical optimization workflows • Ensure fault tolerance, graceful degradation, and efficient resource utilization. • Architect and maintain data pipelines that feed AI/ML workflows • Work with Neptune (graph), Redis, DynamoDB, and other data stores to enable efficient data access patterns • Implement observability including logging, metrics, tracing, and alerting • Own system reliability—troubleshoot issues, conduct post-mortems, and continuously improve. • Design CI/CD pipelines and promote automation best practices.

Job Requirements

  • Deep expertise in Python backend development and building production APIs
  • Experience designing and operating data pipelines and workflow orchestration systems
  • A builder's mindset—you want to create foundational systems that others build on
  • Genuine curiosity about how your work enables scientific discovery
  • A commitment to rigor: AI makes mistakes confidently, and our customers won't accept hand-waving—neither should we
  • A degree in Computer Science or a related field with 7+ years of industry experience (Bachelor's) or 5+ years (Master's or PhD) in software engineering
  • Advanced proficiency in Python including async programming and performance optimization
  • Experience building and maintaining REST APIs using FastAPI or similar frameworks
  • Experience with workflow orchestration tools (Ray, Temporal, or similar)
  • Strong background in data engineering: pipelines, transformations, and working with diverse data stores
  • Experience with cloud platforms (AWS preferred) and containerization (Docker, Kubernetes)
  • Familiarity with graph databases, key-value stores, or other NoSQL systems (Neptune, Redis, DynamoDB a plus)
  • Track record of operating production systems at scale.

Benefits

  • We care about you.
  • We love distributed teams.
  • We value diversity.

Related Categories

Related Job Pages

More Platform Engineer Jobs

Afresh logo

ML Platform Engineer

Afresh

The smartest solution for fresh

Platform Engineer129 days ago
OtherRemoteTeam 51-200Since 2017H1B Sponsor

• You will be instrumental in elevating our core ML platform to its next level of performance, reliability, and scalability. • You'll work on the critical infrastructure that directly enables all of Afresh's Machine Learning and Applied Science teams to innovate faster and deliver impact. • Your contributions will empower our product suite, including our flagship Prediction Engine, to power replenishment decisions on more than 15% of all produce sold in the United States. • In your first 3 months, you might deliver a feature that helps generalize model configuration, enables no-code model deploys for our various ML solutions, or vastly improves integration testing across our ML systems. • By the end of your first 6 months, you will have owned the implementation of significant scalability improvements and additions to our ML platform.

Alabama + 20 moreAll locations: Alabama | California | Colorado | Florida | Illinois | Kentucky | Montana | Nevada | New Jersey | New York | North Carolina | Oregon | Massachusetts | Michigan | Missouri | Pennsylvania | Texas | Utah | Virginia | Washington | Wisconsin
$130K - $176K / year
Job Closed
Axiomatic_AI logo

Senior Platform Engineer

Axiomatic_AI

https://www.axiomatic-ai.com/

Platform Engineer129 days ago
OtherRemoteTeam 11-50Since 2024H1B No Sponsor

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description As a Senior Platform Engineer at Axiomatic, you will own the reliability, deployment, and operational excellence of our AI platform. This role focuses primarily on infrastructure, CI/CD, and operations, with additional responsibilities for automation and tooling development. - Lead deployment strategies and CI/CD pipelines across multiple environments - Architect and maintain multi-cloud infrastructure (Azure, AWS, GCP) and on-premise deployments - Own infrastructure as code using Terraform to automate provisioning and configuration - Build comprehensive observability systems: monitoring, metrics, logging, and alerting - Implement security controls, compliance frameworks, and data governance policies - Develop automation tools, APIs, and scripts (Python) to improve operational efficiency - Ensure system reliability, performance, and scalability - Drive incident response, postmortems, and continuous improvement - Troubleshoot infrastructure and application issues across multiple environments Qualifications - 7+ years of experience in Platform Engineering, Site Reliability Engineering, DevOps, or Infrastructure Engineering roles - Deployment expert: Deep experience with CI/CD pipelines, release strategies, and production deployments at scale - Multi-cloud expertise: Hands-on experience with Azure and AWS required (GCP is a plus) - On-premise deployment experience: Linux system administration, bare-metal provisioning, networking - Terraform expert: Deep experience writing and maintaining infrastructure as code - Observability systems: Proven track record building monitoring, alerting, and metrics platforms - Security mindset: Experience implementing security controls and best practices. Security certification preferred (CISSP, CEH, AWS/Azure Security Specialty, or similar) - Data governance: Understanding of data privacy, residency requirements, and governance frameworks - Backend/scripting skills: Python (preferred) or Go for automation, tooling, and operational scripts - Kubernetes and container orchestration in production - Strong Linux/Unix administration and scripting (Bash, Python) - CI/CD platforms: GitHub Actions, GitLab CI, Jenkins, or similar - Version control and GitOps practices - Strong problem-solving and debugging skills - Fluent in English (Spanish is a plus) Requirements - Design and implement deployment pipelines for multi-environment releases (dev, staging, production) - Own the full deployment lifecycle: build, test, release, and rollback strategies - Implement blue-green deployments, canary releases, and progressive rollouts - Build automated deployment tooling and workflows - Ensure zero-downtime deployments and rollback capabilities - Optimize build and deployment performance - Manage artifact repositories and container registries - Design and operate multi-cloud infrastructure across Azure, AWS, and GCP - Architect and deploy on-premise solutions for enterprise customers (Linux-based) - Manage Kubernetes clusters, container orchestration, and networking - Implement disaster recovery, backup strategies, and business continuity - Optimize cloud costs and resource utilization - Define and track SLIs, SLOs, and error budgets for critical services - Write and maintain Terraform modules for infrastructure provisioning - Implement GitOps workflows for infrastructure changes - Automate infrastructure scaling, updates, and operations - Ensure reproducible and version-controlled infrastructure - Design comprehensive monitoring, logging, and alerting (Prometheus, Grafana, Datadog, or similar) - Build dashboards for system health, performance, and business metrics - Implement distributed tracing for microservices - Conduct capacity planning and performance analysis - Drive reliability improvements through data-driven insights - Implement security best practices: identity management, secrets management, network policies - Work towards or maintain security certifications (SOC 2, ISO 27001, or similar) - Conduct security audits and vulnerability remediation - Implement data governance policies for AI pipelines and user data - Ensure compliance with data privacy regulations (GDPR, CCPA) - Write automation scripts and tools in Python for operational tasks - Build internal tooling for deployments, monitoring, and incident response - Develop runbooks, automation, and self-healing systems - Create APIs for infrastructure operations when needed - Maintain high code quality and testing standards for tooling - Participate in on-call rotation and lead incident response - Conduct blameless postmortems and drive action items - Build and maintain incident response playbooks - Improve system resilience and failure modes - Partner with engineering teams on deployment strategies and architecture - Work with security team on compliance and governance - Mentor engineers on operational best practices - Document systems, procedures, and runbooks Benefits - Opportunity to work on technology that drives innovation in AI for scientific and engineering applications - Contribute to the development of new AI architectures that can reason coherently and produce interpretable and verifiable solutions - Collaborate with a global team of engineers and AI specialists - Flexible working arrangements, including hybrid or fully remote options Company Description Axiomatic AI is building a new class of AI systems designed to reason with the rigor of the scientific method. Our mission, 30×30, is to deliver a 30× improvement in the speed, accessibility, and cost of semiconductor and photonic hardware development by 2030.

United States + 1 moreAll locations: United States | Spain
Job Closed
Curri logo

Platform Engineer

Curri

Transforming the way construction and industrial supplies are delivered.

Platform Engineer129 days ago
OtherRemoteTeam 51-200Since 2018H1B No Sponsor

• Build and maintain CI/CD pipelines and deployment automation using RWX, focusing on reliability, speed, and cost efficiency. • Manage and evolve AWS infrastructure (Aurora, ElastiCache, VPC, IAM, EC2, Secrets Manager) using Infrastructure as Code with Pulumi. • Operate, debug, and scale Kubernetes workloads in production environments. • Improve developer experience by reducing build times, enhancing tooling, and creating self-service capabilities for engineering teams. • Support and optimize the TypeScript monorepo build infrastructure and related tooling. • Collaborate closely with product engineers on debugging, system design, and performance optimization. • Participate in the on-call rotation (Tuesday-to-Tuesday) and support incident response without burnout-driven expectations.

California
Job Closed
Software Mind logo

Senior ML Platform Engineer – ML Platforms, MLOps

Software Mind

Software House focused on results since 1999

Platform Engineer129 days ago
Full TimeRemoteTeam 1,001-5,000Since 1999H1B No Sponsor

• Support and contribute hands-on to multiple ML platform POCs • Work closely with Applied Scientists, ML Engineers, and internal platform teams • Evaluate platform capabilities across: GPU training and experimentation, real-time and batch inference, orchestration, monitoring, and operability, multi-tenancy, isolation, and scalability • Assess integration points with existing in-house tooling • Perform performance and operability analysis • Contribute technical input to: Build vs buy vs extend decisions, target platform stack recommendations, OPEX and CAPEX justification for rollout

Mali
Job Closed