Job Closed

This listing is no longer active.

Pathway logo
Pathway

pathway.com - The smartest way to build Data Products

Senior ML Infrastructure – DevOps Engineer

Location

France

Posted

172 days ago

Salary

0

Seniority

Senior

Job Description

Senior ML Infrastructure – DevOps Engineer

Pathway

• Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management). • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management. • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback. • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services. • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch). • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges. • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems. • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break.

Job Requirements

  • Former or current Linux / systems / network administrator who is comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing).
  • 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads.
  • Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services.
  • What we are looking for
  • Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.
  • Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.
  • Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).
  • Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.
  • Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).
  • Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.
  • Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer.
  • Strong ownership mindset, comfort with ambiguity, and enthusiasm for scaling and hardening critical infrastructure for an ML‑heavy environment.
  • Willingness to learn.

Benefits

  • Intellectually stimulating work environment. Be a pioneer: you get to work with realtime data processing & AI.
  • Work in one of the hottest AI startups, with exciting career prospects. Team members are distributed across the world.
  • Responsibilities and ability to make significant contribution to the company’ success
  • Inclusive workplace culture
  • Further details
  • Type of contract**: Permanent employment contract
  • Preferable joining date**: Immediate.
  • Compensation**: based on profile and location.
  • Location**: Remote work. Possibility to work or meet with other team members in one of our offices: Palo Alto, CA; Paris, France or Wroclaw, Poland. Candidates based anywhere in the EU, United States, and Canada will be considered.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

• Design, implement, and manage CI/CD pipelines using GitHub Actions, GitOps, GCP-native tools, Azure DevOps, and Jenkins. • Develop, automate, and maintain infrastructure as code across multi-cloud environments (GCP, Azure; Kubernetes, Cloud Run, AKS, GKE). • Build, deploy, and operate containerized services, ensuring health, security, scalability, and high availability. • Integrate tools, platforms, and APIs to streamline operations across AI/ML and virtual agent ecosystems. • Develop and maintain automation scripts and tooling using Python, Shell, and Java/Groovy where appropriate. • Monitor system and application health, implement proactive alerting, and continuously improve reliability and performance. • Troubleshoot infrastructure, networking, and deployment issues across cloud services and application stacks. • Collaborate with engineering and product teams to optimize release processes and platform reliability. • Promote DevOps best practices and maintain clear documentation for system architecture, operations, and troubleshooting.

New York
$90K - $120K / year
Job Closed
Nava logo

Senior Software Engineer – DevSecOps Architect

Nava

Building simple, effective government services. Want to contribute? We're hiring!

DevOps Engineer172 days ago
OtherRemoteTeam 501-1,000Since 2015H1B Sponsor

• Design, implement, and maintain the organization’s security architecture in alignment with federal security standards (e.g., FISMA, NIST SP 800-53, 800-171) and contract requirements • Lead security planning and risk assessments for government systems hosted in AWS • Serve as the primary security point of contact for government programs, overseeing incident response, vulnerability management, and system hardening activities • Develop and maintain security documentation required for system authorization, including System Security Plans (SSPs), Plans of Action and Milestones (POA&Ms), Security Assessment Reports (SARs), and Continuous Monitoring strategies • Support the Authority to Operate (ATO) process across multiple projects, working closely with compliance teams, federal partners, and internal stakeholders • Architect, oversee and support implementation of security controls across AWS services (e.g., IAM, KMS, Security Hub, GuardDuty, CloudTrail, Config, WAF, etc.) • Perform regular audits, security assessments, and continuous monitoring to ensure compliance with government standards and internal policies • Collaborate with engineering teams to integrate security into SDLC/DevOps pipelines, using tools such as SonarQube, Snyk, Tenable, and Jenkins • Lead incident response efforts for government systems, including containment, eradication, and recovery, while maintaining proper documentation and communication protocols • Research and recommend emerging AWS security services and technologies to improve security posture and maintain compliance • Mentor junior DevSecOps team members and foster a culture of security-first thinking across the organization • Interface with federal agency stakeholders, auditors, and security assessors to represent the organization’s security practices and compliance efforts • Participate in proposal development and pre-award planning by advising on security architecture and compliance strategies for new federal opportunities

Alabama + 29 moreAll locations: Alabama | Arizona | California | Colorado | District of Columbia | Florida | Illinois | Louisiana | Maine | Nevada | New Jersey | New York | North Carolina | Ohio | Oklahoma | Oregon | Maryland | Massachusetts | Michigan | Minnesota | Missouri | Pennsylvania | Rhode Island | South Carolina | Tennessee | Texas | Utah | Virginia | Washington | Wisconsin
$153K - $171K / year
Job Closed
Xenon Seven logo

Senior Site Reliability Engineer – SRE

Xenon Seven

Human Experts Implementing Artificial Intelligence #AI #ArtificialIntelligence #HumanIntelligence

DevOps Engineer172 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor

• Design and architect highly available and scalable OpenShift/Kubernetes infrastructure for banking applications on on-premise servers • Lead and implement comprehensive monitoring and observability strategy using Prometheus and Grafana • Design and oversee centralized logging infrastructure using ELK Stack (Elasticsearch, Logstash, Kibana) • Lead SRE best practices implementation and adoption of production support standards across teams • Mentor and coach junior SRE and DevOps engineers on OpenShift, Kubernetes, monitoring, and production support • Define and implement Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) with measurable metrics • Lead incident response strategy, post-incident reviews, and drive continuous improvement in production stability • Architect and implement advanced alerting, monitoring dashboards, and visualization strategies using Prometheus and Grafana • Design automation frameworks and tools to reduce operational toil and improve production efficiency • Lead OpenShift/Kubernetes cluster upgrades, security patches, and infrastructure modernization on-premise • Establish production support procedures, on-call rotation policies, and escalation frameworks • Optimize system performance, cost, and resource utilization across containerized on-premise infrastructure • Conduct capacity planning, performance optimization, and infrastructure scaling initiatives • Lead technical architecture reviews and infrastructure design decisions for banking applications • Manage on-premise data center resources and infrastructure planning • Participate in 24/7 on-call rotation and escalation for critical production incidents • Ensure compliance, security hardening, and disaster recovery procedures for financial systems

Germany
GXA logo

DevOps Business Analyst

GXA

Building Stronger Businesses & Communities. Providing Managed IT Services in the Dallas-Fort Worth Area since 2008.

DevOps Engineer172 days ago
ContractRemoteTeam 11-50Since 2004H1B No Sponsor

• Act as the client-facing bridge between business stakeholders and the Dev/Ops engineering team. • Gather and document business requirements, mapping workflows, identifying integration or automation opportunities. • Translate needs into clear, actionable specifications for the Dev/Ops Engineer. • Ensure proposed solutions align with client goals, are technically feasible, and operationally robust. • Lead structured discovery sessions with clients and internal stakeholders. • Document business processes, pain points, data flows, and desired outcomes. • Identify opportunities for integrations, automation, or custom development. • Translate business requirements into functional specifications, user stories, wireframes, and acceptance criteria. • Map system interactions, including API usage, data movement, and workflow triggers. • Validate requirements with the Dev/Ops Engineer to ensure technical feasibility. • Define project scope, success criteria, timelines, and dependencies. • Serve as the primary contact for status updates and requirement clarifications. • Ensure clients understand trade-offs, risks, and set realistic expectations. • Analyze existing business processes and recommend improvements. • Identify automation opportunities, such as ETL, API-based workflows, and SQL-driven tasks. • Produce detailed documentation, diagrams, decision logs, and training materials. • Facilitate handovers to operations, support, and engineering teams.

Philippines
Job Closed