Job Closed
This listing is no longer active.
pathway.com - The smartest way to build Data Products
Senior ML Infrastructure – DevOps Engineer
Location
France
Posted
172 days ago
Salary
0
Seniority
Senior
Job Description
Senior ML Infrastructure – DevOps Engineer
Pathway
• Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management). • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management. • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback. • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services. • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch). • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges. • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems. • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break.
Job Requirements
- Former or current Linux / systems / network administrator who is comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing).
- 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads.
- Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services.
- What we are looking for
- Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.
- Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.
- Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).
- Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.
- Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).
- Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.
- Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer.
- Strong ownership mindset, comfort with ambiguity, and enthusiasm for scaling and hardening critical infrastructure for an ML‑heavy environment.
- Willingness to learn.
Benefits
- Intellectually stimulating work environment. Be a pioneer: you get to work with realtime data processing & AI.
- Work in one of the hottest AI startups, with exciting career prospects. Team members are distributed across the world.
- Responsibilities and ability to make significant contribution to the company’ success
- Inclusive workplace culture
- Further details
- Type of contract**: Permanent employment contract
- Preferable joining date**: Immediate.
- Compensation**: based on profile and location.
- Location**: Remote work. Possibility to work or meet with other team members in one of our offices: Palo Alto, CA; Paris, France or Wroclaw, Poland. Candidates based anywhere in the EU, United States, and Canada will be considered.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Design, implement, and manage CI/CD pipelines using GitHub Actions, GitOps, GCP-native tools, Azure DevOps, and Jenkins. • Develop, automate, and maintain infrastructure as code across multi-cloud environments (GCP, Azure; Kubernetes, Cloud Run, AKS, GKE). • Build, deploy, and operate containerized services, ensuring health, security, scalability, and high availability. • Integrate tools, platforms, and APIs to streamline operations across AI/ML and virtual agent ecosystems. • Develop and maintain automation scripts and tooling using Python, Shell, and Java/Groovy where appropriate. • Monitor system and application health, implement proactive alerting, and continuously improve reliability and performance. • Troubleshoot infrastructure, networking, and deployment issues across cloud services and application stacks. • Collaborate with engineering and product teams to optimize release processes and platform reliability. • Promote DevOps best practices and maintain clear documentation for system architecture, operations, and troubleshooting.
Senior Software Engineer – DevSecOps Architect
NavaBuilding simple, effective government services. Want to contribute? We're hiring!
• Design, implement, and maintain the organization’s security architecture in alignment with federal security standards (e.g., FISMA, NIST SP 800-53, 800-171) and contract requirements • Lead security planning and risk assessments for government systems hosted in AWS • Serve as the primary security point of contact for government programs, overseeing incident response, vulnerability management, and system hardening activities • Develop and maintain security documentation required for system authorization, including System Security Plans (SSPs), Plans of Action and Milestones (POA&Ms), Security Assessment Reports (SARs), and Continuous Monitoring strategies • Support the Authority to Operate (ATO) process across multiple projects, working closely with compliance teams, federal partners, and internal stakeholders • Architect, oversee and support implementation of security controls across AWS services (e.g., IAM, KMS, Security Hub, GuardDuty, CloudTrail, Config, WAF, etc.) • Perform regular audits, security assessments, and continuous monitoring to ensure compliance with government standards and internal policies • Collaborate with engineering teams to integrate security into SDLC/DevOps pipelines, using tools such as SonarQube, Snyk, Tenable, and Jenkins • Lead incident response efforts for government systems, including containment, eradication, and recovery, while maintaining proper documentation and communication protocols • Research and recommend emerging AWS security services and technologies to improve security posture and maintain compliance • Mentor junior DevSecOps team members and foster a culture of security-first thinking across the organization • Interface with federal agency stakeholders, auditors, and security assessors to represent the organization’s security practices and compliance efforts • Participate in proposal development and pre-award planning by advising on security architecture and compliance strategies for new federal opportunities
Senior Site Reliability Engineer – SRE
Xenon SevenHuman Experts Implementing Artificial Intelligence #AI #ArtificialIntelligence #HumanIntelligence
• Design and architect highly available and scalable OpenShift/Kubernetes infrastructure for banking applications on on-premise servers • Lead and implement comprehensive monitoring and observability strategy using Prometheus and Grafana • Design and oversee centralized logging infrastructure using ELK Stack (Elasticsearch, Logstash, Kibana) • Lead SRE best practices implementation and adoption of production support standards across teams • Mentor and coach junior SRE and DevOps engineers on OpenShift, Kubernetes, monitoring, and production support • Define and implement Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) with measurable metrics • Lead incident response strategy, post-incident reviews, and drive continuous improvement in production stability • Architect and implement advanced alerting, monitoring dashboards, and visualization strategies using Prometheus and Grafana • Design automation frameworks and tools to reduce operational toil and improve production efficiency • Lead OpenShift/Kubernetes cluster upgrades, security patches, and infrastructure modernization on-premise • Establish production support procedures, on-call rotation policies, and escalation frameworks • Optimize system performance, cost, and resource utilization across containerized on-premise infrastructure • Conduct capacity planning, performance optimization, and infrastructure scaling initiatives • Lead technical architecture reviews and infrastructure design decisions for banking applications • Manage on-premise data center resources and infrastructure planning • Participate in 24/7 on-call rotation and escalation for critical production incidents • Ensure compliance, security hardening, and disaster recovery procedures for financial systems
DevOps Business Analyst
GXABuilding Stronger Businesses & Communities. Providing Managed IT Services in the Dallas-Fort Worth Area since 2008.
• Act as the client-facing bridge between business stakeholders and the Dev/Ops engineering team. • Gather and document business requirements, mapping workflows, identifying integration or automation opportunities. • Translate needs into clear, actionable specifications for the Dev/Ops Engineer. • Ensure proposed solutions align with client goals, are technically feasible, and operationally robust. • Lead structured discovery sessions with clients and internal stakeholders. • Document business processes, pain points, data flows, and desired outcomes. • Identify opportunities for integrations, automation, or custom development. • Translate business requirements into functional specifications, user stories, wireframes, and acceptance criteria. • Map system interactions, including API usage, data movement, and workflow triggers. • Validate requirements with the Dev/Ops Engineer to ensure technical feasibility. • Define project scope, success criteria, timelines, and dependencies. • Serve as the primary contact for status updates and requirement clarifications. • Ensure clients understand trade-offs, risks, and set realistic expectations. • Analyze existing business processes and recommend improvements. • Identify automation opportunities, such as ETL, API-based workflows, and SQL-driven tasks. • Produce detailed documentation, diagrams, decision logs, and training materials. • Facilitate handovers to operations, support, and engineering teams.



