Job Closed

This listing is no longer active.

Pathway

pathway.com - The smartest way to build Data Products

Senior ML Infrastructure – DevOps Engineer

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 11-50H1B SponsorCompany Site LinkedIn

Location

France

Posted

172 days ago

Salary

Seniority

Senior

5 yrs expEnglishAirflow AWS Azure DNS Docker GCP Grafana Jenkins Kubernetes Linux Prometheus Python PyTorch Shell TensorFlow Terraform

Job Description

• Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management). • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management. • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback. • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services. • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch). • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges. • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems. • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break.

Job Requirements

Former or current Linux / systems / network administrator who is comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing).
5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads.
Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services.
What we are looking for
Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments.
Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch.
Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI).
Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations.
Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents).
Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management.
Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer.
Strong ownership mindset, comfort with ambiguity, and enthusiasm for scaling and hardening critical infrastructure for an ML‑heavy environment.
Willingness to learn.

Benefits

Intellectually stimulating work environment. Be a pioneer: you get to work with realtime data processing & AI.
Work in one of the hottest AI startups, with exciting career prospects. Team members are distributed across the world.
Responsibilities and ability to make significant contribution to the company’ success
Inclusive workplace culture
Further details
Type of contract**: Permanent employment contract
Preferable joining date**: Immediate.
Compensation**: based on profile and location.
Location**: Remote work. Possibility to work or meet with other team members in one of our offices: Palo Alto, CA; Paris, France or Wroclaw, Poland. Candidates based anywhere in the EU, United States, and Canada will be considered.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

DevOps Engineer

Mindex

DevOps Engineer172 days ago

Other Remote

• Design, implement, and manage CI/CD pipelines using GitHub Actions, GitOps, GCP-native tools, Azure DevOps, and Jenkins. • Develop, automate, and maintain infrastructure as code across multi-cloud environments (GCP, Azure; Kubernetes, Cloud Run, AKS, GKE). • Build, deploy, and operate containerized services, ensuring health, security, scalability, and high availability. • Integrate tools, platforms, and APIs to streamline operations across AI/ML and virtual agent ecosystems. • Develop and maintain automation scripts and tooling using Python, Shell, and Java/Groovy where appropriate. • Monitor system and application health, implement proactive alerting, and continuously improve reliability and performance. • Troubleshoot infrastructure, networking, and deployment issues across cloud services and application stacks. • Collaborate with engineering and product teams to optimize release processes and platform reliability. • Promote DevOps best practices and maintain clear documentation for system architecture, operations, and troubleshooting.

AWS Azure Docker GCP Groovy Java Jenkins Kubernetes Python Terraform

View details: DevOps Engineer

New York

$90K - $120K / year

Apply

Job Closed

Senior Software Engineer – DevSecOps Architect

Nava

Building simple, effective government services. Want to contribute? We're hiring!

DevOps Engineer172 days ago

Other RemoteTeam 501-1,000Since 2015H1B Sponsor

Company Site LinkedIn

• Design, implement, and maintain the organization’s security architecture in alignment with federal security standards (e.g., FISMA, NIST SP 800-53, 800-171) and contract requirements • Lead security planning and risk assessments for government systems hosted in AWS • Serve as the primary security point of contact for government programs, overseeing incident response, vulnerability management, and system hardening activities • Develop and maintain security documentation required for system authorization, including System Security Plans (SSPs), Plans of Action and Milestones (POA&Ms), Security Assessment Reports (SARs), and Continuous Monitoring strategies • Support the Authority to Operate (ATO) process across multiple projects, working closely with compliance teams, federal partners, and internal stakeholders • Architect, oversee and support implementation of security controls across AWS services (e.g., IAM, KMS, Security Hub, GuardDuty, CloudTrail, Config, WAF, etc.) • Perform regular audits, security assessments, and continuous monitoring to ensure compliance with government standards and internal policies • Collaborate with engineering teams to integrate security into SDLC/DevOps pipelines, using tools such as SonarQube, Snyk, Tenable, and Jenkins • Lead incident response efforts for government systems, including containment, eradication, and recovery, while maintaining proper documentation and communication protocols • Research and recommend emerging AWS security services and technologies to improve security posture and maintain compliance • Mentor junior DevSecOps team members and foster a culture of security-first thinking across the organization • Interface with federal agency stakeholders, auditors, and security assessors to represent the organization’s security practices and compliance efforts • Participate in proposal development and pre-award planning by advising on security architecture and compliance strategies for new federal opportunities

Angular AWS DynamoDB Amazon EC2 Java JavaScript Jenkins Python SDLC Spring Terraform TypeScript

View details: Senior Software Engineer – DevSecOps Architect

Alabama + 29 more

$153K - $171K / year

Apply

Job Closed

Senior Site Reliability Engineer – SRE

Xenon Seven

Human Experts Implementing Artificial Intelligence #AI #ArtificialIntelligence #HumanIntelligence

DevOps Engineer172 days ago

Full Time RemoteTeam 11-50H1B No Sponsor

Company Site LinkedIn

• Design and architect highly available and scalable OpenShift/Kubernetes infrastructure for banking applications on on-premise servers • Lead and implement comprehensive monitoring and observability strategy using Prometheus and Grafana • Design and oversee centralized logging infrastructure using ELK Stack (Elasticsearch, Logstash, Kibana) • Lead SRE best practices implementation and adoption of production support standards across teams • Mentor and coach junior SRE and DevOps engineers on OpenShift, Kubernetes, monitoring, and production support • Define and implement Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs) with measurable metrics • Lead incident response strategy, post-incident reviews, and drive continuous improvement in production stability • Architect and implement advanced alerting, monitoring dashboards, and visualization strategies using Prometheus and Grafana • Design automation frameworks and tools to reduce operational toil and improve production efficiency • Lead OpenShift/Kubernetes cluster upgrades, security patches, and infrastructure modernization on-premise • Establish production support procedures, on-call rotation policies, and escalation frameworks • Optimize system performance, cost, and resource utilization across containerized on-premise infrastructure • Conduct capacity planning, performance optimization, and infrastructure scaling initiatives • Lead technical architecture reviews and infrastructure design decisions for banking applications • Manage on-premise data center resources and infrastructure planning • Participate in 24/7 on-call rotation and escalation for critical production incidents • Ensure compliance, security hardening, and disaster recovery procedures for financial systems

Elasticsearch Grafana Kubernetes Linux Logstash OpenShift Prometheus Unix

View details: Senior Site Reliability Engineer – SRE

Germany

Apply

DevOps Business Analyst

GXA

Building Stronger Businesses & Communities. Providing Managed IT Services in the Dallas-Fort Worth Area since 2008.

DevOps Engineer172 days ago

Contract RemoteTeam 11-50Since 2004H1B No Sponsor

Company Site LinkedIn

• Act as the client-facing bridge between business stakeholders and the Dev/Ops engineering team. • Gather and document business requirements, mapping workflows, identifying integration or automation opportunities. • Translate needs into clear, actionable specifications for the Dev/Ops Engineer. • Ensure proposed solutions align with client goals, are technically feasible, and operationally robust. • Lead structured discovery sessions with clients and internal stakeholders. • Document business processes, pain points, data flows, and desired outcomes. • Identify opportunities for integrations, automation, or custom development. • Translate business requirements into functional specifications, user stories, wireframes, and acceptance criteria. • Map system interactions, including API usage, data movement, and workflow triggers. • Validate requirements with the Dev/Ops Engineer to ensure technical feasibility. • Define project scope, success criteria, timelines, and dependencies. • Serve as the primary contact for status updates and requirement clarifications. • Ensure clients understand trade-offs, risks, and set realistic expectations. • Analyze existing business processes and recommend improvements. • Identify automation opportunities, such as ETL, API-based workflows, and SQL-driven tasks. • Produce detailed documentation, diagrams, decision logs, and training materials. • Facilitate handovers to operations, support, and engineering teams.

Azure ETL SQL

View details: DevOps Business Analyst

Philippines

Apply

Job Closed

Senior ML Infrastructure – DevOps Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

DevOps Engineer

Senior Software Engineer – DevSecOps Architect

Senior Site Reliability Engineer – SRE

DevOps Business Analyst