Job Closed

This listing is no longer active.

Submer

Datacenters that make sense

Senior AI Workload Platform Engineer

Platform EngineerPlatform EngineerFull Time Remote SeniorTeam 51-200Since 2015H1B No SponsorCompany Site LinkedIn

Location

Europe

Posted

123 days ago

Salary

Seniority

Senior

EnglishJava Kubernetes Node.js Python

Job Description

• Design, build, and operate the compute orchestration layer powering a GPU-native cloud platform for AI and high-performance workloads. • Maintain the existing CloudStack code base used in current production deployments. • Integrate new upstream CloudStack releases into the internal platform fork. • Perform upgrades of existing customer environments to newer CloudStack versions. • Design and execute safe upgrade paths for running production environments. • Troubleshoot orchestration and provisioning issues in existing deployments. • Maintain and troubleshoot CloudStack VPC networking. • Work with and understand CloudStack Debian VPC routers. • Manage networking implementations based on Open vSwitch (OVS) and OVN. • Improve the reliability of network orchestration components. • Manage hypervisor implementations based on KVM and QEMU. • Maintain and evolve the code responsible for QEMU GPU passthrough, including PCI mapping and exposure of L40S, RTX 6000 Pro, and H200 GPUs to virtual machines. • Design orchestration and scheduling primitives for the next-generation platform based on Kubernetes, Slurm, and Argo Workflows. • Build orchestration workflows that expose GPU and CPU compute resources to platform users. • Implement Kubernetes scheduling strategies including GPU partitioning, Multi-GPU job placement, and Topology-aware scheduling for distributed training and inference. • Design and implement Kubernetes-based GPU/CPU scheduling infrastructure for multi-tenant AI workloads. • Design and operate Slurm-based HPC scheduling environments integrated with Kubernetes clusters. • Implement support for Multi-node distributed GPU training, Gang scheduling, and build automation for Dynamic Slurm node registration. • Design and implement workflow orchestration using Argo Workflows and develop reusable workflow templates for common platform workloads.

Job Requirements

Proven experience working with large-scale distributed compute environments at a neo-cloud, hyperscaler, or HPC provider.
Strong experience with CloudStack internals, including extending and maintaining platform functionality.
Experience operating cloud orchestration platforms in production environments.
Experience running GPU-heavy infrastructure for AI training, inference, or HPC workloads.
Experience maintaining or extending large Java codebases, ideally within infrastructure platforms.
Strong programming skills in Go and Python, with experience building cloud-native platform components.
Deep practical knowledge of Kubernetes internals and Slurm scheduling systems.
Familiarity with workflow orchestration systems such as Argo Workflows.
Familiar with virtual networking and distributed networking technologies such as OVS, OVN, VPC networking, RDMA, RoCE, ECMP, EVPN/VXLAN, and leaf-spine fabrics.
Understanding of GPU virtualization and passthrough mechanisms such as QEMU PCI passthrough and NVIDIA MIG.
Experience working with GPU infrastructure, including passthrough, NVIDIA MIG, scheduling, and lifecycle management of GPUs in distributed clusters.
Able to independently own major compute-orchestration initiatives from design through rollout and operational stabilization.
Comfortable mentoring peers and improving implementation quality, documentation, operational workflows, and platform reliability within the compute orchestration domain.

Benefits

Attractive compensation package reflecting your expertise and experience.
A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.

Related Categories

Platform Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More Platform Engineer Jobs

Principal Platform Engineer

FICO - Fair Isaac Corporation

FICO, also known as Fair Isaac Corporation, is one of the world’s leading credit history and financial analysis organizations. It was founded in 1956 on the i

Platform Engineer123 days ago

Full Time Remote

Company Site

• Lead the design, implementation, and evolution of our cloud-native platform infrastructure. • Build and maintain scalable, resilient systems that empower our engineering teams to deliver innovative solutions rapidly and reliably. • Design, deploy, and manage scalable cloud solutions on AWS public cloud platform via Infrastructure as Code. • Manage infrastructure as code (IaC) leveraging Terraform, CloudFormation and GoLang. • Design and implement Kubernetes-based platform solutions with focus on scalability, reliability, and security. • Support and maintain large Kubernetes clusters in production environments. • Implement security best practices and ensure compliance with industry standards and regulations. • Work closely with development, operations, and security teams to integrate infrastructure as code practices. • Develop automation to build and deploy Docker Containers through CI/CD pipelines for engineering teams deploy and test services. • Write policy & standard validation tests and integrations with Security Scanning software to ensure compliance. • Implement and support Observability solutions to ensure platform performance, reliability, and scalability. • Create Dashboards and integrate into Backstage IDP for visibility into system health. • Provide guidance and mentorship to team members on best practices in GitOps, CI/CD, and infrastructure management.

AWS Docker Kubernetes Python Terraform

View details: Principal Platform Engineer

United States

$150.5K - $236.5K / year

Apply

Job Closed

Lead Platform Engineer

FICO - Fair Isaac Corporation

FICO, also known as Fair Isaac Corporation, is one of the world’s leading credit history and financial analysis organizations. It was founded in 1956 on the i

Platform Engineer123 days ago

Full Time Remote

Company Site

• Design, develop, deploy and support modules of large world-class enterprise-level product. • Participate in architectural design of product. • Develop high level development timelines based on project scope and understanding of the existing application code. • Evaluate new design specifications and raise quality standards, address architectural concerns. • Evaluate stability, compatibility, scalability, interoperability, and performance of the software product. • Maintain and upgrade product source codes. • Demonstrate technical expertise through publication, presentations, white papers and event participation. • Continually learn new technologies in related areas. • Serve as a source of technical expertise and mentor junior team members.

View details: Lead Platform Engineer

United States

$105K - $165K / year

Apply

Job Closed

Working Student AI Platform Engineer

engaige GmbH

We give your data purpose - Tech Company focused on innovation and artificial intelligence.

Platform Engineer123 days ago

Part Time RemoteTeam 1-10Since 2021H1B No Sponsor

Company Site LinkedIn

• Responsible for designing and implementing complex AI components — from initial concept to production • Enhance and scale existing AI solutions to improve performance and efficiency • Develop robust data pipelines, backends, and API interfaces for seamless integration • Build and maintain automated deployment pipelines to streamline releases and testing • Ensure quality and performance of backend processes and NLP algorithms through continuous optimization • Work closely with the product team to translate data-driven insights into market-ready, innovative solutions

AWS Azure Docker Kubernetes Python

View details: Working Student AI Platform Engineer

United States

Apply

Job Closed

AI Platform Engineer – m/f/d

engaige GmbH

We give your data purpose - Tech Company focused on innovation and artificial intelligence.

Platform Engineer123 days ago

Full Time RemoteTeam 1-10Since 2021H1B No Sponsor

Company Site LinkedIn

• Design and build AI components — from concept to production • Scale and optimize existing AI solutions (performance, cost, stability — the full triathlon package) • Develop data pipelines, backends & APIs that feel good to both users and for logging • Automate CI/CD and deployments so releases don’t feel like an adventure • Ensure quality & performance — including continuous optimization of backend processes and NLP pipelines • Work closely with Product to turn data-driven insights into market-ready features

AWS Azure Docker Kubernetes Python

View details: AI Platform Engineer – m/f/d

United States

Apply

Job Closed

Senior AI Workload Platform Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Platform Engineer Jobs

Principal Platform Engineer

Lead Platform Engineer

Working Student AI Platform Engineer

AI Platform Engineer – m/f/d