Job Closed
This listing is no longer active.
Datacenters that make sense
Senior AI Workload Platform Engineer
Location
Europe
Posted
69 days ago
Salary
0
Seniority
Senior
Job Description
Senior AI Workload Platform Engineer
Submer
• Design, build, and operate the compute orchestration layer powering a GPU-native cloud platform for AI and high-performance workloads. • Maintain the existing CloudStack code base used in current production deployments. • Integrate new upstream CloudStack releases into the internal platform fork. • Perform upgrades of existing customer environments to newer CloudStack versions. • Design and execute safe upgrade paths for running production environments. • Troubleshoot orchestration and provisioning issues in existing deployments. • Maintain and troubleshoot CloudStack VPC networking. • Work with and understand CloudStack Debian VPC routers. • Manage networking implementations based on Open vSwitch (OVS) and OVN. • Improve the reliability of network orchestration components. • Manage hypervisor implementations based on KVM and QEMU. • Maintain and evolve the code responsible for QEMU GPU passthrough, including PCI mapping and exposure of L40S, RTX 6000 Pro, and H200 GPUs to virtual machines. • Design orchestration and scheduling primitives for the next-generation platform based on Kubernetes, Slurm, and Argo Workflows. • Build orchestration workflows that expose GPU and CPU compute resources to platform users. • Implement Kubernetes scheduling strategies including GPU partitioning, Multi-GPU job placement, and Topology-aware scheduling for distributed training and inference. • Design and implement Kubernetes-based GPU/CPU scheduling infrastructure for multi-tenant AI workloads. • Design and operate Slurm-based HPC scheduling environments integrated with Kubernetes clusters. • Implement support for Multi-node distributed GPU training, Gang scheduling, and build automation for Dynamic Slurm node registration. • Design and implement workflow orchestration using Argo Workflows and develop reusable workflow templates for common platform workloads.
Job Requirements
- Proven experience working with large-scale distributed compute environments at a neo-cloud, hyperscaler, or HPC provider.
- Strong experience with CloudStack internals, including extending and maintaining platform functionality.
- Experience operating cloud orchestration platforms in production environments.
- Experience running GPU-heavy infrastructure for AI training, inference, or HPC workloads.
- Experience maintaining or extending large Java codebases, ideally within infrastructure platforms.
- Strong programming skills in Go and Python, with experience building cloud-native platform components.
- Deep practical knowledge of Kubernetes internals and Slurm scheduling systems.
- Familiarity with workflow orchestration systems such as Argo Workflows.
- Familiar with virtual networking and distributed networking technologies such as OVS, OVN, VPC networking, RDMA, RoCE, ECMP, EVPN/VXLAN, and leaf-spine fabrics.
- Understanding of GPU virtualization and passthrough mechanisms such as QEMU PCI passthrough and NVIDIA MIG.
- Experience working with GPU infrastructure, including passthrough, NVIDIA MIG, scheduling, and lifecycle management of GPUs in distributed clusters.
- Able to independently own major compute-orchestration initiatives from design through rollout and operational stabilization.
- Comfortable mentoring peers and improving implementation quality, documentation, operational workflows, and platform reliability within the compute orchestration domain.
Benefits
- Attractive compensation package reflecting your expertise and experience.
- A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
- You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.
Related Guides
Related Categories
Related Job Pages
More Platform Engineer Jobs
Principal Platform Engineer
FICOFICO is an analytics company helping businesses make better decisions that drive higher levels of growth and success.
• Lead the design, implementation, and evolution of our cloud-native platform infrastructure. • Build and maintain scalable, resilient systems that empower our engineering teams to deliver innovative solutions rapidly and reliably. • Design, deploy, and manage scalable cloud solutions on AWS public cloud platform via Infrastructure as Code. • Manage infrastructure as code (IaC) leveraging Terraform, CloudFormation and GoLang. • Design and implement Kubernetes-based platform solutions with focus on scalability, reliability, and security. • Support and maintain large Kubernetes clusters in production environments. • Implement security best practices and ensure compliance with industry standards and regulations. • Work closely with development, operations, and security teams to integrate infrastructure as code practices. • Develop automation to build and deploy Docker Containers through CI/CD pipelines for engineering teams deploy and test services. • Write policy & standard validation tests and integrations with Security Scanning software to ensure compliance. • Implement and support Observability solutions to ensure platform performance, reliability, and scalability. • Create Dashboards and integrate into Backstage IDP for visibility into system health. • Provide guidance and mentorship to team members on best practices in GitOps, CI/CD, and infrastructure management.
Lead Platform Engineer
FICOFICO is an analytics company helping businesses make better decisions that drive higher levels of growth and success.
• Design, develop, deploy and support modules of large world-class enterprise-level product. • Participate in architectural design of product. • Develop high level development timelines based on project scope and understanding of the existing application code. • Evaluate new design specifications and raise quality standards, address architectural concerns. • Evaluate stability, compatibility, scalability, interoperability, and performance of the software product. • Maintain and upgrade product source codes. • Demonstrate technical expertise through publication, presentations, white papers and event participation. • Continually learn new technologies in related areas. • Serve as a source of technical expertise and mentor junior team members.
Working Student AI Platform Engineer
engaige GmbHWe give your data purpose - Tech Company focused on innovation and artificial intelligence.
• Responsible for designing and implementing complex AI components — from initial concept to production • Enhance and scale existing AI solutions to improve performance and efficiency • Develop robust data pipelines, backends, and API interfaces for seamless integration • Build and maintain automated deployment pipelines to streamline releases and testing • Ensure quality and performance of backend processes and NLP algorithms through continuous optimization • Work closely with the product team to translate data-driven insights into market-ready, innovative solutions
AI Platform Engineer – m/f/d
engaige GmbHWe give your data purpose - Tech Company focused on innovation and artificial intelligence.
• Design and build AI components — from concept to production • Scale and optimize existing AI solutions (performance, cost, stability — the full triathlon package) • Develop data pipelines, backends & APIs that feel good to both users and for logging • Automate CI/CD and deployments so releases don’t feel like an adventure • Ensure quality & performance — including continuous optimization of backend processes and NLP pipelines • Work closely with Product to turn data-driven insights into market-ready features


