Job Closed

This listing is no longer active.

Andromeda Cluster logo
Andromeda Cluster

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

Performance Engineer - AI Infrastructure

Infrastructure EngineerInfrastructure EngineerOtherRemoteTeam 11-50

Location

United States

Posted

100 days ago

Salary

0

No structured requirement data.

Job Description

Performance Engineer - AI Infrastructure

Andromeda Cluster

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are hiring a Performance Engineer to join our Growth team. In this role, your "product" is the efficiency and throughput of our massive-scale AI clusters. As we scale our network, the difference between a "working" cluster and an "optimized" one represents millions of dollars in value and weeks of saved research time for our customers. - Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O. - Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution. - Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime. - Design technical processes (e.g., postmortem reviews, incident response) that help the team operate effectively and avoid repeating performance regressions. Qualifications - You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware. - Proven experience running distributed training jobs on multi-GPU systems or HPC clusters. - Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus). - Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built. - Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code. - A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements. Requirements - Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level. - Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX). - Expertise in security best practices for high-scale infrastructure. - Familiarity with monitoring tools like Prometheus and Grafana. Benefits This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure. Company Description Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. - We began with a single managed cluster — but it filled almost instantly. - Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. - Our long-term vision is to build the liquidity layer for global AI compute. - We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

Job Requirements

  • You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware.
  • Proven experience running distributed training jobs on multi-GPU systems or HPC clusters.
  • Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus).
  • Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built.
  • Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code.
  • A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.
  • Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level.
  • Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX).
  • Expertise in security best practices for high-scale infrastructure.
  • Familiarity with monitoring tools like Prometheus and Grafana.

Benefits

  • This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

becon GmbH logo

Open-Source Infrastructure Specialist

becon GmbH

Komplettanbieter für Lösungen und Dienstleistungen der Informations- und Telekommunikationstechnologie. #DataCenterLove

Full TimeRemoteTeam 51-200Since 1993H1B No Sponsor

• Aufbau und Konfiguration von Linux-basierten Infrastrukturen • Integration und Anpassung von Open-Source-Komponenten • Durchführung von Systemupdates, Hardening und Performance-Optimierungen • Monitoring und Sicherstellung der Systemverfügbarkeit • Unterstützung bei technischen Konzepten und Architekturentscheidungen • Automatisierung wiederkehrender administrativer Prozesse

Germany
Job Closed
OtherRemoteTeam 51-200Since 2017H1B No Sponsor

• Build, maintain, and optimize the physical network and compute layer of on‑premises environments. • Ensure reliability, performance, and scalability of networks, firewalls, server hardware, racks, power, cabling, and related physical systems. • Deploy, install, rack, cable, and power physical servers, storage systems, and supporting hardware. • Perform hardware diagnostics, break/fix tasks, component replacement, and lifecycle upgrades. • Manage firmware, BIOS, and device‑level updates. • Maintain accurate inventory of physical compute assets, spares, and components. • Design, deploy, and manage enterprise network infrastructure using Cisco technologies. • Architect and implement AWS networking solutions.

United States
Job Closed
Inngest logo

Infrastructure Engineer

Inngest

Inngest is the developer platform for easily building reliable workflows with zero infrastructure.

Full TimeRemoteTeam 1-10H1B No Sponsor

• Collaborate with other engineers to architect core systems for Inngest, including message streams, real-time connectivity to external clients, consistent hashing caches, etc. • Collaborate with systems engineers to help develop our execution engine, state store, etc. (in Go) • Develop internal tooling (in Go) to manage our systems • Provision and monitor systems on public clouds, bare metal, and, in the future, our own hardware. • Ensure our systems are up to date and secure

California
$160K - $205K / year
Job Closed

About Delphina Today’s Data Scientists are in pain - spending their time manually wrangling data, building models through slow trial and error, taking on painstaking rewrites for deployment, and dealing with countless other frustrating bottlenecks. And the tools they are using for much of this work – e.g. Jupyter notebooks and Pandas – are over a decade old. We founded Delphina to change this: our mission is to help the world get better at using data to understand the present and predict the future. Delphina is an AI Agent for Data Science: leveraging a combination of generative AI, large scale optimization, and specialized infrastructure to automate the time-consuming but necessary tasks to build powerful ML models quickly; Delphina will identify relevant data, clean it, train models, and even productionize pipelines. Our team has previously led large data science and machine learning teams (covering both applications and infrastructure), built startups, and created successful tools for enterprise ML. We're backed by top AI investors, including Fei-Fei Li, Radical VC, and Costanoa VC. What you’ll do We're looking for an experienced ML Infrastructure Engineer to join as a Member of our Technical Staff of Delphina. As one of our key early hires, you will partner closely with our early team on the direction of our product and drive critical technical decisions. You will have broad impact over the technology, product, and our company's culture. You will be responsible for: Developing platforms that enable scientists, researchers, developers to run ML jobs easily and quickly at scale using the latest technologies Developing solutions that will orchestrate and support massive quantities of data through stages like ingestion, indexing/mining, transformation, machine learning, online deployment Defining a consistent continuous integration/deployment model that will encourage cross-functional development teams to self-service application unit testing, deployment and operations Influencing and lead cross-functional initiatives that will align the team towards commonly used technologies and methodologies What we’re looking for Proficiency in multiple programming languages relevant for such systems (e.g. Python, Rust, C++, Go, Java) Knowledge about what it takes to deploy and operate high availability production systems in the cloud Experience designing service-oriented architectures and leveraging various data store technologies Energy and ambition to build a product that is surprisingly good in surprising ways Intrinsic desire to always be improving our product and yourself. Growth mindset to both stay ahead of the curve and pick up whatever knowledge you're missing to get the job done Nice to haves Experience working directly on machine learning models – either by partnering with scientists and engineers who are building models, or by building models yourself Experience leading cross functional teams through ambiguous problems Benefits Equity in the company Medical, dental, and vision insurance 401k Unlimited PTO Top of the line Apple equipment Free lunch in the office

California