Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
Site Reliability Engineer - AI Infrastructure
Location
United States
Posted
88 days ago
Salary
0
No structured requirement data.
Job Description
Site Reliability Engineer - AI Infrastructure
Andromeda Cluster
Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. What You’ll Do - Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers. - Build automation and tooling to streamline cluster deployments and integrations. - Debug customer issues across networking, storage, scheduling, and system layers. - Improve reliability and scalability of both training and inference infrastructure. - Design and implement monitoring, alerting, and observability for critical systems. - Collaborate with engineering and product teams to plan and deliver infrastructure for new services. - Participate in on-call and incident response, leading postmortems and reliability improvements. What We’re Looking For - 5+ years experience in SRE, DevOps, or infrastructure engineering roles. - Strong Linux systems and networking fundamentals. - Deep experience with Kubernetes and container orchestration at scale. - Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.). - Strong automation and scripting skills (Python, Go, or Bash). - Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.). - Track record of operating production systems and leading incident response. Nice to Have - Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.). - Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph). - Customer-facing support or consulting experience. Why You’ll Love It Here This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.
Related Guides
Related Categories
Related Job Pages
More Infrastructure Engineer Jobs
Performance Engineer - AI Infrastructure
Andromeda ClusterAndromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
Performance Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Opportunity We are hiring a Performance Engineer to join our Growth team. In this role, your "product" is the efficiency and throughput of our massive-scale AI clusters. As we scale our network, the difference between a "working" cluster and an "optimized" one represents millions of dollars in value and weeks of saved research time for our customers. You will sit at the intersection of systems engineering and research, profiling end-to-end training runs to hunt down bottlenecks in compute, communication, and storage. What You’ll Do - Profile & Optimize: Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O. - System Refinement: Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution. - Observability: Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime. - Process Design: Design technical processes (e.g., postmortem reviews, incident response) that help the team operate effectively and avoid repeating performance regressions. What We’re Looking For - Systems Intuition: You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware. - Distributed Training Experience: Proven experience running distributed training jobs on multi-GPU systems or HPC clusters. - Coding Proficiency: Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus). - ML Framework Depth: Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built. - Infrastructure Knowledge: Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code. - Rigor: A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements. Strong Candidates May Have - Low-Level Mastery: Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level. - Specialized AI Infra: Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX). - Security & Privacy: Expertise in security best practices for high-scale infrastructure. - Observability: Familiarity with monitoring tools like Prometheus and Grafana. Why You’ll Love It Here This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure. Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Infrastructure Manager
Andromeda ClusterAndromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
Infrastructure Manager Location: North America Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Opportunity We're hiring a Infrastructure Manager to accelerate supply and demand matching on our platform. This is an Individual Contributor role reporting to the Head of Infrastructure. The Infrastructure team sits at the core of our infrastructure. We're responsible for acquiring and facilitating compute resources across the company, working closely with compute providers, sales, and technical teams to match compute supply with demand. Today we have already established the fundamental layer of capacity with providers. As we scale, we are building the next layer—widening our network and liquidity, deepening the scope of our services, and accelerating our growth. What You'll Do • Match incoming leads from our sales team with internal capacity and external capacity in the market • Maximize utilization of our compute resources • Source and onboard new compute suppliers across the globe • Source capacity based on customer needs and market trends • Solve customer and supplier problems in a fast-moving, dynamic market • Understand technical and commercial differences between suppliers to optimize our capacity funnel • Develop a proactive compute strategy informed by market intelligence • Negotiate cost with suppliers and other vendors • Create and implement processes around capacity planning What We're Looking For • 2+ years in cloud sales, GPUs, data centers, or a related field • Existing network of contacts in the compute market (providers, brokers, or buyers) • Deep understanding of the GPU compute market—what drives supply and demand • Strong written and verbal communication across technical and commercial stakeholders • Sound judgment in decisions that directly impact revenue and cost • Comfortable operating in ambiguity • Self-directed and energetic, able to operate autonomously while collaborating cross-functionally • Bias toward action in a fast-paced environment Why You'll Love It Here - Impact: Be in a critical team unlocking revenue for the wider company - Real business: Meaningful revenue, complex transactions, and tangible impact - High-growth environment: Get in early at a company in a massive market - Ownership: Direct line to leadership and influence over how we scale - Competitive compensation + meaningful equity - Comprehensive benefits for you and your dependents, including healthcare, dental, and vision coverage, 401(k), and unlimited PTO Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are looking for a Lead of Trading Infrastructure to take care of the existing globally distributed infrastructure, ensuring fast go-to-market and reliable SLAs for the SWE and Trading teams. The ideal candidate will have strong hands-on experience and a willingness to dive deep into technical tasks. - Implement enhancements and scaling of current infrastructure, including hardware market analysis and 3rd party vendor interactions. - Lead efforts to improve and scale existing infrastructure. Qualifications - Minimum of 2 years of experience in the lead role in the successful project. Requirements - Manage SLAs for critical infrastructure. Nice-to-have - Ownership mindset. - Hands-on experience with globally distributed bare-metal infrastructure. - Experience in designing and delivering SLAs for critical infrastructure. - Experience with Nomad, LXC and exquisite networking hardware. Benefits - Great challenges with many opportunities to prove yourself. - A welcoming group of highly qualified international professionals. - Cutting-edge hardware and technology. - Work remotely from anywhere in the world. - Access any of our global offices anytime. - Flexible schedule. - 40 paid days off. - Competitive salary.
• Architect and deploy scalable, secure, and highly available network infrastructure solutions in AWS • Manage and optimize AWS networking services such as VPC, Direct Connect, Route 53, CloudWAN, and Transit Gateway • Implement and maintain network security best practices, including firewalls, VPNs, and security groups • Develop and maintain automation scripts using tools like Terraform, CloudFormation, and Python • Monitor network performance and troubleshoot issues to ensure optimal performance and reliability • Work with cross-functional teams to support application development and deployment • Create and maintain detailed network documentation and diagrams • Monitor and optimize cloud costs, implementing cost-saving measures where possible • Ensure that all cloud network solutions comply with LexisNexis standards • Continuously improve the performance and reliability of network systems • Lead incident response efforts for network-related issues and outages • Provide training and mentorship to junior network engineers and other team members • Stay updated with the latest cloud networking technologies and propose innovative solutions to improve the infrastructure

