Job Closed
This listing is no longer active.
Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
Performance Engineer - AI Infrastructure
Location
United States
Posted
91 days ago
Salary
0
No structured requirement data.
Job Description
Performance Engineer - AI Infrastructure
Andromeda Cluster
Performance Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Opportunity We are hiring a Performance Engineer to join our Growth team. In this role, your "product" is the efficiency and throughput of our massive-scale AI clusters. As we scale our network, the difference between a "working" cluster and an "optimized" one represents millions of dollars in value and weeks of saved research time for our customers. You will sit at the intersection of systems engineering and research, profiling end-to-end training runs to hunt down bottlenecks in compute, communication, and storage. What You’ll Do - Profile & Optimize: Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O. - System Refinement: Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution. - Observability: Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime. - Process Design: Design technical processes (e.g., postmortem reviews, incident response) that help the team operate effectively and avoid repeating performance regressions. What We’re Looking For - Systems Intuition: You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware. - Distributed Training Experience: Proven experience running distributed training jobs on multi-GPU systems or HPC clusters. - Coding Proficiency: Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus). - ML Framework Depth: Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built. - Infrastructure Knowledge: Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code. - Rigor: A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements. Strong Candidates May Have - Low-Level Mastery: Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level. - Specialized AI Infra: Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX). - Security & Privacy: Expertise in security best practices for high-scale infrastructure. - Observability: Familiarity with monitoring tools like Prometheus and Grafana. Why You’ll Love It Here This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure. Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Related Guides
Related Categories
Related Job Pages
More Infrastructure Engineer Jobs
Infrastructure Manager
Andromeda ClusterAndromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.
Infrastructure Manager Location: North America Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Opportunity We're hiring a Infrastructure Manager to accelerate supply and demand matching on our platform. This is an Individual Contributor role reporting to the Head of Infrastructure. The Infrastructure team sits at the core of our infrastructure. We're responsible for acquiring and facilitating compute resources across the company, working closely with compute providers, sales, and technical teams to match compute supply with demand. Today we have already established the fundamental layer of capacity with providers. As we scale, we are building the next layer—widening our network and liquidity, deepening the scope of our services, and accelerating our growth. What You'll Do • Match incoming leads from our sales team with internal capacity and external capacity in the market • Maximize utilization of our compute resources • Source and onboard new compute suppliers across the globe • Source capacity based on customer needs and market trends • Solve customer and supplier problems in a fast-moving, dynamic market • Understand technical and commercial differences between suppliers to optimize our capacity funnel • Develop a proactive compute strategy informed by market intelligence • Negotiate cost with suppliers and other vendors • Create and implement processes around capacity planning What We're Looking For • 2+ years in cloud sales, GPUs, data centers, or a related field • Existing network of contacts in the compute market (providers, brokers, or buyers) • Deep understanding of the GPU compute market—what drives supply and demand • Strong written and verbal communication across technical and commercial stakeholders • Sound judgment in decisions that directly impact revenue and cost • Comfortable operating in ambiguity • Self-directed and energetic, able to operate autonomously while collaborating cross-functionally • Bias toward action in a fast-paced environment Why You'll Love It Here - Impact: Be in a critical team unlocking revenue for the wider company - Real business: Meaningful revenue, complex transactions, and tangible impact - High-growth environment: Get in early at a company in a massive market - Ownership: Direct line to leadership and influence over how we scale - Competitive compensation + meaningful equity - Comprehensive benefits for you and your dependents, including healthcare, dental, and vision coverage, 401(k), and unlimited PTO Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are looking for a Lead of Trading Infrastructure to take care of the existing globally distributed infrastructure, ensuring fast go-to-market and reliable SLAs for the SWE and Trading teams. The ideal candidate will have strong hands-on experience and a willingness to dive deep into technical tasks. - Implement enhancements and scaling of current infrastructure, including hardware market analysis and 3rd party vendor interactions. - Lead efforts to improve and scale existing infrastructure. Qualifications - Minimum of 2 years of experience in the lead role in the successful project. Requirements - Manage SLAs for critical infrastructure. Nice-to-have - Ownership mindset. - Hands-on experience with globally distributed bare-metal infrastructure. - Experience in designing and delivering SLAs for critical infrastructure. - Experience with Nomad, LXC and exquisite networking hardware. Benefits - Great challenges with many opportunities to prove yourself. - A welcoming group of highly qualified international professionals. - Cutting-edge hardware and technology. - Work remotely from anywhere in the world. - Access any of our global offices anytime. - Flexible schedule. - 40 paid days off. - Competitive salary.
• Architect and deploy scalable, secure, and highly available network infrastructure solutions in AWS • Manage and optimize AWS networking services such as VPC, Direct Connect, Route 53, CloudWAN, and Transit Gateway • Implement and maintain network security best practices, including firewalls, VPNs, and security groups • Develop and maintain automation scripts using tools like Terraform, CloudFormation, and Python • Monitor network performance and troubleshoot issues to ensure optimal performance and reliability • Work with cross-functional teams to support application development and deployment • Create and maintain detailed network documentation and diagrams • Monitor and optimize cloud costs, implementing cost-saving measures where possible • Ensure that all cloud network solutions comply with LexisNexis standards • Continuously improve the performance and reliability of network systems • Lead incident response efforts for network-related issues and outages • Provide training and mentorship to junior network engineers and other team members • Stay updated with the latest cloud networking technologies and propose innovative solutions to improve the infrastructure
Infrastructure Architect
ARC-One SolutionsSaving lives by providing market-leading blood supply solutions.
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description An Infrastructure Architect plays a crucial role in designing, implementing, and maintaining the Cloud infrastructure of the next generation Blood Establishment (BECS) platform. All team members are considered problem-solvers and actively participate in identifying problems and building solutions that solve the root causes. This is a hands-on position that designs, builds, and deploys high quality capabilities and features. The ideal candidate is comfortable leading Infrastructure as Code (IaC) Cloud initiatives while championing availability, reliability, monitoring, and performance excellence. This role collaborates closely with business leaders, product teams, engineers, and other stakeholders to create value. - Analyzes business requirements and specifications to design Cloud solutions that optimize value and meet customer objectives - Leads the deployment of infrastructure solutions, ensuring seamless integration with existing systems - Evaluates infrastructure performance on an ongoing basis, identifying areas for improvement and cost reduction - Identifies and resolves infrastructure and application defects, working closely with development teams to track and address issues - Delivers and monitors solutions through scripting automations (ex. Python, UNIX shell, YAML, and JSON) - Provides support for and implements DevOps tools such as GitLab and CI/CD Pipelines - Implements Infrastructure as Code (IaC) principles to build and manage infrastructure in a consistent and repeatable way - Develops and implements best practices for automation development and operation - Stays current on emerging Cloud trends and technologies - Collaborates with customers, developers, and stakeholders to understand business needs and deliver effective automation solutions - Documents solutions and processes, sharing knowledge across teams Qualifications - Bachelor's Degree or above in Computer Science or related field - 10+ years of deep technical systems experience designing and supporting production environments - 5+ years of hands-on experience in developing, building, and deploying Cloud solutions - 5+ years of experience with AWS services and solutions - 3+ years working with monitoring solutions (ex. Dynatrace) - Skilled in system administration for Unix (Linux) and Windows environments - Skilled in scripting to automate administrative and security processes - In-depth knowledge and hands-on experience of multiple infrastructure automation tools and technologies including Terraform - Proficiency in containerization technologies such as Rancher, Docker, K8S, and EKS - Experience with CI/CD frameworks and tools, including GitLab and Artifactory - Experience with test automation frameworks is a plus - Strong understanding of software development life cycle (SDLC) and Agile methodologies - Experience working in a SaaS-based product development environment or FDA-regulated medical device environment desired - Cloud and Security certifications is a plus Requirements - Flexible work hours in fun collaborative environment - Working remote requires a reliable internet connection - Must have the ability to travel, as needed for company meetings


