Job Closed

This listing is no longer active.

Andromeda logo
Andromeda

Where technology meets empathy – pioneering the future of human-robot interaction.

Performance Engineer – AI Infrastructure

LLM EngineerMachine Learning EngineerOtherRemoteSeniorTeam 11-50H1B SponsorCompany SiteLinkedIn

Location

California

Posted

92 days ago

Salary

0

Seniority

Senior

Job Description

Performance Engineer – AI Infrastructure

Andromeda

• Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O • Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution • Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime • Design technical processes that help the team operate effectively and avoid repeating performance regressions

Job Requirements

  • Proven experience running distributed training jobs on multi-GPU systems or HPC clusters
  • Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus)
  • Solid understanding of PyTorch, JAX, or TensorFlow, and large-scale training loops
  • Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code
  • Passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.

Benefits

  • Ownership and autonomy to shape how systems run
  • Celebrate diversity and create an inclusive environment

Related Job Pages

More LLM Engineer Jobs

Miratech logo

Senior Conversational AI Engineer

Miratech

Helping Visionaries Change the World

LLM Engineer92 days ago
Full TimeRemoteTeam 501-1,000Since 1989H1B No Sponsor

• Design, develop, and scale agentic AI systems using Google Agent Development Kit (ADK), ensuring enterprise-grade performance, security, and scalability. • Architect and implement multi-agent workflows, tool orchestration, and stateful conversational systems integrated with Dialogflow CX/ES. • Develop production-grade Python services (FastAPI, Flask, or equivalent) to support middleware, APIs, and enterprise integrations. • Design and deploy scalable solutions on Google Cloud Platform (GCP), leveraging services such as CCAI, Cloud Run, Cloud Functions, Pub/Sub, and BigQuery. • Implement advanced prompt engineering strategies, NLP/NLU best practices, context management, and robust error handling to optimize conversational experiences. • Integrate conversational agents with enterprise platforms (CRM systems, contact centers, databases) while ensuring observability through logging, monitoring, and performance optimization. • Provide technical leadership through architecture reviews, mentorship, best-practice enforcement, and cross-functional collaboration with product, DevOps, and business stakeholders.

India
Job Closed
ePlus Technology Solutions logo

Senior Datacenter Architect – AI Infrastructure

ePlus Technology Solutions

Có tâm, đủ tầm, phát triển, vươn xa, ...

LLM Engineer94 days ago
OtherRemoteTeam 51-200Since 2015H1B No Sponsor

• Design and deliver end-to-end data center solutions covering compute, storage, and networking • Deploy and manage GPU-based systems (NVIDIA DGX, HGX, or similar) for AI and HPC workloads • Implement and support virtualization platforms (VMware ESXi, vCenter, vSAN, NSX) • Build and manage containerized environments using Kubernetes or related platforms • Automate infrastructure provisioning and operations using Ansible, Terraform, or scripting (Bash/Python) • Conduct infrastructure assessments, capacity planning, and performance tuning • Work closely with networking, storage, and DevOps teams to ensure smooth integration and delivery • Create and maintain technical documentation for customer and internal team

United States
$125K - $170K / year
EQL Tech (sales & engineering talent) logo

Director, Data Center Energy Strategy – AI Infrastructure

EQL Tech (sales & engineering talent)

Tech recruitment specialists, scaling AI-native startups by hiring top 1% Sales, GTM & Engineering talent globally.

LLM Engineer94 days ago
OtherRemoteTeam 1-10Since 2025H1B No Sponsor

• Define the Standard: Establish technical and operational frameworks for solar + storage, fire safety, and water usage in next-gen data centers. • Drive the Narrative: Reframe solar as critical infrastructure for national security and economic competitiveness. • Build the Coalition: Engage directly with Frontier AI labs, hyperscalers, and energy experts to move solar-first design from concept to pilot. • Navigate Siting: Work with federal and local authorities to define permitting pathways for industrial and public land (e.g., BLM). • Publish the Manifesto: Author and gain external validation for a "Data Center Manifesto" defining best practices for the industry.

United States
Job Closed
Full TimeRemoteTeam 501-1,000Since 1989H1B No Sponsor

• Design, develop, and scale agentic AI systems using Google Agent Development Kit (ADK), ensuring enterprise-grade performance, security, and scalability. • Architect and implement multi-agent workflows, tool orchestration, and stateful conversational systems integrated with Dialogflow CX/ES. • Develop production-grade Python services (FastAPI, Flask, or equivalent) to support middleware, APIs, and enterprise integrations. • Design and deploy scalable solutions on Google Cloud Platform (GCP), leveraging services such as CCAI, Cloud Run, Cloud Functions, Pub/Sub, and BigQuery. • Implement advanced prompt engineering strategies, NLP/NLU best practices, context management, and robust error handling to optimize conversational experiences. • Integrate conversational agents with enterprise platforms (CRM systems, contact centers, databases) while ensuring observability through logging, monitoring, and performance optimization. • Provide technical leadership through architecture reviews, mentorship, best-practice enforcement, and cross-functional collaboration with product, DevOps, and business stakeholders.

India
Job Closed