Job Closed
This listing is no longer active.
Where technology meets empathy – pioneering the future of human-robot interaction.
Performance Engineer – AI Infrastructure
Location
California
Posted
92 days ago
Salary
0
Seniority
Senior
Job Description
Performance Engineer – AI Infrastructure
Andromeda
• Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O • Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution • Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime • Design technical processes that help the team operate effectively and avoid repeating performance regressions
Job Requirements
- Proven experience running distributed training jobs on multi-GPU systems or HPC clusters
- Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus)
- Solid understanding of PyTorch, JAX, or TensorFlow, and large-scale training loops
- Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code
- Passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements.
Benefits
- Ownership and autonomy to shape how systems run
- Celebrate diversity and create an inclusive environment
Related Guides
Related Job Pages
More LLM Engineer Jobs
• Design, develop, and scale agentic AI systems using Google Agent Development Kit (ADK), ensuring enterprise-grade performance, security, and scalability. • Architect and implement multi-agent workflows, tool orchestration, and stateful conversational systems integrated with Dialogflow CX/ES. • Develop production-grade Python services (FastAPI, Flask, or equivalent) to support middleware, APIs, and enterprise integrations. • Design and deploy scalable solutions on Google Cloud Platform (GCP), leveraging services such as CCAI, Cloud Run, Cloud Functions, Pub/Sub, and BigQuery. • Implement advanced prompt engineering strategies, NLP/NLU best practices, context management, and robust error handling to optimize conversational experiences. • Integrate conversational agents with enterprise platforms (CRM systems, contact centers, databases) while ensuring observability through logging, monitoring, and performance optimization. • Provide technical leadership through architecture reviews, mentorship, best-practice enforcement, and cross-functional collaboration with product, DevOps, and business stakeholders.
Senior Datacenter Architect – AI Infrastructure
ePlus Technology SolutionsCó tâm, đủ tầm, phát triển, vươn xa, ...
• Design and deliver end-to-end data center solutions covering compute, storage, and networking • Deploy and manage GPU-based systems (NVIDIA DGX, HGX, or similar) for AI and HPC workloads • Implement and support virtualization platforms (VMware ESXi, vCenter, vSAN, NSX) • Build and manage containerized environments using Kubernetes or related platforms • Automate infrastructure provisioning and operations using Ansible, Terraform, or scripting (Bash/Python) • Conduct infrastructure assessments, capacity planning, and performance tuning • Work closely with networking, storage, and DevOps teams to ensure smooth integration and delivery • Create and maintain technical documentation for customer and internal team
Director, Data Center Energy Strategy – AI Infrastructure
EQL Tech (sales & engineering talent)Tech recruitment specialists, scaling AI-native startups by hiring top 1% Sales, GTM & Engineering talent globally.
• Define the Standard: Establish technical and operational frameworks for solar + storage, fire safety, and water usage in next-gen data centers. • Drive the Narrative: Reframe solar as critical infrastructure for national security and economic competitiveness. • Build the Coalition: Engage directly with Frontier AI labs, hyperscalers, and energy experts to move solar-first design from concept to pilot. • Navigate Siting: Work with federal and local authorities to define permitting pathways for industrial and public land (e.g., BLM). • Publish the Manifesto: Author and gain external validation for a "Data Center Manifesto" defining best practices for the industry.
• Design, develop, and scale agentic AI systems using Google Agent Development Kit (ADK), ensuring enterprise-grade performance, security, and scalability. • Architect and implement multi-agent workflows, tool orchestration, and stateful conversational systems integrated with Dialogflow CX/ES. • Develop production-grade Python services (FastAPI, Flask, or equivalent) to support middleware, APIs, and enterprise integrations. • Design and deploy scalable solutions on Google Cloud Platform (GCP), leveraging services such as CCAI, Cloud Run, Cloud Functions, Pub/Sub, and BigQuery. • Implement advanced prompt engineering strategies, NLP/NLU best practices, context management, and robust error handling to optimize conversational experiences. • Integrate conversational agents with enterprise platforms (CRM systems, contact centers, databases) while ensuring observability through logging, monitoring, and performance optimization. • Provide technical leadership through architecture reviews, mentorship, best-practice enforcement, and cross-functional collaboration with product, DevOps, and business stakeholders.



