Yotta Labs logo
Yotta Labs

Building the Decentralized OS for AI Optimization and Orchestration at Planet Scale

GPU Cloud Platform Engineer

Cloud EngineerCloud EngineerOtherRemoteSeniorTeam 1-10Since 2024H1B No SponsorCompany SiteLinkedIn

Location

United States + 4 moreAll locations: United States | Canada | Brazil | Mexico | Argentina

Posted

129 days ago

Salary

$0

Seniority

Senior

Job Description

GPU Cloud Platform Engineer

Yotta Labs

Location: Remote (Global) Type: Full-time Company: Yotta Labs Apply: careers@yottalabs.ai 🧠 About Yotta Labs Yotta Labs is pioneering the development of a Decentralized Operating System (DeOS) for AI workload orchestration at a planetary scale. Our mission is to democratize access to AI resources by aggregating geo-distributed GPUs, enabling high-performance computing for AI training and inference on a wide spectrum of hardware—from commodity to high-end GPUs. Our platform supports major large language models (LLMs) and offers customizable solutions for new models, facilitating elastic and efficient AI development. 🛠️ Role Overview We are seeking a GPU Cloud Platform Engineer to join our core infrastructure team and help build the next-generation AI compute cloud. In this role, you will design, deploy, and operate large-scale, multi-cluster GPU infrastructure across data centers and cloud environments. You will be responsible for ensuring high availability, performance, and efficiency of containerized AI workloads—ranging from LLMs to generative models—deployed in Kubernetes-based GPU clusters. If you're passionate about high-performance systems, distributed orchestration, and scaling real-world AI infrastructure, this role offers a unique opportunity to shape the backbone of our AI cloud platform. 🎯 Responsibilities Build and operate large-scale, high-performance GPU clusters; ensure stable operation of compute, network, and storage systems; monitor and troubleshoot online issues. Conduct performance testing and evaluation of multi-node GPU clusters using standard benchmarking tools to identify and resolve performance bottlenecks. Deploy and orchestrate large models (e.g., LLMs, video generation models) across multi-cluster environments using Kubernetes; implement elastic scaling and cross-cluster load balancing to ensure efficient service response under high concurrency for global users. Participate in the design, development, and iteration of GPU cluster scheduling and optimization systems. Define and lead Kubernetes multi-cluster configuration standards; Optimize scheduling strategies (e.g., node affinity, taints/tolerations) to improve GPU resource utilization. Build a unified multi-cluster management and monitoring system to support cross-region resource monitoring, traffic scheduling, and fault failover. Collect key metrics such as GPU memory usage, QPS, and response latency in real time; configure alert mechanisms. Coordinate with IDC providers for planning and deploying large-scale GPU clusters, networks, and storage infrastructure to support internal cloud platforms and external customer needs. ✅ Qualifications

Job Requirements

  • Bachelor's degree or higher in Computer Science, Software Engineering, Electronic Engineering, or related fields; 3+ years of experience in system engineering or DevOps.
  • 5+ years of experience in cloud-native development or AI engineering, with at least 2 years of hands-on experience in Kubernetes multi-cluster management and orchestration.
  • Familiarity with the Kubernetes ecosystem; hands-on experience with tools such as kubectl, Helm, and expertise in multi-cluster deployment, upgrade, scaling, and disaster recovery.
  • Proficient in Docker and containerization technologies; knowledge of image management and cross-cluster distribution.
  • Experience with monitoring tools such as Prometheus and Grafana; Has practical experience in GPU fault monitoring and alerting.
  • Hands-on experience with cloud platforms such as AWS, GCP, or Azure; understanding of cloud-native multi-cluster architecture.
  • Experience with cluster management tools such as Ray, Slurm, KubeSphere, Rancher, Karmada is a plus.
  • Familiarity with distributed file systems such as NFS, JuiceFS, CephFS, or Lustre; ability to diagnose and resolve performance bottlenecks.
  • Understanding of high-performance communication protocols such as IB, RoCE, NVLink, and PCIe.
  • Strong communication skills, self-motivation, and team collaboration
  • 🌟
  • Preferred Experience
  • Experience in developing and operating MaaS platforms or large-scale model inference clusters. Proven track record of leading multi-cluster system development or performance optimization projects.
  • Proficiency in CUDA programming and the NCCL communication library; understanding of high-performance GPUs like H100.
  • Ability to develop standardized inference APIs (RESTful/gRPC) and automation tools using Golang or Python.
  • Hands-on experience with optimization techniques such as model quantization, static compilation, and multi-GPU parallelism; capable of profiling inference processes in multi-cluster setups and identifying bottlenecks like memory fragmentation and low compute efficiency.
  • Active engagement with open-source communities such as Hugging Face and GitHub; deep understanding of the design principles of inference frameworks like Triton, vLLM, and SGLang; ability to perform secondary development and optimization based on open-source projects and quickly translate cutting-edge techniques into production-ready multi-cluster solutions.
  • 🌐
  • Why Join Yotta Labs?
  • Be part of a visionary team aiming to redefine AI infrastructure.
  • Work on cutting-edge technologies that bridge AI and decentralized computing.
  • Collaborate with experts from leading institutions and tech companies.
  • Enjoy a flexible, remote work environment that values innovation and autonomy.
  • 📩
  • How to Apply
  • Interested candidates should apply directly or send their resume and a brief cover letter to careers@yottalabs.ai. Please include links to any relevant projects or contributions.

Related Categories

Related Job Pages

More Cloud Engineer Jobs

Curotec logo

Senior Cloud Platform Engineer

Curotec

We help companies master digital innovation.

Cloud Engineer129 days ago
Full TimeRemoteTeam 51-200Since 2010H1B No Sponsor

• Designing and deploying AI agents using Google’s Gemini models • Using GCP’s Vertex AI as the primary ML platform (model hosting, pipelines, endpoints) • Building with Google’s agent tooling — the Agent Development Kit (ADK) for constructing agent logic, and the Managed Agents API for running/orchestrating them at scale • Architecting and implementing infrastructure as code (IaC) . • Defining administrative choices for environments that are auto-configured by IaC. • Focus on agent use cases , including relationships with BigQuery databases, enterprise connectors, and agent orchestration. • Heavy leverage of Vertex . • Investigating and tweaking designs to avoid issues with Pfizer’s enterprise infrastructure and GCP limitations.

Pennsylvania
OtherRemoteTeam 1,001-5,000H1B Sponsor

• Lead the design and architecture of end-to-end integration solutions using Oracle Integration Cloud (OIC), including app-driven orchestrations, scheduled integrations, file-based integrations, event-driven architectures, REST/SOAP APIs, and B2B integrations. • Define integration standards, reference architectures, governance models, reusable patterns, and best practices to ensure consistency, scalability, and maintainability across the enterprise. • Collaborate with business stakeholders, solution architects, and technical teams to gather requirements, analyze integration needs, perform gap analysis, and translate business processes into robust technical designs. • Architect and implement complex integrations, data mappings, transformations (using XSLT, lookups, and canonical models), adapters, and packages in OIC. • Oversee integration with Oracle SaaS applications, legacy systems (including Oracle E-Business Suite), and external platforms, ensuring data integrity, security, and performance. • Provide technical leadership on migration strategies from on-premises middleware (e.g., SOA) to OIC, hybrid cloud integrations, and optimization of existing integration landscapes. • Conduct design reviews, mentor development teams, troubleshoot complex issues, and support testing, deployment, and post-production monitoring. • Stay current with Oracle Integration Cloud updates, new features (such as enhanced observability, projects, file server capabilities, AI agents, and process improvements in recent releases), and industry trends to recommend proactive enhancements. • Ensure compliance with security protocols, data governance, and regulatory requirements in all integration designs.

United States
$130K - $150K / year
Job Closed
Nordcloud, an IBM Company logo

Cloud Developer – API Platform

Nordcloud, an IBM Company

We supercharge our customers, using the world's best technology platforms.

Cloud Engineer130 days ago
Full TimeRemoteTeam 501-1,000Since 2011H1B No Sponsor

• Enhance and optimize our API platform using ApigeeX and Google Cloud services • Develop features and automation for APIs and platform components • Implement Infrastructure-as-Code (Terraform, Github Actions) for deployments and provisioning • Ensure API security, compliance, and observability • Designing technical solutions, building, testing and implementing new business requirements • Working in a small and close-knit team who supports each other and shares knowledge actively • Being responsible and guarding the service you build

Poland
Kraft & Kennedy logo

Microsoft Cloud Support Infrastructure Engineer

Kraft & Kennedy

Kraft Kennedy is an equal opportunity employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity, sexual orientation, national origin, ethnicity, age, disability, marital status, veteran status or any other characteristic protected by law.

Cloud Engineer130 days ago

Role Description The Infrastructure and Enterprise Systems practice group (IES) at Kraft Kennedy provides strategic guidance, technology planning, and systems design and integration services for organizations of all sizes. Within Kraft Kennedy, the IES is responsible for all cloud and on-premises infrastructure, data center technologies, enterprise applications (e.g. messaging systems, unified communications, etc.), identity services, cloud security and compliance, and more. We are looking for an experienced technology professional with strong experience in Microsoft’s cloud infrastructure, security, and collaboration technologies to provide the highest degree of service to our clients. You must live in one of these locations to be considered for this remote position: Connecticut, Delaware, Florida, Georgia, Illinois, Maryland, Massachusetts, New York, South Carolina, North Carolina, Tennessee, Texas, Utah, Virginia, Vermont, DC, Kentucky, Pennsylvania, Ohio, or Washington. Qualifications - Bachelor’s degree - 5 plus years of experience in IT - Excellent verbal and written communication skills - Very organized and detail oriented, with a high degree of accuracy and follow-up - Strong problem solving and technical troubleshooting skills Requirements - Implement and support infrastructure technologies such as Microsoft Azure, VMware, and networking technologies - Execute migrations of on-premises platforms to cloud infrastructure - Manage enterprise support requests from clients subscribing to Kraft Kennedy’s enterprise managed services - Execute planned evening and weekend maintenance tasks in support of Kraft Kennedy’s enterprise managed services clients, when necessary - Participate in weekly on-call rotation for evening and weekend support assistance, as requested by enterprise managed services clients - Escalate to internal and, when necessary, external resources in an appropriate time frame to manage the resolution of complex client issues - Provide on-site support, as necessary Benefits - Medical, dental, life and disability insurance - 401k with company match - Holidays/vacation/sick days - Cutting edge training on the latest technologies - Employee referral bonus program - Phone reimbursement Compensation The base pay for this position has a salary range of $85,000 to $140,000. The actual salary offer will take into account a wide range of factors including the individual’s qualifications, experience as well as location. In addition, certain positions are eligible for bonuses or commissions. Physical Requirements - Prolonged periods of sitting at a desk and working on a computer. - Periodic after hours work

United States
$85K - $140K / year
Job Closed