Nebius

Nebius is a European AI infrastructure company based in Amsterdam, North Holland, the Netherlands, specializing in full-stack AI solutions. The company offers large-scale GPU clust

GPU Cluster Architect

Location

United States

Posted

110 days ago

Salary

$0

Seniority

Mid Level

Bachelor Degree9 yrs expEnglishPython

Job Description

GPU Cluster Architect

Nebius

Why work at Nebius Nebius is leading a new era in cloud computing to serve the global AI economy. We create the tools and resources our customers need to solve real-world challenges and transform industries, without massive infrastructure costs or the need to build large in-house AI/ML teams. Our employees work at the cutting edge of AI cloud infrastructure alongside some of the most experienced and innovative leaders and engineers in the field. Where we work Headquartered in Amsterdam and listed on Nasdaq, Nebius has a global footprint with R&D hubs across Europe, North America, and Israel. The team of over 800 employees includes more than 400 highly skilled engineers with deep expertise across hardware and software engineering, as well as an in-house AI R&D team. The role We are seeking a GPU Cluster Architect to drive the design of our next-generation AI infrastructure. In this high-impact, hands-on role, you will make end-to-end architectural decisions across compute, networking, and storage — ensuring our platforms can meet the massive scale, performance, and reliability requirements of modern AI workloads. Your responsibilities will include: Cluster Design : Architect scalable GPU cluster topologies including compute nodes, interconnect (InfiniBand, Ethernet), storage, and control planes. Performance Modeling : Analyze AI/ML workloads (e.g. LLM training, inference) to inform design tradeoffs across latency, bandwidth, and GPU density. Network Architecture: Align with network architect relevant design and validate low-latency, high-throughput interconnects (e.g., InfiniBand HDR/NDR, RoCEv2) at POD and DC scale. Storage Integration: Work with storage teams to optimize performance for training datasets, checkpointing, and others. Reliability & Monitoring : Understand and analyze signal from monitoring systems to the detect flows in design Collaboration : Partner with site reliability, networking, storage, and DC engineering teams to operationalize and scale your architecture. We expect you to have: 5+ years of experience designing clusters. Deep understanding of modern GPU architecture (NVIDIA, AMD, etc.). Experience with HPC interconnects (InfiniBand & RoCE). Solid background in systems architecture, networking, and hardware reliability. Experience in scripting for automation and telemetry pipelines (Python, Go, etc.) Key employee benefits in the US: Health insurance: 100% company-paid medical, dental, and vision coverage for employees and families. 401(k) plan: Up to 4% company match with immediate vesting. Parental leave: 20 weeks paid for primary caregivers, 12 weeks for secondary caregivers. Remote work reimbursement: Up to $85/month for mobile and internet. Disability & life insurance : Company-paid short-term, long-term and life insurance coverage. Compensation We offer competitive salaries, ranging from  $150k- $180k base + quarterly performance bonuses. What we offer Competitive salary and comprehensive benefits package. Opportunities for professional growth within Nebius. Hybrid working arrangements. A dynamic and collaborative work environment that values initiative and innovation. We’re growing and expanding our products every day. If you’re up to the challenge and are excited about AI and ML as much as we are, join us! What we offer Competitive salary and comprehensive benefits package. Opportunities for professional growth within Nebius. Flexible working arrangements. A dynamic and collaborative work environment that values initiative and innovation. We’re growing and expanding our products every day. If you’re up to the challenge and are excited about AI and ML as much as we are, join us!

Job Requirements

  • This is a high-impact, hands-on architecture role where you’ll define how tens of thousands of GPUs are interconnected, cooled down, powered, and optimized across multiple data center sites.
  • You are welcome to work remotely
  • from the USA
  • .

Related Job Pages

More Artificial Intelligence Jobs

OtherRemoteTeam 596Since 2002

Company And Culture Complex is the definitive platform for global youth culture and music lifestyle, seamlessly integrating cutting-edge content, commerce and live experiences with unparalleled scale. Through innovative content, Complex tells stories of music, streetwear and style, sports, art and beyond. Its content engages in a dynamic conversation with the audience, reflecting and shaping the zeitgeist of convergence culture. A powerful media juggernaut paired with a curated marketplace, Complex is redefining the way fans interact with their favorite brands and artists and reshaping the future of digital culture and commerce. What You'll Do Execute the voice of Complex Sneakers (Facebook, Twitter/X, Instagram, TikTok) on all of our social platforms, exhibiting excellent judgment and audience-sensitive framing Create, curate, and be responsible for all content published to Facebook, Twitter, and Instagram for Complex Sneakers Lead community management, moderation, and DM management using a deep understanding of the Complex tone and voice Surface and pitch viral, breaking, and social-friendly content to our news team Develop creative assets in Photoshop and Premiere/CapCut Ideate, source, and create original real-time content for all Complex Sneakers audiences, identifying memes and trends at the cusp of virality Support all social goals (growth, traffic, views, and engagement) and be accountable for providing vertical-specific tactics, strategic pivots, and added direction when needed Track and share social and platform best practices with internal teams, providing best-in-class examples where possible Manage workflow of editorial social posts, working with the branded social team to coordinate posting and schedules, when necessary Who You Are

New York
OtherRemoteTeam 596Since 2002

Company And Culture Complex is the definitive platform for global youth culture and music lifestyle, seamlessly integrating cutting-edge content, commerce and live experiences with unparalleled scale. Through innovative content, Complex tells stories of music, streetwear and style, sports, art and beyond. Its content engages in a dynamic conversation with the audience, reflecting and shaping the zeitgeist of convergence culture. A powerful media juggernaut paired with a curated marketplace, Complex is redefining the way fans interact with their favorite brands and artists and reshaping the future of digital culture and commerce. What You'll Do Execute the voice of Complex Style (Facebook, Twitter/X, Instagram, TikTok) on all of our social platforms, exhibiting excellent judgment and audience-sensitive framing Leverage a robust knowledge of streetwear and high fashion—and their intersections with music, pop culture, and sports—to inform content curation, trendspotting, and storytelling that resonates authentically with Complex’s audience Create, curate, and be responsible for all content published to Facebook, Twitter, and Instagram for Complex Style. Lead community management, moderation, and DM management using a deep understanding of the Complex tone and voice Surface and pitch viral, breaking, and social-friendly content to our news team Develop creative assets in Photoshop and Premiere/CapCut Ideate, source, and create original real-time content for all Complex Style audiences, identifying memes and trends at the cusp of virality Support all social goals (growth, traffic, views, and engagement) and be accountable for providing vertical-specific tactics, strategic pivots, and added direction when needed Track and share social and platform best practices with internal teams, providing best-in-class examples where possible Manage workflow of editorial social posts, working with the branded social team to coordinate posting and schedules, when necessary Who You Are

New York
Forerunner logo

Deployment Strategist

Forerunner

A venture capital firm that understands the evolving consumer; we invest at the intersection of innovation & culture.

OtherRemoteTeam 11-50Since 2012H1B No Sponsor

Hi! We're Forerunner. We believe that climate adaptation is a necessity, not a luxury, and communities deserve access to powerful software that helps them plan for the future. The challenge of climate change is complex – it implicates how municipalities plan, manage capital, and communicate to both residents and higher levels of government. Forerunner helps local communities do these things better by empowering them to access, understand, and mobilize local-level flood risk data at scale. About the role As a Deployment Specialist, you will own the end-to-end deployment experience for our customers — from discovery to go-live — ensuring the outcome aligns with business impact. You’ll act at the intersection of strategy, project management, and operational delivery, collaborating with customers, internal teams, and technical resources to make deployments high-quality, repeatable, and outcome-driven. What you'll do Customer Discovery & Planning Lead structured workshops with customers to uncover business objectives, operational constraints, and success metrics. Translate findings into a clear deployment roadmap and set of deliverables. Define project scope, milestones, dependencies, and governance. Project Leadership Serve as the primary point of contact for the customer during deployment. Coordinate across cross-functional teams (Implementation Specialist, Forward Deployed Engineer, Customer Success Manager, internal stakeholders) to ensure alignment and momentum. Proactively identify risks, manage dependencies, and escalate when necessary. Provide regular status updates to both the customer and internal leadership. Solution Design & Oversight Partner with the Implementation Specialist to ensure system configurations align with business goals. Work with the Forward Deployed Engineer to scope integrations and customizations. Ensure that deployment deliverables map back to the business requirements and success criteria.

Maine + 1 moreAll locations: Maine | California
Job Closed
Databricks logo

AI Engineer - FDE (Forward Deployed Engineer)

Databricks

Databricks is the data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast, Grammarly, and over 50% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow.

OtherRemoteTeam 1,001-5,000Since 2013H1B Sponsor

CSQ426R189 The AI Forward Deployed Engineering (AI FDE) team is a highly specialized customer-facing AI team at Databricks. We deliver professional services engagements to help our customers build and productionize first-of-its-kind AI applications. We work cross-functionally to shape long-term strategic priorities and initiatives alongside engineering, product, and developer relations, as well as support internal subject matter expert (SME) teams. We view our team as an ensemble: we look for individuals with strong, unique specializations to improve the overall strength of the team. This team is the right fit for you if you love working with customers, teammates, and fueling your curiosity for the latest trends in GenAI, LLMOps, and ML more broadly. This role can be remote. The impact you will have: Develop cutting-edge GenAI solutions, incorporating the latest techniques from our Mosaic AI Research to solve customer problems Own production rollouts of consumer and internally facing GenAI applications Serve as a trusted technical advisor to customers across a variety of domains Present at conferences such as Data + AI Summit, recognized as a thought leader internally and externally Collaborate cross-functionally with the product and engineering teams to influence priorities and shape the product roadmap What we look for: Experience building GenAI applications, including RAG, multi-agent systems, Text2SQL, fine-tuning, etc., with tools such as HuggingFace, LangChain, and DSPy Expertise in deploying production-grade GenAI applications, including evaluation and optimizations Extensive years of hands-on industry data science experience, leveraging common machine learning and data science tools (i.e., pandas, scikit-learn, PyTorch, etc.) Experience building production-grade machine learning deployments on AWS, Azure, or GCP Graduate degree in a quantitative discipline (Computer Science, Engineering, Statistics, Operations Research, etc.) or equivalent practical experience Experience communicating and/or teaching technical concepts to non-technical and technical audiences alike Passion for collaboration, life-long learning, and driving business value through AI [Preferred] Experience using the Databricks Intelligence Platform and Apache Spark™ to process large-scale distributed datasets Pay Range Transparency Databricks is committed to fair and equitable compensation practices. The pay range(s) for this role is listed below and represents the expected base salary range for non-commissionable roles or on-target earnings for commissionable roles.  Actual compensation packages are based on several factors that are unique to each candidate, including but not limited to job-related skills, depth of experience, relevant certifications and training, and specific work location. Based on the factors above, Databricks anticipated utilizing the full width of the range. The total compensation package for this position may also include eligibility for annual performance bonus, equity, and the benefits listed above. For more information regarding which range your location is in visit our page here. Zone 1 Pay Range $180,656 — $248,360 USD Zone 2 Pay Range $180,656 — $248,360 USD Zone 3 Pay Range $180,656 — $248,360 USD Zone 4 Pay Range $180,656 — $248,360 USD About Databricks Databricks is the data and AI company. More than 10,000 organizations worldwide — including Comcast, Condé Nast, Grammarly, and over 50% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to unify and democratize data, analytics and AI. Databricks is headquartered in San Francisco, with offices around the globe and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow. To learn more, follow Databricks on Twitter, LinkedIn and Facebook . Benefits At Databricks, we strive to provide comprehensive benefits and perks that meet the needs of all of our employees. For specific details on the benefits offered in your region, please visit https://www.mybenefitsnow.com/databricks. Our Commitment to Diversity and Inclusion At Databricks, we are committed to fostering a diverse and inclusive culture where everyone can excel. We take great care to ensure that our hiring practices are inclusive and meet equal employment opportunity standards. Individuals looking for employment at Databricks are considered without regard to age, color, disability, ethnicity, family or marital status, gender identity or expression, language, national origin, physical and mental ability, political affiliation, race, religion, sexual orientation, socio-economic status, veteran status, and other protected characteristics. Compliance If access to export-controlled technology or source code is required for performance of job duties, it is within Employer's discretion whether to apply for a U.S. government license for such positions, and Employer may decline to proceed with an applicant on this basis alone.

Texas + 3 moreAll locations: Texas | Illinois | California | Washington
$180.7K - $248.4K / year
Job Closed