Job Closed

This listing is no longer active.

Runpod logo
Runpod

Runpod is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full-stack AI applications. Founded in 2022, we are a rapidly growing, well-funded company with a remote-first organization spread globally. Our mission is to empower innovators and enterprises to unlock AI's true potential, driving technology and transforming industries. Join us as we shape the future of AI. We are building Cloud services focused on accelerating AI adoption. Whether you're an experienced ML developer training a large language model, or an enthusiast tinkering with stable diffusion, we strive to make GPU compute as seamless and affordable as possible.

Engineering Manager, Datacenter Network Engineering

Engineering ManagerEngineering ManagerOtherRemoteLeadTeam 80Since 2022

Location

United States

Posted

136 days ago

Salary

$150K - $240K / year

Seniority

Lead

Job Description

Engineering Manager, Datacenter Network Engineering

Runpod

This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are looking for an Engineering Manager, Datacenter Network Engineering to lead the team responsible for designing, deploying, and operating Runpod’s global datacenter and backbone network. This role manages engineers working on L2/L3 fabrics, high-performance GPU networking, and global WAN connectivity that underpin our AI platform. You will lead execution across multiple regions and vendors, while setting technical direction for network architecture that supports massive east-west traffic, low-latency GPU collectives, and secure multi-tenant isolation. This role is hands-on at the architectural level, while focused on team leadership, operational excellence, and scalability. Responsibilities - Lead the Datacenter Networking Team: - Manage and grow a team of network engineers responsible for datacenter fabrics, interconnects, and global WAN connectivity. - Provide mentorship, technical guidance, and clear ownership boundaries. - Own Datacenter Network Architecture: - Define and evolve network designs for GPU-heavy clusters, including spine-leaf topologies, ECMP routing, and high-bandwidth east-west traffic patterns. - High-Performance GPU Networking: - Oversee design and operation of InfiniBand and RoCE-based fabrics supporting distributed training and inference workloads. - Ensure performance, loss characteristics, and congestion control meet AI workload requirements. - Encapsulation & Overlay Protocols: - Guide implementation and operations of encapsulation technologies such as VXLAN, EVPN, Geneve, or similar, enabling scalable multi-tenant isolation and flexible network provisioning. - Global WAN & Backbone Connectivity: - Lead strategy and execution for global WAN connectivity, including private backbone links, IX connectivity, and hybrid connectivity with cloud providers and partners. - Reliability & Operations: - Establish operational best practices for monitoring, capacity planning, change management, incident response, and post-mortems across the network stack. - Cross-Functional Collaboration: - Partner closely with Infrastructure, SRE, Hardware, and Product Engineering teams to ensure network capabilities align with platform and customer requirements. - Vendor & Partner Management: - Work with hardware vendors, colocation providers, and transit partners on network design, procurement, deployment timelines, and escalations. - Security & Segmentation: - Ensure network designs support secure isolation, DDoS resilience, and compliance requirements without compromising performance. Qualifications - 3+ years managing network or infrastructure engineering teams, with experience scaling teams and systems in production environments. - 8+ years designing and operating large-scale datacenter networks, including spine-leaf architectures, BGP-based routing, and high-throughput fabrics. - Strong hands-on experience with VXLAN/EVPN or equivalent encapsulation protocols, including control-plane and data-plane considerations. - Proven experience with InfiniBand and/or RoCE, including congestion management, lossless Ethernet concepts, and performance tuning for GPU workloads. - Deep familiarity with global WAN technologies, including private backbone design, inter-region connectivity, routing policy, and traffic engineering. - Comfortable working with Linux-based systems, network operating systems, and automation tooling. - Strong background in network observability, incident management, capacity forecasting, and change control. - Clear written and verbal communication skills, with the ability to align stakeholders and lead teams through complex technical challenges. - Successful completion of a background check. Preferred Qualifications - Experience operating networks for GPU clusters, HPC environments, or AI/ML platforms. - Familiarity with RDMA tuning, NCCL traffic patterns, and distributed training communication models. - Experience with automation frameworks and network-as-code (e.g., Terraform, Ansible, internal tooling). - Background in multi-region or multi-cloud networking architectures. - Experience working in high-growth or hyperscale infrastructure environments. Benefits - The competitive base pay for this position ranges from ($150,000 - $240,000). - Meaningful equity in a fast-growing company — everyone on the team receives stock options. - Generous medical, dental & vision plans — we cover 100% for all employees and partial for dependents. - Flexible PTO - take the time you need to recharge. - Most roles are remote work first with an inclusive, collaborative teams utilizing Slack as the main form of internal communication. - Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

Job Requirements

  • 3+ years managing network or infrastructure engineering teams, with experience scaling teams and systems in production environments.
  • 8+ years designing and operating large-scale datacenter networks, including spine-leaf architectures, BGP-based routing, and high-throughput fabrics.
  • Strong hands-on experience with VXLAN/EVPN or equivalent encapsulation protocols, including control-plane and data-plane considerations.
  • Proven experience with InfiniBand and/or RoCE, including congestion management, lossless Ethernet concepts, and performance tuning for GPU workloads.
  • Deep familiarity with global WAN technologies, including private backbone design, inter-region connectivity, routing policy, and traffic engineering.
  • Comfortable working with Linux-based systems, network operating systems, and automation tooling.
  • Strong background in network observability, incident management, capacity forecasting, and change control.
  • Clear written and verbal communication skills, with the ability to align stakeholders and lead teams through complex technical challenges.
  • Successful completion of a background check.
  • Preferred Qualifications
  • Experience operating networks for GPU clusters, HPC environments, or AI/ML platforms.
  • Familiarity with RDMA tuning, NCCL traffic patterns, and distributed training communication models.
  • Experience with automation frameworks and network-as-code (e.g., Terraform, Ansible, internal tooling).
  • Background in multi-region or multi-cloud networking architectures.
  • Experience working in high-growth or hyperscale infrastructure environments.

Benefits

  • The competitive base pay for this position ranges from ($150,000 - $240,000).
  • Meaningful equity in a fast-growing company — everyone on the team receives stock options.
  • Generous medical, dental & vision plans — we cover 100% for all employees and partial for dependents.
  • Flexible PTO - take the time you need to recharge.
  • Most roles are remote work first with an inclusive, collaborative teams utilizing Slack as the main form of internal communication.
  • Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.

Related Categories

Related Job Pages

More Engineering Manager Jobs

Airbnb logo

Engineering Manager, Ambassador Routing

Airbnb

Airbnb is a community based on connection and belonging.

OtherRemoteTeam 5,001-10,000Since 2007H1B Sponsor

• Lead a team of engineers building routing and decisioning capabilities that improve customer outcomes and operational efficiency. • Build and scale Skills Based Routing that better matches contacts to ambassador skills; enable Differentiated Service experiences that appropriately tailor support pathways to customer context - improving quality, speed, and customer satisfaction. • Contribute to Airbnb wide initiatives (e.g. Services, Experiences, multi-region resiliency) by aligning team capabilities with shared goals and ensuring smooth integration across dependencies. • Raise the bar on technical quality and execution through roadmap planning, clear ownership, and pragmatic architecture and design reviews. • Partner closely with Product, Data, Science, Operations, and adjacent engineering teams to drive success metrics (e.g., NPS, CSAT, time-to-resolution) • Champion customer-centric discovery: use qualitative insights alongside metrics to prioritize what will most improve the experience. • Develop talent and team health through coaching, feedback, and career development - creating an environment of high trust, inclusion, and accountability.

United States
$204K - $255K / year
Job Closed
Built logo

Senior Data Engineering Manager

Built

Connect and Simplify Doing Business in Real Estate

OtherRemoteTeam 201-500H1B Sponsor

• Coach engineers through clear expectations, feedback, and career development • Hire and retain top talent and build a high-performance, inclusive culture • Establish strong delivery and operational rituals, including planning, retrospectives, and incident reviews • Define and evolve the architecture for ingestion, transformation, orchestration, governance, and data products • Drive a roadmap that balances foundational platform investments with product delivery needs • Champion best practices, including dbt patterns, data contracts, testing, and documentation • Ensure pipelines and models are accurate, observable, secure, and scalable • Improve reliability through alerting, SLAs and SLOs, runbooks, and root-cause analysis • Partner with platform engineering on deployment patterns, cost optimization, and environment strategy • Collaborate with Product, Analytics, Security, and Engineering leaders to ensure data enables customer and business outcomes • Communicate clearly with stakeholders on tradeoffs, risks, and timelines • Influence the broader organization on data quality, trust, and accountability • Stay close to the work through architecture reviews, pairing, design docs, and occasional implementation • Bring strong judgment to tooling and build-versus-buy decisions across Snowflake, DBT, and Sigma

United States
$225K - $260K / year
Job Closed
WelbeHealth logo

Engineering Manager

WelbeHealth

Unlocking the full potential of our most vulnerable seniors.

OtherRemoteTeam 501-1,000H1B Sponsor

• Lead end-to-end architecture, development, and delivery of APIs, microservices, and integration pipelines for internal and external healthcare systems, as well as act as the senior-most engineer for the team: designing components, writing high-quality code, troubleshooting complex issues, and performing rigorous code reviews • Build scalable, cloud-native solutions using Azure, containerized workloads (Docker/Kubernetes), and modern integration frameworks • Implement repeatable MLOps patterns for model deployment, monitoring, and lifecycle management • Build prototype PoCs that validate new technologies, workflows, or AI-enabled automations before scaling to production • Lead and mentor a team of engineers, fostering a culture of innovation, ownership, and continuous learning, as well as step in as the escalation point for technical blockers, guiding the team toward solutions • Collaborate with Product, Architecture, Cloud, Security, and other engineering teams to define integration strategies and solution blueprints, as well as translate complex technical designs into clear communication for non-technical stakeholders • Ensure all development aligns with HIPAA, PHI/PII-safe engineering, and healthcare interoperability requirements, as well as apply FHIR, HL7, EDI, and related healthcare interoperability patterns where applicable

New York
$132.2K - $174.5K / year
Job Closed
Rebuy Engine logo

Engineering Manager

Rebuy Engine

Create intelligent shopping experiences.

OtherRemoteTeam 51-200Since 2017H1B No Sponsor

• Lead, support and empower a team of brilliant engineers by fostering collaboration, innovation and ownership. • Lead technical architecture decisions for scalable data processing systems handling on Google Cloud Platform (GCP). • Drive technical strategy for data ingestion, processing, and storage solutions. • Collaborate with our Product Owners and other Engineering Managers to define clear functional requirements, timelines, roadmaps, and designs for your team. • Oversee production monitoring, incident response, and blameless post-mortem processes to continuously improve system reliability. • Own the full SDLC execution aligned with Rebuy’s product roadmap. • Manage team resources by clarifying requirements, removing blockers, and overseeing tasks prioritizations. • Oversee release management, ensuring smooth QA and production rollouts. • Develop and maintain dashboards, logging and alerting systems for proactive issue resolution. • Improve operational efficiency by reviewing development processes, testing strategies, code reviews and release management. • Communicate and present to stakeholders and internal teams regarding roadmaps, updates, release schedules, etc.

United States
$150K - $195K / year
Job Closed