Develop, train, and scale AI models. All in one cloud.
Manager, HPC Storage Engineer
Location
United States
Posted
136 days ago
Salary
$150K - $240K / year
Seniority
Lead
Job Description
Manager, HPC Storage Engineer
RunPod
• Own Distributed Storage Architecture: Define, evolve, and operate Runpod’s global storage platforms, supporting training, inference, checkpointing, and dataset access at scale. • Build the Storage Engineering Team: Manage and grow a team of storage and systems engineers. Set clear ownership, technical direction, and operational standards across regions. • High-Performance Shared Filesystems: Design and operate large-scale SAN and NFS deployments, including performance-sensitive shared storage for GPU clusters. • Advanced Filesystems & Platforms: Lead deployments and operations of VAST Data and experience with Lustre or similar parallel filesystems used in HPC and AI environments. • End-to-End Performance Ownership: Drive performance optimization from NAND and NVMe media through controllers, networking, and client access patterns. • Next-Generation Storage Technologies: Evaluate and deploy cutting-edge capabilities such as NFS over RDMA, GPU Direct Storage (GDS), and low-latency data paths for accelerated workloads. • Reliability & Scale: Establish best practices for replication, data tiering, data protection, failure recovery, capacity planning, and lifecycle management. • Automation & Observability: Build automation for provisioning, expansion, upgrades, and monitoring. Ensure deep observability into throughput, latency, and error characteristics. • Cross-Functional Collaboration: Partner with Datacenter Networking, GPU Platform, SRE, and Product teams to ensure storage systems meet evolving workload and customer needs. • Vendor & Partner Management: Own technical relationships with storage vendors, hardware partners, and colocation providers; drive roadmap alignment and issue resolution.
Job Requirements
- 3+ years managing storage, systems, or infrastructure engineering teams in production environments.
- 8+ years designing and operating large-scale storage systems, including SAN and NFS architectures at multi-petabyte scale.
- Hands-on experience deploying, operating, or deeply integrating VAST Data in production environments is required.
- Experience with Lustre or comparable HPC filesystems (e.g., GPFS, BeeGFS) supporting high-concurrency workloads.
- Deep understanding of NAND, NVMe, PCIe, storage controllers, and performance characteristics across the stack.
- Proven experience with NFS over RDMA, RDMA-capable transports, or similar technologies. Familiarity with GPU Direct Storage strongly preferred.
- Strong Linux internals knowledge, including filesystems, I/O scheduling, memory management, and tuning for performance workloads.
- Experience running 24/7 storage platforms with strong incident response, change management, and post-mortem discipline.
- Ability to clearly communicate complex technical tradeoffs and lead teams through high-stakes infrastructure decisions.
- Successful completion of a background check.
Benefits
- Meaningful equity in a fast-growing company- everyone on the team receives stock options — your impact drives our growth, and you share in the upside.
- Generous medical, dental & vision plans — we cover 100% for all employees and partial for dependents.
- Flexible PTO- take the time you need to recharge.
- Most roles are remote work first with an inclusive, collaborative teams utilizing slack as the main form of internal communication.
- Join a passionate team on the cutting edge of AI infrastructure — where culture, learning, and ownership are at the heart of how we scale.
Related Guides
Related Job Pages
More Backend Engineer Jobs
Customer Marketing Manager – VIP Onboarding, Paperless Programs
InvoiceCloud, Inc.Online payment solutions that drive results
• Connect with new clients who are ready to begin marketing efforts during implementations. • Based on tiering structure, contact new clients and work with them to promote EBPP to their customers. • Conduct Teams trainings on online payment adoption best practices. • Communicate marketing strategy, campaign plans, and timelines clearly to clients. • Create and customize marketing materials for VIP clients as needed. • Follow up on client marketing activities and gather actionable feedback. • Drive adoption of online billing and payment services through consistent engagement. • Maintain knowledge of InvoiceCloud services and articulate value propositions effectively. • Track onboarding marketing activities and outcomes in Salesforce. • Partner with external vendors supporting marketing assets and creative execution. • Drive client behavior to promote online billing and payment adoption by payers. • Support the execution and optimization of paperless programs in partnership with the Customer Marketing Manager. • Manage paperless program email sends, including scheduling, execution, and opt-out management. • Collaborate with CSMs to help position and pitch paperless programs to existing customers. • Monitor performance of paperless initiatives and recommend adjustments based on results. • Ensure accuracy, compliance, and consistency across all paperless communications.
Senior Back-End Engineer, Java
TRG Research and DevelopmentCyber Fusion SaaS in 24 hours. Secure Better Lives today!
• Develop new applications and features (back-end) • Decompose challenging business problems into software engineering tasks • Optimize existing codebase for performance, reliability, and scalability • Estimate project work effort and create development roadmaps • Perform code reviews and engage in pair-programming sessions • Collaborate with product management and other functional teams to iterate and enhance our product offerings
• Design, implement, and maintain database infrastructure using StatefulSets, Operators, and Helm charts to ensure databases are reliable, self-healing, and scalable. • Own the deployment lifecycle for database clusters by managing version control for Helm charts and configuration templates. • Support and administer production database systems by proactively instrumenting and monitoring performance, security, and availability within the containerized environment. • Perform zero-downtime upgrades and migrations for major and minor releases, developing and maintaining Helm hooks and custom scripts to automate complex stateful operations. • Manage and optimize performance for backend data stores, ensuring data consistency and integrity across pod life cycles. • Develop and maintain automation tools and scripts (Bash, Python) specifically focused on simplifying Kubernetes management tasks, such as provisioning users/secrets and monitoring cluster state.
• Join the team responsible for designing, developing, and deploying our flagship SaaS product, WEX FSM. • Build robust, scalable APIs that enhance our product and empower partners. • Partner with Product Managers to translate business requirements into technical reality. • Serve as a mentor to fellow engineers and enforce coding standards through rigorous code reviews. • Proactively reduce technical debt to keep our platform healthy and scalable.




