Principal Software Engineer – Large-Scale LLM Memory and Storage Systems
Location
California + 2 moreAll locations: California | Massachusetts | Washington
Posted
166 days ago
Salary
$272K - $425.5K / year
Seniority
Lead
Job Description
Principal Software Engineer – Large-Scale LLM Memory and Storage Systems
NVIDIA
• Design and evolve a unified memory layer that spans GPU memory, pinned host memory, RDMA-accessible memory, SSD tiers, and remote file/object/cloud storage to support large-scale LLM inference • Architect and implement deep integrations with leading LLM serving engines (such as vLLM, SGLang, TensorRT-LLM), with a focus on KV-cache offload, reuse, and remote sharing across heterogeneous and disaggregated clusters • Co-design interfaces and protocols that enable disaggregated prefill, peer-to-peer KV-cache sharing, and multi-tier KV-cache storage (GPU, CPU, local disk, and remote memory) for high-throughput, low-latency inference • Partner closely with GPU architecture, networking, and platform teams to exploit GPUDirect, RDMA, NVLink, and similar technologies for low-latency KV-cache access and sharing across heterogeneous accelerators and memory pools • Mentor senior and junior engineers, set technical direction for memory and storage subsystems, and represent the team in internal reviews and external forums (open source, conferences, and customer-facing technical deep dives)
Job Requirements
- Masters or PhD or equivalent experience
- 15+ years of experience building large-scale distributed systems, high-performance storage, or ML systems infrastructure in C/C++ and Python, with a track record of delivering production services
- Deep understanding of memory hierarchies (GPU HBM, host DRAM, SSD, and remote/object storage) and experience designing systems that span multiple tiers for performance and cost efficiency
- Distributed caching or key-value systems, especially designs optimized for low latency and high concurrency
- Hands-on experience with networked I/O and RDMA/NVMe-oF/NVLink-style technologies, and familiarity with concepts like disaggregated and aggregated deployments for AI clusters
- Strong skills in profiling and optimizing systems across CPU, GPU, memory, and network, using metrics to drive architectural decisions and validate improvements in TTFT and throughput
- Excellent communication skills and prior experience leading cross-functional efforts with research, product, and customer teams.
Benefits
- Equity
- Benefits
Related Guides
Related Job Pages
More Full-stack Engineer Jobs
Director, Product Engineering
May MobilityTransforming cities through autonomous technology to create a safer, greener, more accessible world.
• Lead a team of product managers that span across autonomous driving technology, application software, data & analytics, and vehicle hardware • Create compelling & inspiring product visions, strategies and roadmaps for several areas by collaborating with internal stakeholders and leveraging market, behavior & technology trends • Collaborate across other key stakeholders in product management, engineering, BD, customer success, etc. on product vision & strategy. • Implement a unified product development process based on being a highly iterative, learning organization and by managing internal and external resources throughout the product life cycle (customer insights→ launch→ post-launch improvements→ end-of-life). • Ensure that product roadmaps are customer and innovation driven and create compelling and lasting competitive advantage • Ensure pricing and value analysis is developed to maximize competitiveness, volume and profit for all products while meeting budget requirements. • Communicate product vision, strategy, positioning and plans with other leaders and stakeholders. • Define OKRs for team and products and deliver on them • Ensure collaboration of cross-functional teams to develop refined and improved KPI processes. • Set market research objectives and ensure the market research team serves May Mobility to the highest possible level. • Provide support with special projects as requested.
Staff Engineer – Workflows Engine
HighLevelThe all-in-one sales & marketing platform that agencies can white-label. CRM, Email, 2-way SMS, Funnel Builder, & more!
• Re-architecture: Rebuild the Workflow Engine from Node.js to Go, creating a modular, high-performance foundation for billions of executions • Core abstractions: Design orchestration, state, retries, and execution guarantees with clear contracts and isolation boundaries • Performance model: Optimise for throughput-first execution while maintaining strict ordering within each workflow execution context • APIs & contracts: Define interfaces and schemas between Engine, Triggers, and Actions. Ensure consistent, reliable, and versioned communication • Reliability & observability: Partner with SRE to instrument metrics (latency, throughput, failure rate) and build replay and diagnostics tooling • Operational ownership: Own the engine’s runtime — incidents, RCA, and prevention. Deliver measurable reliability improvements (<1% failures/day) • Migration & rollout: Drive dual-run migration with progressive rollout and auto-rollback safety • Engineering culture: Set the technical benchmark for clarity, testability, and performance within Workflows and beyond
Staff Engineer – Contacts Platform
HighLevelThe all-in-one sales & marketing platform that agencies can white-label. CRM, Email, 2-way SMS, Funnel Builder, & more!
• Architect and scale the Contact Creation Engine, ensuring a single source of truth for contact data with 99.95% availability and minimal latency • Enhance and evolve the Search Engine ecosystem, working with Elasticsearch, Firestore, and ClickHouse to deliver fast, accurate search results at scale • Own and optimise cloud-native infrastructure using Docker and advanced Kubernetes (cluster management, tuning, networking, and configuration) • Establish technical standards, drive architecture governance, conduct design reviews, and champion engineering best practices • Diagnose and solve complex production issues involving latency, throughput, system contention, scaling limits, and distributed node behavior • Create and analyze postmortems and implement long-term fixes to prevent recurrence • Partner closely with Product, Platform, and DevOps teams to architect shared services and ensure platform-wide reliability • Mentor engineers, influence technical decisions, and guide long-term platform evolution
Senior Software Engineer – FOS
CannonDesignWe design solutions that help people continuously flourish. Living-Centered Design is how we do it.
• Contribute to the design and implementation of scalable, robust, and secure application architecture under the guidance of the Director of Software Engineering. • Lead a team of developers through the full software development lifecycle — from design and development to deployment and ongoing support of SaaS products. • Drive the adoption of best practices in software engineering within your team, focusing on DevOps competencies (CI/CD), monitoring and observability, performance, and automated testing. • Oversee modernization efforts - including maintenance and migration of legacy applications while ensuring minimal disruption to existing customers. • Ensure compliance with SOC2 controls by embedding evidence collection, access management, and secure development processes into daily workflows. • Set high standards for code quality by modeling clean, maintainable coding practices and guiding the team through effective peer reviews. • Work across teams to ensure seamless integration and successful deployment of applications. • Assist in the architectural design of SaaS software systems and implement key architectural initiatives as directed. • Continuously assess and improve system performance ensuring accuracy, reliability, and scalability and drive root cause analysis for production issues. • Stay current with industry trends, emerging technologies, and best practices in software engineering and architecture.



