Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.
Senior Product Manager, AI Infrastructure
Location
United States
Posted
19 days ago
Salary
$130K - $165K / year
Seniority
Senior
Job Description
Senior Product Manager, AI Infrastructure
Vultr
• Own the discovery and definition of customer requirements for AI infrastructure use cases, including training, inference, GPU clusters, bare metal, managed orchestration, networking, and storage • Work directly with strategic customers to understand their technical needs, deployment timelines, workload patterns, and success criteria • Translate customer requirements into clear product requirements, technical specifications, and engineering priorities • Partner closely with engineering, infrastructure, networking, storage, and operations teams to deliver customer-ready solutions • Drive alignment across customer needs, product roadmap, architecture decisions, and delivery execution • Understand and define requirements across GPU compute, CPU, memory, local and shared storage, high-performance networking, cluster topology, and orchestration layers such as Kubernetes and Slurm • Support customer conversations around infrastructure design, capacity planning, performance expectations, operational readiness, and acceptance criteria • Help define product capabilities that can scale beyond a single customer into repeatable AI infrastructure offerings
Job Requirements
- 5+ years of experience in technical product management, with focus on infrastructure, cloud computing, or AI/ML platforms
- Deep technical understanding of GPU compute, high-performance networking (InfiniBand, RoCE), distributed storage, and cluster orchestration (Kubernetes, Slurm)
- Experience working directly with enterprise customers to define infrastructure requirements and translate them into product specifications
- Strong understanding of AI/ML workloads including training, fine-tuning, and inference at scale
- Proven ability to collaborate with engineering teams to deliver complex infrastructure products
- Excellent written and verbal communication skills, with the ability to engage both technical and executive audiences
- Experience with bare metal infrastructure and managed cloud services
- Bachelor's degree in Computer Science, Engineering, or related technical field; advanced degree preferred
Benefits
- 100% company-paid insurance premiums for employee medical, dental and vision plans.
- 401(k) plan that matches 100% up to 4%, with immediate vesting
- Professional Development Reimbursement of $2,500 each year
- 11 Holidays + Paid Time Off Accrual + Rollover Plan
- Commitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
- $500 stipend for remote office setup in first year + $400 each following year
- Internet reimbursement up to $75 per month
- Gym membership reimbursement up to $50 per month
- Company paid Wellable subscription
Related Guides
Related Job Pages
More LLM Engineer Jobs
• Design end-to-end AI solutions on Dataiku's platform, leveraging Dataiku Agent Hub, Prompt Studio, LLM Mesh, and Knowledge Banks (Vector Stores), or Python-based frameworks where needed. • Build and orchestrate multi-agent systems using Dataiku's Visual Agents (simple and structured), as well as code-based frameworks (LangGraph, CrewAI, Claude Agent SDK, OpenAI Agents SDK) as appropriate. • Integrate and optimize LLM APIs across providers (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure, open-source models via Dataiku's LLM Mesh), applying model routing strategies to balance cost, latency, and quality. • Implement Retrieval-Augmented Generation (RAG) pipelines, including agentic RAG and GraphRAG, using Dataiku's Knowledge Banks with reranking, dynamic filtering, and document extraction capabilities. • Work exclusively with the Marketing organisation, partnering across functions such as Demand Generation, Content Marketing, Product Marketing, Field Marketing, Marketing Operations, Brand, and Communications. • Engage marketing stakeholders to gather business requirements, then go further: identify the underlying user or team pain points those requirements represent, and design solutions that address both the stated need and the deeper problem. • Own projects end-to-end, from requirements intake and solution design through to build, deployment, and handover. • Develop autonomous and semi-autonomous AI agents, using Dataiku's Agent Builder, custom Python-based architectures (LangGraph, CrewAI, Claude Agent SDK, etc.), or a combination of both. Exercise judgment on when to leverage platform capabilities and when to build custom solutions. • Design and build Agent Tools beyond documented examples, including custom API integrations, data retrieval modules, decisioning logic, and automated workflows, pushing past out-of-the-box patterns to deliver solutions tailored to specific business problems. • Build, publish, and consume MCP (Model Context Protocol) servers to enable agent-to-tool integration across systems, including designing custom MCP servers where needed. • Develop evaluation and monitoring approaches for agent systems, combining Dataiku's built-in capabilities with custom instrumentation to measure reliability, accuracy, cost, and business impact in production. • Design and maintain evaluation frameworks (evals) for LLM-based systems, measuring accuracy, latency, cost, and reliability in production. • Adhere to data governance, security, and regulatory compliance requirements (EU AI Act awareness, responsible AI practices) for all AI solutions. • Leverage Dataiku's Cost Guard and Quality Guard features to manage LLM spend, enforce usage policies, and maintain output quality standards. • Work closely with analytics and data engineering teams to maintain metadata on reference datasets for LLM consumption. • Create front-end user interfaces for AI applications using HTML, CSS, and JavaScript, within Dataiku's webapps framework, Dataiku Answers for chat-based interfaces, or standalone applications built with Vue.js and Node.js. • Collaborate on UX design, ensuring internal stakeholders find AI solutions intuitive and responsive. • Provide product feedback to the development team to improve the platform. • Stay current with the rapidly evolving AI engineering landscape, agent frameworks, model capabilities, evaluation practices, governance requirements, and tools like MCP and A2A protocols.
• Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale. • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.
Generative AI Engineer – LATAM Candidates Only
Talentus GlobalWe facilitate talent & software solutions across the globe. Near-shore, managed services, ERP's, CRM's, EdTech/HigherEd.
• Design, develop, and deploy Generative AI solutions leveraging Large Language Models (LLMs) and multimodal AI technologies. • Build and maintain scalable AI applications using cloud platforms such as Azure, AWS, or GCP. • Develop and optimize Retrieval-Augmented Generation (RAG) architectures, vector databases, and knowledge retrieval systems. • Fine-tune, evaluate, and monitor foundation models to improve performance, accuracy, and reliability. • Implement prompt engineering strategies and AI orchestration frameworks to support business use cases. • Collaborate with software engineering, data science, DevOps, and security teams to integrate AI solutions into production environments. • Develop APIs, microservices, and AI-powered applications following software engineering best practices. • Ensure compliance with AI governance, security, privacy, and responsible AI standards. • Monitor AI workloads, model performance, and operational costs, recommending continuous improvements. • Stay current with emerging Generative AI technologies, frameworks, and industry trends.
• Design and deploy AI platforms that integrate with infrastructure tools • Develop AI-powered workflows to automate operational tasks • Build AI-driven automation for incident response and operational workflows • Implement AI-powered monitoring and anomaly detection capabilities • Create intelligent operational dashboards with actionable insights • Ensure AI platforms operate reliably in production environments • Develop AI solutions for cost optimization and predictive capacity planning




