Vultr

Vultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.

Senior Product Manager, AI Infrastructure

LLM EngineerMachine Learning EngineerFull Time Remote SeniorTeam 201-500Since 2014H1B No SponsorCompany Site LinkedIn

Location

United States

Posted

19 days ago

Salary

$130K - $165K / year

Seniority

Senior

Bachelor Degree5 yrs expEnglishCloud Kubernetes

Job Description

• Own the discovery and definition of customer requirements for AI infrastructure use cases, including training, inference, GPU clusters, bare metal, managed orchestration, networking, and storage • Work directly with strategic customers to understand their technical needs, deployment timelines, workload patterns, and success criteria • Translate customer requirements into clear product requirements, technical specifications, and engineering priorities • Partner closely with engineering, infrastructure, networking, storage, and operations teams to deliver customer-ready solutions • Drive alignment across customer needs, product roadmap, architecture decisions, and delivery execution • Understand and define requirements across GPU compute, CPU, memory, local and shared storage, high-performance networking, cluster topology, and orchestration layers such as Kubernetes and Slurm • Support customer conversations around infrastructure design, capacity planning, performance expectations, operational readiness, and acceptance criteria • Help define product capabilities that can scale beyond a single customer into repeatable AI infrastructure offerings

Job Requirements

5+ years of experience in technical product management, with focus on infrastructure, cloud computing, or AI/ML platforms
Deep technical understanding of GPU compute, high-performance networking (InfiniBand, RoCE), distributed storage, and cluster orchestration (Kubernetes, Slurm)
Experience working directly with enterprise customers to define infrastructure requirements and translate them into product specifications
Strong understanding of AI/ML workloads including training, fine-tuning, and inference at scale
Proven ability to collaborate with engineering teams to deliver complex infrastructure products
Excellent written and verbal communication skills, with the ability to engage both technical and executive audiences
Experience with bare metal infrastructure and managed cloud services
Bachelor's degree in Computer Science, Engineering, or related technical field; advanced degree preferred

Benefits

100% company-paid insurance premiums for employee medical, dental and vision plans.
401(k) plan that matches 100% up to 4%, with immediate vesting
Professional Development Reimbursement of $2,500 each year
11 Holidays + Paid Time Off Accrual + Rollover Plan
Commitment matters to Vultr! Increased PTO at 3 year and 10 year anniversary + 1 month paid sabbatical every 5 years + Anniversary Bonus each year
$500 stipend for remote office setup in first year + $400 each following year
Internet reimbursement up to $75 per month
Gym membership reimbursement up to $50 per month
Company paid Wellable subscription

Related Categories

LLM Engineer AI Engineer Machine Learning Engineer AI Research Scientist Computer Vision Engineer NLP Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More LLM Engineer Jobs

Generative AI Engineer

Dataiku

Everyday AI, Extraordinary People

LLM Engineer21 days ago

Full Time RemoteTeam 1,001-5,000H1B Sponsor

Company Site LinkedIn

• Design end-to-end AI solutions on Dataiku's platform, leveraging Dataiku Agent Hub, Prompt Studio, LLM Mesh, and Knowledge Banks (Vector Stores), or Python-based frameworks where needed. • Build and orchestrate multi-agent systems using Dataiku's Visual Agents (simple and structured), as well as code-based frameworks (LangGraph, CrewAI, Claude Agent SDK, OpenAI Agents SDK) as appropriate. • Integrate and optimize LLM APIs across providers (OpenAI, Anthropic, Google Gemini, AWS Bedrock, Azure, open-source models via Dataiku's LLM Mesh), applying model routing strategies to balance cost, latency, and quality. • Implement Retrieval-Augmented Generation (RAG) pipelines, including agentic RAG and GraphRAG, using Dataiku's Knowledge Banks with reranking, dynamic filtering, and document extraction capabilities. • Work exclusively with the Marketing organisation, partnering across functions such as Demand Generation, Content Marketing, Product Marketing, Field Marketing, Marketing Operations, Brand, and Communications. • Engage marketing stakeholders to gather business requirements, then go further: identify the underlying user or team pain points those requirements represent, and design solutions that address both the stated need and the deeper problem. • Own projects end-to-end, from requirements intake and solution design through to build, deployment, and handover. • Develop autonomous and semi-autonomous AI agents, using Dataiku's Agent Builder, custom Python-based architectures (LangGraph, CrewAI, Claude Agent SDK, etc.), or a combination of both. Exercise judgment on when to leverage platform capabilities and when to build custom solutions. • Design and build Agent Tools beyond documented examples, including custom API integrations, data retrieval modules, decisioning logic, and automated workflows, pushing past out-of-the-box patterns to deliver solutions tailored to specific business problems. • Build, publish, and consume MCP (Model Context Protocol) servers to enable agent-to-tool integration across systems, including designing custom MCP servers where needed. • Develop evaluation and monitoring approaches for agent systems, combining Dataiku's built-in capabilities with custom instrumentation to measure reliability, accuracy, cost, and business impact in production. • Design and maintain evaluation frameworks (evals) for LLM-based systems, measuring accuracy, latency, cost, and reliability in production. • Adhere to data governance, security, and regulatory compliance requirements (EU AI Act awareness, responsible AI practices) for all AI solutions. • Leverage Dataiku's Cost Guard and Quality Guard features to manage LLM spend, enforce usage policies, and maintain output quality standards. • Work closely with analytics and data engineering teams to maintain metadata on reference datasets for LLM consumption. • Create front-end user interfaces for AI applications using HTML, CSS, and JavaScript, within Dataiku's webapps framework, Dataiku Answers for chat-based interfaces, or standalone applications built with Vue.js and Node.js. • Collaborate on UX design, ensuring internal stakeholders find AI solutions intuitive and responsive. • Provide product feedback to the development team to improve the platform. • Stay current with the rapidly evolving AI engineering landscape, agent frameworks, model capabilities, evaluation practices, governance requirements, and tools like MCP and A2A protocols.

AWS Azure JavaScript Node.js Python Vue.js Go

View details: Generative AI Engineer

New York

$160K - $240K / year

Apply

Senior Software Engineer, DGX Cloud AI Infrastructure

NVIDIA

LLM Engineer22 days ago

Full Time RemoteTeam 10,001+Since 1993H1B Sponsor

Company Site LinkedIn

• Lead bring-up, validation, and debugging of large-scale AI clusters, infrastructure, and end-to-end workloads, setting the standard for how the team operates. • Bring up, tune, and benchmark AI pre-training, post-training, and inference workloads using PyTorch, NeMo / Megatron, TensorRT-LLM, and adjacent NVIDIA AI software stacks. • Profile and optimize end-to-end workload performance across compute, memory, networking, and communication layers using tools such as Nsight Systems, NCCL tests, and custom microbenchmarks. • Analyze scaling efficiency for distributed LLM workloads using data, tensor, pipeline, and expert parallelism across modern GPU clusters, and translate findings into concrete tuning guidance. • Own root-cause analysis of complex failures — hangs, performance regressions, topology sensitivity in large distributed environments. • Define and build the resilience and failure-attribution stack: detecting, triaging, and attributing node, fabric, and workload failures across the cluster at scale. • Build repeatable benchmark suites, automation, acceptance criteria, and qualification workflows on new platforms. • Tune runtime settings, communication parameters, and deployment configurations in close partnership with framework, systems, and platform teams. • Deliver actionable, data-driven recommendations based on profiling, benchmark results, and cluster characterization. • Mentor engineers, drive technical standards, and act as a force multiplier across the broader performance and infrastructure organization.

Distributed Systems Node.js Python PyTorch

View details: Senior Software Engineer, DGX Cloud AI Infrastructure

California + 3 more

$184K - $356.5K / year

Apply

Generative AI Engineer – LATAM Candidates Only

Talentus Global

We facilitate talent & software solutions across the globe. Near-shore, managed services, ERP's, CRM's, EdTech/HigherEd.

LLM Engineer23 days ago

Full Time RemoteTeam 201-500Since 2020H1B No Sponsor

Company Site LinkedIn

• Design, develop, and deploy Generative AI solutions leveraging Large Language Models (LLMs) and multimodal AI technologies. • Build and maintain scalable AI applications using cloud platforms such as Azure, AWS, or GCP. • Develop and optimize Retrieval-Augmented Generation (RAG) architectures, vector databases, and knowledge retrieval systems. • Fine-tune, evaluate, and monitor foundation models to improve performance, accuracy, and reliability. • Implement prompt engineering strategies and AI orchestration frameworks to support business use cases. • Collaborate with software engineering, data science, DevOps, and security teams to integrate AI solutions into production environments. • Develop APIs, microservices, and AI-powered applications following software engineering best practices. • Ensure compliance with AI governance, security, privacy, and responsible AI standards. • Monitor AI workloads, model performance, and operational costs, recommending continuous improvements. • Stay current with emerging Generative AI technologies, frameworks, and industry trends.

AWS Azure Cloud Docker Google Cloud Platform Kubernetes Microservices Python

View details: Generative AI Engineer – LATAM Candidates Only

Colombia

Apply

Senior AI Infrastructure Engineer

Pyyne

LLM Engineer24 days ago

Full Time RemoteTeam 51-200Since 2020H1B No Sponsor

Company Site LinkedIn

• Design and deploy AI platforms that integrate with infrastructure tools • Develop AI-powered workflows to automate operational tasks • Build AI-driven automation for incident response and operational workflows • Implement AI-powered monitoring and anomaly detection capabilities • Create intelligent operational dashboards with actionable insights • Ensure AI platforms operate reliably in production environments • Develop AI solutions for cost optimization and predictive capacity planning

Ansible AWS Azure Cloud Google Cloud Platform ITSM Python Terraform

View details: Senior AI Infrastructure Engineer

Brazil

Apply

Senior Product Manager, AI Infrastructure

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More LLM Engineer Jobs

Generative AI Engineer

Senior Software Engineer, DGX Cloud AI Infrastructure

Generative AI Engineer – LATAM Candidates Only

Senior AI Infrastructure Engineer