AI, Uncomplicated.
Principal AI Researcher – Agentic Systems, AI Infrastructure
Location
Virginia + 1 moreAll locations: Virginia | Washington
Posted
4 days ago
Salary
$250K - $300K / year
Seniority
Lead
Job Description
Principal AI Researcher – Agentic Systems, AI Infrastructure
Trase
• Define and evolve the long-term AI/ML research strategy and technical roadmap for Trase OS in alignment with product and platform direction. • Lead large-scale experimentation and prototyping efforts requiring significant compute infrastructure, translating frontier AI research into scalable, production-grade systems with measurable impact. • Drive original research and technical breakthroughs in agentic systems, autonomous execution, multi-agent orchestration, post-training and fine-tuning systems, SLM/LLM-based architectures, and applied AI infrastructure. • Design how models operate within long-lived execution environments, including agent workflows, tool use, planning, memory systems, reasoning, and human-in-the-loop controls. • Establish evaluation methodologies and reliability frameworks for autonomous systems, including benchmarking, regression testing, safety, controllability, and production behavior analysis. • Drive architecture decisions across orchestration, model serving, routing, inference, and infrastructure governance, including latency, reliability, and cost optimization. • Partner closely with engineering and product teams to operationalize research outcomes into deployable systems and enterprise workflows. • Build AI systems that operate reliably in regulated and constrained environments, including secure cloud, on-premise, and air-gapped deployments. • Contribute to the broader AI research community through technical papers, publications, conference participation, architecture proposals, and thought leadership. • Serve as a senior technical authority and mentor across the organization, influencing technical direction, research rigor, experimentation practices, and best practices across research, engineering, and product teams.
Job Requirements
- 12–15+ years of experience in machine learning, AI systems, or applied AI research, including experience operating at a Principal, Distinguished, or equivalent technical level.
- Strong research and publication track record, including authored papers, major technical contributions, or active participation in frontier AI research.
- Experience publishing at top-tier conferences or contributing influential open-source, research, or AI infrastructure systems.
- Experience conducting large-scale experimentation requiring significant compute infrastructure, evaluation workflows, and iterative model/system analysis.
- Deep expertise in one or more areas including agentic systems, LLMs and generative AI, multi-agent systems, reasoning systems, reinforcement learning, orchestration infrastructure, AI systems reliability, NLP, multimodal systems, or deep learning.
- Hands-on experience with agent-based systems, prompt engineering, RAG, RLHF, SLMs, fine-tuning/post-training techniques, tool integration, memory systems, and human-in-the-loop orchestration.
- Proven experience building, deploying, and operating enterprise-grade AI systems, including GenAI, LLM, or agent-based applications at scale.
- Strong understanding of ML system behavior in production, including reliability, latency, cost tradeoffs, observability, evaluation frameworks, regression testing, and failure modes.
- Strong systems thinking and demonstrated ability to partner cross-functionally with engineering and product organizations to move research into production systems.
- Strong programming and prototyping skills in Python and modern ML infrastructure stacks, with experience in Java or related systems languages preferred.
- Experience deploying AI/ML systems in regulated, constrained, or enterprise environments, and demonstrated ability to lead technical direction from research through production impact.
Benefits
- Career track opportunity with potential for rapid advancement with strong performance as the firm grows
- 100% employer paid, comprehensive health care including medical, dental, and vision for you and your family.
- Paid maternity and paternity for 14 weeks at employees' normal pay.
- Unlimited PTO, with management approval.
- Opportunities for professional development and continued learning.
- Optional 401K, FSA, and equity incentives available.
- Mental health benefits are available through Tara Mind.
Related Guides
Related Job Pages
More LLM Engineer Jobs
Principal AI Researcher – Agentic Systems, AI Infrastructure
Red Cell PartnersImpact Through Innovation
• Define and evolve the long-term AI/ML research strategy and technical roadmap for Trase OS in alignment with product and platform direction. • Lead large-scale experimentation and prototyping efforts requiring significant compute infrastructure, translating frontier AI research into scalable, production-grade systems with measurable impact. • Drive original research and technical breakthroughs in agentic systems, autonomous execution, multi-agent orchestration, post-training and fine-tuning systems, SLM/LLM-based architectures, and applied AI infrastructure. • Design how models operate within long-lived execution environments, including agent workflows, tool use, planning, memory systems, reasoning, and human-in-the-loop controls. • Establish evaluation methodologies and reliability frameworks for autonomous systems, including benchmarking, regression testing, safety, controllability, and production behavior analysis. • Drive architecture decisions across orchestration, model serving, routing, inference, and infrastructure governance, including latency, reliability, and cost optimization. • Partner closely with engineering and product teams to operationalize research outcomes into deployable systems and enterprise workflows. • Build AI systems that operate reliably in regulated and constrained environments, including secure cloud, on-premise, and air-gapped deployments. • Contribute to the broader AI research community through technical papers, publications, conference participation, architecture proposals, and thought leadership. • Serve as a senior technical authority and mentor across the organization, influencing technical direction, research rigor, experimentation practices, and best practices across research, engineering, and product teams.
AI Infrastructure Datacenter Technical Project Manager II
AstreyaIT services that put people at the center of your business
Role Description The AI Infrastructure Datacenter Technical Project Manager Level 2 serves as a senior project leader responsible for managing large-scale AI infrastructure programs, complex technical deployments, and cross-functional strategic initiatives. This role drives execution excellence across compute, GPU, storage, networking, and data center infrastructure domains while ensuring alignment with business and operational objectives. Key Responsibilities - Lead large-scale AI infrastructure deployment programs across multiple sites, regions, or business units. - Drive end-to-end project execution for GPU clusters, AI compute environments, storage platforms, high-speed networks, and data center infrastructure. - Develop integrated project plans, implementation strategies, and operational readiness frameworks. - Manage cross-functional coordination between engineering, operations, supply chain, vendors, and executive stakeholders. - Identify and mitigate program risks, schedule impacts, technical dependencies, and operational constraints. - Lead infrastructure migration, expansion, upgrade, and modernization initiatives. - Drive governance reviews, project reporting, KPI tracking, and executive-level communications. - Coordinate infrastructure acceptance testing, deployment validation, and production readiness activities. - Mentor junior project managers and contribute to PMO process standardization and operational maturity. - Support vendor negotiations, technical evaluations, and infrastructure planning initiatives. Scope & Complexity - Leads highly complex infrastructure programs with multiple concurrent workstreams. - Manages enterprise-scale AI infrastructure deployments and operational initiatives. - Influences program execution standards, governance models, and delivery methodologies. Qualifications - Advanced understanding of AI infrastructure technologies including GPU platforms, storage systems, networking, and data center operations. - 8+ years of technical project or program management experience within infrastructure environments. - Proven experience leading large-scale infrastructure deployment or transformation programs. - Strong risk management, executive communication, and stakeholder alignment skills. - Experience coordinating multi-vendor and cross-functional technical teams. - Ability to manage complex schedules, budgets, and operational dependencies. - Relevant certifications preferred (PMP, PgMP, ITIL, Agile, CCNA, etc.). Requirements - Salary Range: $72,960.00 - $100,800.00 USD (Salary) - Please note that the salary information provided herein is base pay only (gross); it does not include other forms of compensation which may or may not apply to this specific position, namely, performance-based bonuses, benefits-related payments, or other general incentives - none of which are guaranteed, may be subject to specific eligibility requirements, and are wholly within the discretion of Astreya to remit. - Further, the salary information noted above is a range that consists of a minimum and maximum rate of pay for this specific position. Where an applicant or employee is placed on this range will depend and be contingent on objective, documented work-related considerations like education, experience, certifications, licenses, preferred qualifications, among other factors. Benefits - Medical provided through UHC (PPO, HSA, Surest options) - Medical provided through Kaiser (HMO option only) for California employees only - Dental provided through UHC - Nationwide Vision provided by UHC - Flexible Spending Account for Health & Dependent Care - Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific) - Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera - Corporate Wellness Program provided by Goomi Group - Employee Assistance Program - Wellness Days - 401k Plan - Basic and Supplemental Life Insurance - Short Term & Long Term Disability - Critical Illness, Critical Hospital, and Voluntary Accident Insurance - Tuition Reimbursement (available 6 months after start date, capped) - Paid Time Off (accrued and prorated, maximum of 120 hours annually) - Paid Holidays - Any other statutory leaves, paid time, or other ancillary benefits required under state and federal law
• Identify, vet, and manage Tier 1/2 OEMs and regional distributors for high-density servers, network gear, and cabling. • Drive end-to-end contract lifecycles, including Master Purchase Agreements (MPAs), Service Level Agreements (SLAs), and complex warranty/support negotiations. • Monitor global semiconductor trends to mitigate long-lead-time risks. Support Solution Engineering by ensuring "just-in-time" inventory of mission-critical hardware (GPUs, NICs, Switches). • Partner with Systems Engineering and Architecture teams to translate technical specs into scalable, multi-year procurement roadmaps. • Oversee the procurement and delivery of integrated components, including NVIDIA Grace CPUs, NVLink, InfiniBand, and ConnectX-8 technologies. • Architect procurement workflows that satisfy stringent security, data residency, and national compliance requirements for Sovereign AI cloud deployments.
Principal AI Infrastructure – Hardware Program Management
Astera LabsPurpose-built connectivity solutions for intelligent systems
• Lead and manage global AI, Storage, and Networking hardware design programs, ensuring on-time delivery, scope control, and budget adherence • Drive program governance, risk management, and execution excellence across all phases of product development • Provide regular program updates, risk assessments, and financial reporting to executive leadership through structured reviews (e.g., Leadership Program Reviews) • Oversee the successful launch of complex hardware platforms, including AI GPU-based systems • Manage high-priority, resource-constrained programs while maintaining quality and schedule commitments • Enable innovation in next-generation AI infrastructure and high-performance computing platforms • Lead end-to-end program management for UALink / PCIe Gen6 switch tray development supporting rack-scale AI platforms and GPU clusters • Coordinate design, validation, and manufacturing readiness of switch trays across EVT, DVT, and PVT phases • Drive integration of switch silicon, retimer cards, cabling, and system-level connectivity within full rack-scale architectures • Collaborate with ODMs and partners to align on design specifications, cost models, and development schedules • Manage technical trade-offs across performance, signal integrity, power delivery, thermals, and scalability for high-density GPU deployments • Partner with Tier-1 customers to deliver Joint Design Manufacturing (JDM) programs • Ensure alignment with customer-specific requirements in design, supply chain, and quality • Build strong customer relationships and act as a trusted advisor throughout the product lifecycle • Lead globally distributed teams across engineering (HW/SW), supply chain, manufacturing, and quality organizations • Coordinate teams across multiple regions (e.g., North America and Asia) to drive seamless execution • Guide programs through EVT, DVT, and PPVT phases, ensuring technical validation across electrical, thermal, power, signal integrity, and mechanical domains




