IT services that put people at the center of your business
AI Infrastructure TPM II
Location
California
Posted
3 days ago
Salary
$73.0K - $100.8K / year
Seniority
Lead
Job Description
AI Infrastructure TPM II
Astreya
• Lead large-scale AI infrastructure deployment programs across multiple sites, regions, or business units. • Drive end-to-end project execution for GPU clusters, AI compute environments, storage platforms, high-speed networks, and data center infrastructure. • Develop integrated project plans, implementation strategies, and operational readiness frameworks. • Manage cross-functional coordination between engineering, operations, supply chain, vendors, and executive stakeholders. • Identify and mitigate program risks, schedule impacts, technical dependencies, and operational constraints. • Lead infrastructure migration, expansion, upgrade, and modernization initiatives. • Drive governance reviews, project reporting, KPI tracking, and executive-level communications. • Coordinate infrastructure acceptance testing, deployment validation, and production readiness activities. • Mentor junior project managers and contribute to PMO process standardization and operational maturity. • Support vendor negotiations, technical evaluations, and infrastructure planning initiatives.
Job Requirements
- Advanced understanding of AI infrastructure technologies including GPU platforms, storage systems, networking, and data center operations.
- 8+ years of technical project or program management experience within infrastructure environments.
- Proven experience leading large-scale infrastructure deployment or transformation programs.
- Strong risk management, executive communication, and stakeholder alignment skills.
- Experience coordinating multi-vendor and cross-functional technical teams.
- Ability to manage complex schedules, budgets, and operational dependencies.
- Relevant certifications preferred (PMP, PgMP, ITIL, Agile, CCNA, etc.).
Benefits
- Medical provided through UHC (PPO, HSA, Surest options) / Medical provided through Kaiser (HMO option only) for California employees only
- Dental provided through UHC Nationwide
- Vision provided by UHC
- Flexible Spending Account for Health & Dependent Care
- Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific)
- Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera
- Corporate Wellness Program provided by Goomi Group
- Employee Assistance Program
- Wellness Days
- 401k Plan
- Basic and Supplemental Life Insurance
- Short Term & Long Term Disability
- Critical Illness, Critical Hospital, and Voluntary Accident Insurance
- Tuition Reimbursement (available 6 months after start date, capped)
- Paid Time Off (accrued and prorated, maximum of 120 hours annually)
- Paid Holidays
- Any other statutory leaves, paid time, or other ancillary benefits required under state and federal law
Related Guides
Related Job Pages
More LLM Engineer Jobs
• Assist in creating and updating AutoCAD drawings, rack elevations, cabinet layouts, and structured cabling documentation. • Support data hall design documentation, asset tracking, and revision management activities. • Assist engineering teams with infrastructure inventory validation and basic capacity tracking. • Help maintain design standards, templates, and project documentation repositories. • Participate in engineering reviews, design walkthroughs, and quality assurance activities. • Coordinate with cross-functional teams to gather project inputs and update design records. • Support preparation of project reports, spreadsheets, diagrams, and technical documentation. • Learn data center infrastructure concepts including power, cooling, cabling, rack configurations, and AI cluster environments. • Follow established operational procedures, engineering standards, and safety requirements. • Assist with administrative and project coordination tasks related to infrastructure deployment activities.
• Define and evolve the long-term AI/ML research strategy and technical roadmap for Trase OS in alignment with product and platform direction. • Lead large-scale experimentation and prototyping efforts requiring significant compute infrastructure, translating frontier AI research into scalable, production-grade systems with measurable impact. • Drive original research and technical breakthroughs in agentic systems, autonomous execution, multi-agent orchestration, post-training and fine-tuning systems, SLM/LLM-based architectures, and applied AI infrastructure. • Design how models operate within long-lived execution environments, including agent workflows, tool use, planning, memory systems, reasoning, and human-in-the-loop controls. • Establish evaluation methodologies and reliability frameworks for autonomous systems, including benchmarking, regression testing, safety, controllability, and production behavior analysis. • Drive architecture decisions across orchestration, model serving, routing, inference, and infrastructure governance, including latency, reliability, and cost optimization. • Partner closely with engineering and product teams to operationalize research outcomes into deployable systems and enterprise workflows. • Build AI systems that operate reliably in regulated and constrained environments, including secure cloud, on-premise, and air-gapped deployments. • Contribute to the broader AI research community through technical papers, publications, conference participation, architecture proposals, and thought leadership. • Serve as a senior technical authority and mentor across the organization, influencing technical direction, research rigor, experimentation practices, and best practices across research, engineering, and product teams.
Principal AI Researcher – Agentic Systems, AI Infrastructure
Red Cell PartnersImpact Through Innovation
• Define and evolve the long-term AI/ML research strategy and technical roadmap for Trase OS in alignment with product and platform direction. • Lead large-scale experimentation and prototyping efforts requiring significant compute infrastructure, translating frontier AI research into scalable, production-grade systems with measurable impact. • Drive original research and technical breakthroughs in agentic systems, autonomous execution, multi-agent orchestration, post-training and fine-tuning systems, SLM/LLM-based architectures, and applied AI infrastructure. • Design how models operate within long-lived execution environments, including agent workflows, tool use, planning, memory systems, reasoning, and human-in-the-loop controls. • Establish evaluation methodologies and reliability frameworks for autonomous systems, including benchmarking, regression testing, safety, controllability, and production behavior analysis. • Drive architecture decisions across orchestration, model serving, routing, inference, and infrastructure governance, including latency, reliability, and cost optimization. • Partner closely with engineering and product teams to operationalize research outcomes into deployable systems and enterprise workflows. • Build AI systems that operate reliably in regulated and constrained environments, including secure cloud, on-premise, and air-gapped deployments. • Contribute to the broader AI research community through technical papers, publications, conference participation, architecture proposals, and thought leadership. • Serve as a senior technical authority and mentor across the organization, influencing technical direction, research rigor, experimentation practices, and best practices across research, engineering, and product teams.
AI Infrastructure Datacenter Technical Project Manager II
AstreyaIT services that put people at the center of your business
Role Description The AI Infrastructure Datacenter Technical Project Manager Level 2 serves as a senior project leader responsible for managing large-scale AI infrastructure programs, complex technical deployments, and cross-functional strategic initiatives. This role drives execution excellence across compute, GPU, storage, networking, and data center infrastructure domains while ensuring alignment with business and operational objectives. Key Responsibilities - Lead large-scale AI infrastructure deployment programs across multiple sites, regions, or business units. - Drive end-to-end project execution for GPU clusters, AI compute environments, storage platforms, high-speed networks, and data center infrastructure. - Develop integrated project plans, implementation strategies, and operational readiness frameworks. - Manage cross-functional coordination between engineering, operations, supply chain, vendors, and executive stakeholders. - Identify and mitigate program risks, schedule impacts, technical dependencies, and operational constraints. - Lead infrastructure migration, expansion, upgrade, and modernization initiatives. - Drive governance reviews, project reporting, KPI tracking, and executive-level communications. - Coordinate infrastructure acceptance testing, deployment validation, and production readiness activities. - Mentor junior project managers and contribute to PMO process standardization and operational maturity. - Support vendor negotiations, technical evaluations, and infrastructure planning initiatives. Scope & Complexity - Leads highly complex infrastructure programs with multiple concurrent workstreams. - Manages enterprise-scale AI infrastructure deployments and operational initiatives. - Influences program execution standards, governance models, and delivery methodologies. Qualifications - Advanced understanding of AI infrastructure technologies including GPU platforms, storage systems, networking, and data center operations. - 8+ years of technical project or program management experience within infrastructure environments. - Proven experience leading large-scale infrastructure deployment or transformation programs. - Strong risk management, executive communication, and stakeholder alignment skills. - Experience coordinating multi-vendor and cross-functional technical teams. - Ability to manage complex schedules, budgets, and operational dependencies. - Relevant certifications preferred (PMP, PgMP, ITIL, Agile, CCNA, etc.). Requirements - Salary Range: $72,960.00 - $100,800.00 USD (Salary) - Please note that the salary information provided herein is base pay only (gross); it does not include other forms of compensation which may or may not apply to this specific position, namely, performance-based bonuses, benefits-related payments, or other general incentives - none of which are guaranteed, may be subject to specific eligibility requirements, and are wholly within the discretion of Astreya to remit. - Further, the salary information noted above is a range that consists of a minimum and maximum rate of pay for this specific position. Where an applicant or employee is placed on this range will depend and be contingent on objective, documented work-related considerations like education, experience, certifications, licenses, preferred qualifications, among other factors. Benefits - Medical provided through UHC (PPO, HSA, Surest options) - Medical provided through Kaiser (HMO option only) for California employees only - Dental provided through UHC - Nationwide Vision provided by UHC - Flexible Spending Account for Health & Dependent Care - Pre-Tax Account for Commuter Benefit/Parking & Transit (location-specific) - Continuing Education and Professional Development via various integrated platforms, e.g. Udemy and Coursera - Corporate Wellness Program provided by Goomi Group - Employee Assistance Program - Wellness Days - 401k Plan - Basic and Supplemental Life Insurance - Short Term & Long Term Disability - Critical Illness, Critical Hospital, and Voluntary Accident Insurance - Tuition Reimbursement (available 6 months after start date, capped) - Paid Time Off (accrued and prorated, maximum of 120 hours annually) - Paid Holidays - Any other statutory leaves, paid time, or other ancillary benefits required under state and federal law


