CACI International Inc logo
CACI International Inc

Expertise and Technology for National Security

Infrastructure Lead Architect

Infrastructure EngineerInfrastructure EngineerOtherRemoteTeam 10,001+Since 1962H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

94 days ago

Salary

$114K - $252K / year

No structured requirement data.

Job Description

Infrastructure Lead Architect

CACI International Inc

Job Title: Infrastructure Lead Architect Job Category: Information Technology Time Type: Full time Minimum Clearance Required to Start: Secret Employee Type: Regular Percentage of Travel Required: Up to 10% Type of Travel: Continental US * * * The Opportunity: The Infrastructure Lead Architect will oversee all IT Engineering efforts supporting the Department of the Air Force (DAF) Enterprise Information Technology as a Service (EITaaS) Cloud Infrastructure IT Services Group (IITS). This is a strategic technical role that will oversee and lead the technical direction of the infrastructure group within the DAF EITaaS Program at the direction of the Senior PM. As the Lead Architect, you will work with the Senior PM to oversee the technical resources of the entire group, assist in the leading of all technical projects, technical capabilities roadmaps, and technical success of projects within the group. You will be asked to assist the PM in high level Gov’t meetings and briefings, as well as internal Executive briefings. You will assist in the development of schedules, documentation, and requirements for individual projects within IITS. You will assist in the improvement of technical processes such as ticket and change management and assure that the group aligns with the technical needs of internal EITaaS and external DAF customers. The Infrastructure Lead Architect will oversee the technical strategic direction of multiple infrastructure teams within the EITaaS Infrastructure IT Services Group (IITS) to provide full project lifecycle technical oversite at the direction of the Senior PM. What you’ll get to do: - Work with a team of engineers exceeding 50 personnel. Includes technical oversight of projects and the greater infrastructure group, lead technical requirements, lead strategic planning of the group with technical initiatives of 12-24 months. - Assist the Senior PMs and technical staff within the projects in writing and tracking technical requirements and documentation. - Work with the individual project PMs on the 2-year goals and objectives for the successful delivery of all Project Scopes. - Work with the Sr. PM daily to track and support technical progress of projects. - Define continuous improvement opportunities and project improvement efforts within the overall project strategic plan. - Support cost management with the Sr PM to manage project costs; develop ODC spend plan by quarter to track project hardware and software expenditures. - Work to prioritize work and ensure successful delivery of program, and project milestones. Qualifications: Required: - Ability to obtain a Secret security clearance - 15+ Years of relevant experience (Bachelor’s Degree in applicable field may be substituted for 5 years of experience). - 10+ years of experience as an Architect leading complex high visibility infrastructure projects. - 5+ years of experience managing a team of 15 plus technical resources - 3+ years of experience developing schedules within Microsoft Project, or Microsoft Project Online - 3+ years of experience working with project cost models for Labor and Other Direct Costs (ODC’s) - 10+ years of experience delivering SE solutions preferably to the Department of Defense - 5+ years of experience managing Cloud computing infrastructure solutions (Azure preferred, but not required) - Understanding and experience with the System Development Lifecycle (SDLC) - Experience Developing technical project roadmaps projecting 18+ months of project milestones - Experience briefing senior members of a program team and senior gov’t officials - Experience managing the full SELC lifecycle and moving capabilities from Design to Production and Operational Readiness. Previous ownership of a project from development through deployment and maintenance. - Experience managing incident and change tickets in Production Enterprise Environments. - Strong analytical and problem-solving skills, ability to prioritize requirements for value that align with customer expectations and product vision. - Excellent communication and leadership skills. You must effectively communicate the priorities, product vision, and requirements to the development team, stakeholders, and customers. Additionally, you should be able to collaborate effectively with cross-functional teams, including developers, designers, marketers, and salespeople. - Experience managing a large DoD enterprise environment. - Experience building and managing complex Schedules within Microsoft Project or Microsoft Project Online. Desired: - Experience of a similar complexity of work establishing Enterprise capabilities for a .MIL service - Current DOD Secret clearance or higher - Experience Managing a team responsible for incident ticket response and change management. - Experience working on large programs with over 1,000 FTEs. - What You Can Expect: A culture of integrity. At CACI, we place character and innovation at the center of everything we do. As a valued team member, you’ll be part of a high-performing group dedicated to our customer’s missions and driven by a higher purpose – to ensure the safety of our nation. An environment of trust. CACI values the unique contributions that every employee brings to our company and our customers - every day. You’ll have the autonomy to take the time you need through a unique flexible time off benefit and have access to robust learning resources to make your ambitions a reality. A focus on continuous growth. Together, we will advance our nation's most critical missions, build on our lengthy track record of business success, and find opportunities to break new ground — in your career and in our legacy. Pay Range: There are a host of factors that can influence final salary including, but not limited to, geographic location, Federal Government contract labor categories and contract wage rates, relevant prior work experience, specific skills and competencies, education, and certifications. Our employees value the flexibility at CACI that allows them to balance quality work and their personal lives. We offer competitive compensation, benefits and learning and development opportunities. Our broad and competitive mix of benefits options is designed to support and protect employees and their families. At CACI, you will receive comprehensive benefits such as; healthcare, wellness, financial, retirement, family support, continuing education, and time off benefits. Since this position can be worked in more than one location, the range shown is the national average for the position. The proposed salary range for this position is: $114,600-$252,100 CACI is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, age, national origin, disability, status as a protected veteran, or any other protected characteristic.

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Defense Unicorns logo

Infrastructure Engineer – Bare Metal Kubernetes

Defense Unicorns

We help mission-focused heroes solve the world’s biggest software challenges.

Full TimeRemoteTeam 51-200H1B No Sponsor

• Work in a small, dedicated team architecting US Navy enterprise systems. • Collaborate with other teams to integrate applications with Defense Unicorns’ software delivery platform. • Help build pipelines that automate testing, building, packaging, and deploying platform components. • Leverage Infrastructure-as-Code to create standardized, reusable modules that automate provisioning across diverse hypervisors and cloud platforms. • Provide grounded time and effort estimates for tasks. Assist in project planning and resource allocation based on estimated timelines. • Validate Solutions/Implementations: Ensure that solutions and implementations align with the outlined tasks and business requirements. • Rapidly understand customer mission requirements, conceptualize and prototype solutions, and rapidly iterate to deliver impact. • Author and maintain technical design documents, architecture decision records (ADRs), and test plans.

United States
$148.8K - $201.3K / year
Job Closed
Umpisa Inc logo

AI Infrastructure Engineer

Umpisa Inc

It’s easier to start than you think.

Full TimeRemoteTeam 11-50H1B No Sponsor

• Define AI infrastructure architecture strategy • Lead cross-functional collaboration with Data Science and Security teams • Design multi-region GPU cluster strategy • Evaluate emerging AI infrastructure technologies • Establish best practices and governance models • Design and implement inference efficiency initiatives such as prompt/context caching. • Build systems that allow fine-grained control over cache prefixes and retrieval strategies. • Optimize latency and cost efficiency of large-scale LLM inference workloads. • Support Retrieval-Augmented Generation (RAG) architectures. • Architect and implement end-to-end encryption for cached AI content. • Integrate customer-managed encryption keys (CMEK) within cloud environments. • Ensure secure multi-tenant data isolation and compliance standards. • Develop enterprise-ready vector similarity search systems. • Optimize Approximate Nearest Neighbor (ANN) algorithms for scale and latency. • Build ranking models for personalization, recommendation, and monetization. • Contribute to highly scalable embedding search infrastructure. • Design and maintain petabyte-scale distributed storage systems. • Implement materialized views with consistent cross-datacenter updates. • Support high-update throughput systems with low-latency point queries. • Optimize large-scale table scans and distributed data processing.

Philippines
Andromeda Cluster logo

Site Reliability Engineer - AI Infrastructure

Andromeda Cluster

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

OtherRemoteTeam 11-50

Site Reliability Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute — a marketplace that moves the infrastructure and workloads powering AGI not dissimilar to the flows of capital in the world's financial markets. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. What You’ll Do - Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers. - Build automation and tooling to streamline cluster deployments and integrations. - Debug customer issues across networking, storage, scheduling, and system layers. - Improve reliability and scalability of both training and inference infrastructure. - Design and implement monitoring, alerting, and observability for critical systems. - Collaborate with engineering and product teams to plan and deliver infrastructure for new services. - Participate in on-call and incident response, leading postmortems and reliability improvements. What We’re Looking For - 5+ years experience in SRE, DevOps, or infrastructure engineering roles. - Strong Linux systems and networking fundamentals. - Deep experience with Kubernetes and container orchestration at scale. - Proficiency with Infrastructure-as-Code (Terraform, Helm, Ansible, etc.). - Strong automation and scripting skills (Python, Go, or Bash). - Experience with observability stacks (Prometheus, Grafana, Loki, Datadog, etc.). - Track record of operating production systems and leading incident response. Nice to Have - Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton, etc.). - Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph). - Customer-facing support or consulting experience. Why You’ll Love It Here This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure.

United States
Andromeda Cluster logo

Performance Engineer - AI Infrastructure

Andromeda Cluster

Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering.

OtherRemoteTeam 11-50

Performance Engineer - AI Infrastructure Location: Global Remote / San Francisco · Full-Time About Andromeda Andromeda Cluster was founded by Nat Friedman and Daniel Gross to give early-stage startups access to the kind of scaled AI infrastructure once reserved only for hyperscalers. We began with a single managed cluster — but it filled almost instantly. Since then, we’ve been quietly building the systems, network, and orchestration layer that makes the world’s AI infrastructure more accessible. Today, Andromeda works with leading AI labs, data centers, and cloud providers to deliver compute when and where it’s needed most. Our platform routes training and inference jobs across global supply, unlocking flexibility and efficiency in one of the fastest-growing markets on earth. Our long-term vision is to build the liquidity layer for global AI compute. We are expanding to new frontiers to find the brightest that work in AI infrastructure, research and engineering. The Opportunity We are hiring a Performance Engineer to join our Growth team. In this role, your "product" is the efficiency and throughput of our massive-scale AI clusters. As we scale our network, the difference between a "working" cluster and an "optimized" one represents millions of dollars in value and weeks of saved research time for our customers. You will sit at the intersection of systems engineering and research, profiling end-to-end training runs to hunt down bottlenecks in compute, communication, and storage. What You’ll Do - Profile & Optimize: Conduct end-to-end profiling of training workloads to identify bottlenecks across GPU kernels, NCCL communication, and storage I/O. - System Refinement: Collaborate with systems engineers to improve scheduling efficiency, collective communication performance, and kernel execution. - Observability: Build and maintain high-fidelity tooling to monitor and visualize MFU, throughput, and cluster uptime. - Process Design: Design technical processes (e.g., postmortem reviews, incident response) that help the team operate effectively and avoid repeating performance regressions. What We’re Looking For - Systems Intuition: You love optimizing performance and digging into systems to understand how every layer interacts—from the training loop to the hardware. - Distributed Training Experience: Proven experience running distributed training jobs on multi-GPU systems or HPC clusters. - Coding Proficiency: Strong programming skills in Python and C++ (Rust or CUDA experience is a major plus). - ML Framework Depth: Solid understanding of PyTorch, JAX, or TensorFlow, and how large-scale training loops are built. - Infrastructure Knowledge: Familiarity with modern cloud infrastructure, including Kubernetes and Infrastructure as Code. - Rigor: A passion for measuring efficiency rigorously and translating raw profiling data into practical engineering improvements. Strong Candidates May Have - Low-Level Mastery: Experience with Linux kernel tuning, eBPF, and understanding systems design tradeoffs at the hardware level. - Specialized AI Infra: Hands-on experience with GPUs, TPUs, or Trainium, and the networking libraries that power them (NCCL, MPI, UCX). - Security & Privacy: Expertise in security best practices for high-scale infrastructure. - Observability: Familiarity with monitoring tools like Prometheus and Grafana. Why You’ll Love It Here This is a builder’s role. You’ll have ownership and autonomy to shape how our systems run, working directly with customers and providers while building the foundation for reliable, scalable AI infrastructure. Andromeda Cluster is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

United States
Job Closed