Job Closed

This listing is no longer active.

The AI Factory. Accelerating the Future.

Lead Infrastructure Engineer

Infrastructure EngineerInfrastructure EngineerFull Time Remote SeniorTeam 51-200Since 2020H1B No SponsorCompany Site LinkedIn

Location

United Kingdom

Posted

93 days ago

Salary

Seniority

Senior

Bachelor DegreeEnglishCloud Kubernetes Linux OpenStack

Job Description

• Own and drive the design, deployment, and operation of OpenStack and Kubernetes clusters optimised for GPU workloads • Lead and develop a team of 4–5 infrastructure engineers, setting clear direction and standards • Build and improve infrastructure through automation — IaC, GitOps, and CI/CD pipelines • Ensure platform reliability through strong monitoring, observability, and incident management practices • Collaborate closely with DevOps, Product, and Support teams to align infrastructure with real-world customer needs • Take ownership of operational governance including incident, problem, and change management • Identify opportunities to simplify, standardise, and scale systems as the platform grows • Communicate clearly with leadership on platform performance, risks, and improvements

Job Requirements

Strong hands-on experience operating OpenStack in production environments
Experience running production-grade Kubernetes clusters — ideally bare-metal or private cloud
Solid Linux, networking, and storage fundamentals with a pragmatic troubleshooting approach
Experience with infrastructure automation, CI/CD, and Git-based workflows
Proven leadership or mentoring experience within infrastructure or platform teams
Experience managing incidents and coordinating response during critical service events
Strong communication skills, particularly translating technical issues for non-technical stakeholders.
Experience integrating Kubernetes with OpenStack (Nice to Have)
Exposure to GPU infrastructure, HPC, or large-scale compute platforms (Nice to Have)
Familiarity with advanced networking or cloud-native ecosystems (Nice to Have)
Contributions to open-source or cloud-native communities (Nice to Have)

Benefits

Competitive salary and annual discretionary bonus scheme
Employee wellbeing benefits
25 days of holiday, plus public holidays
Flexible working arrangements (remote or hybrid, depending on role and location)
Real ownership and autonomy, with the trust to take initiative and experiment
The opportunity to make a visible, meaningful impact as we scale
Clear career progression and growth opportunities in a fast-growing company
A collaborative, international culture built on trust, transparency, and ownership
The chance to help shape NexGen Cloud's team, culture, and future alongside ambitious, mission-driven colleagues

Related Categories

Infrastructure Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More Infrastructure Engineer Jobs

Founding Cloud Infrastructure Engineer

Henosia

Infrastructure Engineer93 days ago

Full Time Remote

The Opportunity Henosia has product-market fit. People are building real apps with us. They're automating work. They’re integrating systems and connecting data. We've proven that vibe coding works, not just as a toy, but for real businesses, in production. You can describe software in plain text and get production-ready code. Here's the challenge: our current infrastructure doesn’t scale to what's coming. We need to go from supporting hundreds of concurrent sandboxes to tens of thousands. And we need to do it without our cloud bills spiralling out of control. Fully isolated development environments that start in a second. That's where you come in. You'll own the entire cloud infrastructure. You'll architect how we scale our sandbox environment using micro VMs. You'll build the systems that keep users isolated from each other. You'll make sure we can handle 100x growth without everything catching fire. There's no infrastructure team. No senior engineer to guide you. It's you, the founders, and the product engineers. You'll make the calls on what tech to use, how to architect it, and when to ship. If you get this right, you'll have built the foundation that lets millions of people build software on Henosia. No pressure. What You'll Do You'll own infrastructure. Specifically: - Build and scale our Micro VM-based sandbox infrastructure from the ground up - Design and build resilient systems for rolling out new releases - Design isolation and security systems so users can run untrusted code safely - Architect for massive scale—think 10x, 100x, 1000x current load - Optimize costs—every Micro VM costs money, every second matters - Work with product engineers to make sure infra integrates smoothly - Monitor, debug, and fix infrastructure issues before users notice them - Make hard calls on technical tradeoffs - Ship fast—this is infrastructure, but it's still a startup Who You Are You've been building infrastructure for 5-10+ years. You know how to scale systems. You've dealt with the pain of production outages at 3am and learned from them. You understand Micro VMs, Linux networking, and containerization deeply. Maybe you've worked with Cloud-hypervisor, Firecracker, gVisor, or similar tech. You know or want to learn Typescript and Go. TypeScript is our main language, but we’re open to adopting Go for key infrastructure components. Ideally, you know both languages, or you’re willing to learn them. You don't need someone to tell you what to build. You see a scaling problem, you architect a solution. You see a security hole, you fix it. You're comfortable making big technical decisions with incomplete information. You're paranoid about security. Not in a theoretical way - in a "users will run potentially malicious code in our sandbox, and I need to make sure they can't escape"-way. You're fine with chaos. Infrastructure at early-stage startups is held together with duct tape and prayers. You know how to move fast without breaking everything. Requirements Must have: - 5-10+ years building and scaling cloud infrastructure (you've been on-call) - Deep experience with Linux containerization and Linux VMs - Track record of architecting systems for scale - Experience with one or more cloud platforms (AWS, GCP, Azure, Hetzner) - Self-starter mindset—you figure things out Nice to have: - Experience with Cloud-hypervisor, Firecracker, or similar Micro VM technologies - Experience managing a fleet of bare metal servers at scale, e.g. Hetzner Robot - Advanced level experience with Linux networking, namespaces, file systems, and memory management - TypeScript and Go knowledge - Worked at an early-stage startup before - Built coding sandbox systems - Dealt with security in multi-tenant environments What We Offer - Base salary is 70-80K DKK/month (depends on experience) - ESOP. For the right profile, real equity in a fast-growing startup - Remote friendly. Denmark-based is a plus but not required - Own the infrastructure. Founding role means you decide how we scale - Work with experienced founders. Jim and Janne have been building products for 20 years - Solve real problems. This isn't maintaining someone else's infra - you're building it from scratch

Linux Docker/Containers TypeScript AWS GCP Azure

View details: Founding Cloud Infrastructure Engineer

Denmark

Apply

Lead Infrastructure Engineer

NexGen Cloud

The AI Factory. Accelerating the Future.

Infrastructure Engineer93 days ago

Full Time RemoteTeam 51-200Since 2020H1B No Sponsor

Company Site LinkedIn

Cloud Kubernetes Linux OpenStack

View details: Lead Infrastructure Engineer

Australia

Apply

Job Closed

AWS Cloud Infrastructure Engineer

Leidos

Leidos is an innovation company rapidly addressing the world’s most vexing challenges in national security and health.

Infrastructure Engineer93 days ago

Full Time RemoteTeam 10,001+Since 1969H1B Sponsor

Company Site LinkedIn

Leidos was awarded the U.S. Air Force Cloud One Architecture and Common Shared Services contract, and currently has an opening for Cloud Engineers across AWS, Azure, Google, and Oracle clouds. This is an exciting opportunity to use your experience to modernize a leading, global-scale multi-cloud environment in support of a critical mission, supporting USAF system resiliency, security, and cost effectiveness. Location: This position will be remote. Preferred candidates will be located near Hanscom AFB (Boston, MA) or work in Huntsville, AL. Primary Responsibilities: We are seeking an AWS Cloud Operation and support Engineer with expertise in multiple cloud platforms. A successful individual will be responsible for developing in a scalable cloud-native solutions, and ensuring best practices across architecture, development, deployment, and security from design, test, integration, production, sustainment and maintenance. This is a hands-on technical role that requires rolling up your sleeves to architect, code, debug, and mentor.  - Perform cloud operations and engineering tasks to enhance, sustain, and maintain scalable, resilient, and secure cloud solutions for AWS cloud environment - Perform AWS cloud operations, sustainment, and maintenance activities to maintain optimum cloud - Adopt and utilize DevSecOps practices, infrastructure as code, and automation frameworks  - Through development and sustainment activities, optimize application performance and reliability in cloud environments  - Design, implement and sustain secure cloud architectures and networks implementing zero-trust principles and defense-in-depth strategies  - Maintain compliance with industry standards (SOC 2, HIPAA, PCI-DSS, etc.) and regulatory requirements  - Architect, implement and maintain cloud networking security controls including STIG requirements - Implement identity and access management solutions and security monitoring frameworks  - Support development of migration methodologies and ensure minimal organizational disruption during transitions  - Utilize CI/CD workflows and infrastructure-as-code development using Jenkins, Terraform, Ansible, Kubernetes, Jira, Confluence, Artifactory, and Guacamole to support DevSecOps practices. - Containerize applications to enhance scalability and deployment efficiency. - Support the design and development of Shared Services. - Configure and troubleshoot cloud, virtual, and physical hardware and software systems. - Establish and maintain SQL and NoSQL databases, ensuring their performance and reliability. - Support preparation of detailed technical documentation of development and operational processes. - Work in cross-functional teams including development, operations, security, and product management  Minimum Qualifications - Bachelors and 4+ years or more of experience; Masters and 2+ years or more of experience. Additional experience may be accepted in lieu of degree. - Secret clearance required - US citizenship required - Certifications: CompTIA Security+ or equivalent (IAT-2) - Practiced verbal and written communications skills - Ability to participate in team efforts to accomplish assigned tasks  - Demonstrated experience in cloud operations and sustainment and performing tasks and actions described in the primary responsibilities section Preferred Qualifications - Experience with USAF Cloud One or Platform 1 - Knowledge of Zero Trust Architecture. Experience a plus. - Capable of working in high powered teams and maintaining positive interpersonal relationships while delivering products and services to the customer - Understanding Active Directory, AWS AD, SAML and the standards, procedures, and processes  - Experience with Ansible, AWS console, Elastic, AWS, Jira, Confluence, Git, Bitbucket and various cloud Software as a Service (SaaS) offerings to conduct DEV/SEC/OPS pipeline development activities - Administration experience with cloud-based applications (MS O365, SharePoint, AWS AD, AWS)  - Experience administering Windows Server, and related services  - Cloud certifications in AWS, Azure, Google, or Oracle clouds - Certification Examples - AWS Certified Solutions Architect (Professional), Azure Solutions Architect (Expert), MCSE (Server), Certified AWS SysAdmin, AWS Certified Cloud Practitioner, AWS Certified Developer, AWS Certified Solutions Architect (Dev/Associate), AWS Certified DevOps Engineer, AWS Certified Advanced Networking, AWS Certified Security, Azure Developer Associate, Azure Solutions Architec If you're looking for comfort, keep scrolling. At Leidos, we outthink, outbuild, and outpace the status quo — because the mission demands it. We're not hiring followers. We're recruiting the ones who disrupt, provoke, and refuse to fail. Step 10 is ancient history. We're already at step 30 — and moving faster than anyone else dares. Original Posting: April 13, 2026 For U.S. Positions: While subject to change based on business needs, Leidos reasonably anticipates that this job requisition will remain open for at least 3 days with an anticipated close date of no earlier than 3 days after the original posting date as listed above. Pay Range: Pay Range - The Leidos pay range for this job level is a general guideline only and not a guarantee of compensation or salary. Additional factors considered in extending an offer include (but are not limited to) responsibilities of the job, education, experience, knowledge, skills, and abilities, as well as internal equity, alignment with market data, applicable bargaining agreement (if any), or other law.

AWS Azure Oracle Database Infrastructure as Code Observability/Monitoring CI/CD Jenkins Terraform Ansible Kubernetes JIRA Confluence Artifactory SQL NoSQL Active Directory SAML Git Bitbucket Windows Server

View details: AWS Cloud Infrastructure Engineer

United States

Apply

Job Closed

Senior or Staff AI Infrastructure Engineer

TRM Labs

TRM Labs specializes in blockchain investigations and risk management, empowering organizations to detect, investigate, and prevent crypto-related fraud and fin

Infrastructure Engineer93 days ago

Full Time Remote

Company Site

Build a Safer World. TRM Labs provides blockchain analytics and AI solutions to help law enforcement and national security agencies, financial institutions, and cryptocurrency businesses detect, investigate, and disrupt crypto-related fraud and financial crime. TRM’s blockchain intelligence and AI platforms include solutions to trace the source and destination of funds, identify illicit activity, build cases, and construct an operating picture of threats. TRM is trusted by leading agencies and businesses worldwide who rely on TRM to enable a safer, more secure world for all. The AI Engineering Team is chartered with enabling next-generation AI applications, with a special focus on Large Language Models (LLMs) and agentic systems. Our mission is to build robust pipelines, high-performance infrastructure, and operational tooling that allow AI systems to be deployed with speed, safety, and scale. We manage petabyte-scale pipelines, serve models with millisecond-level latency, and provide the observability and governance needed to make AI production-ready. We’re also deeply involved in evaluating and integrating cutting-edge tools in the LLM and agent space — including open-source stacks, vector databases, evaluation frameworks, and orchestration tools that unlock TRM’s ability to innovate faster than the market. As a Senior or Staff AI Infrastructure Engineer, you’ll be at the core of building and scaling the technical infrastructure for AI/ML systems. You will: - Build reusable CI/CD workflows for model training, evaluation, and deployment — integrating Langfuse, GitHub Actions, and experiment tracking, etc. - Automate model versioning, approval workflows, and compliance checks across environments. - Build out a modular and scalable AI infrastructure stack — including vector databases, feature stores, model registries, and observability tooling. - Partner with engineering and data science to embed AI models and agents into real-time applications and workflows. - Continuously evaluate and integrate state-of-the-art AI tools (e.g. LangChain, LlamaIndex, vLLM, MLflow, BentoML, etc.). - Drive AI reliability and governance, enabling experimentation while ensuring compliance, security, and uptime. - Build and enhance AI/ML Model Performance - Ensure data accuracy, consistency and reliability, leading to better model training and inferencing - Deploy infrastructure to support offline and online evaluation of LLMs and agents — including regression testing, cost monitoring, and human-in-the-loop workflows. - Enable researchers to iterate quickly by providing sandboxes, dashboards, and reproducible environments. What We’re Looking For - Write high-quality, maintainable software — primarily in Python, but we value engineering ability over language familiarity. - Have a strong background in scalable infrastructure, including: - Containerization and orchestration (e.g. Docker, Kubernetes) - Infrastructure-as-code and deployment (e.g. Terraform, CI/CD pipelines) - Monitoring and logging frameworks (e.g. Datadog, Prometheus, OpenTelemetry) - Understand and implement ML Ops best practices, including: - Model versioning and rollback strategies - Automated evaluation and drift detection - Scalable model and agent serving infrastructure (e.g. vLLM, Triton, BentoML) - Deploy and maintain LLM and agentic workflows in production, including: - Monitoring cost, latency, and performance - Capturing traces for analysis and debugging - Optimizing prompt/response flows with real-time data access - Demonstrate strong ownership and pragmatism, balancing infrastructure elegance with iterative delivery and measurable impact. Learn about TRM Speed in this position: - Rapid Issue Resolution. TRM Engineers identify and resolve critical onsite issues in minutes to hours, not weeks. We create virtual war rooms, implement fixes, and share lessons with both customer stakeholders and internal teams within 48 hours. - Navigating Bureaucracy. We anticipate and address procedural hurdles, build trust with key stakeholders, and find alternative pathways to approvals. This keeps projects moving even in complex environments. - Efficient Knowledge Transfer. Engineers document and share updates in real time, ensuring the entire team—onsite and remote—has full visibility into plans, blockers, and resolutions. Knowledge sharing sessions and clear documentation reduce friction and accelerate delivery. About TRM's Engineering Levels: Engineer: Responsible for helping to define project milestones and executing small decision decisions independently with the appropriate tradeoffs between simplicity, readability, and performance. Provides mentorship to junior engineers, and enhances operational excellence through tech debt reduction and knowledge sharing. Senior Engineer: Successfully designs and documents system improvements and features for an OKR/project from the ground up. Consistently delivers efficient and reusable systems, optimizes team throughput with appropriate tradeoffs, mentors team members, and enhances cross-team collaboration through documentation and knowledge sharing. Staff Engineer: Drives scoping and execution of one or more OKRs/projects that impact multiple teams. Partners with stakeholders to set the team vision and technical roadmaps for one or more products. Is a role model and mentor to the entire engineering organization. Ensures system health and quality with operational reviews, testing strategies, and monitoring rigor. The following represents the expected range of compensation for this role: - Individual pay is determined by skills, qualifications, experience, and location. The compensation details listed in this posting reflect the US base salary only. - The estimated base salary range for this role is $200,000 - $275,000. - Additionally, this role may be eligible to participate in TRM’s equity plan. - Please note – we factor in the different costs for geographies outside the United States. Life at TRM We are building a safer world. That promise shows up in how we work every day. TRM moves quickly. We are a high velocity, high ownership team that expects clarity, follow-through, and impact. People who thrive here are energized by hard problems, experimentation, and continuous feedback. If something takes months elsewhere, it will ship here in days. Our work sits at the intersection of AI, national security, and fighting financial crime. The problems are complex, the stakes are real, and the environment evolves quickly. The pace and intensity of the work reflect the importance of the mission. As a result, the way we operate requires a high level of ownership, adaptability, collaboration, and creative problem-solving. At TRM, you should expect: - Priorities and targets to change quickly as we experiment and iterate - Work that often requires operating with a high degree of ambiguity - A high level of personal ownership and accountability - Close collaboration across teams and functions - Frequent, high-touch communication - Creative problem solving and out-of-the-box thinking - A pace that rewards urgency, adaptability, and outcomes This environment is energizing for people who enjoy building, solving hard problems, and making progress in situations that are not always fully defined. It also requires comfort navigating ambiguity, adjusting course as new information emerges, and maintaining focus and positivity in a fast-moving and intense environment. We also recognize that this style of operating is not for everyone. If you are primarily optimizing for predictability or a consistently balanced workload, we encourage you to use the interview process to pressure test whether this environment is truly the right fit. We want teammates who thrive here, not just survive here. At the same time, many people find this work deeply rewarding. If you are excited by meaningful problems, motivated by ambitious goals, and energized by working alongside mission-driven colleagues, there is a good chance you will find TRM to be an exceptional place to grow and contribute. Learn more: Interviewing at TRM: How We Hire and What Success Looks Like AI Fluency at TRM AI fluency is a baseline expectation at TRM. We believe AI meaningfully changes how top performers operate. We expect every team member to use AI to accelerate and reimagine their craft, not just automate surface tasks. At TRM, AI fluency means you are among the top 10 percent of operators in your function in how you apply AI to: - Accelerate repeatable workflows - Structure and solve problems - Improve output quality - Increase speed and leverage You will be evaluated on applied AI fluency during the interview process. Leadership Principles We hire and grow against three leadership principles. They’re the standards for how we operate, treat each other, and make decisions. - Impact-Oriented Trailblazer: We put customers first and move with speed, focus, and adaptability. We treat every plan like an experiment – test, ship, measure, and iterate quickly. - Master Craftsperson: We care deeply about our craft. We balance speed with high standards, own outcomes end‑to‑end, and invest in getting better everyday. - Inspiring Colleague: We add clarity and energy, not noise. We bring humility, candor, and a one‑team mindset — giving and receiving feedback to make the team stronger. Join our Mission At TRM we care deeply about our craft. We are looking for individuals who want their work to matter, who experiment with speed and rigor, and who take pride in building a safer world for billions of people. If you’re excited by TRM’s mission but don’t check every box, we encourage you to apply — we hire for slope, judgment, and the will to learn fast. TRM is a Series C company with $220M in total funding, backed by Blockchain Capital, Goldman Sachs, Bessemer, Y Combinator, Thoma Bravo, and others. Headquartered in San Francisco, TRM operates as a distributed-first company with hubs in Los Angeles, San Francisco, New York, Washington D.C., London, and Singapore. Privacy Policy and Additional Information By submitting your application, you are agreeing to allow TRM to process your personal information in accordance with the TRM Privacy Policy. Our typical hiring cycles for specialized roles span 24 to 36 months. Accordingly, we retain your personal information for up to 36 months to evaluate your application and to consider you for current and future employment opportunities, unless you request earlier deletion or a different retention period is required or permitted by law. To notify TRM Labs that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance. The use of AI tools of any kind (including but not limited to notetakers, interview assistants, and real-time coaching tools such as Otter.ai, Fireflies, Fathom, Cluey, or similar) during TRM interviews is not permitted without prior approval from TRM. TRM uses its own internal tools for note-taking to ensure a consistent and confidential experience for all candidates. We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this form. Recruitment agencies TRM Labs does not accept unsolicited agency resumes. Please do not forward resumes to TRM employees. TRM Labs is not responsible for any fees related to unsolicited resumes and will not pay fees to any third-party agency or company without a signed agreement. Learn More: Company Values | Interviewing | FAQs

Blockchain/Web3 AI Observability/Monitoring LLM AI/ML CI/CD GitHub Actions LangChain LlamaIndex MLflow Python Docker/Containers Docker Kubernetes Terraform Datadog Prometheus OpenTelemetry C

View details: Senior or Staff AI Infrastructure Engineer

United States

$200K - $275K / year

Apply

Lead Infrastructure Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Founding Cloud Infrastructure Engineer

Lead Infrastructure Engineer

AWS Cloud Infrastructure Engineer

Senior or Staff AI Infrastructure Engineer