Job Closed

This listing is no longer active.

Secure your enterprise with the autonomous cybersecurity platform. Endpoint. Cloud. Identity. XDR. Now.

Staff AI Infrastructure Engineer

Infrastructure EngineerInfrastructure EngineerOther Remote LeadTeam 1,001-5,000Since 2013H1B SponsorCompany Site LinkedIn

Location

United States

Posted

111 days ago

Salary

$170.2K - $234.6K / year

Seniority

Lead

Bachelor Degree7 yrs expEnglishAWS Azure GCP Grafana Jenkins Kubernetes Prometheus Python Terraform

Job Description

• Architect, build, and maintain scalable infrastructure to host and serve AI products and models reliably. • Automate infrastructure deployment and management using Helm, ArgoCD and Terraform. • Manage and optimize Kubernetes clusters to support high-performance AI workloads. • Implement and manage CI/CD pipelines utilizing GitHub Actions and Jenkins. • Ensure infrastructure compliance with security standards including FedRAMP and related guidelines. • Collaborate closely with AI engineering, product teams, and DevOps to meet infrastructure requirements. • Monitor infrastructure health and performance, implementing optimizations proactively. • Drive infrastructure best practices and mentor team members to foster technical excellence.

Job Requirements

A degree in Computer Science, Information Technology, or related field, or equivalent practical experience.
7+ years of experience managing scalable, secure, and resilient infrastructure for AI and machine learning applications.
Deep proficiency with infrastructure-as-code tools like Helm, Terraform and ArgoCD.
Extensive hands-on experience with Kubernetes for deploying containerized workloads.
Demonstrated experience with major cloud platforms (AWS, GCP, Azure), specifically with services related to AI model hosting (e.g., Azure OpenAI).
Experience implementing and managing CI/CD pipelines (GitHub Actions, Jenkins).
Familiarity with compliance frameworks, particularly FedRAMP, and security best practices.
Strong scripting and automation skills using Python, Bash, or similar languages.
Excellent problem-solving skills, creativity, and self-driven motivation.
Previous experience as a Site Reliability Engineer (SRE), particularly in AI or ML contexts.
Monitoring and logging tools (Prometheus, Grafana, Datadog, Jaeger).
Networking concepts and security best practices within cloud infrastructure.
Professional certifications in Kubernetes or cloud platforms (AWS, Azure, GCP).

Benefits

Medical, Vision, Dental, 401(k), Commuter, Health and Dependent FSA
Unlimited PTO
Industry-leading gender-neutral parental leave
Paid Company Holidays
Paid Sick Time
Employee stock purchase program
Disability and life insurance
Employee assistance program
Gym membership reimbursement
Cell phone reimbursement
Numerous company-sponsored events, including regular happy hours and team-building events

Related Categories

Infrastructure Engineer

Related Job Pages

Remote Python Jobs (US)More Remote Jobs

More Infrastructure Engineer Jobs

Infrastructure Engineer

Learning Technologies Group plc

LTG is a leader in corporate digital learning and talent management.

Infrastructure Engineer111 days ago

Full Time RemoteTeam 5,001-10,000Since 2013H1B No Sponsor

Company Site LinkedIn

• Combine software and systems engineering to help build and run large-scale, distributed and fault-tolerant systems • Use automation and Infrastructure as a Code (IaC) to continuously improve the reliability, scalability, and performance of services deployed on AWS • Performance tuning and configuration of both Linux system and application parameters supporting highly concurrent web stacks • Manage infrastructure through code using configuration management and IaC templating software such as Terraform and Puppet • Document procedures and knowledge base articles throughout problem resolution and architecture development processes • Monitor the availability, performance and health of production systems in support of meeting service level objectives using monitoring systems such as Icinga, Prometheus, Grafana, CloudWatch, and Loki • Participate in emergency incident response on-call rosters • Practice blameless postmortems that lead to improvements in resiliency and reductions in alert fatigue

Ansible Apache HTTP Server AWS Chef DNS Amazon EC2 Grafana LAMP Linux MariaDB MySQL PHP PostgreSQL Prometheus Puppet Python SMTP SQL Terraform

View details: Infrastructure Engineer

Colombia

Apply

Senior Infrastructure Engineer

EXANTE

Global prime broker backed by proprietary technology and dedicated service.

Infrastructure Engineer111 days ago

Other RemoteTeam 501-1,000Since 2011H1B No Sponsor

Company Site LinkedIn

• Operate and maintain on-premises infrastructure based on bare-metal Debian Linux servers • Manage OS-level configuration, networking, and system services • Automate infrastructure provisioning and lifecycle management using Chef • Manage infrastructure state via Git repositories following GitOps principles • Standardize and continuously improve infrastructure deployment workflows • Manage and operate hybrid connectivity across GCP, AWS, Megaport, and on-prem data centers • Design and maintain VPC networking, routing, and firewall rules • Operate hybrid connectivity (Direct Connect, Cloud Interconnect, VPN) • Configure and support BGP routing between cloud and on-prem environments • Ensure high availability, redundancy, and fault tolerance of network connectivity • Troubleshoot network and connectivity issues across cloud and on-prem layers

AWS Chef GCP Grafana HAProxy Kubernetes Linux Nginx Prometheus Puppet Python Ruby SaltStack TCP/IP Terraform VMware

View details: Senior Infrastructure Engineer

United States

Apply

Job Closed

Senior Infrastructure Support Engineer

Nscale

Nscale is the Hyperscaler engineered for AI.

Infrastructure Engineer111 days ago

Full Time RemoteTeam 201-500Since 2024H1B No Sponsor

Company Site LinkedIn

• You’ll join the Support duty rotation and, as a Senior, will collaborate with Engineering on incidents and changes. • Proactively improve dashboards, alerts, and runbooks to prevent repeat incidents. • Contribute to knowledge sharing across Operations and Engineering, including training content, workshops, and PR reviews. Drive to upskill - better the team and yourself. • Accurately record, update, manage and resolve tickets using the call tracking system whilst keeping all parties (internal or external) informed of the tickets progression via phone and email. • Demonstrate a solid understanding of the underlying Platform to our customers and providing assistance in helping them leverage the service and products • Respond to incoming monitoring alerts, resolving or escalating as required in accordance with priorities and agreed service levels • Take decisive actions, and calculated risks, on technically complex incidents and tasks to ensure business speed and efficiency. • Lead by earning trust, speaking candidly, and benchmark against the best to identify where we can improve. • Disagree when appropriate and challenge the status quo. Commit wholly to decisions and plans once in motion. Be a technical expert, and drive the team to make the best decisions. • Deliver project tasks, improvements, and technical assessments in the right quality in a timely fashion. • Handle escalated customer support issues, providing solutions aligned with business SLA requirements • Design and implement automation scripts and tools to optimize processes. • Conduct root cause analysis for major incidents and recommend long-term fixes. • Collaborate with cross-functional teams for service improvements • Responding to critical incidents during out of business hours, and be on-call as required.

Ansible Chef Cloud Firewalls JavaScript Kubernetes Linux OpenStack Puppet Python Terraform

View details: Senior Infrastructure Support Engineer

United Kingdom

Apply

Lead Cloud Infrastructure Engineer – Azure, AWS

Creative Chaos

Your innovation delivery partner.

Infrastructure Engineer111 days ago

Full Time RemoteTeam 201-500Since 2000H1B No Sponsor

Company Site LinkedIn

**Job Summary: ** We are hiring a Lead Infrastructure & Cloud Engineer with a strong Wintel infrastructure foundation and current, hands-on capability in modern cloud infrastructure across Azure (primary) and AWS. This role exists to close a capability gap: we have deep on-prem expertise, and we need a leader who can define and drive modern cloud standards, guide technical direction, and uplift the team. You’ll operate as a technical lead with an architecture mindset: creating reference designs, setting guardrails, making pragmatic trade-offs (security, resilience, cost), and leading delivery across infrastructure and hybrid cloud. This is not a DevOps role, you will collaborate with DevOps and engineers, but your focus is infrastructure/platform, governance, reliability, and technical leadership. **Job Responsibilities:** **Cloud & Hybrid Architecture (Azure & AWS)** - Own the target-state hybrid cloud architecture and roadmap (12–24 months), aligning security, resilience, and cost requirements. - Define reference architectures and standards: landing zones, network patterns, identity patterns, logging/monitoring, backup/DR, and environment separation. - Lead design and implementation of secure cloud networking: VNets/VPCs, routing, VPN, ExpressRoute/Direct Connect, Private Link/Endpoints, load balancers, WAF where needed. - Own cloud governance foundations: subscriptions/accounts, management groups, RBAC, naming/tagging, logging, budgets and policy guardrails. **Modern Cloud Operations (Hands-on Leadership)** - Ensure cloud platforms, services, and workloads remain on supported, secure versions; implement drift detection and lifecycle management. - Establish platform observability: Azure Monitor/Log Analytics/App Insights, CloudWatch, OpenTelemetry where used; improve alert quality and operational readiness. - Build and maintain backup/DR posture with tested RTO/RPO, runbooks, and regular restore/DR exercises. - Drive FinOps discipline: cost allocation, tagging compliance, rightsizing, reservations/savings plans, and cost anomaly detection. **Security, Governance & Incident Readiness** - Ensure security controls are in place and effective (least privilege, secure baselines, encryption, key management, vulnerability/patch posture). - Log & telemetry onboarding: own onboarding of data/log sources and integration with the SIEM (e.g., Microsoft Sentinel/Splunk) in partnership with Security. - Lead incident response for infrastructure/cloud events: triage, investigation, reporting, RCA, and implementation of preventative controls and guardrails. - Manage, document, and audit configuration changes; champion “repeatable by design” changes and reduce configuration drift. **Wintel & Core Infrastructure Leadership** - Provide technical leadership across core infrastructure services: Windows Server, AD DS, DNS/DHCP, certificates/PKI, and integration with Entra ID. - Guide virtualisation/storage teams (VMware/Hyper-V, SAN/storage) towards cloud-aligned standards for resilience, security, and lifecycle. **Leadership and Uplift** - Act as the technical authority for infrastructure and hybrid cloud lead technical decisions and drive outcomes. - Mentor and upskill engineers on modern cloud infrastructure practices; run knowledge sessions and codify standards into reusable patterns. - Provide input during design and architectural discussions with DevOps and software teams; unblock delivery with clear, pragmatic guidance.

AWS Azure DNS Python Splunk Terraform VMware

View details: Lead Cloud Infrastructure Engineer – Azure, AWS

Pakistan

Apply

Staff AI Infrastructure Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Infrastructure Engineer

Senior Infrastructure Engineer

Senior Infrastructure Support Engineer

Lead Cloud Infrastructure Engineer – Azure, AWS