Job Closed

This listing is no longer active.

SentinelOne logo
SentinelOne

Secure your enterprise with the autonomous cybersecurity platform. Endpoint. Cloud. Identity. XDR. Now.

Staff AI Infrastructure Engineer

Infrastructure EngineerInfrastructure EngineerOtherRemoteLeadTeam 1,001-5,000Since 2013H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

111 days ago

Salary

$170.2K - $234.6K / year

Seniority

Lead

Job Description

Staff AI Infrastructure Engineer

SentinelOne

• Architect, build, and maintain scalable infrastructure to host and serve AI products and models reliably. • Automate infrastructure deployment and management using Helm, ArgoCD and Terraform. • Manage and optimize Kubernetes clusters to support high-performance AI workloads. • Implement and manage CI/CD pipelines utilizing GitHub Actions and Jenkins. • Ensure infrastructure compliance with security standards including FedRAMP and related guidelines. • Collaborate closely with AI engineering, product teams, and DevOps to meet infrastructure requirements. • Monitor infrastructure health and performance, implementing optimizations proactively. • Drive infrastructure best practices and mentor team members to foster technical excellence.

Job Requirements

  • A degree in Computer Science, Information Technology, or related field, or equivalent practical experience.
  • 7+ years of experience managing scalable, secure, and resilient infrastructure for AI and machine learning applications.
  • Deep proficiency with infrastructure-as-code tools like Helm, Terraform and ArgoCD.
  • Extensive hands-on experience with Kubernetes for deploying containerized workloads.
  • Demonstrated experience with major cloud platforms (AWS, GCP, Azure), specifically with services related to AI model hosting (e.g., Azure OpenAI).
  • Experience implementing and managing CI/CD pipelines (GitHub Actions, Jenkins).
  • Familiarity with compliance frameworks, particularly FedRAMP, and security best practices.
  • Strong scripting and automation skills using Python, Bash, or similar languages.
  • Excellent problem-solving skills, creativity, and self-driven motivation.
  • Previous experience as a Site Reliability Engineer (SRE), particularly in AI or ML contexts.
  • Monitoring and logging tools (Prometheus, Grafana, Datadog, Jaeger).
  • Networking concepts and security best practices within cloud infrastructure.
  • Professional certifications in Kubernetes or cloud platforms (AWS, Azure, GCP).

Benefits

  • Medical, Vision, Dental, 401(k), Commuter, Health and Dependent FSA
  • Unlimited PTO
  • Industry-leading gender-neutral parental leave
  • Paid Company Holidays
  • Paid Sick Time
  • Employee stock purchase program
  • Disability and life insurance
  • Employee assistance program
  • Gym membership reimbursement
  • Cell phone reimbursement
  • Numerous company-sponsored events, including regular happy hours and team-building events

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Learning Technologies Group plc logo

Infrastructure Engineer

Learning Technologies Group plc

LTG is a leader in corporate digital learning and talent management.

Full TimeRemoteTeam 5,001-10,000Since 2013H1B No Sponsor

• Combine software and systems engineering to help build and run large-scale, distributed and fault-tolerant systems • Use automation and Infrastructure as a Code (IaC) to continuously improve the reliability, scalability, and performance of services deployed on AWS • Performance tuning and configuration of both Linux system and application parameters supporting highly concurrent web stacks • Manage infrastructure through code using configuration management and IaC templating software such as Terraform and Puppet • Document procedures and knowledge base articles throughout problem resolution and architecture development processes • Monitor the availability, performance and health of production systems in support of meeting service level objectives using monitoring systems such as Icinga, Prometheus, Grafana, CloudWatch, and Loki • Participate in emergency incident response on-call rosters • Practice blameless postmortems that lead to improvements in resiliency and reductions in alert fatigue

Colombia
EXANTE logo

Senior Infrastructure Engineer

EXANTE

Global prime broker backed by proprietary technology and dedicated service.

OtherRemoteTeam 501-1,000Since 2011H1B No Sponsor

• Operate and maintain on-premises infrastructure based on bare-metal Debian Linux servers • Manage OS-level configuration, networking, and system services • Automate infrastructure provisioning and lifecycle management using Chef • Manage infrastructure state via Git repositories following GitOps principles • Standardize and continuously improve infrastructure deployment workflows • Manage and operate hybrid connectivity across GCP, AWS, Megaport, and on-prem data centers • Design and maintain VPC networking, routing, and firewall rules • Operate hybrid connectivity (Direct Connect, Cloud Interconnect, VPN) • Configure and support BGP routing between cloud and on-prem environments • Ensure high availability, redundancy, and fault tolerance of network connectivity • Troubleshoot network and connectivity issues across cloud and on-prem layers

United States
Job Closed
Nscale logo

Senior Infrastructure Support Engineer

Nscale

Nscale is the Hyperscaler engineered for AI.

Full TimeRemoteTeam 201-500Since 2024H1B No Sponsor

• You’ll join the Support duty rotation and, as a Senior, will collaborate with Engineering on incidents and changes. • Proactively improve dashboards, alerts, and runbooks to prevent repeat incidents. • Contribute to knowledge sharing across Operations and Engineering, including training content, workshops, and PR reviews. Drive to upskill - better the team and yourself. • Accurately record, update, manage and resolve tickets using the call tracking system whilst keeping all parties (internal or external) informed of the tickets progression via phone and email. • Demonstrate a solid understanding of the underlying Platform to our customers and providing assistance in helping them leverage the service and products • Respond to incoming monitoring alerts, resolving or escalating as required in accordance with priorities and agreed service levels • Take decisive actions, and calculated risks, on technically complex incidents and tasks to ensure business speed and efficiency. • Lead by earning trust, speaking candidly, and benchmark against the best to identify where we can improve. • Disagree when appropriate and challenge the status quo. Commit wholly to decisions and plans once in motion. Be a technical expert, and drive the team to make the best decisions. • Deliver project tasks, improvements, and technical assessments in the right quality in a timely fashion. • Handle escalated customer support issues, providing solutions aligned with business SLA requirements • Design and implement automation scripts and tools to optimize processes. • Conduct root cause analysis for major incidents and recommend long-term fixes. • Collaborate with cross-functional teams for service improvements • Responding to critical incidents during out of business hours, and be on-call as required.

United Kingdom
Full TimeRemoteTeam 201-500Since 2000H1B No Sponsor

**Job Summary: ** We are hiring a Lead Infrastructure & Cloud Engineer with a strong Wintel infrastructure foundation and current, hands-on capability in modern cloud infrastructure across Azure (primary) and AWS. This role exists to close a capability gap: we have deep on-prem expertise, and we need a leader who can define and drive modern cloud standards, guide technical direction, and uplift the team. You’ll operate as a technical lead with an architecture mindset: creating reference designs, setting guardrails, making pragmatic trade-offs (security, resilience, cost), and leading delivery across infrastructure and hybrid cloud. This is not a DevOps role, you will collaborate with DevOps and engineers, but your focus is infrastructure/platform, governance, reliability, and technical leadership. **Job Responsibilities:** **Cloud & Hybrid Architecture (Azure & AWS)** - Own the target-state hybrid cloud architecture and roadmap (12–24 months), aligning security, resilience, and cost requirements. - Define reference architectures and standards: landing zones, network patterns, identity patterns, logging/monitoring, backup/DR, and environment separation. - Lead design and implementation of secure cloud networking: VNets/VPCs, routing, VPN, ExpressRoute/Direct Connect, Private Link/Endpoints, load balancers, WAF where needed. - Own cloud governance foundations: subscriptions/accounts, management groups, RBAC, naming/tagging, logging, budgets and policy guardrails. **Modern Cloud Operations (Hands-on Leadership)** - Ensure cloud platforms, services, and workloads remain on supported, secure versions; implement drift detection and lifecycle management. - Establish platform observability: Azure Monitor/Log Analytics/App Insights, CloudWatch, OpenTelemetry where used; improve alert quality and operational readiness. - Build and maintain backup/DR posture with tested RTO/RPO, runbooks, and regular restore/DR exercises. - Drive FinOps discipline: cost allocation, tagging compliance, rightsizing, reservations/savings plans, and cost anomaly detection. **Security, Governance & Incident Readiness** - Ensure security controls are in place and effective (least privilege, secure baselines, encryption, key management, vulnerability/patch posture). - Log & telemetry onboarding: own onboarding of data/log sources and integration with the SIEM (e.g., Microsoft Sentinel/Splunk) in partnership with Security. - Lead incident response for infrastructure/cloud events: triage, investigation, reporting, RCA, and implementation of preventative controls and guardrails. - Manage, document, and audit configuration changes; champion “repeatable by design” changes and reduce configuration drift. **Wintel & Core Infrastructure Leadership** - Provide technical leadership across core infrastructure services: Windows Server, AD DS, DNS/DHCP, certificates/PKI, and integration with Entra ID. - Guide virtualisation/storage teams (VMware/Hyper-V, SAN/storage) towards cloud-aligned standards for resilience, security, and lifecycle. **Leadership and Uplift** - Act as the technical authority for infrastructure and hybrid cloud lead technical decisions and drive outcomes. - Mentor and upskill engineers on modern cloud infrastructure practices; run knowledge sessions and codify standards into reusable patterns. - Provide input during design and architectural discussions with DevOps and software teams; unblock delivery with clear, pragmatic guidance.

Pakistan