Nscale is the Hyperscaler engineered for AI.
Principal Site Reliability Engineer - AI Infrastructure Operations
Location
Worldwide
Posted
5 days ago
Salary
$150K - $2,150K / year
Seniority
Lead
Job Description
Principal Site Reliability Engineer - AI Infrastructure Operations
Nscale
Role Description At Nscale, our AI Infrastructure Operations team is responsible for the reliability and scalability of one of the most demanding AI platforms in the industry. We value engineers who think in systems, lead through influence, and raise the bar for operational excellence across the organisation. We’re looking for a Principal Site Reliability Engineer (SRE) to provide technical leadership across our AI Infrastructure Operations domain. This is a senior, highly impactful role focused on setting reliability strategy, designing foundational systems, and driving cross-team improvements at scale. You will operate as a technical authority for reliability, automation, and operational architecture across Nscale’s GPU, network, and control-plane platforms. - Owning and evolving the long-term reliability strategy for Nscale’s AI and HPC infrastructure - Designing and leading the development of large-scale control-plane systems, automation frameworks, and operational tooling - Defining reliability standards, SLO frameworks, and operational best practices used across multiple teams - Acting as a senior technical escalation point during critical incidents, guiding resolution and ensuring systemic fixes - Identifying structural reliability risks and driving cross-functional initiatives to address them at the architectural level - Partnering with Engineering, Network Operations, and Fleet Operations leadership to influence platform design and operational maturity - Mentoring senior and mid-level engineers, raising the overall quality and effectiveness of SRE practices - Driving measurable improvements in availability, MTTR, cost efficiency, and operational scalability Qualifications - 10+ years of experience in Site Reliability Engineering, Systems Engineering, or Software Engineering roles operating complex, large-scale infrastructure - Expert-level software engineering skills, with a strong track record of building production-grade automation and systems - Deep expertise in Linux, networking, and distributed systems design at scale - Extensive experience debugging and resolving failures across hardware, OS, networking, and application layers - Proven ability to lead technical initiatives across teams without direct authority - Strong systems-thinking mindset, with the ability to balance reliability, velocity, and cost Requirements - Deep hands-on experience with AI or HPC platforms, including GPUs, high-speed interconnects (InfiniBand/RDMA), and workload schedulers (e.g. SLURM) - Experience designing observability systems for high-cardinality, high-throughput environments - Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures - A history of driving step-change improvements in reliability, scalability, or operational efficiency Benefits - Collaborative, supportive, and innovative environment where your contributions spark real impact - Highly competitive package (base + equity) with reviews every 12 months - Opportunity to join the fastest-growing tech startup, pushing boundaries and collaborating with brilliant minds - Dynamic progression plan tailored to your ambitions - Human-First Flexibility: autonomy to shape your day around life's moments - Thriving remote-first team with seamless virtual collaboration
Related Guides
Related Categories
Related Job Pages
More Infrastructure Engineer Jobs
• Lead the administration and evolution of Microsoft Active Directory in a complex enterprise environment. • Own Exchange and Exchange Hybrid (on-premises and Exchange Online), ensuring reliability, security, and seamless coexistence. • Design, operate, and maintain Public Key Infrastructure (PKI), including certificate lifecycle management. • Administer and develop Microsoft 365 / Entra ID identity services, roles, and access models. • Implement and support ADFS / SSO and federation scenarios for internal and external applications. • Ensure secure access control, authentication, and authorization across platforms. • Collaborate with network, platform, and application teams on identity-related integrations. • Drive continuous improvement of identity and messaging architectures and operational practices.
• Supports the Case Management Modernization (CMM) Program for the U.S. Courts by designing, implementing, and managing secure authentication and authorization frameworks • Ensures compliance with federal identity governance, FedRAMP, and Zero Trust Architecture (ZTA) principles • Collaborates with architecture, security, and DevSecOps teams to ensure access control and credential management are integrated across all layers of the CMM application ecosystem • Designs and maintains the identity architecture utilizing Keycloak • Implements federated identity and single sign-on (SSO) solutions using modern protocols (SAML, OAuth2.0, OIDC) • Configures directory services and identity providers (AWS Cognito, AWS IAM Identity Center, Azure AD, etc.) • Conducts access audits, user entitlement reviews, and anomaly detection to ensure least-privilege compliance
• Craft creative scalable cloud solutions for running millions of jobs, thousands of systems, and petabytes of storage. • Address exciting challenges in infrastructure such as Kubernetes, job scheduling, multi-region services, resource management, and automated recovery. • Create agentic workflows for infrastructure. • Collaborate with customers to understand their needs and develop innovative solutions that cater to their requirements.
Senior Infrastructure Engineer
Definitive Healthcare, USDefinitive Healthcare (NASDAQ: DH) is passionate about turning data, analytics, and expertise into meaningful intelligence that helps our customers achieve success and shape the future of healthcare. We empower them to uncover the right markets, opportunities, and people—paving the way for smarter decisions and greater impact. Headquartered just outside of Boston, Massachusetts. Operates across North America, Europe, and India. Supports a growing global client base of more than 2,400 customers since our founding in 2011. Earned multiple workplace honors, including Built In’s 100 Best Places to Work in Boston (2024 and 2025), a Stevie Bronze Award for Great Employers, and recognition as a Great Place to Work in India. Fosters a collaborative, inclusive culture where diverse perspectives drive innovation.
Role Description We are looking for an experienced and versatile Infrastructure Engineer to join our team. This is a broad, hands-on role for someone who is comfortable operating across the modern infrastructure stack — spanning cloud platforms, virtualization, systems administration, automation, and network engineering. This role replaces a traditional network-only function and reflects how we think about infrastructure today: as an interconnected discipline where networking, computing, security, and automation are inseparable. You will be a key contributor to the reliability, scalability, and security of our platforms, and a go-to escalation point for complex infrastructure challenges. What You'll Do - Cloud & Hybrid Infrastructure - Support the maintenance of both on-prem and cloud infrastructure across AWS, Azure, or GCP including compute, storage, networking, and identity services. - Manage and optimize hybrid connectivity between on-premises environments and cloud platforms (VPN, ExpressRoute/Direct Connect, Transit Gateway). - Govern infrastructure as code (IaC) using tools such as Terraform or Pulumi, ensuring environments are reproducible and version-controlled. - Networking - Participate in ongoing management of network infrastructure including routers, switches, firewalls, and load balancers. - Manage and troubleshoot LAN/WAN, SD-WAN, VPN, BGP/OSPF, and VLAN environments. - Administer DNS, DHCP, and IP address management (IPAM) across hybrid environments. - Review and enforce network security policies, firewall rules, and segmentation strategies in collaboration with the security team. - Systems & Virtualization - Administer virtualized environments (VMware vSphere, Hyper-V, or equivalent) and container platforms (Kubernetes, Docker). - Manage server operating systems at scale across Linux (RHEL/Ubuntu) and Windows Server. - Monitoring, Reliability & Security - Maintain observability tooling including infrastructure monitoring, alerting, and log aggregation (e.g. Datadog, Prometheus/Grafana, Splunk, or similar). - Participate in an on-call rotation for infrastructure-level incidents, driving timely resolution and thorough post-incident reviews. - Contribute to backup, disaster recovery, and business continuity planning and testing. - Partner with the security team on vulnerability management, patching cadences, and hardening standards. - Automation & Continuous Improvement - Identify and drive automation opportunities to reduce toil and improve consistency across the infrastructure estate. - Contribute to capacity planning and infrastructure roadmap discussions. - Mentor P1/P2 engineers and share knowledge through internal documentation and team sessions. Qualifications - 4–7 years of experience in infrastructure, network engineering, or a related discipline. - Strong networking fundamentals and hands-on experience with enterprise network platforms (Cisco, Juniper, Palo Alto, Fortinet, or similar). - Proven experience with at least one major cloud platform (AWS, Azure, or GCP) — ideally with a cloud associate-level certification or equivalent practical experience. - Experience managing Linux and Windows Server environments in production. - Familiarity with containerization concepts and platforms (Docker, Kubernetes). - Experience with monitoring and observability tooling in a production environment. - Strong troubleshooting skills across network, compute, and storage layers. - Comfortable working in an on-call capacity for infrastructure incidents. Nice to Have - Working knowledge of infrastructure as code and configuration management tools (Terraform, Ansible, or similar). - Experience with SD-WAN platforms (e.g. Meraki, Viptela/Cisco SDWAN, VMware VeloCloud). - Scripting proficiency in Python, Bash, or PowerShell for automation. - Exposure to DevOps practices and CI/CD pipelines. - Experience in a zero-trust network architecture project. - Relevant certifications: CCNA/CCNP, AWS Solutions Architect, Azure Administrator (AZ-104), CKA, or equivalent. Compensation and Benefits - The salary range for this position is $115,000 – $173,000 per year, which represents the base pay the company reasonably and in good faith expects to pay for this role. - Actual pay within this range will be determined based on factors such as relevant experience, skills, and qualifications. - Depending on the position, employees may also be eligible to participate in a company bonus or commission plan. - All employees are eligible for a comprehensive benefits package, including medical, dental, and vision coverage, unlimited paid time off, and participation in the company’s 401(k) plan with employer contribution.



