Nscale logo
Nscale

Nscale is the Hyperscaler engineered for AI.

Senior Infrastructure Support Engineer

Infrastructure EngineerInfrastructure EngineerFull TimeRemoteSeniorTeam 201-500Since 2024H1B No SponsorCompany SiteLinkedIn

Location

United Kingdom

Posted

110 days ago

Salary

0

Seniority

Senior

Job Description

Senior Infrastructure Support Engineer

Nscale

• You’ll join the Support duty rotation and, as a Senior, will collaborate with Engineering on incidents and changes. • Proactively improve dashboards, alerts, and runbooks to prevent repeat incidents. • Contribute to knowledge sharing across Operations and Engineering, including training content, workshops, and PR reviews. Drive to upskill - better the team and yourself. • Accurately record, update, manage and resolve tickets using the call tracking system whilst keeping all parties (internal or external) informed of the tickets progression via phone and email. • Demonstrate a solid understanding of the underlying Platform to our customers and providing assistance in helping them leverage the service and products • Respond to incoming monitoring alerts, resolving or escalating as required in accordance with priorities and agreed service levels • Take decisive actions, and calculated risks, on technically complex incidents and tasks to ensure business speed and efficiency. • Lead by earning trust, speaking candidly, and benchmark against the best to identify where we can improve. • Disagree when appropriate and challenge the status quo. Commit wholly to decisions and plans once in motion. Be a technical expert, and drive the team to make the best decisions. • Deliver project tasks, improvements, and technical assessments in the right quality in a timely fashion. • Handle escalated customer support issues, providing solutions aligned with business SLA requirements • Design and implement automation scripts and tools to optimize processes. • Conduct root cause analysis for major incidents and recommend long-term fixes. • Collaborate with cross-functional teams for service improvements • Responding to critical incidents during out of business hours, and be on-call as required.

Job Requirements

  • Ability to adapt to customer-driven demands, such as providing specialist support after core business hours, with availability to travel to Nscale or Customer locations to provide onsite technical expertise and guidance.
  • Disciplined, organised and self-motivated. Able to motivate, support and mentor other team members
  • Strong leadership principals, with a bias for taking decisive action, working independently, and driving the team and wider organisation to improve.
  • Understanding of how datacenters operate and the core datacentre technologies: Servers, Networks, Storage and Virtualisation, ideally gained through an operational support background.
  • Good organisational and time management skills, with strong interpersonal skills, able to deal effectively with people at all levels whilst also having good written and verbal communication skills
  • Linux systems engineering at scale. Strong command over modern Linux distributions, kernel modules, systemd, networking stack, and filesystem tooling. Proven troubleshooting across compute, storage and network layers in production.
  • Kubernetes. Operate and troubleshoot K8s clusters, and understand how physical resources are abstracted up the stack to K8s.
  • GPU platforms (NVIDIA and AMD). Practical experience with GPU drivers and GPU logs investigation tools, e.g. nvidia-smi. Performance diagnostics using NCCL on large scale clusters.
  • Observability and incident response. Build and use alerting stacks and dashboards, interpret metrics and alerts, and drive runbooks to resolution; contribute to SLOs and post‑incident reviews.
  • Strong Networking fundamentals. Solid grasp of L2/L3, routing, BGP, VLANs, VXLAN, firewalls, load balancing. Understanding of high‑performance fabrics (RDMA/NVLink basics) for cluster‑to‑cluster traffic.
  • SRE‑style operations. Write and maintain runbooks, automate diagnostics, and reduce human intervention using scripts or small tools.
  • Automation and Git. Scripting or software skills in Bash, Python, or JavaScript (or equivalent) for operational tooling and integrations, and experience with Infrastructure Automation tools (Ansible, Puppet, Terraform, Chef)
  • Cloud Infrastructure Administration and Troubleshooting. Strong familiarity with using virtualisation technologies, and investigating issues that arise, performing deep dive investigation to perform root cause analysis. Openstack operations experience preferred.

Benefits

  • Highly competitive package (base + equity) with reviews every 12 months. 🚀
  • Join the fastest-growing tech startup, your chance to push boundaries, collaborate with brilliant minds, and make your mark on cutting-edge AI. ✨
  • Expect a dynamic progression plan tailored to your ambitions. Grow by trying new things, leading, challenging the status quo, and owning your impact, always with our full support.
  • Human-First Flexibility: We treat you as humans first. 🫶🏽 Our flexible workplace trusts Nscalers to deliver, giving you the autonomy to shape your day around life's moments.

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

Full TimeRemoteTeam 201-500Since 2000H1B No Sponsor

**Job Summary: ** We are hiring a Lead Infrastructure & Cloud Engineer with a strong Wintel infrastructure foundation and current, hands-on capability in modern cloud infrastructure across Azure (primary) and AWS. This role exists to close a capability gap: we have deep on-prem expertise, and we need a leader who can define and drive modern cloud standards, guide technical direction, and uplift the team. You’ll operate as a technical lead with an architecture mindset: creating reference designs, setting guardrails, making pragmatic trade-offs (security, resilience, cost), and leading delivery across infrastructure and hybrid cloud. This is not a DevOps role, you will collaborate with DevOps and engineers, but your focus is infrastructure/platform, governance, reliability, and technical leadership. **Job Responsibilities:** **Cloud & Hybrid Architecture (Azure & AWS)** - Own the target-state hybrid cloud architecture and roadmap (12–24 months), aligning security, resilience, and cost requirements. - Define reference architectures and standards: landing zones, network patterns, identity patterns, logging/monitoring, backup/DR, and environment separation. - Lead design and implementation of secure cloud networking: VNets/VPCs, routing, VPN, ExpressRoute/Direct Connect, Private Link/Endpoints, load balancers, WAF where needed. - Own cloud governance foundations: subscriptions/accounts, management groups, RBAC, naming/tagging, logging, budgets and policy guardrails. **Modern Cloud Operations (Hands-on Leadership)** - Ensure cloud platforms, services, and workloads remain on supported, secure versions; implement drift detection and lifecycle management. - Establish platform observability: Azure Monitor/Log Analytics/App Insights, CloudWatch, OpenTelemetry where used; improve alert quality and operational readiness. - Build and maintain backup/DR posture with tested RTO/RPO, runbooks, and regular restore/DR exercises. - Drive FinOps discipline: cost allocation, tagging compliance, rightsizing, reservations/savings plans, and cost anomaly detection. **Security, Governance & Incident Readiness** - Ensure security controls are in place and effective (least privilege, secure baselines, encryption, key management, vulnerability/patch posture). - Log & telemetry onboarding: own onboarding of data/log sources and integration with the SIEM (e.g., Microsoft Sentinel/Splunk) in partnership with Security. - Lead incident response for infrastructure/cloud events: triage, investigation, reporting, RCA, and implementation of preventative controls and guardrails. - Manage, document, and audit configuration changes; champion “repeatable by design” changes and reduce configuration drift. **Wintel & Core Infrastructure Leadership** - Provide technical leadership across core infrastructure services: Windows Server, AD DS, DNS/DHCP, certificates/PKI, and integration with Entra ID. - Guide virtualisation/storage teams (VMware/Hyper-V, SAN/storage) towards cloud-aligned standards for resilience, security, and lifecycle. **Leadership and Uplift** - Act as the technical authority for infrastructure and hybrid cloud lead technical decisions and drive outcomes. - Mentor and upskill engineers on modern cloud infrastructure practices; run knowledge sessions and codify standards into reusable patterns. - Provide input during design and architectural discussions with DevOps and software teams; unblock delivery with clear, pragmatic guidance.

Pakistan
Playlab logo

Staff ML Infrastructure Engineer

Playlab

Build, remix and share AI-powered educational tools.

OtherRemoteTeam 1-10Since 2023H1B No Sponsor

• Build data pipelines that scrub PII, create research datasets, and power the research portal for educational AI studies • Architect the path toward self-hosted and on-device model deployments for privacy and global accessibility • Design and implement model orchestration systems that intelligently route requests across multiple AI providers (OpenAI, Anthropic, AWS Bedrock, open-source models) • Build cost optimization infrastructure - implement conversation compression, prompt caching, and smart model selection to keep AI accessible • Create comprehensive observability systems for ML operations - track costs, latency, quality, and usage patterns across thousands of applications • Design and implement infrastructure for fine-tuning and deploying custom models • Build monitoring and alerting systems that help us maintain reliability as AI interactions scale

United States
$180K - $240K / year
Job Closed
OtherRemoteTeam 51-200Since 1988

• Implement and support infrastructure technologies such as Microsoft Azure, VMware and networking technologies • Execute migrations of on-premises platforms to cloud infrastructure • Manage enterprise support requests from clients subscribing to Kraft Kennedy’s enterprise managed services • Execute planned evening and weekend maintenance tasks in support of Kraft Kennedy’s enterprise managed services clients, when necessary • Participate in weekly on-call rotation for evening and weekend support assistance, as requested by enterprise managed services clients • Escalate to internal and, when necessary, external resources in an appropriate time frame to manage the resolution of complex client issues • Provide on-site support, as necessary

Connecticut + 17 moreAll locations: Connecticut | District of Columbia | Florida | Illinois | Kentucky | New York | North Carolina | Ohio | Maryland | Massachusetts | Pennsylvania | South Carolina | Tennessee | Texas | Utah | Vermont | Virginia | Washington
$85K - $140K / year
Apogee Global RMS logo

IT Infrastructure Support Engineer

Apogee Global RMS

Taking People, Process and Technology to the Next Level

OtherRemoteTeam 1-10Since 2018H1B No Sponsor

- IT Support & Helpdesk - Provide Tier 1–2 technical support for desktops, laptops, printers, and mobile devices - Troubleshoot hardware, software, OS, and application issues - Set up, configure, and maintain user accounts, email, and access permissions - Respond to tickets, document issues, and ensure timely resolution - Support onboarding/offboarding of employees (devices, accounts, access) - Systems Administration - Install, configure, and maintain Windows/Linux servers and workstations - Manage Active Directory, user/group policies, and permissions - Monitor system performance, backups, patches, and updates - Maintain virtualization environments (VMware/Hyper-V or similar) - Ensure security best practices, antivirus, patching, and access controls - Document systems, procedures, and configurations - Network Administration - Configure and maintain LAN/WAN, switches, routers, firewalls, and Wi-Fi - Monitor network performance, uptime, and security - Troubleshoot network connectivity and performance issues - Manage VPNs, DNS, DHCP, and basic firewall rules - Assist with network upgrades, expansions, and improvements

California