Job Closed
This listing is no longer active.
All internet should be this simple.
Head of Network Reliability Engineering
Location
United States
Posted
74 days ago
Salary
$208K - $286K / year
Seniority
Lead
Job Description
Head of Network Reliability Engineering
Google Fiber
• Lead the Reliability Engineering and Metro Engineering functions, overseeing both the physical expansion of metro networks and the observability systems that support them. • Own the end-to-end Tier 3 escalation lifecycle, working with NOC and Incident Management teams to drive a blameless engineering culture focused on systemic improvement and data-driven root cause analysis. • Define the roadmap for Infrastructure-as-Code and GitOps workflows, collaborating with software and network teams to ensure configurations are version-controlled, auditable, and deployed via CI/CD. • Drive the strategy for closed-loop automation by partnering with software engineering teams to implement systems that leverage real-time streaming telemetry for autonomous fault detection and remediation. • Champion the elimination of operational toil; work across the organization to automate change verification and routine maintenance, allowing the NRE team to focus on high-value reliability engineering.
Job Requirements
- Bachelor’s in Computer Science, Electrical Engineering, or equivalent practical experience.
- 10 years of experience in network engineering, with direct experience in operations, site reliability, or network reliability.
- Experience in IP networking (BGP, OSPF, MPLS), optical transport, and access networks (PON/Wireless).
- Experience managing high-stakes incidents and designing high-availability systems.
- Experience managing engineering teams and driving cross-functional outcomes.
Benefits
- health insurance
- retirement plans
- paid time off
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Engineer (SRE)
Radiance TechnologiesRadiance Technologies, Inc. is an employee-owned small business prime contractor. Radiance leads the way in developing government and commercial customer-focused solutions. Leveraging its record of technical innovation and operational expertise, Radiance Technologies offers: • Cyber Solutions • Systems Engineering • Technology Development, Production, Testing, and Evaluation • Technology Application • Intelligence Community Support • Government Program Support The company’s 900+ employees in 15+ U.S. and international offices serve customers in the Department of Defense (DOD), National Aeronautics and Space Administration (NASA), the national intelligence community, the Department of Homeland Security (DHS), other government organizations, and selected commercial customers. Radiance Technologies continues to attract and retain talented motivated employees by being an employee-owned company – founded with the idea of providing an environment, a benefits package, and a stock ownership plan that are second to none. For more information, visit www.radiancetech.com. Radiance Technologies, Inc. – Concepts to Capabilities®
Salary Range: $75,000 - $100,000 At Radiance our SREs own the reliability of systems they don't write - defining what "reliable enough" means from the user’s perspective, instrumenting and measuring against those targets, and building the tooling and runbooks that make failure recoverable. They partner with dev teams pushing operational quality upstream before code ships, and they lead the resolution in production when things go wrong. SREs are comfortable debugging distributed systems, resolving incidents, and translating findings into lasting reliability improvements. Day to day responsibilities fall into four categories: Incident Response, Toil Reduction, Reliability Evaluations, Platform Enablement Required Qualifications - 1+ years of experience in Operations, Sys Admin, DevOps, or Software engineering - Bachelor’s Degree in CS, Computer Engineering, or related technical field - US Citizenship & must have or be able to obtain a Top Secret Clearence - Systems thinking – understanding how systems fail together, blast radius, and more - Observability Fundamentals – not just the 3 signals, but knowing why and how to use telemetry to optimize services and engineering quality of life - Basic software engineering – building automation & non-trivial APIs, git workflows, effectively engaging in code reviews - Linux/networking fundamentals - Strong Communication, Collaboration, and Organizational Skills Specialty Skills: (1 or more) - Platform & Infrastructure - Kubernetes, ArgoCD/GitOps, disaster recovery, capacity planning - Observability - OTel standards, Grafana/Perses, Tempo, Clickhouse, VictoriaMetrics - Automation & Toil Reduction - scripting, CI/CD, runbook automation, “DevOps” - Developer Enablement - instrumentation SDKs, SRE practice onboarding - Data & Alerting - dashboard quality, alert design, anomaly detection Desired Qualifications - SRE Certifications from The DevOps Institute, AWS Solution Architect, or similar - Hands-on experience with: Python, Go, Kubernetes, Argo CD, GitLab/GitHub, Jenkins, Docker, Locust/Gatling, Prometheus, Grafana/Perses Radiance Technologies is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or protected veteran status.
Site Reliability Engineer
Precisely US JobsPrecisely is the leader in data integrity. We empower businesses to make more confident decisions based on trusted data through a unique combination of software, data enrichment products and strategic services. Focused on delivering outstanding innovation and support that helps customers increase revenue, lower costs and reduce risk Powers better decisions for more than 12,000 global organizations, including 95 of the Fortune 100 2500 employees unified by four core values: Openness, Determination, Individuality, and Collaboration Committed to career development for employees with opportunities for growth, learning, and building community "Work from anywhere" culture celebrating diversity in a distributed environment with a presence in 30 countries and 20 offices across 5 continents
Role Description This position is 100% remote located anywhere in the United States. You help keep our platforms reliable, available, and easy to operate for teams across the company. We rely on you to improve system stability through automation, monitoring, and thoughtful operational practices. You primarily support cloud-based services and reliability initiatives, while also helping maintain a stable network environment when needed. You work closely with others to reduce incidents, improve resilience, and support systems that people depend on every day. They are trusted to balance reliability engineering with network support as part of a broader infrastructure role. What you will do: - You improve system reliability through monitoring, automation, and operational improvements. - You support cloud and platform environments to ensure services remain available and resilient. - You respond to incidents, help restore service, and reduce the chance of repeat issues. - You build and maintain monitoring, alerting, and operational tooling. - You support production changes and infrastructure improvements using established processes. - You provide secondary support for network systems, ensuring connectivity remains stable. - You assist with routine network tasks such as maintenance, upgrades, and troubleshooting. - You support secure connectivity between cloud services, offices, and remote users. - We rely on you to document systems, changes, and operational practices. - They are trusted to protect critical services and improve reliability over time. Qualifications - 5 years of experience supporting production systems, platforms, or infrastructure. - Experience supporting reliable systems in a production environment. - Experience responding to incidents and restoring service. - Experience working with cloud or virtual environments. - Ability to automate, monitor, and improve system operations. - Comfort supporting infrastructure changes and upgrades. - No travel required. - Familiarity with network concepts such as Fortinet firewalls, Cisco routing, F5 Load balancing or virtual private connectivity. - Familiarity with cloud networking or hybrid environments. - Bonus points for experience with certificates or infrastructure automation. Company Description Precisely is the leader in data integrity. We empower businesses to make more confident decisions based on trusted data through a unique combination of software, data enrichment products and strategic services. What does this mean to you? For starters, it means joining a company focused on delivering outstanding innovation and support that helps customers increase revenue, lower costs and reduce risk. In fact, Precisely powers better decisions for more than 12,000 global organizations, including 95 of the Fortune 100. Precisely's 2500 employees are unified by four company core values that are central to who we are and how we operate: Openness, Determination, Individuality, and Collaboration. We are committed to career development for our employees and offer opportunities for growth, learning and building community. With a "work from anywhere" culture, we celebrate diversity in a distributed environment with a presence in 30 countries as well as 20 offices in over 5 continents. Learn more about why it's an exciting time to join Precisely!
About Satsuma Satsuma is a commerce iPaaS that builds merchant-specific APIs, MCP Servers, and MCP Apps, enabling retailers to connect their full commerce stack once and deploy branded shopping experiences across every AI channel. We work with enterprise retailers and move fast. Our infra has to match. The role We're looking for a Senior SRE to own the reliability, scalability, and operational posture of Satsuma's multi-cloud infrastructure. You'll be the person who keeps things running, builds the systems that prevent fires, and makes on-call not terrible. This is an infra-first role. But we're an AI-native company, and we expect you to use AI-assisted development (Claude Code) as a core part of your workflow — writing tooling, automating runbooks, building internal utilities. What you'll do - Own infrastructure across AWS, GCP, and Azure environments - Build and maintain CI/CD pipelines, observability stacks, and incident response workflows - Define and enforce SLOs/SLIs; lead postmortems - Author and maintain IaC (Terraform preferred) - Write internal tooling and automation using AI-assisted development workflows - Partner closely with engineering on reliability reviews and architecture decisions
• Manage .NET Core deployments in a k8s environment • Work in a cloud-native environment • Collaborate with software engineers and leadership on infrastructure management • Make independent decisions on infrastructure stack




