Site Reliability Engineer
Location
United Kingdom
Posted
74 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer
Intermedia Cloud Communications
• Build and operate metrics/monitoring platforms: **Prometheus and/or VictoriaMetrics** (scrape configs, exporters, recording rules) • Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction • Integrate monitoring/alerting and events with **BigPanda** (correlation, enrichment, routing, incident workflows) • Create and maintain dashboards and operational visibility (Grafana or equivalent) • Develop and maintain runbooks, operational playbooks, and incident response procedures • Participate in **on-call shifts**: triage alerts, manage incidents, coordinate response, and lead communication during outages • Perform root-cause analysis, postmortems, and implement corrective/preventive actions • Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil • Support monitoring for core infrastructure and services on **Windows and Linux**, including HA components and clusters • Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)
Job Requirements
- Experience in **SRE / Operations / DevOps** with production incident ownership
- Hands-on experience with **Prometheus and/or VictoriaMetrics** (exporters, alert rules, recording rules, troubleshooting)
- Experience integrating alerting/event pipelines with **BigPanda** (or similar event correlation tools)
- Strong troubleshooting skills across **Linux and Windows** systems (networking, OS, services)
- Ability to build reliable alerting with minimal noise (correlation, grouping, suppression, maintenance windows)
- Experience with Git-based workflows for monitoring-as-code and configuration management
- Nice to have**
- Grafana administration and dashboard design standards
- Log management (ELK/EFK, Loki) and/or tracing (OpenTelemetry)
- Automation skills (Python, PowerShell, Bash) and configuration tools (Ansible)
- Messaging/cache/proxy operations: **RabbitMQ**, **Redis**, **Nginx**
- Experience with Windows clustering or HA environments
- Experience defining SLOs/SLIs and operational KPIs
- Experience in managing VOIP components and protocols (SIP , FreeSwitch, OpenSIP, session border controllers)
- Experience with load balancing components ( F5 LTM, F5 GTM)
- Experience with Virtualization platforms such as VMWare or HyperV****
- Experience with administering AWS or Azure tenants****
Benefits
- We hire, promote, and compensate employees based on their ability to perform their job responsibilities, without regard to race, color, creed, religion, sex, gender, marital status, national origin, ancestry, age, citizenship, physical or mental disability, sexual orientation, or any other basis protected by applicable law (collectively referred to in our Code of Conduct as “Protected Classes”). We do not tolerate employment discrimination in the workplace, and we are committed to making reasonable accommodations for identified disabilities or other limitations as required by all applicable laws. We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.*
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer
Happy Returns LLCHappy Returns is committed to providing a workplace free of discrimination, harassment, and retaliation. Happy Returns is an equal opportunity employer. Happy Returns does not discriminate on the basis of race/color/religion/sex/national origin/veteran/disability/age/sexual orientation/gender identity or any other characteristic protected by law.
Role Description We’re not solving a small problem, and we’re not addressing a small market. We’re tackling returns—the part of the online shopping experience shoppers say they hate most. Our customers, i.e., top online retailers, use Happy Returns’ returns software and reverse logistics to offer shoppers a genuinely delightful return experience while at the same time reducing costs, retaining sales, and making their supply chains more sustainable. We’re making returns legitimately better for everyone, and we’re having fun doing it. - Collaborate on and deploy cloud infrastructure on AWS using infrastructure-as-code (Terraform, Pulumi) that is secure, scalable, and highly available. - Actively collaborate with software engineering to define infrastructure, build, release and deployment tooling. - Collaborate with information security for requirements and integration with security and compliance tooling. - Troubleshoot problems across a wide array of services and function areas. - Build and maintain operational tools for deployment, monitoring, and analysis of AWS infrastructure and systems to ensure availability, performance, security, and scalability. - Provide recommendations for architecture and process improvements. - Help us build and maintain a world class technology system so we can achieve our mission of making returns beautiful for Shoppers, Retailers, and the Planet. Qualifications - At least 1 year of experience in designing, provisioning and maintaining infrastructure using Infrastructure as Code. - Experience in code development in at least one high-level programming language. - Knowledge of Kubernetes and containerized applications. - Proficiency working with a cloud provider (preferably AWS). - Proficiency with Docker, Git and software development processes for deploying applications. - Understanding of/familiarity with using Github Actions (or similar) to build, test and release software. - Familiarity with database technology such as PostgreSQL or MySQL. - Previous experience using cloud observability platforms such as DataDog or NewRelic. - A strong interest in joining a highly collaborative environment and working daily with Product Managers and Engineers. - Interest in e-commerce, logistics and/or sustainability. - Comfort with leveraging the latest available AI-powered tools (IDEs, agents, automated reviewers) to accelerate and assist with daily development, debugging, and technical tasks. Requirements - This is a fully remote position and is not limited to candidates located in Georgia. EEO Statement Integrated into our shared values is Happy Returns’ commitment to diversity. Happy Return is committed to being a globally inclusive company where all people are treated fairly, recognized for their individuality, promoted based on performance and encouraged to strive to reach their full potential. We believe in understanding and respecting differences among all people. This concept encompasses but is not limited to human differences with regard to race, ethnicity, religion, gender, culture and physical ability. Every individual at Happy Return has an ongoing responsibility to respect and support a globally diverse environment. Visa Sponsorship We do not offer visa sponsorship or transfer for this role. Statement to Third Party Agencies To all recruitment agencies: Happy Return only accepts resumes from agencies on the UPS preferred supplier list. Happy Return is not responsible for any fees or charges associated with unsolicited resumes.
• Define a unified vision for observability across all platforms, with golden signals as the foundation for monitoring and alerting • Develop and maintain a comprehensive roadmap to improve observability, reduce tool redundancy, and standardize practices across platforms • Establish and track key performance indicators (KPIs) to measure progress and ensure accountability for roadmap milestones • Partner with the ZEIT SRE team and engineering leads to break down silos and promote consistent observability practices • Standardize the implementation of golden signals across applications to improve system reliability and incident detection • Identify and address gaps in existing observability practices, prioritizing long-term scalability and reliability • Measure and report on observability success metrics, including actionable alert volume and reduced issue escalations
• Writing Ansible and Terraform to expand and automate a large Elastic Stack implementation • Scripting in Python or Ruby to automate tool integration and processes • Automating the development of security controls, including firewall rules and policy and IPS policy. • Automating new integration with the Elastic Stack SOAR automation • Developing custom enhancements to COTS tools to improve their functionality and enrich data. • Automating server configuration for security, including logging, key changes, and system hardening. • Automate and enhance CI/CD pipelines and environments. • Automating the implementation of security controls in Amazon Web Services (AWS) via the AWS API. • Build out auto-provisioning and auto-scaling of security infrastructure. • Developing security enhancements to improve the security posture of our Government clients. • Building blue team defenses to detect and block the adversary.
• Lead the Reliability Engineering and Metro Engineering functions, overseeing both the physical expansion of metro networks and the observability systems that support them. • Own the end-to-end Tier 3 escalation lifecycle, working with NOC and Incident Management teams to drive a blameless engineering culture focused on systemic improvement and data-driven root cause analysis. • Define the roadmap for Infrastructure-as-Code and GitOps workflows, collaborating with software and network teams to ensure configurations are version-controlled, auditable, and deployed via CI/CD. • Drive the strategy for closed-loop automation by partnering with software engineering teams to implement systems that leverage real-time streaming telemetry for autonomous fault detection and remediation. • Champion the elimination of operational toil; work across the organization to automate change verification and routine maintenance, allowing the NRE team to focus on high-value reliability engineering.



