Arbor MIS helps schools and MATs work more easily and collaboratively. Join a free webinar: http://bit.ly/Arbor-webinars
Site Reliability Engineer
Location
United Kingdom
Posted
20 days ago
Salary
£60K - £70K / year
Seniority
Senior
Job Description
Site Reliability Engineer
Arbor Education
• Proactively monitor and analyse platform performance. • Collaborate with engineering teams to address performance bottlenecks and ensure scalability. • Assist engineering teams with implementing and reviewing SLOs • Continually improve observability through monitoring and alerting, and dashboards, using tools such as DataDog or Prometheus for example. • Work with other teams to ensure it is effective and provides full coverage. • Ensure the service is highly available and resilient • Champion best practices in design for high availability • Devise runbooks and run game sessions to test our DR plan, H/A and backups • Conduct assessments of capacity and plan for scaling to meet current and future business needs. • Work closely with the Head of Platform Engineering and Head of SRE to strategize and implement scalable solutions. • Work closely with the Platform team, feature teams and, 2nd line support and other stakeholders to ensure a good level of service is provided for our customers and embed SRE practices. • Key player in the response and troubleshooting of incidents, ensuring rapid resolution and minimising downtime. • Participate in blameless postmortems to identify root cause and corrective actions • Develop and maintain playbooks and documentation
Job Requirements
- Experience in performance monitoring and analysis
- Capacity planning experience
- Scripting and automation skills, with experience in relevant technologies.
- Experience with Infrastructure as Code, in particular, Terraform
- Understanding of relational database technologies and their cloud versions (e.g. AWS Aurora)
- Experience with messaging and distributed asynchronous workloads
- Experience with nginx or similar technologies
- Familiarity with SRE processes.
- Aware of DevOps principles like the 3 ways and 5 ideals.
- Bonus Skills**
- Experience with other database technologies and cloud platforms.
- Past experience with Enterprise solutions running at scale
- Familiarity with Kanban and Agile development processes
- Experience with containerisation, for example Docker
- Familiarity with software best practices such as Refactoring, Clean Code, Domain-Driven Design and Test-Driven Development.
Benefits
- A dedicated wellbeing team who champion initiatives such as mindfulness, lunch n learns, manager training, mental health first aid training and much more!
- 32 days holiday (plus Bank Holidays). This is made up of 25 days annual leave plus 7 extra company wide days given over Easter, Summer & Christmas
- Life Assurance paid out at 3x annual salary
- Comprehensive wellness benefit provided by AIG Smart Health, which provides a 24/7 virtual GP service, Mental health support, Counselling, and personalised Health Checks
- Private Dental Insurance with Bupa
- Salary sacrifice Pension provided by Scottish Widows
- Enhanced maternity and adoption leave (20 weeks full pay) and paternity (6 weeks full pay) pay
- 5 free return to work maternity coaching sessions, helping you adapt to this new exciting time of life!
- Access to services such as Calm and Bippit (financial wellbeing coaching)
- All of our roles champion flexible working and we are happy to discuss what this means to you
- Social committees that plan team, office and company wide events to bring people together and celebrate success
- Dedicated professional development training budget (CPD courses, upskilling resources, professional memberships etc)
- Volunteer with a charity of your choice for a day each year
- Dog friendly offices!
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Work primarily with on‑premise infrastructure (bare metal and VMs): setup, maintenance, troubleshooting • Drive clarity in ambiguous situations by defining requirements, assumptions, and next steps • Own automation projects end‑to‑end (design → rollout → maintenance) • Improve how we operate: harden and tune systems and also improve the way the team works in terms of operational hygiene • Keep the platform stable, fast, and secure: servers, web servers, databases, queues • Investigate production incidents across OS / networking / infrastructure layers, apply temporary mitigations, coordinate with developers and participate in post‑mortems • Participate in on‑call rotations • Use AI in all aspects of day‑to‑day work: researching, troubleshooting, developing
Site Reliability Engineer
QlikFounded in 1993, Qlik is an award-winning, market-leading software company that specializes in business intelligence technology. Qlik provides tools that make d
Role Description Join our dynamic team at Qlik as a Site Reliability Engineer, where you'll play a crucial role in ensuring the security, stability, and scalability of our Qlik and Talend Cloud services. This exciting role offers hands-on experience with cutting-edge technology and scale challenges as we expand to support millions of transactions across our cloud environment. - Exciting Challenges: Maintain the reliability and availability of our cloud platforms, tackling complex problems and driving improvements to enhance performance and scalability. - Collaborative Environment: Work closely with our Engineering organization, collaborating with Architecture, Platforms, and Domains teams to design and develop new infrastructure features and optimize cloud-related practices. - Innovative Solutions: Design and develop effective tooling, alerts, and responses to identify and address reliability risks, utilizing your expertise in cloud technology and backend systems. - Professional Growth: Act as a resource for fellow engineers, sharing your knowledge and expertise in cloud engineering, production service operations, incident management, and troubleshooting. - Continuous Learning: Stay updated on the latest industry trends and technologies, contributing to the adoption of best practices and driving continuous improvement within our cloud environment. Here’s how you’ll be making an impact: - Reliability and Scalability: Ensure high reliability and availability of our cloud platforms, collaborating with cross-functional teams to implement new infrastructure features and optimize performance. - Cloud Optimization: Define and evangelize cloud-related optimizations and best practices, driving improvements in reliability, scalability, and performance. - Problem Solving: Analyze complex issues at the infrastructure, systems, network, and application levels, making recommendations and decisions to resolve them effectively. - Knowledge Sharing: Share your expertise with fellow engineers, providing guidance on cloud technologies, automation, security, and best practices. - On-Call Support: Participate in on-call duties to maintain the availability and performance of our cloud infrastructure, providing regular updates on project status and activities. Qualifications - Bachelor's or master’s degree in computer science or a relevant field. - Self-motivated with the ability to work autonomously and multitask effectively. - Strong analytical skills for solving complex problems and driving innovative solutions. - 3+ years’ experience with Infrastructure as Code (IaC) tools such as Terraform, Crossplane, Ansible, or similar. - 3+ years’ experience working alongside a production system running on Kubernetes. - 3+ years of professional experience in cloud engineering, preferably with AWS and Azure. - 3+ years of professional experience with operating and/or building microservices. - Proficiency in scripting and automation (e.g., Bash, Python, Go) and software engineering concepts. - Experience with CI/CD, cloud and microservice autoscaling. - Experience with networking security and secret-management tools (e.g. Vault, AWS SSM). - Proficiency with observability tooling such as Prometheus, Open Telemetry, distributed tracing, and SIEM such as Splunk. - Experience with Helm including but not limited to managing helm charts as well as creating custom charts from existing ones or building new. - Excellent English communication skills, both oral and written. - Curiosity and desire to learn. - Ability to take a rotating on-call shift. - Knowledge of infrastructure security review and compliance frameworks. - Experience working with database concepts and tooling such as MongoDB. - Demonstrated ability to collaborate with development teams and provide expert guidance on implementing reliability best practices, ensuring systems are robust, scalable, and highly available. - Where applicable, experience with or interest in learning other tools such as Temporal, Clik House, Fire Hydrant, Grafana, Solace, Gloo, Isito, and other cloud native related tools. - Certifications such as CKD, CKS, AWS Certified Solutions Architect Associate/Professional, AWS Certified Advanced Networking Specialty, AWS Certified Security Specialty, all considered assets. - Ability to obtain sufficient clearance status to work on IL5 systems with Qlik support. Benefits - Named in Newsweek’s ‘Americas Greatest Workplaces 2025’. - Genuine career progression pathways and mentoring programs. - Culture of innovation, technology, collaboration, and openness. - Flexible, diverse, and international work environment. - Participation in Corporate Responsibility Employee Programs. - Comprehensive benefits, including medical, dental, and vision coverage, life and AD&D, short and long-term disability coverage, paid time off, paid parental/maternity leave, participation in a 401(k) program that includes company match, and many other additional voluntary benefits.
Senior ServiceDesk Reliability Engineer – SDRE
TabbyOn a mission to create financial freedom. No interest. No fees. Shariah-Compliant.
• Tabby creates financial freedom in the way people shop, earn and save by reshaping their relationship with money. • The company’s flagship offering allows shoppers to split their payments online and in-store with no interest or fees. • Tabby generates over $10 billion in annual transaction volume for its partner brands and is the highest-rated, most-reviewed, largest, and fastest-growing FinTech in the GCC region. • Tabby launched in 2019 and has since raised +$1 billion in equity and debt funding from global and regional investors, and is now valued at $4.5 billion.
Cloud DevSecOps Engineer – III
Regions BankDo what is right. Put people first. Reach higher. Focus on your customer. Enjoy life.
• Partners with other engineers and information technology staff to orchestrate code builds, quality and security analyses, deployments, and automated testing through CI/CD release candidacy pipelines • Articulates business needs and translate them into technology solutions • Models release candidate CI/CD pipelines as a mechanism to communicate the states and steps necessary to determine a release candidate for each application and service • Designs and develops fully autonomous CI/CD pipelines which facilitate cloud deployments which includes automation of all infrastructure, services and application build and deployment • Ensures that all parts of the pipeline follow good software engineering practices to include automated tests and infrastructure tests • Researches new technologies that will improve efficiency and effectiveness • Implements highly scalable CI/CD platforms to support high change volumes and fast feedback • Automates operational activities and tasks • Responds to performance issues identified by alerts and reported incidents related to CI/CD platforms • Builds tools which reduce errors and improve our overall customer experiences • Assists in troubleshooting of production issues and ensure pipeline and infrastructure produces clear documentation and metrics which enables Root Cause Analysis • Develops and tests – Ansible Playbooks, Terraform Scripts, Packer Scripts and establish immutable infrastructure such that patches are an artifact of the past • Works with Enterprise Architecture, Information Security (InfoSec), Software Delivery, and Quality Assurance to enable the organization to move to the cloud using complete automation • Partners across Technology, Operations, Digital, and Data (TODD) to ensure controls are designed, implemented, and monitored to strengthen risk management, compliance, and cyber security, effectively mitigating risk to levels within the company’s risk appetite • Practices disciplined change management by evaluating risk and control impacts when designing or implementing changes to processes, systems, products, and/or services and ensures appropriate updates to procedures, training, and controls are made accordingly



