Deliciously simple way to run a business and empower your team 💫
Site Reliability Engineer, Remote – UK & EU
Location
United Kingdom
Posted
71 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer, Remote – UK & EU
CAKE.com
• Scale and secure infrastructure to handle increasing user demand. • Define and deploy monitoring, alerting, and logging systems. • Respond to and resolve production incidents; conduct post-mortems. • Monitor server logs and design tools to automate operational processes.
Job Requirements
- 5+ years of relevant work experience.
- Working experience with AWS, Docker, Git, CI/CD tools like Gitlab CI, Jenkins, etc.
- Experience with IaC tools like Terraform, CloudFormation, Ansible, Puppet, Packer.
- Proficiency with Linux and other Unix based systems (including writing shell scripts).
- Experience setting up build automation and repositories.
- Excellent understanding of security and safety best practices.
- Bachelor’s degree in Computer Science or equivalent work experience.
- Excellent written and verbal English communication skills.
- Ability to work with mixed US and EU based teams.
Benefits
- Innovative Environment – Work with top tech talent on cutting-edge products.
- Work-Life Balance - No overtime, no work on weekends and no late working hours.
- Continuous Learning – In-house learning programs, tech lectures, knowledge sharing.
- Remote Work Set-up – Enjoy the flexibility of working from home with a provided MacBook to support your work.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Engineer
TillsterWe’re a unified commerce platform that enables QSR restaurants to deliver personalized brand experiences & drive sales.
• Analyzing and troubleshooting large-scale distributed systems in the public cloud • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity • Improve and maintain monitoring and logging solutions that measure availability, latency and overall system health of production systems • Provision and manage cloud Infrastructure through automation and infrastructure as code • Restore healthy operation of applications and services through sustainable incident response and blameless postmortems • Follow and monitor security and compliance best practices • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
• Architect, build, and operate resilient, scalable, and self-healing cloud infrastructure on AWS. • Lead the evolution of Kubernetes and platform services to enable secure, automated, and multi-region operations. • Define and enforce Infrastructure as Code (IaC) standards using Terraform, AWS CDK, and Crossplane to ensure consistency, security, and auditability. • Drive automation across provisioning, configuration, and monitoring pipelines to reduce manual effort and operational risk. • Establish and champion reliability, observability, and performance standards across Tier-1 services, ensuring alignment with regulatory and partner requirements. • Partner with product engineering to enhance CI/CD velocity, service resilience, and visibility through shared tooling, SLOs, and platform patterns. • Lead incident reviews, root-cause analyses, and systemic reliability improvements, embedding learnings into runbooks and design practices. • Optimize cloud infrastructure for cost, performance, and fault tolerance, driving data-driven operational excellence. • Mentor and upskill engineers, shaping architectural direction and influencing design decisions across multiple teams. • Contribute to the technical strategy and roadmap for Paxos’ infrastructure platform, aligning platform scalability with business growth and compliance objectives.
• Design, build, and operate scalable, highly available cloud infrastructure primarily on AWS. • Manage and evolve our Kubernetes environments to support the deployment and operation of modern, containerized applications. • Define and implement Infrastructure as Code (IaC) using tools like Terraform, CDK, or Crossplane. • Automate infrastructure provisioning, configuration, maintenance, and monitoring to reduce manual effort and improve reliability. • Apply best practices around security, observability, and cost optimization across infrastructure and services. • Manage and optimize database technologies, with a focus on Amazon RDS and Aurora. • Partner with development teams to ensure seamless deployment and integration of new features and updates. • Investigate and resolve incidents, perform root cause analysis, and implement long-term fixes. • Participate in on-call rotations and provide support for critical production systems. • Contribute to SRE best practices, internal tooling, and team knowledge sharing.
• Provide solutions to customers to make them successful using our products. • Troubleshoot customer environments and engage in active triaging with customers • Build out our monitoring and alerting systems. • Build and maintain automation to ensure daily operational tasks are handled as efficiently as possible. • Help direct the architecture of the products and contribute where possible. • Own the customer experience, working directly with customers to prioritize and solve issues, meet SLAs, and provide “white glove” guidance on the path to production. • Participate remotely within a fully distributed team. • Enhance and enrich customer documentation • Work with the latest technology and multi-cloud implementations


