Deliciously simple way to run a business and empower your team 💫
Site Reliability Engineer, SRE
Location
United States
Posted
70 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer, SRE
CAKE.com
• Scale and secure our rapidly growing infrastructure • Automate critical processes • Ensure a seamless experience for new users • Make sure the infrastructure keeps up with the growth • Ensure system scalability and high traffic handling • Define and deploy monitoring, alerting, and logging systems • Respond to and resolve production incidents • Conduct thorough post-mortems • Monitor server logs for abnormalities • Design, manage and maintain automation tools for operational processes
Job Requirements
- 5+ years of relevant work experience
- Working experience with AWS
- Docker
- Git
- CI/CD tools like Gitlab CI, Jenkins, etc.
- Experience with IaC tools like Terraform, CloudFormation, Ansible, Puppet, Packer
- Proficiency with Linux and other Unix-based systems
- Experience setting up build automation
- Excellent understanding of security and safety best practices
- Bachelor’s degree in Computer Science or equivalent work experience
- Excellent written and verbal English communication skills
- Ability to work with mixed US and EU based teams
Benefits
- No overtime
- No work on weekends
- No late working hours
- In-house learning programs
- Tech lectures
- Knowledge sharing
- Remote work with provided MacBook
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Engineer
TillsterWe’re a unified commerce platform that enables QSR restaurants to deliver personalized brand experiences & drive sales.
• Analyzing and troubleshooting large-scale distributed systems in the public cloud • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity • Improve and maintain monitoring and logging solutions that measure availability, latency and overall system health of production systems • Provision and manage cloud Infrastructure through automation and infrastructure as code • Restore healthy operation of applications and services through sustainable incident response and blameless postmortems • Follow and monitor security and compliance best practices • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
• Architect, build, and operate resilient, scalable, and self-healing cloud infrastructure on AWS. • Lead the evolution of Kubernetes and platform services to enable secure, automated, and multi-region operations. • Define and enforce Infrastructure as Code (IaC) standards using Terraform, AWS CDK, and Crossplane to ensure consistency, security, and auditability. • Drive automation across provisioning, configuration, and monitoring pipelines to reduce manual effort and operational risk. • Establish and champion reliability, observability, and performance standards across Tier-1 services, ensuring alignment with regulatory and partner requirements. • Partner with product engineering to enhance CI/CD velocity, service resilience, and visibility through shared tooling, SLOs, and platform patterns. • Lead incident reviews, root-cause analyses, and systemic reliability improvements, embedding learnings into runbooks and design practices. • Optimize cloud infrastructure for cost, performance, and fault tolerance, driving data-driven operational excellence. • Mentor and upskill engineers, shaping architectural direction and influencing design decisions across multiple teams. • Contribute to the technical strategy and roadmap for Paxos’ infrastructure platform, aligning platform scalability with business growth and compliance objectives.
• Design, build, and operate scalable, highly available cloud infrastructure primarily on AWS. • Manage and evolve our Kubernetes environments to support the deployment and operation of modern, containerized applications. • Define and implement Infrastructure as Code (IaC) using tools like Terraform, CDK, or Crossplane. • Automate infrastructure provisioning, configuration, maintenance, and monitoring to reduce manual effort and improve reliability. • Apply best practices around security, observability, and cost optimization across infrastructure and services. • Manage and optimize database technologies, with a focus on Amazon RDS and Aurora. • Partner with development teams to ensure seamless deployment and integration of new features and updates. • Investigate and resolve incidents, perform root cause analysis, and implement long-term fixes. • Participate in on-call rotations and provide support for critical production systems. • Contribute to SRE best practices, internal tooling, and team knowledge sharing.
• Provide solutions to customers to make them successful using our products. • Troubleshoot customer environments and engage in active triaging with customers • Build out our monitoring and alerting systems. • Build and maintain automation to ensure daily operational tasks are handled as efficiently as possible. • Help direct the architecture of the products and contribute where possible. • Own the customer experience, working directly with customers to prioritize and solve issues, meet SLAs, and provide “white glove” guidance on the path to production. • Participate remotely within a fully distributed team. • Enhance and enrich customer documentation • Work with the latest technology and multi-cloud implementations


