Hazelcast modernizes applications with a unified real-time data platform.
Senior Cloud SRE
Location
United Kingdom
Posted
54 days ago
Salary
0
Seniority
Senior
Job Description
Senior Cloud SRE
Hazelcast
• Keep Hazelcast cloud-based production systems running smoothly 24/7/365 • Design and Development: • Design, develop, and maintain our cloud infrastructure to support both our end user management center and microservice based platform • Implement new solutions using AWS and terraform, improving scalability, throughput, and reliability. • Support and manage our Keycloak IDP ensuring it provides appropriate security while meeting the needs of the development team • Security and Integration: • Implement security measures to protect data integrity and confidentiality, including encryption, access control, and compliance with relevant regulations. • Work with our operations team to maintain our SOC2 & ISO27001 compliance, and keeping our environment secure • Monitoring and Maintenance: • Monitor the system for performance issues, errors, and potential failures, and implement maintenance procedures such as backups, data recovery, and disaster recovery plans. • Troubleshoot issues related to data storage, including performance bottlenecks, data corruption, or compatibility issues with other software components. • Collaboration: • Collaborate with cross-functional teams, including software developers, architects, and product managers, to ensure the effective integration and operation of the components within the overall software infrastructure. • Document design decisions, implementation details, and operational procedures to facilitate collaboration among team members and ensure the maintainability of the system. • Continuous Learning: • Stay updated with the latest developments in storage technologies, Java programming language, and software engineering best practices, and apply this knowledge to improve existing storage systems and develop new solutions. • On-call participation • Be part of our on-call rotation to respond to availability incidents and work with support and engineers on customer incidents
Job Requirements
- Experience of distributed systems, Kubernetes & microservices
- Infrastructure as Code (Terraform)
- Modern devops stack (K8s, Prometheus, Grafana, Opentelemetry, ArgoCD, helm)
- Experience with at least one programming languages, preferably Golang or Python
- Experience with CI and building CD pipelines (Jenkins, GitHub Actions)
- A passion for automation and keeping our software delivery fast and efficient
- Knowledge of following are desirable:
- Mutli-cloud (AWS, GCP and/or Azure)
- Experience working with software engineers in designing cloud-native applications or troubleshooting them
- Experience as part of an on-call rota
- Bachelor's degree in a relevant field of study (Computer Science, or related discipline). OR equivalent experience.
Benefits
- 25 days annual leave
- Group Company Pension Plan
- Private Medical Insurance
- Private Dental Insurance
- Life Insurance
- EAP (Employee Assistance Program)
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Oversee, lead and manage projects of a cross functional team of 5 DevSecOps engineers and cloud engineers in delivering secure, reliable, and scalable infrastructure and deployment pipelines across Azure and AWS. • Apply knowledge leveraging Infrastructure as Code with Terraform, automated CI/CD (Azure DevOps), and container oriented architectures to drive 3E modernization initiatives, embed security throughout the software lifecycle, and accelerate AI adoption to improve delivery velocity and operational insight. • Lead guide and develop engineering talent; set objectives and key results, plan and oversee projects, timelines and objectives, conduct performance reviews, and foster an inclusive, high- performance culture. • Define, plan and own the DevSecOps roadmap aligned to business objectives, cloud modernization, and AI enabled capabilities. • Design, develop, and maintain resilient cloud solutions in collaboration with application, security, data, and product teams that meet evolving requirements while continuously improving existing cloud applications. • Provide hands-on technical leadership to design, build, and operate secure CI/CD pipelines using Azure DevOps and Terraform across multi cloud (Azure & AWS) environments. • Architect, implement, and maintain containerized workloads (Docker, Kubernetes, ECS) with built in security scanning, policy enforcement, and automated scaling. • Establish, monitor, and improve SRE driven metrics, dashboards, and incident response practices to meet SLOs, compliance mandates, and operational excellence targets. • Implement adoption of AI/ML to predictive anomaly detection, generative code analysis, and AI driven infrastructure optimization. • Stay current on industry trends, emerging cloud services, and regulatory changes; disseminate knowledge through brown bags, documentation, and internal communities. • Ensure documentation, runbooks, and knowledge transfer are maintained and continuously improved to support on-call engineers and audit requirements.
DevOps Architect / SME, MultiCloud
EITACIES Inc.EITACIES, The Edge where we bring the difference. Accelerating performance. Achieving #business goals.
• Architect and standardize DevOps practices across AWS, GCP, and hybrid cloud environments • Build and maintain Infrastructure as Code frameworks using Terraform, CDK, or Pulumi • Design and implement scalable Kubernetes-based platforms (Helm, deployments, orchestration) • Develop reusable CI/CD modules and automation frameworks • Lead cloud architecture for large-scale systems including compute, storage, networking, and databases • Debug complex multi-cloud and third-party integration issues • Implement observability, incident management, and reliability systems • Define and enforce security best practices across infrastructure and pipelines • Drive performance tuning, scalability, and system optimization • Establish disaster recovery and backup strategies • Mentor engineers and drive adoption of DevOps best practices across teams • Collaborate with cross-functional teams and offshore engineering teams
Engineer Sr Lead, Site Reliability
ZensarAt Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.
JD - Engineer Sr Lead, Site Reliability What you will be doing: Software Engineer/Site Reliability Engineer will play a critical role in driving innovation and growth for the Banking Solutions, Payments and Capital Markets business. In this role, the candidate will have the opportunity to make a lasting impact on the company's transformation journey, drive customer-centric innovation and automation, and position the organization as a leader in the competitive banking, payments and investment landscape. Specifically, the Site Reliability Engineer will be responsible for the following: • Design and maintain monitoring solutions for infrastructure, application performance, and user experience. • Implement automation tools to streamline tasks, scale infrastructure, and ensure seamless deployments. • Ensure application reliability, availability, and performance, minimizing downtime and optimizing response times. • Lead incident response, including identification, triage, resolution, and post-incident analysis. • Conduct capacity planning, performance tuning, and resource optimization. • Collaborate with security teams to implement best practices and ensure compliance. • Manage deployment pipelines and configuration management for consistent and reliable app deployments. • Develop and test disaster recovery plans and backup strategies. • Collaborate with development, QA, DevOps, and product teams to align on reliability goals and incident response processes. • Participate in on-call rotations and provide 24/7 support for critical incidents. What you bring: • Proficiency in development technologies, architectures, and platforms (web, API). • Experience with cloud platforms (AWS, Azure, Google Cloud) and IaC tools. • Hands-on experience with Docker, Kubernetes. • Knowledge of monitoring tools (Prometheus, Grafana, DataDog) and logging frameworks (Splunk, ELK Stack). • Experience in incident management and post-mortem reviews. • Strong troubleshooting skills for complex technical issues. • Proficiency in scripting languages (Python, Bash) and automation tools (Terraform, Ansible). • Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps). • Ownership approach to engineering and product outcomes. • Excellent interpersonal communication, negotiation, and influencing skills. At Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. Explore Life at Zensar and join us to Grow. Own. Achieve. Learn. to be the best version of yourself. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.
Senior Site Reliability Engineer, API Platform Engineer
ELSA, CorpWorld's leading A.I. app in English speaking and communication
• Join the AI Infrastructure & Platform team to build, operate, and scale the production systems that power ELSA’s APIs, platform services, and AI-enabled applications. • This Senior Site Reliability Engineer / API Platform Engineer role bridges software engineering, cloud infrastructure, and operational excellence, requiring a pragmatic, highly productive individual who can use modern AI tools and automation to accelerate delivery and improve reliability. • Collaborate closely with engineering, AI, and product teams to ensure our services are secure, scalable, observable, and resilient in real-world production environments. • Design, build, and operate reliable, scalable infrastructure for APIs, platform services, and AI-enabled applications on AWS and Kubernetes. • Own and enhance CI/CD pipelines, deployment workflows, and operational tooling to enable safe and fast software delivery. • Build and maintain robust observability systems across metrics, logging, tracing, alerting, and service health. • Lead incident response, root cause analysis, postmortems, and remediation efforts to continuously improve production reliability. • Automate repetitive operational work through software, infrastructure-as-code, and AI-assisted workflows. • Use AI-native engineering tools including copilots, intelligent automation, and agentic operational tooling to improve debugging, response time, analysis, and team productivity. • Partner with backend, platform, and AI engineering teams to productionize new services and ensure they meet reliability, security, and scalability standards. • Optimize infrastructure and runtime performance across latency, throughput, availability, and cost. • Define and enforce engineering standards for reliability, security, observability, and operational excellence across services. • Contribute production-grade software and internal tools that reduce toil and improve platform leverage across the organization.



