At Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.

Senior Engineer, Site Reliability

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 10,001

Location

India

Posted

39 days ago

Salary

Seniority

Senior

No structured requirement data.

Job Description

Role Description The Software Engineer / Site Reliability Engineer (SRE) will play a critical role in driving reliability, scalability, and performance for the Banking Solutions, Payments, and Capital Markets platforms. This role blends core SRE principles, performance engineering, and service health management to support large-scale, mission-critical systems. The ideal candidate will help modernize platforms through automation-first practices, data-driven reliability metrics, and proactive performance optimization, ensuring exceptional customer experience and business continuity in a highly regulated environment. What You Will Be Doing - Core SRE & Reliability Engineering - Design, implement, and operate highly available, resilient, and scalable systems aligned with SRE best practices. - Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets to balance reliability and delivery velocity. - Build and maintain service health dashboards to provide real-time visibility into platform stability and customer experience. - Reduce toil through extensive automation of operational workflows, alerts, and remediation activities. - Monitoring, Observability & Service Health - Design and maintain end-to-end monitoring and observability solutions covering infrastructure, applications, APIs, and user journeys. - Implement advanced alerting strategies to reduce noise and improve mean time to detect (MTTD) and mean time to resolution (MTTR). - Leverage metrics, logs, and traces to drive root cause analysis and proactive incident prevention. - Enable reliability reporting for stakeholders using SLO compliance and service health metrics. - Performance Engineering & Testing - Lead performance engineering initiatives, including load testing, stress testing, endurance testing, and capacity validation. - Identify performance bottlenecks across application, middleware, database, and infrastructure layers. - Conduct capacity planning and performance tuning to support business growth and peak traffic scenarios. - Partner with development and QA teams to embed performance testing into CI/CD pipelines. - Incident Management & Operations - Lead and participate in incident response activities, including triage, mitigation, recovery, and post-incident reviews. - Drive blameless post-mortems and ensure corrective actions are tracked to completion. - Participate in on-call rotations, providing 24x7 support for critical production systems. - Continuously improve operational readiness and resilience. - Automation, CI/CD & Cloud Operations - Design and manage deployment pipelines, configuration management, and environment consistency across lower and production environments. - Implement Infrastructure as Code (IaC) practices for repeatable and secure cloud provisioning. - Collaborate with DevOps teams to improve deployment reliability, rollback mechanisms, and release safety. - Develop and test disaster recovery plans, backup strategies, and failover mechanisms. - Collaboration & Governance - Work closely with Development, QA, DevOps, Security, and Product teams to align on reliability and performance goals. - Ensure platforms meet security, compliance, and regulatory requirements common in financial services. - Act as a reliability and performance advocate throughout the SDLC. Qualifications - Strong experience in Core SRE practices, including reliability engineering, incident management, and automation. - Proven hands-on experience in Performance Engineering / Performance Testing for large-scale distributed systems. - Deep understanding and implementation experience with SLI / SLO / Error Budget frameworks. - Proficiency in cloud platforms (AWS, Azure, or Google Cloud). - Hands-on experience with containerization and orchestration (Docker, Kubernetes). - Strong background in monitoring, observability, and logging tools such as Prometheus, Grafana, Datadog, Splunk, ELK Stack. - Experience with CI/CD pipelines (Jenkins, GitLab CI/CD, Azure DevOps). - Proficiency in scripting and automation using Python, Bash, Terraform, Ansible. - Strong troubleshooting skills across application, infrastructure, and network layers. - Experience designing and running incident response and post-mortem reviews. - Ownership mindset with accountability for service reliability and customer outcomes. - Excellent communication, collaboration, and stakeholder management skills. Nice to Have (SRE+ Skills) - Experience with Keptn or similar tools for automated SLO-based quality gates and continuous delivery. - Programming experience in Java, especially for debugging, performance profiling, or building automation tools. - Familiarity with chaos engineering practices and tools. - Experience working in banking, payments, or capital markets domains. - Knowledge of security best practices and regulatory compliance in enterprise environment. Company Description At Zensar, we’re “experience-led everything”. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose: Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is ONE with Client - a set of four core values that reflect who we are and how we work: One Zensar, Nurturing, Empowering, and Client Focus. Part of the $4.8 billion RPG Group, we’re a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Site Reliability Engineer, SRE Team

Semrush

Your competitors' favorite marketing platform used by 10,000,000 marketers

DevOps Engineer40 days ago

Full Time RemoteTeam 1,001-5,000Since 2008H1B Sponsor

Company Site LinkedIn

• Collaborate with development teams to design and implement scalable, reliable, and efficient system architectures • Establish and refine SLOs in partnership with stakeholders to guarantee service reliability and performance • Read and write code in Python/Go • Induce application failure and work to recover it from that state • Debug applications using metrics and add traces/metrics as needed • Participate in on-call duties to provide constant support • Lead the changes in common engineering practices in the Company • Possible night shifts (on-call)

Cloud Google Cloud Platform Kubernetes Python Go

View details: Site Reliability Engineer, SRE Team

Cyprus

Apply

Job Closed

Principal DevOps Engineer

Lodgify

Everything you need to run your vacation rental business.

DevOps Engineer40 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Define and drive long-term infrastructure/platform strategy aligned with business objectives and R&D needs; lead cross-org initiatives (e.g., multi-cloud strategy, disaster recovery architecture, platform modernization). • Design robust, scalable, secure architectures; run architectural reviews to identify systemic issues/tech debt and optimization opportunities; define rollout strategies for high-impact changes (DB migrations, Kubernetes upgrades, network redesigns) to ensure continuity. • Evolve internal developer platforms and tools (self-service infrastructure, golden paths/paved roads); establish IaC + GitOps standards and reusable patterns that balance productivity, safety, compliance, and cost. • Establish org-wide reliability practices (SLOs/SLIs/error budgets); drive observability/alerting strategy; reduce toil and improve MTTR through automation and self-healing; identify bottlenecks and implement efficiency improvements. • Lead platform-wide security and compliance evolutions (zero trust, encryption in transit/at rest, regulatory compliance); strengthen controls (segmentation, secrets, access, vulnerability management) and secure the software supply chain. • Improve cost visibility and governance; identify cost optimizations across cloud, databases, and services without compromising reliability or performance. • Coach DevOps Engineers, Senior DevOps Engineers, and Tech Leads; contribute to promotion committees; build learning paths and knowledge-sharing for the infrastructure community. • Provide escalation support for complex incidents; lead post-incident reviews and drive improvements to incident response processes and tooling.

AWS Cloud Distributed Systems Google Cloud Platform Kubernetes Terraform

View details: Principal DevOps Engineer

Spain

Apply

Reliability & Scale Engineer – DevOps/Cloud

RedSky

Mission Focused | Beyond The Horizon

DevOps Engineer40 days ago

Full Time RemoteTeam 11-50Since 2016H1B No Sponsor

Company Site LinkedIn

• Build teams that push boundaries • Create startups from the ground up (0→1) • Enter Talent Pipeline for project matches

View details: Reliability & Scale Engineer – DevOps/Cloud

Poland

Apply

Junior DevOps Engineer

Element 8

Crafting Tomorrow's Answers, Today: Where Technology Meets Imagination

DevOps Engineer40 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Design, implement, and maintain CI/CD pipelines using GitHub Actions. • Deploy and manage applications across AWS and Azure cloud environments. • Work under senior team members to understand workflows and processes. • Use tools like Git and GitLab for code versioning and collaboration. • Learn and work with containerization technologies like Docker and Kubernetes. • Assist in managing configurations using Ansible and deploying infrastructure with Terraform. • Use tools like CloudWatch and Nagios to monitor system health and performance. • Work with teams to manage tasks and projects using JIRA.

Ansible AWS Azure Cloud Docker Kubernetes Terraform

View details: Junior DevOps Engineer

India

₹15K - ₹25K / month

Apply

Job Closed

Senior Engineer, Site Reliability

Job Description

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Site Reliability Engineer, SRE Team

Principal DevOps Engineer

Reliability & Scale Engineer – DevOps/Cloud

Junior DevOps Engineer