Set data in motion.
Staff Software Engineer I – SRE
Location
India
Posted
128 days ago
Salary
0
Seniority
Lead
Job Description
Staff Software Engineer I – SRE
Confluent
• Analyze systemic failure patterns and design improvements that prevent incident recurrence • Define and maintain SLO/SLA frameworks; use error budgets to guide reliability investments • Build tooling and automation to reduce incident response toil and scale team impact • Own Rootly configuration, workflows, and integrations with PagerDuty, Jira, Confluence, and Slack • Analyze reliability data to identify systemic improvements; build dashboards that drive action • Explore AI-assisted approaches to documentation quality and incident analysis • Design scalable reliability standards that reduce reactive workload over time. • Own standards, practices, and continuous improvement of incident response • Define incident commander eligibility criteria and manage the rotation • Available as escalation IC when incidents exceed a team's management chain • Develop and deliver training programs for engineering teams at all levels • Coach teams through post-mortems and on developing actionable corrective actions • Edit and review customer-facing incident documents to ensure quality and clarity • Drive turnaround SLAs while maintaining technical accuracy • Ensure clear explanation of what happened, why, and how we'll prevent recurrence • Partner with engineering leaders to elevate reliability practices • Be the expert who teams proactively engage for guidance
Job Requirements
- 10+ years in SRE, incident management, or reliability engineering
- Cloud experience with at least one of AWS, GCP, or Azure
- Deep expertise with incident management tooling (Rootly, PagerDuty, or similar platforms)
- Strong understanding of distributed systems and failure modes at scale—Kafka/event streaming expertise preferred, or demonstrated rapid mastery of complex systems
- Deep experience with observability: metrics, logging, tracing—ability to diagnose complex issues
- Kubernetes and container orchestration experience
- Understanding of CI/CD pipelines and release processes
- Systems thinking: understanding how infrastructure design choices affect failure modes and recovery
- Familiarity with SLO/SLA frameworks.
- Track record as a trusted advisor across engineering organizations
- Experience driving org-wide process and cultural changes
- Strong written communication (design docs, one-pagers, runbooks)
- Post-mortem facilitation experience
- Experience with async collaboration across time zones
- Large company experience navigating reliability/incident programs at 500+ engineer organizations
Benefits
- Belonging isn’t a perk here. It’s the baseline. We work across time zones and backgrounds, knowing the best ideas come from different perspectives. And we make space for everyone to lead, grow, and challenge what’s possible.
- We’re proud to be an equal opportunity workplace. Employment decisions are based on job-related criteria, without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or any other classification protected by law.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer
Veradigm®Driving value through its unique combination of platforms, data, expertise, connectivity, and scale.
• Veradigm is expanding its DevOps Engineering team and is seeking a highly skilled and enthusiastic DevOps Engineer to support and evolve our platforms and systems. • This role is critical to the success of our VEHR/VPM/VIE products and will be responsible for building and deploying solutions and services in On-premises and Hosted environment. • Simultaneously, it will also support Azure environments used by the Dev/QA teams. • Knowledge of secure DevOps practices (secrets management, compliance, scanning tools). • Exposure and understanding of container technologies like Docker and/or Kubernetes. • Experience with Configuration Management tools (e.g., Ansible, Chef, etc.) is a plus. • Able to work with developers supporting both modern and legacy applications. • Comfortable with CI/CD, including debugging build failures and deployment issues. • Self-driven and motivated, with the ability to work independently and prioritize tasks effectively. • Strong communication and interpersonal skills, with the ability to collaborate and communicate effectively with cross-functional teams. • Excellent troubleshooting and problem-solving skills, with keen attention to detail. • Excellent documentation skills.
Senior DevOps Engineer – Data & Integration Platform
Cuculus GmbHAffordable energy and water for everyone.
• Lead the installation, automation, and operational reliability of modern open-source data and integration platform. • Install, configure, upgrade, and maintain distributed open-source components including Apache Airflow, Apache NiFi, Apache Spark, Apache Kafka, PostgreSQL, MQTT brokers. • Ensure platform stability, scalability, high availability, and fault tolerance. • Design, deploy, and operate containerized workloads using Docker and Kubernetes. • Implement Infrastructure as Code (IaC) using Terraform and build configuration management and automation workflows using Ansible. • Deploy and operate workloads on public cloud platforms (AWS, Azure, GCP) and private/on-prem infrastructure. • Design and implement comprehensive monitoring, logging, and alerting for infrastructure and applications. • Implement security best practices across containers, Kubernetes, and networks.
• Design, implement, and maintain Azure-based infrastructure and CI/CD pipelines using Azure DevOps (ADO). • Deploy and operate containerized applications using Azure Container Apps, with readiness to transition to AKS as scale and complexity increase. • Support microservices-based applications built on .NET APIs with React front-end components. • Enable and support AI-enabled workloads leveraging Azure OpenAI and related Azure services. • Partner with the client’s Cloud Technology and Security teams to provision subscriptions, resource groups, and security configurations. • Implement and maintain DevSecOps best practices aligned with the client’s governance, compliance, and security standards. • Support production environments with a focus on availability, scalability, and operational resilience. • Collaborate with multiple product and engineering teams across concurrent projects. • Participate in knowledge transfer and onboarding to the client’s security, compliance, and operating procedures.
Staff Software Engineer – SAP BTP, CPI, SRE
Scratch FinancialScratch Financial is the world's simplest patient financing solution.
• Collaborate with onsite architects and onsite leads to improve SAP BTP engineering development strategy, architecture, and governance of application development and engineering related activities for NBCU • Lead offshore team and work with product and other engineering teams to architect, design and evaluate different technical design options for custom development options keeping in mind industry best practices and NBCU development standards • Take a lead role in creating procedures and documenting for proper rollout and implementations • Support production deployments • Manage relationships with Product Teams, Change Management, Testing and other Engineering teams • Prioritize and schedule enhancements stemming from business needs • Implement procedures to communicate effectively with external and internal stakeholders during service interruptions




