Job Closed
This listing is no longer active.
Two years in a row: Innovaccer Awarded Best in KLAS Data & Analytics Platforms Category.
Site Reliability Engineer II
Location
United States
Posted
96 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer II
Innovaccer
• Take ownership of SRE pillars: Deployment, Reliability, Scalability, Service Availability (SLA/SLO/SLI), Performance, and Cost. • Lead production rollouts of new releases and emergency patches using CI/CD pipelines while continuously improving deployment processes. • Establish robust production promotion and change management processes with quality gates across Dev/QA teams. • Roll out a complete observability stack across systems to proactively detect and resolve outages or degradations. • Analyze production system metrics, optimize system utilization, and drive cost efficiency. • Manage autoscaling of the platform during peak usage scenarios. • Perform triage and RCA by leveraging observability toolchains across the platform architecture. • Reduce escalations to higher-level teams through proactive reliability improvements. • Participate in the 24x7 OnCall Production Support team. • Lead monthly operational reviews with executives covering KPIs such as uptime, RCA, CAP (Corrective Action Plan), PAP (Preventive Action Plan), and security/audit reports. • Operate and manage production and staging cloud platforms, ensuring uptime and SLA adherence. • Collaborate with Dev, QA, DevOps, and Customer Success teams to drive RCA and product improvements. • Implement security guidelines (e.g., DDoS protection, vulnerability management, patch management, security agents). • Manage least-privilege RBAC for production services and toolchains. • Build and execute Disaster Recovery plans and actively participate in Incident Response.
Job Requirements
- 4–7 years in production engineering, site reliability, or related roles.
- Solid hands-on experience with at least one cloud provider (AWS, Azure, GCP) with automation focus (certifications preferred).
- Strong expertise in Kubernetes and Linux.
- Proficiency in scripting/programming (Python required).
- Observability is very critical for the scale of our systems and ability to find insights/behavior, detect problem/failures. Looking for leads to drive this charter spanning across logs, metrics, mesh, tracing etc.
- Knowledge of CI/CD pipelines and toolchains (Jenkins, ArgoCD, GitOps).
- Familiarity with persistence stores (Postgres, MongoDB), data warehousing (Snowflake, Databricks), and messaging (Kafka).
- Exposure to monitoring/observability tools such as ElasticSearch, Prometheus, Jaeger, NewRelic, etc.
- Proven experience in production reliability, scalability, and performance systems.
- Experience in 24x7 production environments with process focus.
- Familiarity with ticketing and incident management systems.
- Security-first mindset with knowledge of vulnerability management and compliance.
- Excellent judgment, analytical thinking, and problem-solving skills.
- Strong sense of personal responsibility and accountability for delivering high quality work.
Benefits
- Generous Paid Time Off: Recharge and relax with 22 days of fixed time off per year, in addition to company holidays—because we believe work-life balance fuels performance.
- Best-in-Class Parental Leave: Spend quality time with your growing family. We offer one of the industry’s most generous parental leave policies to support you during life’s most important moments.
- Recognition & Rewards: We celebrate wins—big and small. Get rewarded with monetary incentives and company-wide recognition for your impact and dedication. Your hard work won’t go unnoticed.
- Comprehensive Insurance Coverage: Stay covered with medical, dental, and vision insurance, plus 100% company-paid short- and long-term disability and basic life insurance. Optional perks include discounted legal aid and pet insurance.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Engineer / Site Reliability Engineer
DivioOperate, orchestrate, protect, mirror, backup, and migrate your web apps on Divio's external development platform.
• Keeping the lights on: patching, tuning, incident response, and keeping the stack healthy • Pushing long-term improvements: migrations, internal tooling, security hardening, monitoring revamps • Shaping the infrastructure: evolving our setup to stay modern, secure, and developer-friendly • helping: our internal support crew for technical questions or the external dev team with their day-to-day
3954- Site Reliability Engineer II
InnovaccerTwo years in a row: Innovaccer Awarded Best in KLAS Data & Analytics Platforms Category.
About the Role We at Innovaccer are looking for a Site Reliability Engineer-II to build secured modern healthcare cloud infrastructure and a massive data stack and aim to write everything as code A Day in the Life - Take ownership of SRE pillars: Deployment, Reliability, Scalability, Service Availability (SLA/SLO/SLI), Performance, and Cost. - Lead production rollouts of new releases and emergency patches using CI/CD pipelines while continuously improving deployment processes. - Establish robust production promotion and change management processes with quality gates across Dev/QA teams. - Roll out a complete observability stack across systems to proactively detect and resolve outages or degradations. - Analyze production system metrics, optimize system utilization, and drive cost efficiency. - Manage autoscaling of the platform during peak usage scenarios. - Perform triage and RCA by leveraging observability toolchains across the platform architecture. - Reduce escalations to higher-level teams through proactive reliability improvements. - Participate in the 24x7 OnCall Production Support team. - Lead monthly operational reviews with executives covering KPIs such as uptime, RCA, CAP (Corrective Action Plan), PAP (Preventive Action Plan), and security/audit reports. - Operate and manage production and staging cloud platforms, ensuring uptime and SLA adherence. - Collaborate with Dev, QA, DevOps, and Customer Success teams to drive RCA and product improvements. - Implement security guidelines (e.g., DDoS protection, vulnerability management, patch management, security agents). - Manage least-privilege RBAC for production services and toolchains. - Build and execute Disaster Recovery plans and actively participate in Incident Response. - Work with a cool head under pressure and avoid shortcuts during production issues. - Collaborate effectively across teams with excellent verbal and written communication skills. - Build strong relationships and drive results without direct reporting lines. - Take ownership, be highly organized, self-motivated, and accountable for high-quality delivery. What You Need - 4–7 years in production engineering, site reliability, or related roles. - Solid hands-on experience with at least one cloud provider (AWS, Azure, GCP) with automation focus (certifications preferred). - Strong expertise in Kubernetes and Linux. - Proficiency in scripting/programming (Python required). - Observability is very critical for the scale of our systems and ability to find insights/behavior, detect problem/failures. Looking for leads to drive this charter spanning across logs, metrics, mesh, tracing etc. - Knowledge of CI/CD pipelines and toolchains (Jenkins, ArgoCD, GitOps). - Familiarity with persistence stores (Postgres, MongoDB), data warehousing (Snowflake, Databricks), and messaging (Kafka). - Exposure to monitoring/observability tools such as ElasticSearch, Prometheus, Jaeger, NewRelic, etc. - Proven experience in production reliability, scalability, and performance systems. - Experience in 24x7 production environments with process focus. - Familiarity with ticketing and incident management systems. - Security-first mindset with knowledge of vulnerability management and compliance. - Advantageous: hands-on experience with Kafka, Postgres, and Snowflake. - Excellent judgment, analytical thinking, and problem-solving skills. - Ability to quickly identify and drive optimal solutions within constraints. - Lead least privilege based RBAC for various production services and tool chains. - Able to perform with cool head under pressure situations without taking any shortcuts. - Collaboration with solid verbal and oral communication skills are very critical to this role. Strong cross-functional collaboration skills, relationship building skills, and ability to achieve results without direct reporting relationships - Ability to quickly identify and drive to the optimal solution when presented with a series of constraints. - Excellent judgment, analytical thinking, and problem-solving skills. - Self-motivated individual that possesses excellent time management and organizational skills. - Strong sense of personal responsibility and accountability for delivering high quality work. We offer competitive benefits to set you up for success in and outside of work. Here’s What We Offer - Generous Paid Time Off: Recharge and relax with 22 days of fixed time off per year, in addition to company holidays—because we believe work-life balance fuels performance. - Best-in-Class Parental Leave: Spend quality time with your growing family. We offer one of the industry’s most generous parental leave policies to support you during life’s most important moments. - Recognition & Rewards: We celebrate wins—big and small. Get rewarded with monetary incentives and company-wide recognition for your impact and dedication. Your hard work won’t go unnoticed. - Comprehensive Insurance Coverage: Stay covered with medical, dental, and vision insurance, plus 100% company-paid short- and long-term disability and basic life insurance. Optional perks include discounted legal aid and pet insurance. Innovaccer Inc. is an equal opportunity employer. We celebrate diversity and are committed to fostering an inclusive workplace where all employees feel valued and empowered regardless of any characteristic protected by federal, state or local law including, without limitation, race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, medical condition, disability, age, marital status, or veteran status. Innovaccer Inc. participates in the E-Verify program to confirm employment eligibility of all newly hired employees based out of the U.S. and employed by Innovaccer Inc. For any additional information, please visit the below websites: E-Verify Right to Work (English) Right to Work (Spanish) Disclaimer: Innovaccer does not charge fees or require payment from individuals or agencies for securing employment with us. We do not guarantee job spots or engage in any financial transactions related to employment. If you encounter any posts or requests asking for payment or personal information, we strongly advise you to report them immediately to our HR department at px@innovaccer.com. Additionally, please exercise caution and verify the authenticity of any requests before disclosing personal and confidential information, including bank account details.
Forward Deployment Engineer
ToastWe empower the restaurant community to delight guests, do what they love, and thrive.
• Hands-on Execution: Design and Ship production grade agentic solutions tailored specifically for high-velocity GTM motions • Rapid Prototyping: Transition from ambiguous business problems to functional AI Proof-of-Concepts (PoCs) within 1–2 week sprint cycles • Process Discovery: Embed directly with Sales, Service, and Marketing leaders to deconstruct high-friction manual workflows ripe for AI-driven transformation • Strategic Integration: Architect the flow between LLMs and the GTM stack (Salesforce, Clay, Snowflake) with a "Golden Record" mindset to ensure data integrity at scale • ROI Modeling: Translate technical metrics into business outcome driven metrics • Architectural Resilience: Act as the domain expert for complex synchronization challenges, ensuring GTM systems are built for concurrency, scale, and reliability
DevSecOps Engineer
TypeformTypeform is a Barcelona, Spain-based online SaaS (Software-as-a-Service) organization that creates software for building online forms. Typeform allows companies to create beautiful
• Embed security into the software development lifecycle, enabling teams to ship features safely at high velocity. • Partner with engineering and AI teams to assess and mitigate security risks for new AI features, infrastructure, and pipelines. • Develop and maintain secure CI/CD pipelines, tooling, and automation to support rapid deployment. • Conduct threat modeling, vulnerability assessments, and code reviews for new services and AI workloads. • Advise on secure architecture patterns for infrastructure, agent systems, and model-serving environments. • Implement monitoring, alerting, and incident response practices for critical security events. • Define best practices, standards, and policies for secure feature delivery across engineering teams. • Act as the internal security advocate, balancing risk management with product velocity.



