Clover is a healthcare technology company helping members live their healthiest lives with our Medicare Advantage plans.
Senior Manager, Site Reliability Engineering
Location
United States
Posted
4 days ago
Salary
$187K - $243K / year
Seniority
Senior
Job Description
Senior Manager, Site Reliability Engineering
Clover Health
• Lead and grow our SRE team of ~10 engineers, including hiring, retention, career development, and performance management across multiple time zones (US, HK, NZ). • Build strategic partnerships with product engineering pillars — shifting SRE from reactive, ticket-based support to proactive co-ownership of reliability outcomes. • Scale our multi-tenant infrastructure to support new customer onboarding and growing patient populations. • Own cloud cost management and FinOps practices, building frameworks that balance cost control with reliability and performance. • Champion developer self-service and platform engineering. Build self-service capabilities so product teams can manage routine operations without filing SRE tickets. Establish SLOs/SLIs for critical services and improve alert quality so every page is meaningful. • Ensure the SRE team is fully leveraging AI tooling in their workflows — using tools like Claude Code for IaC generation, log analysis, root cause investigation, and automating repetitive work — at the same level as the rest of engineering.
Job Requirements
- You have 6+ years managing an SRE team and 10+ years of hands-on SRE or infrastructure engineering experience.
- You're deeply comfortable with our core stack: Kubernetes, GCP (GKE, Cloud SQL, Pub/Sub, GCS), Terraform, Helm, ArgoCD, PostgreSQL, and Prometheus/Grafana.
- You have strong programming skills in Python and/or Go, and you're comfortable writing and reviewing infrastructure tooling code — including using AI coding tools to do so.
- You have experience with CI/CD pipelines (GitHub Actions) and a track record of building or improving developer tooling and automation.
- You have sound build vs. buy judgment — you default to the right answer, not the easiest one, and you're comfortable building internal tooling when existing solutions don't fit.
- You have experience leading teams across multiple time zones and a track record of developing engineers into strong technical contributors.
Benefits
- Financial Well-Being: Our commitment to attracting and retaining top talent begins with a competitive base salary and equity opportunities. Additionally, we offer a performance-based bonus program, 401k matching, and regular compensation reviews to recognize and reward exceptional contributions.
- Physical Well-Being: We prioritize the health and well-being of our employees and their families by providing comprehensive medical, dental, and vision coverage. Your health matters to us, and we invest in ensuring you have access to quality healthcare.
- Mental Well-Being: We understand the importance of mental health in fostering productivity and maintaining work-life balance. To support this, we offer initiatives such as No-Meeting Fridays, monthly company holidays, access to mental health resources, and a generous flexible time-off policy. Additionally, we embrace a remote-first culture that supports collaboration and flexibility, allowing our team members to thrive from any location.
- Professional Development: Developing internal talent is a priority for Clover. We offer learning programs, mentorship, professional development funding, and regular performance feedback and reviews.
- Additional Perks: Employee Stock Purchase Plan (ESPP) offering discounted equity opportunities
- Reimbursement for office setup expenses
- Monthly cell phone & internet stipend
- Remote-first culture, enabling collaboration with global teams
- Paid parental leave for all new parents
- And much more!
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Manager, Site Reliability Engineering
Counterpart HealthIn 2018, Clover Health set out to build a clinically intuitive, AI-enabled solution that fits within physicians' workflows to help support the earlier diagnosis and management of chronic conditions. Years later, that vision is a reality, with thousands of practitioners using Counterpart Assistant during patient visits. Counterpart Health is a subsidiary of Clover Health, committed to Diversity & Inclusion as key to our success. We are an Equal Opportunity Employer, valuing diverse strengths, experiences, perspectives, and backgrounds.
Role Description We're looking for a Senior Manager of Site Reliability Engineering to join our team. You'll lead a team of ~10 SREs across North America, UK, HK, and New Zealand — owning both the day-to-day operations and the long-term technical direction of the SRE organization. This role sits at the intersection of people leadership, technical depth, and strategic partnership: you're here to make Counterpart’s infrastructure reliable, scalable, and cost-efficient — and to transform the SRE team's engagement model from reactive support to proactive collaboration with our product engineering pillars. - Lead and grow our SRE team of ~10 engineers, including hiring, retention, career development, and performance management across multiple time zones (US, HK, NZ). - Build strategic partnerships with product engineering pillars — shifting SRE from reactive, ticket-based support to proactive co-ownership of reliability outcomes. - Scale our multi-tenant infrastructure to support new customer onboarding and growing patient populations. - Own cloud cost management and FinOps practices, building frameworks that balance cost control with reliability and performance. - Champion developer self-service and platform engineering. Build self-service capabilities so product teams can manage routine operations without filing SRE tickets. Establish SLOs/SLIs for critical services and improve alert quality so every page is meaningful. - Ensure the SRE team is fully leveraging AI tooling in their workflows — using tools like Claude Code for IaC generation, log analysis, root cause investigation, and automating repetitive work — at the same level as the rest of engineering. Qualifications - You have 6+ years managing an SRE team and 10+ years of hands-on SRE or infrastructure engineering experience. - You're deeply comfortable with our core stack: Kubernetes, GCP (GKE, Cloud SQL, Pub/Sub, GCS), Terraform, Helm, ArgoCD, PostgreSQL, and Prometheus/Grafana. - You have strong programming skills in Python and/or Go, and you're comfortable writing and reviewing infrastructure tooling code — including using AI coding tools to do so. - You have experience with CI/CD pipelines (GitHub Actions) and a track record of building or improving developer tooling and automation. - You have sound build vs. buy judgment — you default to the right answer, not the easiest one, and you're comfortable building internal tooling when existing solutions don't fit. - You have experience leading teams across multiple time zones and a track record of developing engineers into strong technical contributors. Benefits - Financial Well-Being: Competitive base salary and equity opportunities, performance-based bonus program, 401k matching, and regular compensation reviews. - Physical Well-Being: Comprehensive medical, dental, and vision coverage. - Mental Well-Being: Initiatives such as No-Meeting Fridays, monthly company holidays, access to mental health resources, and a generous flexible time-off policy. - Professional Development: Learning programs, mentorship, professional development funding, and regular performance feedback and reviews. - Additional Perks: Employee Stock Purchase Plan (ESPP), reimbursement for office setup expenses, monthly cell phone & internet stipend, remote-first culture, paid parental leave for all new parents, and much more!
Senior Platform Engineer – DevOps, Infrastructure and Platform
OZmapDiscover the best solution for documenting fiber optic networks.
• Design, operate and evolve AWS (EC2) and on-premises environments with containers (Docker), ensuring availability, security and scalability; • Operate and administer Linux production environments (systemd, kernel/network tuning, I/O, process troubleshooting); • Build and evolve CI/CD pipelines from scratch, including quality and security gates; • Develop end-to-end observability (instrumentation, exporters, PromQL, SLI/SLO, alerts); • Lead advanced troubleshooting, root cause analysis and blameless post-mortems — driving structural change afterwards, not just producing a report; • Implement automation using Infrastructure as Code; • Analyze and optimize cloud costs: rightsizing, usage analysis and proposing data-driven alternatives; • Act as a technical reference for developers and engineers, influencing architecture without relying on formal authority.
• Synthesize program requirements (financial, weight, complexity) aligned with the thematic style. • Evaluate advanced topics and escalate incompatibilities. • Define part breakdown and assembly sequence. • Develop source packages (PBOM, eBOM, technical sheets, DVP&R). • Support technical reviews and supplier evaluations. • Oversee master section development and 3D releases. • Manage design practice analysis and critical issue resolution. • Maintain issue logs and present them weekly. • Conduct manufacturing and assembly feasibility studies. • Create technical sheets and DVP&R for new components.
Senior Site Reliability Engineer
NexthinkUnparalleled Visibility Into Issue Detection, Diagnosis, and Remediation
Company Description Nexthink is the leader in digital employee experience management software. The company provides IT leaders with unprecedented insight allowing them to see, diagnose and fix issues at scale impacting employees anywhere, with any application or network, before employees notice the issue. As the first solution to allow IT to progress from reactive problem solving to proactive optimization, Nexthink enables its more than 1,300 customers to provide better digital experiences to more than 18 million employees. Dual headquartered in Lausanne, Switzerland and Boston, Massachusetts, Nexthink has 9 offices worldwide. #LI-Hybrid Job Description At Nexthink, we empower our customers with industry-leading solutions to enable continuous improvement of employee experience. We deliver unmatched visibility across all environments, so IT teams can consistently see, diagnose, and fix digital workplace issues. As a SaaS provider, our commitment is to deliver a seamless, resilient, and scalable platform around the clock. We are looking for an experienced, proactive and innovative professional that is keen to join as a Senior Site Reliability Engineer! The mission of Nexthink's SRE team is to strengthen our infrastructure and enhance our ability to deploy, monitor, and scale systems effectively and reliably. They work closely with over 50 Product Engineering teams that develop our products and services, as well as with the Technical Platform Engineering, Security and Architecture teams to understand the reliability requirements, design and implement solutions, and promote them for adoption and usage. Join our vibrant team of diverse and experienced engineers where cutting-edge technology meets innovation. Be a part of Nexthink's Digital Employee Experience technological revolution, ensuring our global customers enjoy a seamless user experience. Apply now and become a key player in our dynamic SRE organisation. As a Senior Site Reliability Engineer, you will: - Implement and manage cloud-native systems (AWS) using best-in-class tools and automation. - Operate and enhance Kubernetes clusters, deployment pipelines, and service meshes to support rapid delivery cycles. - Design, build, and maintain the infrastructure powering our multi-tenant SaaS platform with reliability, security, and scalability in mind. - Define and maintain SLOs, SLAs, and error budgets, and proactively address availability and performance issues. - Develop infrastructure-as-code (Terraform or similar) for repeatable and auditable provisioning. - Build internal platform tools and automation to support provisioning, monitoring, and operational efficiency. - Monitor infrastructure and applications ensuring high-quality user experiences. - Participate in a shared on-call rotation, responding to incidents, troubleshooting outages, and driving timely resolution and communication. - Act as an Incident Commander during the on-call duty and coordinate cross-team responses effectively to maintain an SLA. - Drive and refine incident response processes, reducing Mean Time to Detect (MTTD) and Mean Time to Recovery (MTTR). - Diagnose and resolve complex issues independently, minimizing the need for external escalation. - Work closely with software engineers to embed observability, fault tolerance, and reliability principles into service design. - Automate runbooks, health checks, and alerting to support reliable operations with minimal manual intervention. - Support automated testing, canary deployments, and rollback strategies to ensure safe, fast, and reliable releases. - Contribute to security best practices, compliance automation, and cost optimization. Qualifications Minimum Bachelor's degree in Computer Science or equivalent practical experience. - 5+ years of experience as a Site Reliability Engineer or Platform Engineer with strong knowledge of software development best practices. - Strong hands-on experience with public cloud services (AWS, GCP, Azure) and supporting SaaS product. - Strong programming or scripting skills (e.g., Python, Go, Bash...), and experience with infrastructure-as-code (e.g. Terraform). - Proficiency with Kubernetes, container-based deployment (e.g., Docker) and related ecosystems (e.g., Helm). - Experience supporting multi-tenant microservices architectures. - Experience with CI/CD pipelines & tools (e.g., Jenkins, GitHub Actions, GitLab CI, FluxCD, Crossplane). - Experience with managing monitoring solutions (e.g. Datadog). - Comfortable participating in a rotating on-call schedule, managing critical incidents, and leading post-incident reviews. - At ease with operating and managing production systems, striking the right balance between urgency and methodology. - Strong system-level troubleshooting skills and a proactive mindset toward incident prevention. - Deep understanding of Linux systems, networking, and common troubleshooting practices. - Solid understanding of the network stack (e.g., TCP/IP, VPN, etc.), cloud architectures (VPC, subnets, firewalls, load balancers), service mesh (e.g., Istio) and storage (e.g., S3, EBS, etc). - Knowledge of zero-downtime deployment strategies, blue/green and canary releases. - Exposure to compliance standards such as SOC 2, ISO 27001, or HIPAA. FedRAMP experience is a big plus. - Experience with chaos engineering or resilience testing practices. - Excellent problem-solving skills, collaborative mindset, and a strong grasp of agile, iterative development. - Self-driven, highly organised, and capable of independently managing priorities. - Curiosity to learn new things and discover new technologies. - Strong communication, presentation, and team collaboration skills. - Excellent written and verbal skills in English. The prior experience with any of the above-mentioned tools is a bonus, but not a must! We encourage you to apply even if you do not meet every single requirement. We welcome candidates with different level of background and experience. If you are excited about this role, please apply and our recruiters will assess your application. Additional Information Additional Information We are the pioneers and trailblazers of a global IT Market Category (DEX) that is shaping the future of how the world works, giving our customers' IT Teams total digital visibility across their enterprise. Our innovative solutions integrate real-time analytics, automation, and employee feedback across all endpoints. This enables our IT teams to solve complex technical challenges, create ever more productive workplaces, and deliver happy, satisfied employees in the digital workplace. With over 1000 employees across 5 continents, Nexthink operates as One Team, connecting, collaborating and innovating to continuously grow. We call our employees 'Nexthinkers' and our commitment to diversity, inclusion, and equity is second to none. We currently have over 75 nationalities working with us, from all cultures and backgrounds, speaking many different languages. IIf you are looking for a change and like a nice atmosphere, lots of challenges, and having fun while working, this is a great opportunity for you! Check what we offer: - Permanent Contract and a competitive compensation package. - Health insurance through our partnership with ACKO, including OPD coverage for dental, vision, health check-ups, consultations, and pharmacy expenses. - Hybrid work model balancing office and remote work, with a structured approach for new hires to foster connections and onboarding. - Flexible Hours and unlimited vacation (employees have unlimited paid time off on top of the 22 days of holidays we offer). Plus, company-paid bank holidays (12), sick days (10-30), bereavement leave (5), and 3 days per year for volunteering. - Free access to professional training platforms to explore your interests and enhance your skills. - Stay covered against accidents, bodily injuries, and disabilities with our personal accident insurance policy, providing assurance with coverage up to three times your annual CTC. - New mothers are entitled to up to 26 weeks of maternity leave, with the flexibility to use up to 8 weeks before the expected delivery and the remaining 18 weeks after. Birth fathers can take 6 weeks of paternity leave, while adoptive parents are eligible for 26 weeks of leave for mothers and 6 weeks for fathers. - Under the Payment of Gratuity Act, receive gratuity at the rate of 15 days of basic pay for every completed year of service, provided you've been employed by the company for a minimum of 5 years. Gratuity is payable at retirement or resignation based on your last drawn basic pay. - Bonuses for referring successful hires after three months of continuous employment. Please note that not all the benefits listed above are available for temporary, contract, and internship roles. To ensure you have the most up-to-date information, we recommend checking with your Recruitment Partner. The base salary for this role is €60,000 - €85,500 gross per year, with a total on-target earnings (OTE) range of €66,000 - 93,000€ including an annual performance bonus. You'll also be part of our broader total rewards package - including benefits tailored to where you live and how you work best. We set our pay ranges using objective criteria: the scope and level of the role, the skills it takes to do it well, and the relevant market data. Ranges are reviewed every year to remain competitive and fair. We're transparent about this because we think you deserve to know what you're working towards from day one. In accordance with the EU Pay Transparency Directive (2023/970), we publish salary ranges on every Nexthink role. We won't ask what you currently earn or your previous salary. What matters to us is what this role is worth and whether it works for you. Nexthinkers come from all kinds of backgrounds, and that's what makes us stronger. We welcome applications from everyone.



