Job Closed
This listing is no longer active.
Bringing our heart to every moment of your health.
Senior Site Reliability Engineer – Metrics and Observability
Location
Louisiana + 4 moreAll locations: Louisiana | Montana | North Carolina | Mississippi | Rhode Island
Posted
65 days ago
Salary
$83.4K - $203.9K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer – Metrics and Observability
CVS Health
• Define, implement, and maintain key performance metrics, SLOs, and SLIs to measure system reliability and performance • Manage error budgets effectively, collaborating with development teams to balance reliability and feature delivery • Design and implement comprehensive monitoring solutions to provide real-time visibility into system health • Develop and implement automated quality gates that ensure all releases meet defined reliability and performance standards • Assist in incident response efforts by providing insights from metrics and monitoring tools • Drive initiatives to enhance monitoring, observability, and reliability practices
Job Requirements
- 5+ years of experience in Site Reliability Engineering and/or DevOps
- 3+ years of experience defining and implementing metrics, SLOs, and SLIs
- 2+ years of experience with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack)
- 2+ years of experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Docker, Kubernetes)
Benefits
- Affordable medical plan options
- 401(k) plan (including matching company contributions)
- Employee stock purchase plan
- No-cost programs for all colleagues including wellness screenings, tobacco cessation, and weight management programs
- Confidential counseling and financial coaching
- Paid time off
- Flexible work schedules
- Family leave
- Dependent care resources
- Colleague assistance programs
- Tuition assistance
- Retiree medical access
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Principal DevOps Engineer
Perforce SoftwareThe DevOps Edge for the Outperformers: Enable teams to build, manage & maintain apps — from code to business-ready.
• Responsible for building platforms and frameworks to create consistent, verifiable, and automatic management of applications and infrastructure between non-production and production environments • Mentored exceptional engineers and DevOps developers on Cloud technology and practice. • Implement application security best practices throughout the agile SDLC • Foster and advocate for a DevOps culture at Perforce to ensure efficient testing, delivery, and deployment of all software artifacts • Lead the development and enhancements of our CI/CD pipeline infrastructure/tools. • Establish technical design principles and practices and drive them across all product portfolios to make operation design a must-have phase of the development lifecycle • Your daily tasks will include developing a technical design for our cloud platform, developing a framework, and working with • Dev and DevOps teams to ensure that the product meets our quality standards.
Senior Autonomy Release Engineer
May MobilityTransforming cities through autonomous technology to create a safer, greener, more accessible world.
• Release ownership and release execution end-to-end across: • Major autonomy releases • Incremental/performance releases • Hotfix/safety patches • Manage branching strategy, versioning, and release cut processes • Drive release readiness and go/no-go decisions • Partner cross-functionally with Autonomy, Infra, Validation, and Fleet Ops • Act as a system owner for release readiness • Investigate and resolve complex issues arising from: • Software/hardware interactions • Distributed systems behavior • On-vehicle vs simulation discrepancies • Develop deep understanding of: • Sensor stack, middleware, autonomy stack • Compute platforms, networking, configurations • Be the go-to for: • “Why does this fail in the real world?” • “What changed between releases?” • Enforce stage-gated release framework: • Feature Complete → Code Freeze → Validation → Release Candidate • Integrate validation signals: • Simulation corpus results • Regression testing • Vehicle testing (HIL / on-road) • Ensure safety-critical issues are identified, tracked, and gated • Take initiative to find and permanently solve challenging system level issues caused by the interplay between different software and hardware components. • Collaborate and lead system-wide improvements when working with other teams without having direct ownership or management responsibility. • Assess and develop approaches that scale and improve performance in a variety of ways (e.g. CPU performance, memory usage, disk usage, network usage).
• Implements complex solutions • Mentors junior engineers • Manages technical risks • Demonstrates a clear understanding of business impact
• Monitor and analyze cloud spend across Azure, Databricks, and Snowflake (if applicable). • Build cost tracking and reporting dashboards using native tools (Azure Cost Management, Databricks Billing APIs & Dashboards, Power BI) and SQL/Python. • Develop tagging and cost-allocation frameworks to support chargeback and forecasting models. • Partner with Engineering and Infrastructure to optimize compute usage, storage tiers, and licensing. • Recommend query tuning, partitioning and caching strategies to improve performance and reduce cost-per-job on Databricks. • Support monthly budget reviews and identify trends, anomalies, and optimization opportunities. • Manage governance policies for resource provisioning and cost alerts. • Partner with Information Security to ensure FinOps practices and cost controls align with SOC2 and HIPAA compliance requirements. • Provide insights into FinOps maturity and recommend process improvements.




