Site Reliability Engineer
Location
Florida
Posted
21 days ago
Salary
0
Seniority
Mid Level
No structured requirement data.
Job Description
Site Reliability Engineer
Zelis
Role Description We are seeking a strategic and results-oriented Site Reliability Engineer (Golden Signals Lead) to define and drive the observability roadmap across all platforms. - Job Title: Site Reliability Engineer - Location: Remote, In-office, or Hybrid - Department: IT Operations - Reports To: Manager of Observability & Reliability - Job Type: Full-Time Employee (FTE) This role is responsible for establishing a consistent and scalable approach to monitoring and alerting, leveraging golden signals to enhance system reliability and operational efficiency. The successful candidate will collaborate closely with the ZEIT SRE team, engineering leads, and India-based resources to build a unified observability strategy aligned with organizational goals. Key Responsibilities - Observability Roadmap Development: - Define a unified vision for observability across all platforms, with golden signals as the foundation for monitoring and alerting. - Develop and maintain a comprehensive roadmap to improve observability, reduce tool redundancy, and standardize practices across platforms. - Establish and track key performance indicators (KPIs) to measure progress and ensure accountability for roadmap milestones. - Collaboration and Alignment: - Partner with the ZEIT SRE team and engineering leads to break down silos and promote consistent observability practices. - Drive cross-platform collaboration to reduce operational inconsistencies and define a 'north star' approach for observability. - Facilitate knowledge sharing to ensure alignment on current and future observability initiatives. - Monitoring and Alerting: - Standardize the implementation of golden signals across applications to improve system reliability and incident detection. - Optimize alerting tools and reduce redundant or ineffective monitoring interfaces ('panes of glass'). - Lead efforts to enhance observability while minimizing operational overhead for platform teams. - Maintain and enhance observability dashboards, delivering actionable insights into application health and performance. - Operational Support and Improvement: - Identify and address gaps in existing observability practices, prioritizing long-term scalability and reliability. - Collaborate with India-based resources to execute observability build-outs efficiently and with high quality. - Reduce client, provider, and print facility-raised issues through proactive monitoring and early detection. - Reporting and Continuous Improvement: - Measure and report on observability success metrics, including actionable alert volume and reduced issue escalations. - Continuously evaluate and refine observability strategies based on stakeholder feedback and evolving organizational needs. Qualifications - Educational Background: - Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent experience). - Experience: - Minimum of 5 years of experience in Site Reliability Engineering, DevOps, or a related role with a strong focus on observability. - 5+ years of hands-on experience with .NET (C#), including advanced knowledge of ASP.NET Core, Web APIs, and performance optimization. - Demonstrated success in designing and implementing monitoring and alerting solutions across complex IT environments. - Technical Skills: - Deep understanding of SRE principles and golden signals for system monitoring. - Proficiency with observability tools such as Prometheus, Grafana, Splunk, New Relic, or Datadog. - Familiarity with cloud platforms (AWS, Azure, GCP) and containerization technologies (Docker, Kubernetes). - Advanced proficiency in scripting languages such as PowerShell. - Experience in front-end development using React.js. - Advanced knowledge of .NET. - Soft Skills: - Strong leadership and collaboration abilities, with a proven ability to align diverse teams toward common goals. - Excellent analytical and problem-solving skills, with a proactive approach to identifying and resolving issues. - Clear and effective communication skills, capable of conveying technical concepts to stakeholders at all levels. Preferred Qualifications - Experience with building observability roadmaps and scaling solutions in enterprise environments. - Certifications in cloud or DevOps-related disciplines (e.g., AWS Certified DevOps Engineer, Kubernetes Administrator). Location and Workplace Flexibility We have offices in Atlanta GA, Boston MA, Morristown NJ, Plano TX, St. Louis MO, St. Petersburg FL, and Hyderabad, India. We foster a hybrid and remote friendly culture, and all our employee's work locations are based on the needs of the position and determined by the Leadership team. In-office work and activities, if applicable, vary based on the work and team objectives in accordance with Company policies. Equal Employment Opportunity Zelis is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws. We welcome applicants from all backgrounds and encourage you to apply even if you don’t meet 100% of the qualifications for the role. We believe in the value of diverse perspectives and experiences and are committed to building an inclusive workplace for all. Accessibility Support We are dedicated to ensuring our application process is accessible to all candidates. If you are a qualified individual with a disability or a disabled veteran and require a reasonable accommodation with any part of the application and/or interview process, please email TalentAcquisition@zelis.com. Disclaimer The above statements are intended to describe the general nature and level of work being performed by people assigned to this classification. They are not to be construed as an exhaustive list of all responsibilities, duties, and skills required of personnel so classified. All personnel may be required to perform duties outside of their normal responsibilities, duties, and skills from time to time.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Forward Deployment Engineer
EXLEXL is a global company providing business process solutions engineered to help companies streamline operations, simplify compliance, prepare for change, and cr
• Deploy EXLdata.ai into client cloud infrastructure (AWS/Azure/GCP) • Resolve infra, security, and data pipeline issues in real time • Customize accelerators and agent workflows for client-specific needs • Drive measurable value realization from Day 1 • Champion product enhancements back to the EXLdata.ai engineering team • Deliver white-glove support for clients using our managed platform offering
Senior DevOps Engineer
TwingateDitch your VPN. Easily secure access to networks, technical infra, and SaaS for companies of all sizes with Twingate.
• Design, implement, and maintain scalable Cloud infrastructure using Terraform and Kubernetes • Drive GitOps practices using Flux for deployment automation • Collaborate with product and engineering to improve developer experience and reduce toil • Lead incident response, post-mortems, and reliability improvements • Evaluate and introduce new tools and technologies to improve platform efficiency • Work closely with other senior engineers and mentor junior engineers on DevOps best practices • Our stack - GCP, Kubernetes, Terraform, Flux, ELK, Python (Django), Postgres, Redis, React.js (NextJS), GraphQL, Apollo.
Forward Deployment Engineer – Support Engineer, Director
MentoCoaching that accelerates the growth of high performers
• Providing 1:1 coaching to Mento members through virtual sessions using the Mento Platform & Tools. • Participating in ongoing training sessions to improve coaching abilities and skills. • Communicating with members via your Mento email or via chat in the Mento App in between sessions. • Being an integral member of the Mento coaching community by sharing coaching tools and insights — helping enhance our coaching practices and fulfill our mission of empowering people to thrive in jobs that they love. • Acting with kindness, respect, and empathy as an ambassador of the Mento community.
• Manage a team of Software and Site Reliability engineers, including program development, task planning and code reviews. • Define team strategy and roadmap, and drive adoption of scalable SDLC practices, test infrastructure, and modern practices Nvidia’s DGX Cloud Computing environment. • Drive technical projects and provide leadership in an innovative and fast-paced environment. • Be responsible for the overall planning, tracking and success of technical projects. • Work closely with project and product management teams to ensure best-in-class product development. • Contribute technically to the technical projects for DGX Cloud Computing Services. • Interact with key internal stakeholders to provide operational and financial clarity on technical spend. • Lead efforts related to executive reporting, dashboards, and operational CTO metrics focusing on continuous improvement and evolution to maximize decision making and executive visibility.




