Job Closed
This listing is no longer active.
Elevating Autism & IDD Care through Technology
Senior Site Reliability Engineer
Location
United States
Posted
79 days ago
Salary
$160K - $180K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer
CentralReach
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description As a Sr. SRE, you will work closely with the key stakeholders in Software Engineering to drive adoption of modern reliability practices like SLOs, error budget policies, actionable alerts, incident retrospectives, chaos testing, and end-to-end ownership. - Responsible for availability, latency, performance, efficiency, monitoring/observability, emergency response, capacity planning, setting and maintaining SLOs, SLIs and Error Budgets, creating dashboards. - Analyze, troubleshoot and resolve operational challenges contributing to defined SLO's. - Manage site stability, performance, reliability, and maintain uptime for production environments. - Develop a fully automated multi-environment observability stack based on the existing system and extend it to predict capacity needs based on the usage patterns. - Strive for automation to reduce toil and increase development velocity. - Perform application-specific production support, incident management, change management, problem management, RCAs, and service restoration as needed. - Identify changes for the product architecture from the reliability, performance and availability perspective with a data driven approach. - Document resolution run books and standard operating procedures. - Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation. - Collaborate with software development teams in the release management process and to shape the future roadmap and establish strong operational readiness across teams. - Implementation of reliability and observability tools (like New Relic, Prometheus, Grafana etc.). - Collaborates with Security team and other platform engineering teams to build reliable, maintainable, and scalable solutions that improve our security posture. Qualifications - Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider. - Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.). - Experience implementing observability plans around logs, metrics, and traces. - Experience in an agile development team developing software. - Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code (Terraform, CloudFormation). - Extensive experience with Docker, Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef. - Strong experience with containerization technology and/or Kubernetes. - Experience with Release automation, system administration, configuration management. - Experience with programming languages (Java, Python, Go, etc.). - Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts. - Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports. - Strong analytical and programming skills (Python, Go, Java etc.). - Deep understanding around best practices for modern cloud security. - Proven experience building observability for security concerns, such as privilege escalations and bot detection. Requirements - Location: Hybrid capacity from Holmdel, New Jersey or Fort Lauderdale, Florida, or remote candidates located in other U.S. states for the right individual. - In-person interview or face-to-face meeting required for fully remote roles prior to the first day of employment. Benefits - Competitive compensation. - Comprehensive health benefits. - Generous PTO. - 401(k) matching. - Paid parental leave for full-time employees. - Hybrid work schedules. - Career development support. - Wellness programs. - Opportunities to give back through CR Cares™, our community engagement initiative.
Job Requirements
- Strong background as a SRE supporting a 24x7 highly available production environment for a SaaS or cloud service provider.
- Solid experience with Monitoring/APM/Observability tools (Splunk, New Relic etc.).
- Experience implementing observability plans around logs, metrics, and traces.
- Experience in an agile development team developing software.
- Experience with cloud infrastructure environments, preferably AWS, and Infrastructure as code (Terraform, CloudFormation).
- Extensive experience with Docker, Kubernetes, Helm, CI/CD and config management tools like Ansible, Chef.
- Strong experience with containerization technology and/or Kubernetes.
- Experience with Release automation, system administration, configuration management.
- Experience with programming languages (Java, Python, Go, etc.).
- Strong understanding of Linux, Windows, software development, systems, networking, and cloud concepts.
- Strong interpersonal and teaming skills - ability to set and enforce process and influence engineers who are not direct reports.
- Strong analytical and programming skills (Python, Go, Java etc.).
- Deep understanding around best practices for modern cloud security.
- Proven experience building observability for security concerns, such as privilege escalations and bot detection.
- Location: Hybrid capacity from Holmdel, New Jersey or Fort Lauderdale, Florida, or remote candidates located in other U.S. states for the right individual.
- In-person interview or face-to-face meeting required for fully remote roles prior to the first day of employment.
Benefits
- Competitive compensation.
- Comprehensive health benefits.
- Generous PTO.
- 401(k) matching.
- Paid parental leave for full-time employees.
- Hybrid work schedules.
- Career development support.
- Wellness programs.
- Opportunities to give back through CR Cares™, our community engagement initiative.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Role Description Do you thrive on solving tough problems—even under pressure? Are you motivated by fast-paced environments with continuous learning opportunities? Do you enjoy collaborating with a team of peers who push you to constantly up your game? At Pythian, we are building a next-generation Site Reliability Engineering team. We need motivated and talented individuals on our teams, and we want you! You’ll act as a technology leader, advisor for our clients, and mentor for other team members. Projects would include: - Infrastructure architecture - Automation - Intelligent monitoring systems from design through implementation If you Love Your Data and want to Love Your Career, this could be the job for you! If this is you, and you wonder what it would be like to work at Pythian, reach out to us and find out! Intrigued to see what a life is like at Pythian? Check out #pythianlife on LinkedIn and follow @loveyourdata on Instagram! Not the right job for you? Check out what other great jobs Pythian has open around the world! Qualifications - Experience working with Google and AWS Clouds (including infrastructure as code deployment with Cloud Formation, Terraform, Opsworks, etc) - Scripting and automation of administrative tasks using Python and Scala is a must - Solid understanding of microservices architecture and container technologies (Kubernetes is a must, Docker, lxc, etc) - Clear understanding of software development lifecycles and best practices from an infrastructure point of view (PRs, merge, rebase, etc) - Understanding the end-to-end operations of a ‘Business System’ vs components - Comprehensive systems hardware and network troubleshooting experience - Common Linux distribution platform installation, configuration, performance tuning, and cloud migration - TCP/IP networking, NIC bonding, and network services configuration (DNS, NTP, DHCP, SMTP, etc) - Operation and administration of virtual infrastructure, including experience with at least one hypervisor (VMware, Hyper-V, KVM, etc) - Ability to describe IaaS, PaaS, SaaS, pros and cons of each, use cases for virtualization and cloud - Administration of web servers and supporting technologies, including network load balancers - Experience with the design, development, and deployment of Puppet - System and application error investigation, troubleshooting of access/availability issues including deep multi-system root cause analysis - Experience managing networking devices, such as switches and firewalls from a variety of vendors - Solid understanding of DevOps tools, processes, and culture - Ability to pick up new technologies quickly - Ability to provide accurate work scheduling and task estimations for work delivery Benefits - Competitive total rewards package - Flexibly work remotely from your home, there’s no daily travel requirement to an office - Collaborate with some of the best and brightest in the industry - Hone your skills or learn new ones with our substantial training allowance - We provide all the equipment you need to work from home including a laptop with your choice of OS - Annual wellness budget to prioritize your health and well-being - Generous amount of paid vacation and sick days, as well as a day off to volunteer for your favorite charity Hiring Disclaimer The successful applicant will need to fulfill the requirements necessary to obtain a background check. Accommodations are available upon request for candidates taking part in any aspect of the selection process. AI Disclaimer Pythian may utilize Enterprise Generative Artificial Intelligence (AI) tools or features throughout its hiring process. These tools help us manage high volumes of applications efficiently and may be employed to review applications, analyze resumes, and assist with other recruitment steps. While Pythian uses AI in its hiring process, it does not substitute for human judgment. Our Talent Acquisition Team reviews all AI-generated recommendations, and the system is subject to regular bias audits to ensure fairness and compliance with all applicable employment and human rights laws. All final hiring decisions are made by, and remain the responsibility of, human decision-makers. By applying for this position, you consent to Pythian’s use of these AI tools in the evaluation of your application. You have the right to request a human review of any solely AI-driven decision or to request an accommodation. Should you require further details regarding the processing of your data, please reach out to us.
DevOps Engineer – Google Cloud Platform, Terraform
Smart WorkingEmpowering companies to work with the best engineers in the world
• Design, implement, and maintain cloud-native infrastructure on Google Cloud Platform (GCP) using Terraform across multiple environments (production, staging, sandbox, and customer deployments). • Architect and operate serverless container workloads using Cloud Run, ensuring efficient scaling, resource management, and cost optimisation. • Design and manage event-driven systems using Pub/Sub, including message retention, acknowledgement deadlines, dead-letter queues (DLQ), and monitoring. • Build and maintain CI/CD pipelines using GitHub Actions and Cloud Build, including automated Terraform deployments and GitOps-based workflows. • Develop reusable Terraform modules and manage infrastructure across multiple GCP projects using best practices for remote state and environment separation. • Manage containerized workloads and cloud networking using services such as GKE, VPC, Load Balancers, Cloud Armor, IAM, and Secret Manager. • Collaborate with software engineers on architecture design decisions, including scaling strategies, service separation (HTTP vs WebSockets), and performance optimisation. • Implement monitoring, alerting, and observability using Google Cloud Monitoring, Cloud Logging, Sentry, and OpenTelemetry. • Administer and optimise data infrastructure, including MongoDB Atlas, Redis, BigQuery, and Cloud Storage. • Perform incident response and root cause analysis, implementing long-term improvements to increase reliability and resilience. • Own infrastructure end-to-end, including architecture decisions, performance optimisation, cost management, and operational excellence. • Create and maintain documentation, operational runbooks, and best practices. • Mentor engineers and promote DevOps and cloud architecture best practices across the organisation.
DevSecOps Engineer
Weekday (YC W21)We are a Y-Combinator-backed startup building your AI-powered Recruiter Agent
• Responsible for integrating security practices into the DevOps lifecycle. • Build and maintain scalable, secure, and reliable cloud infrastructure. • Collaborate closely with software engineers, security teams, and infrastructure specialists. • Design and manage cloud environments, automating infrastructure provisioning. • Strengthen CI/CD pipelines and embed security controls throughout the software development lifecycle.
DevOps Lead
Resolve Tech SolutionsERP/SAP Modernization | Managed Cloud Delivery Services | Advanced Tech - AI / ML | Cyber Security | Digital Signature
• Lead the design and implementation of scalable, resilient cloud infrastructure across AWS, Azure, or GCP environments • Architect, build, and optimize CI/CD pipelines using tools such as Jenkins, GitLab CI, GitHub Actions, or Azure DevOps • Champion infrastructure-as-code practices using Terraform, Ansible, or similar automation tools • Design and manage containerized environments using Docker and orchestrate workloads with Kubernetes or managed Kubernetes services • Establish and enhance monitoring, logging, and observability platforms using tools such as Prometheus, Grafana, Datadog, or cloud-native monitoring solutions • Lead DevOps team members by providing technical guidance, mentorship, and performance support • Collaborate cross-functionally with engineering, security, and product teams to streamline release cycles and improve deployment reliability • Implement and enforce cloud security best practices, governance standards, and compliance requirements • Drive cloud cost optimization strategies and infrastructure efficiency initiatives • Promote a culture of automation, reliability, and continuous improvement across platform and engineering teams • Troubleshoot complex infrastructure and deployment issues, ensuring minimal disruption to business operations • Contribute to documentation, standards development, and long-term platform architecture strategy




