Job Closed
This listing is no longer active.
Top rated business phone solution and personalized service to help your business thrive.
Site Reliability Engineer
Location
United States
Posted
89 days ago
Salary
$110K - $175K / year
Seniority
Mid Level
Job Description
Site Reliability Engineer
Ooma, Inc.
Here at Ooma we empower people to connect in smarter ways. We do this by creating powerful communication experiences through our cloud-based platform to bring people together at work and at home. Our solutions help small business owners stay connected with their customers and manage their businesses from anywhere. For larger companies we provide customized unified communications solutions to meet their unique needs. At home, we help our customers connect with their loved ones by providing the #1 rated VoIP phone service available. We also provide them with peace of mind through our innovative smart home security solution. At Ooma, all our products and services are priced competitively, because we believe advanced technology should be accessible to all. About the Role: At Ooma, we are innovators driven by operational excellence and passionate about delivering outstanding customer experiences. As part of our transformation into a truly international team of Site Reliability Engineers (SREs), we are expanding our capabilities to meet the demands of our worldwide customer base. We are looking for a talented SRE, Systems Administrator, Operations Engineer, DevOps Engineer, or IT professionals who thrive on solving challenging problems and ensuring service reliability at scale. If you are motivated by working with the latest technologies, mentoring others, and building highly available systems, we’d love to talk to you. What You’ll Do: - Become a subject matter expert in applications supporting Ooma customers. - Collaborate with Development, QA and other SREs to evaluate, deploy, and debug applications. - Improve observability by implementing, refining, and adjusting application monitoring and thresholds. - Mentor team members to enhance application management practices. - Act as an escalation path and backup for junior team members, providing guidance during alerts and incidents. - Write automation scripts, set up CI pipelines, and review/evaluate software solutions and best practices. - Participate in on-call rotations, providing 24/7 support for Ooma services. Experience We’re Looking For: - Strong background in production (24/7) support for large-scale environments required. - 6+ years of Linux administration and troubleshooting experience with full-stack, application support focus; 8+ years overall working experience as an IT professional. - Proven expertise in advanced scripting using Python, Perl, and Bash. - Database administration experience with MySQL, MongoDB, or PostgreSQL required. - Must have experience with configuration management tools such as Ansible, Puppet, etc. - Hands-on experience with cloud platforms such as OCI, AWS, or GCP required. - Proven ability to lead technical projects from inception to completion. - Experience with DevOps tools, like Docker, K8s, Gitlab CICD, Jenkins, Terraform preferred - Experience with monitoring best practices using ELK stack, Prometheus, Nagios, or Grafana preferred - Experience with Agile tools like Jira, Confluence, or any similar tool, preferred - Strong collaboration skills and empathy for the end-user experience. - Excellent troubleshooting, communication (written and verbal), and cross-functional leadership skills. - Ability to work effectively in fast-paced, dynamic environments and manage ambiguity. - Quick learner with a self-starter mindset and ownership of outcomes. - Sound judgment in escalation and decision-making. - Bachelor's degree in Engineering/Computer Science or equivalent experience. #LI-CC1 What We Offer: Working at Ooma means being a team player, while allowing your individual voice to come through. And, you'll receive competitive compensation, benefits and generous company perks. - Comprehensive Medical/Dental/Vision insurance for you and eligible dependents - HMO, PPO’s or a PPO with a HDHP (including HSA, which Ooma helps fund) - Employer Paid Income Protection Benefits (Basic Life and AD&D, Short- and Long-term disability) - FSA Healthcare & Dependent Care - Commuter Benefits - Voluntary Accident, Critical Illness, Hospital Indemnity and Legal - 401(k), including employer match, and Roth - Employee Stock Purchase Plan (ESPP) - Paid Time off, Sick Time, as well as corporate holidays observed - Employee Assistance Program - Life Balance benefits with Travel Assistance Services and Identity Theft - Additional Benefits include a Discount Program, Credit Union, Medicare Assistance, etc. Ooma is an equal-opportunity employer committed to recruiting, employing, retaining, promoting, and otherwise treating all employees on the basis of merit, qualifications, and competence. We do not discriminate on the basis of any trait or characteristic protected by applicable federal, state, or local laws. #LI-C1 The base salary range for candidates within the United States is listed below. Actual base pay will depend on a variety of factors such as education, skills, experience, specific location, etc. The base pay range is subject to change and may be modified in the future. Regular employees may also be eligible for bonus(es), sales incentive(s) (target included in OTE) and/or stock in the form of Restricted Stock Units (RSUs). United States Pay Range $110,000—$175,000 USD
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Full Stack Developer, Strategic Forward Deployment
Monday.comFormerly known as dapulse, Monday.com, also known as Monday, is a privately held computer software company offering a team management and collaboration tool des
As a Forward-Deployed Engineer in the Services organization, you'll be a key technical partner for our most strategic customers. Your focus is high impact: driving AI transformation, help drive growth for our strategic customers and translating real world complexity into product insights. You will deeply embed with a single high value customer, working with C-level and technical teams to architect solutions that redefine their work. You'll lead customer transformations leveraging AI capabilities in complex, mission critical use-cases. You'll own the full cycle of a delivery from understanding customer needs and business pains, architecting and designing their solution, build, and roll it out to the teams. This foundational role will help define the future of forward-deployed engineering at monday.com. Please note that this is a hybrid position of 3 days/week in our NYC office. - Experience leading hands-on projects from A to Z in a production environment, preferably client-facing - 5+ years of previous experience building web applications from scratch using TypeScript/Node.js/React - Previous experience in Full-stack development - Background in Solution Architecture or Software Engineering with a strong grasp of AI/LLM workflows and systems thinking - Experience navigating C-suite relationships and the ability to influence executive decision-making in ambiguous environments - Entrepreneurial, fast-moving, and focused on long-term impact rather than just immediate fixes What monday.com can offer you: - Opportunity to join a well-funded, proven company with big ambitions, competitive salary and benefits package, bonus potential, and eligibility to take part in the company equity incentive program - An amazing company culture that values transparency and collaboration while never forgetting to have fun while we work! - Monthly stipends for food, wellness, and commuter/remote work - Fully dedicated learning and development team that provides opportunities for our employees to hone and gain new skills - Award-winning work environment - named a “Best Place to Work” by BuiltIn as well as “Great Place To Work” certified - We foster diversity, inclusion, and belonging through our Employee Resource Groups in addition to providing access to resources and education to support our team, facilitate conversations, and encourage understanding - A global work environment with employees in Tel Aviv, New York, San Francisco, Denver, London, Kiev, Sydney, São Paulo, and Tokyo Visa sponsorship for this role is currently not available. monday.com is proud to be an equal opportunity employer. We hire talented individuals, regardless of gender, race, ethnicity, ancestry, age, disability, sexual orientation, gender identity or expression, military or veteran status, cultural background, religious beliefs, or any other characteristic protected by federal, state, or local laws. For New York City-based hires only: Compensation Range: $180,000-$230,000 base salary, subject to standard withholding and applicable taxes. In addition to base salary, the role includes the opportunity to receive and/or earn a discretionary bonus and/or equity-based on Company’s plans and in accordance with Company’s policies. Compensation finally awarded to the candidate will be commensurate with the candidate’s skills and experience. Compensation ranges for candidates in locations outside of New York City may differ based on the cost of labor and such additional factors for such other locations.
DevOps Engineer
Team Up - We Build Teams🧩 We provide everything EU companies need to build in-house teams, hire freelancers, or EOR talents from the Caucasus.
• Lead complex infrastructure and platform projects from research and planning to implementation and testing • Own the lifecycle of AWS infrastructure, including EKS and EC2 provisioning, upgrades, autoscaling, and cost optimization • Design and maintain CI/CD pipelines and GitOps workflows using modern deployment tools • Standardize infrastructure provisioning using Terraform and infrastructure-as-code practices • Build and maintain platform tooling, environments, and internal automation systems • Develop scalable self-service infrastructure solutions for product engineering teams • Collaborate with platform engineering, security, and development teams on operational practices and incident response • Improve observability and monitoring across systems using modern monitoring tools • Maintain documentation and establish platform engineering standards • Mentor engineers and support adoption of best practices across teams
Senior Site Reliability Engineer 5, CORE (Resilience Operations)
NetflixDescribed as the world's top internet television network, Netflix is a publicly-traded entertainment company offering video-on-demand and streaming media. As an
Role Description The Critical Operations and Reliability Engineering team sits at the heart of Netflix’s ability to deliver a high-quality streaming experience to hundreds of millions of members worldwide. We build, operate, and evolve the systems and practices that keep Netflix resilient in the face of failures, traffic spikes, and constant change. We are looking for an experienced Site Reliability Engineer to help us deepen the reliability, observability, and operational excellence of Netflix’s systems across Streaming, Games, Ads, and our large-scale platform. You will partner closely with engineering teams across these vertices to design resilient architectures, build automation, and continuously improve how we learn from incidents. What You’ll Do - Design and evolve resilient infrastructure for Netflix Streaming services, ensuring our systems are scalable, fault-tolerant, and operable at a global scale. - Design and run opinionated resilience tests at scale that intentionally induce failures in order to validate system behavior, uncover weaknesses, and proactively prove (or disprove) resilience. - Partner with engineering and product teams to embed reliability, observability, and security into the full software development lifecycle—from design and readiness reviews through rollout and ongoing operations. - Define and measure Service Level Objectives (SLOs) and other reliability metrics that matter to the member experience, using them to guide capacity planning, operational priorities, and tradeoffs between reliability, feature velocity, and cost. - Build and improve automated processes for deployment, monitoring, capacity management, and incident response to ensure our operations are fast, reliable, and repeatable. - Participate in on-call rotations for critical Streaming services, helping ensure 24/7 availability and a great member experience. - Lead and contribute to incident response—from triage and mitigation through follow-ups—focusing on learning, systemic fixes, and avoiding repeat issues. - Proactively identify and reduce sources of instability in distributed systems by analyzing how our systems actually fail in production and driving architectural or operational improvements. - Champion a culture of reliability across business domains, acting as a force multiplier: creating clear documentation, developing best-practice guides, and building tooling that enables other teams to adopt reliability improvements at scale. Qualifications - 5+ years of experience in an SRE, Production Engineering, or similar role operating business-critical, high-traffic services in production. - Strong coding skills in one or more languages such as Python, Go, or Java, with a focus on automating solutions instead of relying on manual operations. - Fluency in modern cloud infrastructure: hands-on experience with large-scale environments on AWS/Azure/GCP, along with abstracted compute and platform orchestration systems. - Deep understanding of large-scale distributed systems, including common failure modes, performance bottlenecks, and how to design for resilience and graceful degradation. - Track record of proactively identifying reliability risks and gaps through metrics, incidents, architecture reviews, or resilience testing—and implementing pragmatic, scalable solutions to mitigate them. - Strong observability and performance tuning skills: you can use metrics, logs, and traces to debug issues in complex systems, and you’re comfortable profiling and optimizing services to meet latency, availability, or efficiency goals. - Experience with incident management and response: you can navigate ambiguous, high-pressure production issues, drive coordinated response, and follow through with durable improvements. - Strong collaboration and influence skills: you communicate clearly, build trust with partner teams, and can guide engineering teams toward better reliability practices without relying on authority. - Ability to balance reliability, velocity, and cost: you’re comfortable making and explaining tradeoffs, and using data (SLOs, error budgets, performance metrics) to guide decision-making. - Growth mindset and curiosity: you are eager to learn, comfortable challenging assumptions (including your own), and motivated by continuous improvement of systems, processes, and yourself. Benefits - Health Plans - Mental Health support - 401(k) Retirement Plan with employer match - Stock Option Program - Disability Programs - Health Savings and Flexible Spending Accounts - Family-forming benefits - Life and Serious Injury Benefits - Paid leave of absence programs - Full-time hourly employees accrue 35 days annually for paid time off to be used for vacation, holidays, and sick paid time off. - Full-time salaried employees are immediately entitled to flexible time off.
Site Reliability Engineer – Dynatrace
AbstraPersonalized solutions and expert guidance from your trusted nearshore partner.
• Monitor, Troubleshoot, and Optimize Performance • Use Dynatrace to continuously monitor the health of digital systems, applications, and infrastructure • Proactively investigate anomalies, perform deep-dive analysis, troubleshoot issues, and identify root causes of performance bottlenecks or failures • Implement solutions to improve system uptime, stability, and response times • Implement and analyze key metrics using Google Analytics to track and optimize digital user experiences • Use insights to identify areas of improvement and ensure seamless digital interactions for end-users • Analyze NPS data, detractors, and feedback from digital surveys to identify opportunities for improvement in user experience and platform performance • Proactively identify issues in production environments and work with cross-functional teams to troubleshoot, triage, and resolve system outages or service disruptions in real time • Lead or contribute to formal root cause analysis (RCA) activities and ensure preventive measures are implemented • Develop scripts and automation tools to streamline monitoring, alerting, troubleshooting workflows, and recovery processes—reducing manual intervention and improving operational efficiency • Work closely with developers, product managers, and other engineers to enhance application performance, scalability, and reliability • Leverage data analysis to identify trends and patterns that influence site reliability and performance • Establish and enforce reliability engineering best practices—including monitoring, incident response, and RCA documentation—across the development lifecycle to ensure high availability and fault tolerance



