Job Closed

This listing is no longer active.

SonicWall logo
SonicWall

Delivering real-time breach detection and prevention solutions backed by SonicWall Capture Threat Network.

Principal Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteLeadTeam 1,001-5,000Since 1991H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

4 days ago

Salary

0

Seniority

Lead

Job Description

Principal Site Reliability Engineer

SonicWall

• Own the reliability, scalability, and operational excellence of our Cloud-based services. • Define and enforce reliability standards. • Drive the adoption of SRE practices across engineering teams. • Build the systems and tooling that keep our production infrastructure healthy. • Define, publish, and continuously refine Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) for all critical services, partnering with product and engineering leadership. • Own the error budget framework: track consumption, enforce error budget policies, and drive reliability investments when budgets are at risk. • Lead the design and implementation of comprehensive observability platforms — metrics, structured logging, and distributed tracing — to ensure full visibility into production systems. • Drive toil reduction initiatives by identifying and automating repetitive, manual operational work, targeting measurable reduction in operational burden across teams. • Design and execute chaos engineering programs to proactively uncover reliability weaknesses in our infrastructure and services before they impact customers. • Lead blameless postmortem culture: facilitate incident retrospectives, extract systemic learnings, and track corrective action items to completion. • Build and improve on-call incident response processes, runbooks, and escalation paths; manage and optimize on-call rotation health to prevent burnout. • Help design, build, and support infrastructure and security technologies within the cloud that offer resiliency, observability, and optimized cost. • Develop solutions for automated deployment of software and services on our production infrastructure hosted on AWS, applying reliability engineering principles throughout. • Shape how mission-critical enterprise software solutions are developed and deployed using optimized CI/CD pipelines that embed reliability and quality gates. • Develop management solutions for services across multiple cloud platforms and data centers, with a focus on fault tolerance and graceful degradation. • Collaborate with developers to bring new features and services into production using production-readiness reviews and launch checklists. • Champion reliability engineering best practices across the organization, embedding SRE principles into the software development lifecycle. • Mentor team members on SRE philosophy, technical decision-making, code reviews, and cloud engineering best practices. • Participate in roadmap planning, identify areas of improvement, and perform technology evaluation and selection.

Job Requirements

  • 7+ years of experience in scalable, distributed systems architecture.
  • 3+ years of hands-on Site Reliability Engineering experience, including ownership of SLOs and error budget management.
  • 4+ years of experience with Cloud Platforms, including AWS.
  • 4+ years of experience in infrastructure as code (Terraform, AWS CDK).
  • 5+ years of experience in scripting using Python, Shell, or a similar language.
  • 3+ years of experience with containerization technologies, including Docker.
  • 4+ years of experience with orchestration technologies, including Kubernetes.
  • Demonstrated experience designing and operating observability stacks (e.g., Prometheus, Grafana, Datadog, OpenTelemetry, Jaeger, or equivalent).
  • Experience with incident management platforms and on-call tooling (e.g., PagerDuty, OpsGenie).
  • Experience defining and implementing automated service deployments, including provisions for networking, security, reliability, management, reporting, and configuration management.
  • Experience with chaos engineering principles and tools (e.g., Chaos Monkey, LitmusChaos, Gremlin, or equivalent).
  • Experience managing databases — PostgreSQL, Redis, DynamoDB, MongoDB.
  • In-depth understanding of best practices for deployment automation and production-readiness reviews.
  • Experience using Git in a team environment (merge requests, branching, push, and pulls).
  • CS Degree or equivalent experience.

Benefits

  • Health insurance
  • Professional development opportunities

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Full TimeRemoteTeam 10,001+Since 1986H1B No Sponsor

• Collaborate with Technology Infrastructure teams to build and operate reusable, cloud-native platforms that abstract complexity and accelerate delivery while incorporating reliability from design through operations. • Work with business units and technical teams to improve application availability, observability, and reliability as our business applications are migrated to the Private Cloud. • Enhance platform reliability through automatic problem detection, self-healing systems, and well-architected notification and escalation protocols. • Use SLOs, SLIs, and KPIs to guide prioritization, measure impact, and drive continuous improvement. • Eliminate toil using intelligent automation and agentic workflows. • Conduct blameless retrospectives and share learnings across the organization. • Foster a culture of ownership, positive thinking, and continuous learning while remaining grounded in practicality, experimentation, and engineering excellence. • Integrate DevSecOps, zero-trust principles, and policy-as-code into every pipeline. • Produce and promote Architecture Decision Records (ADRs) and Cloud Well-Architected Frameworks that our business units can adopt to improve our technology standardization. • Maintain 24x5 active coverage with seamless regional handoffs and weekend escalation protocols.

Arizona + 3 moreAll locations: Arizona | Florida | North Carolina | Tennessee
Okta logo

Forward Deployment Engineer

Okta

The World's Identity Company

DevOps Engineer4 days ago
Full TimeRemoteTeam 5,001-10,000Since 2010H1B Sponsor

Role Description As a Forward Deployed Engineer (FDEs) you will operate at the intersection of product engineering and customer impact. - Embedded within some of Okta’s most strategic customers, working side-by-side with their engineering teams. - Write production code, shape technical architecture, and influence how Okta's products are used at scale. - Act as a key liaison to understand customer needs and what engineering teams can build at scale. This role is ideal for engineers who want to: - Lead the development of Lab/Demo assets and AI prototyping. - Create "The Engine" for sales velocity by aligning pricing strategies and technical execution with field reality. - Develop reference architectures, whitepapers, and standardized POC guides. - Create centralized hubs and reporting dashboards to scale AI expertise globally. - Provide strategic guidance to SEs, TAMs, and Professional Services. - Act as a feedback loop, funneling field insights back to Product. - Represent Okta's AI strategy at conferences and analyst briefings. - Partner with the CPO, CISO, and Engineering to influence roadmaps with "AI-first" innovations. - Serve as a trusted advisor to CISOs and CIOs. - Lead high-impact POCs to secure "The Win." - Identify technical upsell opportunities through successful usage. - Partner with Post-Sales teams to develop runbooks and implementation guides. Qualifications - 10+ years in identity, cybersecurity, AI, or Data Science. - Experience driving technical customer success beyond the initial sale, including implementation, health monitoring, and expansion. - Deep understanding of AI/ML technologies (anomaly detection, LLMs) and Agentic Frameworks (A2A, MCP, and relevant protocols). - Strong credibility with enterprise CISOs and boards, with a proven track record of shaping cybersecurity strategy. - Ability to create repeatable playbooks and "backbone" assets that allow a sales organization to scale. - Prior experience as a CTO, Field CTO, or Distinguished Architect in a cybersecurity or SaaS company. - Advanced degree in a technical or business-related field. Requirements - Strong operational mindset. - Proven leadership background. Benefits - Health, dental, and vision insurance. - 401(k) plan. - Flexible spending account. - Paid leave, including PTO and parental leave.

United States
$206K - $383K / year
AlpacaDB logo

Site Reliability Engineer

AlpacaDB

AlpacaDB, Inc., also known as Alpaca and Alpaca Securities, is an API stock and crypto brokerage platform that enables services to embed investing and developer

DevOps Engineer4 days ago

• Operate production day-to-day - oncall, incident response, postmortems, and the follow-ups that actually close the loop. • Own reliability practice - define and refine SLIs/SLOs and error budgets, and help product teams live within them. • Strengthen our observability across metrics, logs, traces, and alerting. • Ship infrastructure through code in a GitOps workflow - cloud resources and Kubernetes workloads alike. • Look after PostgreSQL: performance tuning, schema and migration review, online migrations on large tables, HA/DR, and CDC pipelines. • Mentor engineers on reliability and database fundamentals through code review, design review, and pairing.

Europe
$150+ / month
GoDaddy logo

Site Reliability Engineer - Storage Engineer

GoDaddy

GoDaddy is a web services platform that helps individuals and businesses worldwide start, grow, and manage their online presence. GoDaddy employs team members across North America,

DevOps Engineer4 days ago

Role Description GoDaddy is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. This role will focus on automating and maintaining our storage infrastructure with a focus on Ceph, ensuring the reliability, scalability, and performance of our systems. - Automate and maintain day-to-day operations of storage systems to support application demands - Develop and maintain tools and automation scripts to streamline storage operations and improve efficiency - Monitor system performance, identify issues, and implement solutions to ensure high availability and reliability - Participate in agile concepts such as daily stand-up meetings, task tracking boards, design and code reviews, automated testing, continuous integration, and deployment - Continuously improve system reliability, performance, and capacity through proactive monitoring, automation, and optimization Qualifications - 2+ years of professional experience with Ceph, working in a production environment - 2+ years of experience in site reliability engineering or a similar role - 2+ years of professional experience with Ceph, including deployment, configuration, and management of Ceph clusters and systems - Experience working on Linux/Unix systems, with a focus on automation and operating at scale - Proficiency in Python or Bash - Experience with Ansible, Terraform, or SaltStack - Experience with Nagios-based monitoring tools, such as Icinga2 - Experience with observability tooling, such as Prometheus, Grafana, Mimir, and Loki - Solid understanding of core networking concepts and protocols, particularly in relation to Linux/Unix systems Requirements - Experience with containerization and orchestration tools (e.g., Docker, Kubernetes) - Exposure to and experience working with compute platforms (e.g., OpenStack, AWS) - Familiarity with ability to contribute to CI/CD pipelines and automation workflows Benefits - Competitive pay - Generous time off - Parental and wellness leave - Healthcare - Retirement savings program - Comprehensive benefits package, including: - Medical, dental, and vision insurance - 401(k)-retirement plan - Paid sick time - Paid flexible time off - Paid parental leave - Life insurance - Short- and long-term disability - AD&D insurance - Mental health or EAP programs - Remote or hybrid work options - Paid holidays - Paid Wellness days - Tuition assistance - Adoption, surrogacy, and fertility benefits - Dependent daycare and backup care benefits - Employee stock purchase plan - Financial education and advice Compensation - Bay Area (Santa Clara, San Francisco) and Los Angeles: $128,000 — $192,000 USD - Austin, D.C. Metro, CA (non-Bay Area), HI, IL, MA, NH, OR, VA, WA: $110,500 — $165,500 USD - New York City Metro, Kirkland/Seattle: $117,200 — $175,800 USD - All other US locations not previously listed: $98,500 — $147,500 USD

United States
$98.5K - $192K / year