Backblaze External Website logo
Backblaze External Website

At Backblaze, we value being fair and good to our customers, partners, and employees. That’s why diversity, equity, and inclusion are at the core of our values. We are committed to fostering a workforce where all employees feel a sense of belonging regardless of race, ethnicity, nationality, gender, sexual orientation, age, religion, socio-economic status, ability, veteran status, and education. We believe that our dedication to cultivating a diverse workspace not only allows us to better serve our customers in over 175 countries but further reinforces our commitment to doing the right thing. We are proud to be an Equal Opportunity Employer.

Sr. Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 201-500

Location

United States

Posted

84 days ago

Salary

$150K - $200K / year

Seniority

Senior

Job Description

Sr. Site Reliability Engineer

Backblaze External Website

About Backblaze Backblaze is the object storage leader in the open cloud movement, fueling customer success with cloud storage built purposefully to unlock budgets, unburden administrators, and unleash innovators. Together with our partners, we’re helping customers break free from the restrictive, overpriced legacy solutions that hold them back, and blaze forward with the full power of the open cloud in their hands. Founded in 2007, we scaled the business with less than $3 million in outside funding until 2021, when we did a traditional IPO on the Nasdaq stock exchange. Today, Backblaze generates over $100m in revenue and is the leading specialized storage cloud - managing over three billion gigabytes of data storage for 500K+ customers in 175+ countries, including businesses, developers, IT professionals, and individuals. But while there is a lot to celebrate in our past, there is almost as much opportunity ahead of us. We’re seeking a Sr. Site Reliability Engineer to join our team! About the Role: We are seeking a Senior Site Reliability Engineer (SRE) to help ensure the stability, scalability, and reliability of our services and infrastructure. This role focuses on building automation, maintaining observability, and supporting incident response to keep customer-facing systems performing at their best. The SRE will collaborate with engineering, product, and operations teams to embed reliability practices into day-to-day development and operations while contributing to tools and processes that improve efficiency and reduce manual effort. What You'll Do: - Service Reliability & Operations - Own and drive the availability, durability, and performance of critical services across all production environments. - Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership. - Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services. - Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes. - Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management). Automation & Tooling - Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform. - Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability. - Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins). - Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems. Collaboration - Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation. - Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features. - Lead capacity planning and disaster recovery strategy across critical infrastructure components. - Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance. - Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams. Continuous Improvement - Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation. - Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans. - Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams. Qualifications: - Education & Experience - Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience). - 8+ years of progressive experience in site reliability, systems engineering, or operations. - Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems. Technical Skills - Expert-level Linux systems administration and advanced troubleshooting skills. - Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification. - Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis. - Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred). - Expert knowledge of incident response methodologies and operational best practices. - Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required. - Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment. Preferred Attributes - Significant experience in a SaaS, service provider, or hyper-scale distributed systems environment. - Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s. - Exceptional problem-solving skills and a strong drive to learn and apply new, complex technologies. - Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting. Backblaze Perks: - Healthcare for family, including dental and vision - Competitive compensation and 401K - RSU grants for full-time employees - ESPP program - Flexible vacation policy - Maternity & paternity leave - MacBook Pro to use for work, plus a generous stipend to personalize your workstation - Childcare bonus (human children only) - Fertility treatment and support - Learning & development program - Commuter benefits - Culture that supports a healthy work-life balance To provide greater transparency to candidates, we share base pay ranges for all US-based job postings regardless of state. We set standard base pay ranges for all roles based on function, level, and country location, benchmarked against similar-stage growth companies. Final offer amounts are determined by multiple factors, including candidate location, skills, depth of work experience, and relevant licenses/credentials, and may vary from the amounts listed below. The expected salary range for this role is $150,000 - $200,000. At Backblaze, we value being fair and good to our customers, partners, and employees. That’s why diversity, equity, and inclusion are at the core of our values. We are committed to fostering a workforce where all employees feel a sense of belonging regardless of race, ethnicity, nationality, gender, sexual orientation, age, religion, socio-economic status, ability, veteran status, and education. We believe that our dedication to cultivating a diverse workspace not only allows us to better serve our customers in over 175 countries but further reinforces our commitment to doing the right thing. We are proud to be an Equal Opportunity Employer. To understand more about the data we collect and process as part of your application, please view our Backblaze Employee Privacy Notice.

Job Requirements

  • Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience).
  • 8+ years of progressive experience in site reliability, systems engineering, or operations.
  • Extensive experience designing, scaling, and operating large-scale, production-grade distributed systems.
  • Expert-level Linux systems administration and advanced troubleshooting skills.
  • Lead security-minded operations, focusing on system-wide patching, hardening, and proactive vulnerability identification.
  • Deep mastery of service reliability concepts, including advanced monitoring, complex alerting strategy, leading incident response, and in-depth root cause analysis.
  • Advanced proficiency in at least one modern scripting/programming language (Python or Go strongly preferred).
  • Expert knowledge of incident response methodologies and operational best practices.
  • Proven experience designing and operating container orchestration (Kubernetes, Docker) and microservices concepts required.
  • Expert experience with Hashicorp products (Nomad, Vault, Terraform) in a production environment.
  • Preferred Attributes
  • Significant experience in a SaaS, service provider, or hyper-scale distributed systems environment.
  • Deep familiarity with ITIL/OSS practices and experience defining/enforcing SLO/SLA’s.
  • Exceptional problem-solving skills and a strong drive to learn and apply new, complex technologies.
  • Advanced experience with cloud platforms (AWS, GCP, or Azure) in a production setting.

Benefits

  • Healthcare for family, including dental and vision.
  • Competitive compensation and 401K.
  • RSU grants for full-time employees.
  • ESPP program.
  • Flexible vacation policy.
  • Maternity & paternity leave.
  • MacBook Pro to use for work, plus a generous stipend to personalize your workstation.
  • Childcare bonus (human children only).
  • Fertility treatment and support.
  • Learning & development program.
  • Commuter benefits.
  • Culture that supports a healthy work-life balance.
  • Salary Range
  • The expected salary range for this role is $150,000 - $200,000.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Centene Corporation logo

Senior Site Reliability Engineer

Centene Corporation

Transforming the health of the communities we serve, one person at a time.

DevOps Engineer84 days ago
Full TimeRemoteTeam 10,001+Since 1984H1B No Sponsor

• Helps lead projects that are focused on managing and maintaining optimum platform infrastructure performance, reliability, and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes and continuous delivery (CI/CD) tools, processes, and designs. • Develops complex services to automate monitoring activities and provide critical information to facilitate response and resolution of performance and availability issues and incidents. • Troubleshoots and analyzes service disruptions to determine the root cause of issues and develop solutions for improved reliability. • Support multiple applications and schedule batch jobs for a large number of transactions weekly • Leads more complex projects focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility.

Nebraska + 3 moreAll locations: Nebraska | Ohio | Tennessee | Wisconsin
$87K - $161.3K / year
Job Closed
OpenVPN Inc. logo

Senior DevOps Engineer

OpenVPN Inc.

OpenVPN® helps businesses of all sizes create secure, virtualized, reliable networks that scale with your team.

DevOps Engineer84 days ago
Full TimeRemoteTeam 51-200Since 2002H1B No Sponsor

• Design, implement, and maintain highly scalable, fault-tolerant systems that leverage cluster orchestration and containerization technologies • Work alongside Software Engineering and QA teams to refine and implement deployment processes that support microservices-based architectures • Build and oversee CI/CD pipelines that accommodate container-based application deployment and rollback capabilities • Ensure systems are consistently available, performing automated health checks and coordinating zero-downtime deployments • Participate in an on-call rotation to rapidly diagnose and resolve critical system outages • Collaborate with information security teams to guarantee that industry best practices and compliance requirements are met

Albania
Job Closed
Collaboration.Ai logo

DevOps Engineer — AWS, Observability & Security

Collaboration.Ai

Connecting the right people around the right ideas.

DevOps Engineer84 days ago
Full TimeRemoteTeam 11-50Since 2016H1B No Sponsor

Who we are: Collaboration.Ai is a mission-focused, AI-powered software and services company based in Minnesota, with employees, partners, and customers around the world. We unite people, technology, and purpose to accelerate breakthroughs that transform industries, empower communities, and create a more sustainable future. We collaborate with customer teams across a broad spectrum of public and private sector organizations, helping them navigate complex challenges and drive transformative change. To learn more about us, visit collaboration.ai Our product lineup: NetworkOS is an AI-powered platform that aligns people, purpose, ideas, and expertise in real-time, generating actionable insights to propel movements forward. CrowdVector is an integrated solution marketplace and innovation management platform that rapidly uncovers new ideas and advances breakthroughs to fuel movements. About the Role You’ll own CAI’s cloud infrastructure end-to-end: the AWS footprint, CI/CD pipelines, container orchestration, and observability platforms that let our engineering teams ship with confidence. This isn’t a “keep the lights on” role. You’ll architect infrastructure that meets federal compliance requirements by design, build out our DataDog monitoring from the ground up, and lay the foundation for AI/ML workloads. You’ll work closely with our Product Dev teams, Security and Compliance, and Platform Architect to keep our infrastructure secure, efficient, and developer-friendly - supporting customers who turn ideas, networks, and data into real outcomes. High autonomy. Real ownership. Infrastructure that matters. What You’ll Do - Architect AWS infrastructure: Design and manage our cloud footprint for scalability, high availability, and security controls aligned with NIST 800-53 and SOC 2 - Build reliable CI/CD pipelines: GitHub Actions with automated quality gates (testing, coverage, security scanning) that let developers ship to Kubernetes with high confidence - Codify everything in Terraform: Establish IaC standards, enforce peer review, detect drift, and maintain consistent patterns across dev, nonprod, and prod - Run Kubernetes at scale: Operate and optimize EKS clusters with zero-downtime upgrades, right-sized resources, and security policies - Build out DataDog observability: Create dashboards, alerts, and integrations that give teams actionable insight into infrastructure health - Embed security in pipelines: Snyk scanning, container image validation, AWS security baselines, and compliance-ready documentation - Support AI workloads: Build infrastructure for LLM Ops; compute scaling, model deployment pipelines, and cost optimization Our Tech Stack - Cloud: AWS (EKS, RDS, S3, KMS, Secrets Manager, VPC, ALB, CloudTrail, Security Hub) - IaC: Terraform, GitOps workflows Containers: Kubernetes (EKS), Docker, Helm, ArgoCD - CI/CD: GitHub Actions, Codecov, Amazon ECR Security: Snyk (SAST, container scanning, dependency scanning), AWS security controls, TLS 1.3 - Observability: DataDog (infrastructure monitoring, dashboards, alerts), OpenTelemetry - Compliance: NIST 800-53, NIST 800-171, SOC 2, CMMC Level 2, FedRAMP High readiness - Future: AWS GovCloud (IL2/IL4) What We’re Looking For Must-Haves - 5+ years of hands-on DevOps or Platform Engineering experience with AWS - Production Terraform (or OpenTofu) expertise for infrastructure as code - Strong Docker and Kubernetes (EKS) experience in production - Experience implementing security controls aligned with NIST 800-171 or NIST 800-53 - Hands-on CI/CD pipeline design (GitHub Actions preferred) - Experience with observability platforms (DataDog strongly preferred) - Understanding of GitOps practices and deployment automation - US citizenship (required DoD contracting and FedRAMP compliance) Nice-to-Haves - Candidates located in the Minneapolis/ Saint Paul, MN area for coworking opportunities. However remote candidates are encouraged to apply. - LLM Ops or AI/ML infrastructure experience - Advanced AWS certifications (DevOps Engineer Professional, Security Specialty) - High-growth startup or B2B SaaS background Why Join Collaboration AI? Modern stack, real problems. Kubernetes, Terraform, DataDog, GitHub Actions — no legacy infrastructure, no manual deployments. You’ll build on tools you actually want to use. AI-native culture. We build with AI, not just for AI. Claude Code, agentic workflows, and AI-assisted infrastructure automation are how we work daily. You’ll be expected to push the boundaries of what’s possible with AI tooling in your domain and you’ll have the freedom to do it. Work that matters. >Our customers span defense, pharma, aerospace, global enterprises, and universities; complex organizations where the stakes are real. The infrastructure you build will meet the most demanding compliance requirements in the industry: FedRAMP High, CMMC Level 2, SOC 2. Own it. Small, senior team. High autonomy. Your infrastructure decisions are visible, valued, and directly tied to company outcomes.

United States
Job Closed
Full TimeRemoteTeam 1,001-5,000

ABOUT VEG In 2014, VEG was born with a mission to help people and their pets when they need it most by challenging norms and fixing the ER experience. Since then, we’ve expanded rapidly, with hospitals nationwide open 24/7/365, and created an ER experience that focuses on what our pets and pet parents really need. We’ve done the same for our people (VEGgies), finding a way to say YES so they are empowered to achieve great things, grow in unexpected ways, and find a place where they truly belong. We’re rethinking emergency care from every angle—from how we run our hospitals to how we support the people working inside them. That’s where our headquarters team comes in. Whether building technology to make our hospitals more efficient, recruiting and growing incredible VEGgies, or bringing our brand to life through marketing, our VQ (VEG Headquarters) team makes it all possible—ensuring our hospitals and people have everything they need to help pets and their families. VEG is a 2025 and 2026 certified Great Place to Work®. THE JOB We are looking for a Senior Site Reliability Engineer who understands that at VEG, "reliability" is a medical necessity – if our proprietary platform, DogByte, goes down, a pet's life could be at risk. You will be the primary lead for our platform's resilience, transforming our infrastructure into a self-healing system that empowers our medical teams to provide 24/7/365 life-saving care. You will spend your time bridging the gap between high-level architectural strategy and hands-on technical "surgery," ensuring our engineering teams can build at pace while the foundation remains rock-solid. You will evolve and strengthen an existing system that must meet the demands of VEG’s hospital expansion – ensuring our infrastructure never limits our ability to open new hospitals or provide medical care. You will own the ongoing stability of DogByte, scaling it from its current state into a robust enterprise platform where one hospital's traffic is isolated and does not impact another's experience. This job has an opportunity to work at our VQ in White Plains or could be open to remote work. WHAT YOU’LL DO - Formulate short- and long-term strategies to ensure DogByte withstands year-over-year volume increases, specifically solving for hospital-to-hospital traffic isolation - Work with engineers to ensure data flows -- from client to API to database -- are configured for high-concurrency and maximum reliability - Build automated processes to handle high-traffic spikes and automatically remediate common system errors - Set up monitoring and alerting to identify latency throughout the stack and resolve issues before they impact hospital operations - Establish and meet SLOs for high availability, ensuring our engineers can build products without worrying if the system can support them WHAT YOU NEED - Bachelor’s Degree preferred or equivalent experience - 5+ years in SRE/DevOps roles, expertly handling high-concurrency environments - Deep understanding of the AWS ecosystem managed entirely through Infrastructure as Code - Expertise in traffic management, including load balancing techniques, Nginx configuration, and autoscaling to handle volatile patterns - Technical leadership in observability, establishing the tracing frameworks and monitoring required to diagnose latency issues and ensure high availability across the entire request lifecycle - You have direct experience with technologies relevant to our technical stack, which currently includes: AWS ECS, Terraform, Nginx, PostgreSQL (RDS), Python WHO YOU ARE - Empathetic, instinctively taking a people-centric approach, whether supporting your colleagues or making an effort to understand different perspectives - Have a sense of humility; acknowledging mistakes, sharing credit with others, and lifting up your team’s’ accomplishments - Feel a strong sense of ownership over your work, taking responsibility for outcomes and staying committed to achieving long-term, impactful results - Curious by nature; you ask insightful questions and continuously seek out opportunities to learn and grow your skills and knowledge HOW WE INVEST IN YOU - Competitive Compensation Including ($170,000 - $200,000) + bonus + benefits. - Comprehensive health and wellness benefits that start on day one, and access to free therapy or counseling - Paid parental leave, up to 10 weeks at 100% of regular salary, and offering inclusive fertility and family-building care for all types of families - Unlimited PTO to use for vacation or sick days—however you need it! - Generous employee referral program, so our awesome people can bring in more awesome people. - And the little (big) things, like casual office attire, ability to bring your fur baby to work, cool VEG swag, food in the fridge for when you’re hungry and free lunches twice a week!! - Company laptop and a monthly cell phone reimbursement DEI At VEG, diversity is not just a word—it's a strength that fuels innovation and kindness. Our mission is “Helping people and their pets when they need it most.” And we do that better when our VEGgies (employees) feel valued, respected, and empowered to bring their authentic selves to work. That's why we're devoted to creating an environment that reflects the diverse communities we serve—where different perspectives are not only welcomed but celebrated. We are focused on providing equitable opportunities for growth, promoting inclusive decision-making, and ensuring that everyone's perspective is considered. Saying yes to VEG means helping us build a culture where your unique experiences and background contribute to a shared vision: being the world’s veterinary emergency company.

United States
$170K - $200K / year