GoDaddy logo
GoDaddy

GoDaddy is a web services platform that helps individuals and businesses worldwide start, grow, and manage their online presence. GoDaddy employs team members across North America,

Site Reliability Engineer - Storage Engineer

Location

United States

Posted

6 days ago

Salary

$98.5K - $192K / year

Seniority

Mid Level

No structured requirement data.

Job Description

Site Reliability Engineer - Storage Engineer

GoDaddy

Role Description GoDaddy is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. This role will focus on automating and maintaining our storage infrastructure with a focus on Ceph, ensuring the reliability, scalability, and performance of our systems. - Automate and maintain day-to-day operations of storage systems to support application demands - Develop and maintain tools and automation scripts to streamline storage operations and improve efficiency - Monitor system performance, identify issues, and implement solutions to ensure high availability and reliability - Participate in agile concepts such as daily stand-up meetings, task tracking boards, design and code reviews, automated testing, continuous integration, and deployment - Continuously improve system reliability, performance, and capacity through proactive monitoring, automation, and optimization Qualifications - 2+ years of professional experience with Ceph, working in a production environment - 2+ years of experience in site reliability engineering or a similar role - 2+ years of professional experience with Ceph, including deployment, configuration, and management of Ceph clusters and systems - Experience working on Linux/Unix systems, with a focus on automation and operating at scale - Proficiency in Python or Bash - Experience with Ansible, Terraform, or SaltStack - Experience with Nagios-based monitoring tools, such as Icinga2 - Experience with observability tooling, such as Prometheus, Grafana, Mimir, and Loki - Solid understanding of core networking concepts and protocols, particularly in relation to Linux/Unix systems Requirements - Experience with containerization and orchestration tools (e.g., Docker, Kubernetes) - Exposure to and experience working with compute platforms (e.g., OpenStack, AWS) - Familiarity with ability to contribute to CI/CD pipelines and automation workflows Benefits - Competitive pay - Generous time off - Parental and wellness leave - Healthcare - Retirement savings program - Comprehensive benefits package, including: - Medical, dental, and vision insurance - 401(k)-retirement plan - Paid sick time - Paid flexible time off - Paid parental leave - Life insurance - Short- and long-term disability - AD&D insurance - Mental health or EAP programs - Remote or hybrid work options - Paid holidays - Paid Wellness days - Tuition assistance - Adoption, surrogacy, and fertility benefits - Dependent daycare and backup care benefits - Employee stock purchase plan - Financial education and advice Compensation - Bay Area (Santa Clara, San Francisco) and Los Angeles: $128,000 — $192,000 USD - Austin, D.C. Metro, CA (non-Bay Area), HI, IL, MA, NH, OR, VA, WA: $110,500 — $165,500 USD - New York City Metro, Kirkland/Seattle: $117,200 — $175,800 USD - All other US locations not previously listed: $98,500 — $147,500 USD

Related Categories

Related Job Pages

More DevOps Engineer Jobs

OpenVPN Inc. logo

Junior DevOps Engineer

OpenVPN Inc.

OpenVPN® helps businesses of all sizes create secure, virtualized, reliable networks that scale with your team.

DevOps Engineer6 days ago
Full TimeRemoteTeam 51-200Since 2002H1B No Sponsor

• Assist in designing, implementing, and maintaining scalable, fault-tolerant systems that leverage cluster orchestration and containerization technologies, with guidance from senior engineers. • Work alongside the Software Engineering and QA teams to support deployment processes for microservices-based architectures. • Help build and maintain CI/CD pipelines that support container-based application deployment and rollback. • Support system availability by running health checks and contributing to zero-downtime deployments. • Participate in a supported on-call rotation, escalating critical system issues appropriately as you build incident-response experience. • Collaborate with information security teams to follow industry best practices and compliance requirements. • Use AI-assisted tooling (e.g., LLM-powered coding assistants, chat-ops agents, scripted automations) to accelerate routine tasks such as log triage, runbook execution, ticket drafting, and code review.

Bosnia And Herzegovina
Jamf logo

Senior Systems Site Reliability Engineer, B2B

Jamf

The Standard in Apple Enterprise Management

DevOps Engineer6 days ago
Full TimeRemoteTeam 1,001-5,000Since 2002H1B Sponsor

• Site Reliability Engineers are responsible for helping balance development velocity against customer-centric stability of systems through the use of SRE best practices and the creation of new processes with automation. • The Senior Site Reliability Engineer is responsible for creating and leading projects around how services and other workflows should be measured as well as participating in the observability of production systems and services through day to day operational responsibilities with the intention to gain that wisdom of production to then decide what toil should be automated next. • The Senior Site Reliability Engineer is expected to operate with a DevOps mindset at the convergence point of Cloud Operations, Engineering, and Technical Support within the framework of the Agile process. • Identify improvements in both the platform and processes by implementing established SRE concepts with the goal of improving product and system reliability. • Proactively engage and collaborate with other individuals and teams as issues arise by serving as an escalation point of customer issues to ensure successful outcomes. • Perform root cause analysis for customer impacting issues and be able to clearly document the solution and advise others from the results of those findings. • Create technical documentation based upon new technology proof of concepts, project work, root cause issue analysis, identification of alerting patterns, and proactively sharing this knowledge with other teams as part of the Continuous Improvement Model. • Participate in team ceremonies to identify and refine potential work, communicate findings, and drive opportunities to collaborate. • Assign and communicate the business value and benefit hypothesis of new projects, initiatives, and strategies while being able to break down the technical work require to achieve a successful outcome. • Lead cross-team and cross-department technical collaboration in critical customer escalations. • Advise stakeholders and senior leadership on critical customer escalations. • Occasionally provide off hours support for deployments and customer escalations.

Poland
Smartsheet logo

Senior DevOps Engineer

Smartsheet

Modern work management platform

DevOps Engineer6 days ago
Full TimeRemoteTeam 1,001-5,000Since 2005H1B Sponsor

• Own and evolve the edge proxy platform: Maintain, upgrade, and extend a high-performance reverse proxy — including maintaining the proxy binary and its configuration tooling, writing Go and Python automation, managing the full container image lifecycle on hardened Linux base images, and working across the broader edge layer, including CDN, WAF, and traffic management capabilities. • Build and maintain cloud infrastructure as code: Design and implement Terraform/Terragrunt modules and live environment configurations managing EKS clusters, load balancers, IAM roles, VPC networking, ECR registries, and supporting AWS services across multiple regions including GovCloud. • Operate Kubernetes clusters at scale: Manage multi-region, multi-cluster EKS deployments via FluxCD GitOps workflows and Helm charts, including node AMI rotation, add-on lifecycle management, and horizontal pod autoscaling. • Build and own CI/CD pipelines: Design, maintain, and improve shared GitLab CI/CD pipeline templates used across all team repositories; build and operate alternative pipeline workflows for isolated government cloud environments. • Automate operational toil: Build and maintain tooling for tasks such as container image patching, EKS AMI rotation, air-gapped ECR image sync to GovCloud, and automated MR creation for monthly version-bump patching cycles. • Manage observability and on-call: Provision and maintain Datadog SLOs, monitors, and dashboards via Terraform; participate in the team's on-call rotation responding to edge proxy incidents across production and GovCloud environments. • Support FedRAMP/GovCloud operations: Operate the GovCloud environment with its unique constraints — air-gapped image distribution, infrastructure automation in isolated networks, and alert management with compliance-aware data handling. • Evaluate and adopt internal developer tooling: Research, prototype, and drive the adoption of internal tools that improve engineering productivity across the company — including developer portals, platform self-service capabilities, and other tooling that raises the bar for the developer experience at Smartsheet. • Mentor and collaborate: Share knowledge across the team through code reviews, architecture discussions, and runbook authorship; foster a culture of engineering excellence and operational rigour. • Strategically apply AI tools: Strategically apply and champion AI tools within your team's domain to improve project execution, infrastructure design, quality, and debugging, leading adoption of AI best practices.

Bulgaria

DevOps / SRE Engineer - AI Platform

Makro PRO

Makro PRO is an exciting new digital venture by the iconic Makro. Our proud purpose is to build a technology platform that will help make business possible for restaurant owners, hotels, and independent retailers, and open the door for sellers. We welcome bold, energetic, and thoughtful people who share our belief in collaboration, diversity, excellence, and putting customers at the heart of our work. Clear focus Diverse Workplace (Our members are from around the world!) Non-hierarchical and agile environment Growth opportunity and career path

DevOps Engineer6 days ago

Role Description The DevOps / SRE Engineer owns the operational substrate of an AI-native retail decisioning platform — infrastructure, CI / CD, observability, cost meter, and incident response for a system that runs production agents taking real business actions. The role builds on the enterprise Terraform standard, CI / CD spine, and FinOps tagging policy rather than reinventing parallel infrastructure. Remote candidates outside of Thailand are welcome to apply. - Adopt the enterprise Terraform standard and module library for all platform infrastructure; author platform-specific modules where needed (agent runtime, vector DB, knowledge graph); run drift detection weekly. - Build platform-specific CI / CD pipelines on the enterprise spine — service deploys, agent deploys, eval-gate enforcement; integrate eval gates so no agent reaches production without eval pass. - Operate rollback orchestration with sub-15-minute recovery; quarterly game days. - Own the platform observability stack — OpenTelemetry, Langfuse for LLM traces, custom dashboards for per-agent cost. - Implement the per-agent cost meter end-to-end — token counts, vector queries, model inference, downstream LLM Gateway costs; surface cost data to the enterprise GenAI cost dashboard. - Stand up the platform on-call rotation; author runbooks for every production agent and service; lead incident response with measurable corrective actions. - Implement platform cost-tagging policy consistent with the enterprise standard (team, domain, environment, project, agent, suite, persona); report monthly to Cost Review. - Drive cost optimisation — right-sizing, caching, model routing decisions, reserved compute. Qualifications - Bachelor's or Master's degree in Computer Science, Engineering, or a related discipline. - 5+ years SRE / DevOps with production ownership. - Terraform at scale — modules, state, drift, environment promotion. - CI / CD for data + ML / AI services (GitLab CI / CD or comparable). - Cloud platform (Azure preferred; AWS / GCP transferable). - Observability — OpenTelemetry, Langfuse (or comparable LLM traces), custom dashboards. - FinOps — tagging policies, attribution, optimisation. - Incident response — on-call, post-mortems, runbook authorship. Preferred Qualifications - AI / agent platform SRE experience; cost-meter / chargeback systems built or operated. - Multi-cloud production experience; open-source contributions to IaC / observability tooling. - AI / ML / agent system observability instrumentation (LLM cost, agent cost, eval scores). - Vendor certifications such as HashiCorp Terraform Associate / Professional, Azure Solutions Architect Associate, or Databricks Data Engineer Professional.

Worldwide