Job Closed
This listing is no longer active.
GetBlock provides developers with instant connection to 40+ blockchain nodes via JSON-RPC, REST and WebSockets APIs.
SRE Lead
Location
United States
Posted
160 days ago
Salary
0
Seniority
Senior
Job Description
SRE Lead
GetBlock
• Lead and grow the SRE team: hiring, onboarding, 1:1s, performance reviews, and career development. • Own SRE operating cadence: prioritization, planning, execution, and visibility of reliability work. • Maintain high standards for production readiness: runbooks, operational checklists, change management, and quality gates. • Own production reliability end-to-end across gateways, clusters, and blockchain node fleets. • Define and evolve SLIs/SLOs for uptime, response time, RPS, and time-to-resolve; partner with engineering teams to meet targets. • Own incident management standards: alerting strategy, escalation, incident coordination, and communications. • Run and improve postmortems: ensure follow-ups are executed and reliability debt is reduced over time. • Lead capacity planning and performance work across regions and chains; balance reliability, speed, and cost. • Lead design reviews and set engineering standards for reliability, scalability, and operational excellence. • Drive architecture decisions across Nomad + Kubernetes environments, gateways, and observability stack. • Build and evolve internal tooling that improves reliability and operational efficiency (automation, health systems, diagnostics, self-service).
Job Requirements
- 3+ years in SRE / infrastructure / production engineering, including 1+ year leading people
- Strong Linux, networking, and production incident debugging skills
- Experience running and scaling distributed, multi-region, high-load systems
- Hands-on with orchestration (Nomad and/or Kubernetes) and modern gateways/proxies
- Solid observability practices (metrics, logs, traces, alerting, incident response)
- Using AI agents to improve operational efficiency and reliability automation
- Strong communication and ability to lead technical decisions end to end
- Nice to have: Web3 / RPC infrastructure and blockchain node operations
- HashiCorp stack (Nomad, Consul, Vault), Prometheus ecosystem
- Terraform / IaC, capacity & cost modeling, DDoS and abuse protection
- Building internal platforms: self-service tools, runbooks, reliability automation.
Benefits
- 20 days of annual leave, plus an additional 12 days off to use for your holidays or personal days.
- Well-being programs to support your health and balance.
- Coworking space compensation for a productive work environment.
- Paid sick leave to ensure you can rest when needed.
- A company that invests in your growth, with personalized roadmaps to guide your professional development.
- An actively growing company with great opportunities for both horizontal and vertical career development.
- Opportunity to shape the initiatives you’re working on and make a real impact.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer – Golang, OpenShift, AWS, Linux
Red HatThe leading provider of enterprise open source solutions.
• Develop, scale, and operate OpenShift managed cloud services • Enable customer self-service and improve monitoring systems • Eliminate work through automation • Participate in a regular on-call schedule, including occasional paid weekends and holidays • Resolve customer issues escalated from the Red Hat Global Support team • Work within a small agile team to develop and improve SRE software
Engineer III – CICD DevOps
CrowdStrikeCrowdStrike has redefined security with the world’s most advanced cloud-native platform that protects and enables the people, processes and technologies that drive modern enterprise. Tested and proven, the world's largest organizations trust CrowdStrike to stop breaches with unparalleled protection against the most sophisticated cyberattacks. The CrowdStrike culture has been built upon our Core Values since the day we began. We are Fanatical About the Customer, Relentlessly Focused on Innovation and believe that our Limitless Passion drives Unlimited Potential for every CrowdStriker. As a purpose-built remote-first company, we believe cultivating a connected culture for every employee, no matter where they are in the world, is a key ingredient in building a high-performing, diverse team. We don’t have a mission statement. We’re on a mission—to stop breaches. Ready to join a mission that matters?
• On-Premise Provisioning and Administration • Experience with Kubernetes (k8s), ArgoCD, FLuxCD, and Containers • Jenkins with JCASC • Artifact repository services such as: JFrog Artifactory, Nexus, or Quay.io • Atlassian Stack (Jira, Confluence, Bitbucket) • IaaS Provisioning tools such as Ansible, Chef, Salt, Puppet etc. • Experience with common scripting languages Python, REST APIs, Groovy • Experience with Linux and Windows server administration in Hybrid Environments • Knowledge of proper monitoring, maintenance, and disaster recovery of critical services • Ability to document processes/procedures.
Senior DevOps Engineer
ScaleUP WeekFour transformational days of best practices, impact and inspiration.
• Own and scale ZayZoon’s AWS infrastructure to ensure reliability, performance, and security. • Design and automate infrastructure provisioning using CloudFormation, build CI/CD pipelines and manage existing infrastructure stacks. • Optimize our PostgreSQL databases for high availability and performance. • Improve monitoring and observability when possible, ensuring detection and a proactive resolution of issues. • Designing secure (SOC-2 and cybersecurity compliance), scalable cloud services in AWS. • Evaluate and remediate Critical and High CVEs across all our services on a moment’s notice. • Work cross-functionally across multiple teams such as development, data, testing, and security to convey concepts and build understanding as well as improve deployment processes. • Apply DevOps best practices in everything you do - there are many ways to do things, and you bring the best of them to our environment.
Senior Site Reliability Engineer – Chaos Engineering
Articul8 AISolving the world's toughest problems with Generative AI.
• Architect and maintain scalable, highly available infrastructure for our GenAI platform. • Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance. • Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency. • Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality. • Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact. • Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads. • Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives. • Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads. • Implement and enforce security best practices across all systems and environments. • Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge.




