Honeycomb.io logo
Honeycomb.io

The fastest way to visualize, understand and debug software. Find the critical issues that logs and metrics can’t see.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200Since 2016H1B No SponsorCompany SiteLinkedIn

Location

Ireland

Posted

4 days ago

Salary

€140.6K - €165.4K / year

Seniority

Senior

Job Description

Senior Site Reliability Engineer

Honeycomb.io

• Help Honeycomb scale our backend systems to support our highest-volume customers. • Build organizational trust through transparent communication, giving and receiving direct and kind feedback. • Work with other backend teams to dive deep into our stack to make sure we’re getting the most out of our infrastructure. • Be trained, become, and then train others as an Incident Commander. • Help SRE and Honeycomb develop a healthy cross-Atlantic engineering culture. • Participate in the team’s on-call rotation as the EU side of a new follow-the-sun rotation. • Help the organization navigate tradeoffs between reliability and its other goals and priorities. • Optional: act as an external ambassador through blog posts, conference talks, and presentations with support from our DevRel team.

Job Requirements

  • Strong experience in AWS and Kubernetes.
  • Experience performing cost analysis and reduction.
  • Solid Helm, Terraform, and CI/CD experience.
  • Project management skills.
  • Software engineering experience (Golang is a plus, and so is performance engineering).
  • Experience with Kafka or another high-volume distributed system.
  • Excellent written and spoken communication skills, with the ability to tailor your communication for your audience and give direct feedback when you notice something wrong.
  • A curiosity to learn how people and systems work, and the willingness to make them partners in your initiatives.
  • Familiarity with observability concepts (SLOs, instrumentation) and data-driven decision making.
  • Comfort operating in ambiguity, with a bias for action and experimentation.
  • Interest in both the technical and human sides of reliability engineering.
  • Experience working in geographically distributed teams.

Benefits

  • A stake in our success - generous equity with employee-friendly stock program
  • It’s not about how strong of a negotiator you are - our pay is based on transparent levels relative to experience
  • Time to recharge with unlimited PTO
  • A distributed-first mindset and culture (really!)
  • Home office, co-working, and internet stipend
  • Full benefits coverage for employees, with additional coverage available for dependents
  • Up to 16 weeks of paid parental leave, regardless of path to parenthood
  • Annual development allowance
  • And much more...

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Capco logo

Senior Site Reliability Engineer - Investments (She/ He/ They)

Capco

Capco, a Wipro company, is a management & technology consultancy dedicated to the financial services & energy industries

DevOps Engineer4 days ago
Full TimeRemoteTeam 1,001-5,000Since 1998H1B Sponsor

CAPCO POLAND *We are looking for Poland based candidate. At Capco Poland, we’re not just another consultancy - we’re the spark behind digital transformation in the financial world. As a global leader in technology and management consulting, we thrive on helping clients tackle the toughest challenges across banking, payments, capital markets, wealth, and asset management. ROLE OVERVIEW: We are looking for a Site Reliability Engineer (SRE) to act as an embedded reliability engineering partner supporting critical digital platforms and business services within a leading global financial services environment. Working alongside application, platform, and infrastructure teams, you will help improve the reliability, scalability, observability, and operational maturity of systems that support investment, advisory, and customer-facing services. You will apply Site Reliability Engineering principles to reduce operational risk, improve service availability, and enhance customer experience across a complex technology landscape. The role combines hands-on engineering with operational leadership, leveraging automation, AI-driven capabilities, and modern observability practices to accelerate incident response, reduce manual effort, and continuously improve service resilience. You will play a key role in driving reliability outcomes while collaborating with stakeholders across technology and business functions. WHAT YOU'LL DO: - Partner with engineering, platform, and business-aligned teams to improve the reliability and performance of critical financial services applications. - Define, measure, and manage SLIs, SLOs, and error budgets to drive data-driven reliability improvements. - Lead and support incident management activities, participating in a 24x7 on-call rotation and driving effective post-incident reviews. - Build automation, self-healing capabilities, and operational tooling that reduce manual intervention and improve service recovery times. - Analyse application, infrastructure, and platform performance to identify reliability risks and deliver continuous improvements across the technology estate. - Partner with India-based team of engineers WHAT WE'RE LOOKING FOR: - Proven experience in Site Reliability Engineering, Production Engineering, DevOps, Platform Engineering, or a similar operationally focused role. - Strong knowledge of observability, monitoring, incident management, reliability engineering, and service operations best practices. - Experience supporting business-critical applications within complex enterprise environments. - Hands-on experience with automation, scripting, infrastructure management, and cloud or hybrid technology platforms. - Excellent communication skills with the ability to collaborate effectively with engineering teams, operational stakeholders, and business partners. BONUS POINTS FOR: - Experience supporting wealth management, investment management, banking, or other financial services platforms. - Knowledge of regulatory, security, and operational resilience requirements within highly governed environments. - Experience implementing AIOps, intelligent alerting, automated remediation, or predictive monitoring capabilities. - Familiarity with Kubernetes, container platforms, cloud-native architectures, and distributed systems. - Experience driving service reliability programmes using SLOs, error budgets, and operational excellence frameworks. We offer a flexible collaboration model based on a B2B contract, with the opportunity to work on diverse projects. Recruitment Process: - HR Interview with the recruiter - Technical Interview - Client Interview - Feedback and offer #LI-HYBRID

Poland
Vanilla logo

Senior DevSecOps Engineer

Vanilla

Making Estate Planning Simple for Financial Advisors. Built for advisors, loved by clients.

DevOps Engineer4 days ago
Full TimeRemoteTeam 51-200Since 2019H1B No Sponsor

• Own and operate security tooling, manage key vendor relationships, and drive application and cloud security programs forward • Secure AWS infrastructure, systems, and networking • Review infrastructure-as-code (Terraform) changes for security implications • Monitor and triage security alerts across dedicated channels • Manage the vCISO relationship and own the annual penetration test lifecycle • Run tabletop exercises and maintain the incident response playbook

Arizona + 17 moreAll locations: Arizona | California | Colorado | Connecticut | Florida | Idaho | Illinois | Kentucky | Maine | New Jersey | New York | Ohio | Massachusetts | Minnesota | Pennsylvania | Texas | Utah | Washington
$180K - $210K / year

Associate Site Reliability Engineer

UnitedHealth Group

UnitedHealth Group is a healthcare and well-being company that’s dedicated to improving the health outcomes of millions around the world. We are comprised of two distinct and com

DevOps Engineer4 days ago

Role Description As a member of our team, you will: - Design, develop, and deploy AI-powered solutions using no-code, low-code, and advanced platforms, translating business needs into scalable applications that enhance products, workflows, and decision-making. - Design, deploy, and maintain Kubernetes-based infrastructure to ensure high availability and scalability of applications. - Build and manage CI/CD pipelines using GitHub Actions to enable fast and reliable deployments. - Use Terraform to provision and manage infrastructure in Google Cloud Platform (GCP). - Manage and optimize Apache Kafka-based systems to ensure reliable message streaming and data processing. - Monitor and improve system performance and reliability using Prometheus and Grafana. - Collaborate with developers to automate workflows and implement best practices for infrastructure-as-code (IaC). - Write Python scripts for automation and tooling to enhance operational efficiency. - Troubleshoot and resolve system issues to minimize downtime and impact on users. - Participate in on-call rotations and incident response to ensure high service reliability. Qualifications - 1+ years of experience with Google Cloud Platform (GCP) services such as Compute Engine, Kubernetes Engine, and Cloud Storage. - 1+ years of hands-on experience with Kubernetes for deploying and managing containerized applications. - 1+ years of experience in understanding GitHub Actions for creating and maintaining CI/CD pipelines. - 1+ years of experience in proficiency in Python for scripting, automation, and tooling. - 1+ years of experience with Apache Kafka for building, maintaining, and troubleshooting message-driven systems. - 1+ years of experience using Prometheus and Grafana for monitoring and observability. - Basic level of knowledge of Terraform for infrastructure provisioning and management. Requirements - Familiarity with other cloud providers (e.g., AWS or Azure). - Knowledge of Helm for Kubernetes package management. - Experience with debugging and optimizing distributed systems. - Exposure to security best practices for cloud infrastructure. - Knowledge of Java for developing and troubleshooting backend systems. - Familiarity with DataHub or similar data cataloging and metadata management platforms. - Understanding of Artificial Intelligence (AI) concepts and tools, such as building or managing machine learning pipelines, integrating AI models, or working with ML platforms like TensorFlow, PyTorch, or Vertex AI. - Experience with Golang for developing infrastructure tools or cloud-native applications. Benefits - Comprehensive benefits package. - Incentive and recognition programs. - Equity stock purchase. - 401k contribution (subject to eligibility requirements).

United States
$60.2K - $107.4K / year
Amwell logo

Senior Site Reliability Engineer

Amwell

Amwell (previously known as American Well): digital care delivery will transform healthcare

DevOps Engineer4 days ago
Full TimeRemoteTeam 501-1,000H1B No Sponsor

• Support production systems on platforms such as ESXi, Azure, AWS, and GCP • Utilize configuration management tools for scalable and repeatable systems management including Ansible and Puppet • Design, develop, and maintain automation frameworks, scripts, and operational tooling to improve scalability, reliability, and operational efficiency across infrastructure and platform services. • Configure, maintain, patch, and troubleshoot Linux operating systems with basic knowledge of Windows operating systems • Ensure compliance with security and data handling policies to meet PCI, HIPAA, and other standards • Develop and maintain Infrastructure-as-Code (IaC) solutions using tools such as Terraform, Ansible, and Puppet to support repeatable and standardized deployments. • Collaborate with peers as an accountable and supportive member of Amwell technology teams • Participate in 24/7 call rotation and scheduled maintenance tasks

United States
$129.3K - $140K / year