Job Closed
This listing is no longer active.
Boosting offensive security with AI
SRE – Incident Response
Location
Ireland
Posted
117 days ago
Salary
0
Seniority
Senior
Job Description
SRE – Incident Response
XBOW
• Keeping XBOW’s production systems stable, observable, and resilient as the product scales. • Building and maintaining automated reliability tooling, covering monitoring, alerting, and self healing. • Defining and tracking service level goals for both production and development environments. • Close collaboration with infrastructure and feature teams to manage cloud systems through IaC. • Conducting root-cause investigations and incident analysis across the organization. • Helping maintain internal and customer-facing status dashboards that clearly communicate system health and uptime.
Job Requirements
- Strong experience with TypeScript
- Hands-on experience with AWS
- Solid expertise in Linux, plus experience with infrastructure & DevOps tooling such as Kubernetes, Docker, Terraform, and CI/CD pipelines (especially GitHub Actions)
- Background in infrastructure automation and/or incident response (depth may vary by candidate)
- Familiarity with monitoring and observability tools such as OpenTelemetry, Prometheus, VictoriaMetrics, Grafana, and Datadog
- Experience with Python and/or Go (advantageous)
- Experience with additional cloud providers beyond AWS (advantageous)
Benefits
- Competitive salary and a generous equity package, making you a true owner of the company.
- Shape your role, lead the function, and grow with the company as we redefine cybersecurity.
- You will tackle technically complex challenges and play a pivotal role in the growth of our business, working alongside an amazing team and some of the world’s experts to shape how AI transforms cybersecurity.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Automate infrastructure with Terraform and ARM templates • Build CI/CD pipelines and automation with PowerShell and YAML • Manage Azure IaaS & PaaS resources efficiently • Configure Azure networking: App Gateways, Firewalls, API Management • Monitor infrastructure using Azure tools and third-party observability platforms • Implement OAuth2/OpenID Connect with Entra ID • Govern resources using Azure Policies • Design geo-redundant, highly available systems • Apply basic SQL in deployment and support tasks • Ensure infrastructure security, compliance (e.g., PCI DSS) • Optimize cloud costs using Azure Cost Management tools • OnCall availability (if needed) • Contribute to a collaborative team environment while working independently • Participate in Agile Scrum teams, continuously improving delivery pipelines
Senior Site Reliability Engineer
AiraloAiralo is an eSIM store where travelers can access more than 200 eSIMS at affordable, local rates from around the world while using an eSIM-compatible tablet, smartphone, or PC. Th
• Lead the design of scalable, fault-tolerant and self-healing systems in a multi-region AWS environment. • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies. • Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures. • Identify patterns of manual work and lead the development of internal tools/automation to permanently eliminate them. • Develop and maintain automated runbooks and playbooks for common operational tasks and complex incident response. • Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights. • Proactively identify and mitigate operational risks through chaos engineering and architecture reviews. • Work with software engineers to design systems for reliability, scalability, and maintainability from the early stages of the SDLC. • Continuously evaluate and optimize system performance, capacity, and cost efficiency. • Beyond just participating, you will refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health.
• Provide strong leadership, mentoring, and sound judgment as the Reliability Engineering lead on your team. • Design and maintain autonomous systems for building, deploying, testing, and operating all Filevine products. • Act as the authoritative voice of reliability across the full software development lifecycle (SDLC). • Monitor, aggregate, dashboard, and alert on software/infrastructure events to ensure visibility and fast response. • Continuously enhance CI/CD pipelines, automation scripts, playbooks, and tools to streamline processes and reduce resolution time. • Proactively identify and resolve gaps in system availability, performance, and security while defending overall security posture. • Document processes, architecture, procedures, and best practices; research, adopt, or build reliable tools to boost engineer productivity. • Collaborate within your team (or independently), mentor junior engineers, participate in 24/7 on-call rotation for production support and emergency response, and communicate clearly with technical and management stakeholders.
**Responsibilities:**** • Build and maintain CI and CD pipelines that support OpenShift workloads** • Manage OpenShift clusters, including deployment, upgrades, scaling, logging, and monitoring** • Automate everything possible using tools like Ansible, GitOps workflows, or scripting languages** • Troubleshoot cluster, container, and pipeline issues across the full stack** • Work with development teams to containerize apps and optimize performance** • Implement security best practices for OpenShift and Kubernetes environments** • Maintain strong documentation around workflows, pipelines, and cluster operations** • Improve reliability, resiliency, and performance of platform services




