Lead DevOps Engineer
Location
United Kingdom
Posted
51 days ago
Salary
0
Seniority
Senior
Job Description
Lead DevOps Engineer
Recruiting.com
• Define and drive the DevOps Vision and using Agile best practices • Set direction, standards, and best practices for the team • Lead the design of scalable, secure, and reliable infrastructure and delivery pipelines • Establish and maintain CI/CD pipelines for multiple applications and services • Align DevOps initiatives with engineering, product, and business goals • Ensure high-quality engineering is demonstrated across the team • Design, deploy and maintain cloud infrastructure (Azure) • Mentor engineers and promote knowledge sharing • Facilitate clear communication between the different departments • Advocate DevOps culture across the teams looking to shift-left wherever possible
Job Requirements
- 6+ years in DevOps/Cloud/Platform Engineering roles
- 2+ years in a technical leadership or team lead capacity
- Strong hands-on experience with Microsoft Azure (VMs/networking/Storage, AKS, App Services, Keyvault)
- Deep experience with Kubernetes in production environments
- Strong infrastructure as code experience, mainly Terraform/Helm
- Strong CI/CD experience, ideally with Azure DevOps
- Candidates must be based in Poland (ideally located near Lublin) or the United Kingdom
- Functional development/coding ability with at least one language - Golang/Python/PowerShell/bash
- Strong understanding of networking, security, and cloud architecture
- Experience working in cross-functional engineering teams
- Familiar with observability platforms
- Cloudflare/Edge/WAF knowledge - will be a plus
- LLM experience - will be a plus
- Minimum of 5 GCSEs at grades A–C (9–4)* including Maths and English, or equivalent high school diploma.
Benefits
- Benefit offerings outside the US may vary by country and will be aligned to local market practice. The eligibility and effective date may differ for some benefits and for team members covered under collective bargaining agreements.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Operations Analyst II
Centene CorporationTransforming the health of the communities we serve, one person at a time.
• Collaborate with cross-functional teams to proactively monitor, maintain, and enhance production systems. • Bridge the gap between software development and IT operations, with an emphasis on incident triage, systems operations and documentation, business user support and communications, toil reduction, work automation and improving the reliability of services. • Provide timely and accurate user support/troubleshooting as well as management of escalations within enterprise SLA’s (Service Level Agreements). • Researches and analyzes enrollment/provider data and Utilization Management/Case Management business processes to support applications issues while acquiring a more advanced skill set for assistance of product design and solutions. • Perform On Call post deployment application validations of enhancement/code fixes and monitoring for escalations. • Act as a key player in application incident management, ensuring prompt detection, triage, and resolution of production incidents. • Monitor and manage technology business cycle processing within the application environment. • Establish, document, and refine incident management processes to reduce downtime and service degradation. • Participate in post-incident reviews, ensuring that lessons learned are incorporated into operational practices and automation efforts. • Develop, improve and review runbooks and documentation and keep them up to date to ensure consistency across SRE teams. • Document operational processes including workflows, system configurations, troubleshooting guides, and incident reports to improve the team’s ability to respond quickly to incidents and system failures. • Review and provide feedback and process improvement recommendations on topics like automation, monitoring gaps, toil reduction and systems resiliency. • Comply with all policies and standards.
TL;DR We're hiring a Site Reliability Engineer to own and evolve deepset's cloud and customer infrastructure end to end. You'll work across SaaS, private cloud, and on-prem environments to make our self-hosted platform production-ready, drive CI/CD and GitOps maturity, and reduce complexity at scale. Your work will directly shape how deepset's AI platform is built, deployed, and scaled for our own cloud and for customers running it in their own environments. Why deepset At deepset, we’re on a mission to make custom AI solutions accessible to every organization. With Haystack, thousands of developers build advanced LLM applications every day, while our enterprise-ready AI Platform helps companies turn large language models into business value. We’re remote-first, flexible, and built on a culture of trust and ownership. You’ll collaborate with top-tier tech talent, tackle meaningful challenges, and help transform complex AI into solutions that are simple, powerful, and ready for the real world. What you will do You won’t just “keep things running” - you’ll help define how our platform is built, deployed, and scaled across cloud and customer environments. - Build and operate real-world infrastructureDesign, configure, and evolve infrastructure that runs both in our cloud and inside customer environments (SaaS, private cloud, on-prem). - Make self-hosted production-readyHelp us deliver a production-grade, self-hosted platform that can be deployed on any Kubernetes setup in weeks - not months. - Drive automation & platform maturityImprove CI/CD pipelines, GitHub workflows, and GitOps setups so teams can ship faster with confidence. - Reduce complexity and costContinuously simplify systems and optimize infrastructure spend without compromising performance or reliability. - Shape how we buildChampion best practices in reliability, scalability, and security across the organization, not as rules, but as working systems. Requirements - 2-5 years of experience working with large-scale production infrastructure - Fluent German language skills - Experience with distributed or service-oriented architectures - Hands-on expertise with: - AWS - Kubernetes - CI/CD and GitOps (e.g. ArgoCD) - Working knowledge of Infrastructure as Code (Terraform preferred) - Solid troubleshooting skills - you can debug across systems, not just within one layer - A pragmatic mindset: you balance speed, simplicity, and reliability - Ownership and accountability - you take responsibility for systems end-to-end - Ability to work independently while staying aligned with the team’s goals Nice to have - Familiarity with observability stacks (e.g. Datadog, Prometheus) - Experience optimizing cloud costs at scale - Interest or experience in Machine Learning / LLM systems - Experience improving developer experience and platform tooling using AI agents - Contributions to SRE practices like postmortems, SLIs/SLOs, and reliability engineering culture Benefits - Remote-first setup with flexible hours & tech of your choice - 30 days vacation + extra days for family sick leave - Competitive salary & stock options for every team member - Monthly sports & mental health support allowance with Oliva - Annual learning & development budget - Monthly team socials & in-person meetups - Dog-friendly Berlin HQ
TL;DR We're hiring a Site Reliability Engineer to own and evolve deepset's cloud and customer infrastructure end to end. You'll work across SaaS, private cloud, and on-prem environments to make our self-hosted platform production-ready, drive CI/CD and GitOps maturity, and reduce complexity at scale. Your work will directly shape how deepset's AI platform is built, deployed, and scaled for our own cloud and for customers running it in their own environments. Why deepset At deepset, we’re on a mission to make custom AI solutions accessible to every organization. With Haystack, thousands of developers build advanced LLM applications every day, while our enterprise-ready AI Platform helps companies turn large language models into business value. We’re remote-first, flexible, and built on a culture of trust and ownership. You’ll collaborate with top-tier tech talent, tackle meaningful challenges, and help transform complex AI into solutions that are simple, powerful, and ready for the real world. What you will do You won’t just “keep things running” - you’ll help define how our platform is built, deployed, and scaled across cloud and customer environments. - Build and operate real-world infrastructureDesign, configure, and evolve infrastructure that runs both in our cloud and inside customer environments (SaaS, private cloud, on-prem). - Make self-hosted production-readyHelp us deliver a production-grade, self-hosted platform that can be deployed on any Kubernetes setup in weeks - not months. - Drive automation & platform maturityImprove CI/CD pipelines, GitHub workflows, and GitOps setups so teams can ship faster with confidence. - Reduce complexity and costContinuously simplify systems and optimize infrastructure spend without compromising performance or reliability. - Shape how we buildChampion best practices in reliability, scalability, and security across the organization, not as rules, but as working systems. Requirements - 2-5 years of experience working with large-scale production infrastructure - Fluent German language skills - Experience with distributed or service-oriented architectures - Hands-on expertise with: - AWS - Kubernetes - CI/CD and GitOps (e.g. ArgoCD) - Working knowledge of Infrastructure as Code (Terraform preferred) - Solid troubleshooting skills - you can debug across systems, not just within one layer - A pragmatic mindset: you balance speed, simplicity, and reliability - Ownership and accountability - you take responsibility for systems end-to-end - Ability to work independently while staying aligned with the team’s goals Nice to have - Familiarity with observability stacks (e.g. Datadog, Prometheus) - Experience optimizing cloud costs at scale - Interest or experience in Machine Learning / LLM systems - Experience improving developer experience and platform tooling using AI agents - Contributions to SRE practices like postmortems, SLIs/SLOs, and reliability engineering culture Benefits - Remote-first setup with flexible hours & tech of your choice - 30 days vacation + extra days for family sick leave - Competitive salary & stock options for every team member - Monthly sports & mental health support allowance with Oliva - Annual learning & development budget - Monthly team socials & in-person meetups - Dog-friendly Berlin HQ
Senior DevOps Engineer
AbacumAbacum is the leading business planning platform that empowers Finance teams to drive performance.
• Design and implement our systems to be efficient, scalable, accountable, and secure • Team up with other Engineers to perform experiments and test new ideas • Build a strong DevOps culture and tooling that enable our delivery teams to be autonomous while providing best practices (security, observability, scalability, performance, etc.) • Deploy and manage our infrastructure provisioning • Develop and drive real time observability solutions that provide visibility into system health • Provide technical guidance and educate team members and coworkers on operations and cloud best practices • Continuously improve development delivery CI/CD • Ability to develop and implement security measures related to the development processes and operational needs driven by our security and compliance team • Build and scale our Kubernetes clusters and workloads • Manage and scale our cloud databases • Participate in a 24x7 on-call rotation



