Generative media in the blink of an API.
Staff DevOps Engineer
Location
United Kingdom
Posted
37 days ago
Salary
0
Seniority
Lead
Job Description
Staff DevOps Engineer
Runware
- Build and scale the infrastructure that powers real-time AI inference across GPU fleets, bare-metal servers, serverless and containerised production systems - Help evolve Runware’s platform toward more elastic, on-demand infrastructure that can scale quickly with customer traffic and model demand - Make Runware faster, more reliable and more resilient by improving the critical paths behind our request entrypoints, inference services, queues, storage, load balancers and networking layer - Automate the hard parts of infrastructure operations, from provisioning and configuration through to CI/CD, deployment safety, progressive rollouts and rapid rollback - Build the observability backbone for a high-performance AI platform, with the signals needed to spot issues early, understand capacity and fix problems before customers feel them - Play a leading role in production operations, incident response, debugging and post-incident improvements, helping us turn operational challenges into a stronger platform - Strengthen the security and compliance foundations of our infrastructure through patching, secrets management, access controls, hardening, auditability, documentation and repeatable operational processes
Job Requirements
- Strong experience as a DevOps Engineer, SRE, Infrastructure Engineer, Platform Engineer or similar, with a track record of running production systems at scale
- Deep Linux knowledge and confidence debugging real production issues across networking, storage, performance, services and system behaviour
- Hands-on experience building automation, Infrastructure-as-Code, CI/CD pipelines and deployment workflows that make infrastructure safer and easier to operate
- Experience operating high-availability, low-latency or high-throughput platforms where reliability and performance directly affect customers
- Strong networking fundamentals across TCP/IP, DNS, load balancing, routing, firewalls, proxies, TLS and HTTP
- A calm and pragmatic approach under pressure, with strong communication, good judgement and a bias toward automation over manual toil
- Bonus
- Experience operating GPU infrastructure for AI/ML inference, including NVIDIA drivers, CUDA, container runtimes, GPU monitoring, capacity planning and workload isolation
- Familiarity with inference serving and optimisation frameworks such as vLLM, TensorRT, Triton or similar
Benefits
- We’re a remote-first collective, meeting in person twice a year to plan, brainstorm, celebrate wins, and enjoy some face-to-face time. We have core hours for cooperative working and calls, but outside of that your calendar is yours. Work the hours that let you perform at your peak while also building a healthy life.
- Our release cycles are fast and intense, but they’re followed by real downtime. After big pushes we expect the team to unplug, recharge, and come back ready & stronger than ever for the next leap.
- Generous paid time off** – vacation, sick days, public holidays
- Meaningful stock options** – share in the upside you create
- Remote-first setup** – work from home anywhere we can employ you
- Flexible hours** – own your schedule outside core collaboration blocks
- Family leave** – paid maternity, paternity, and caregiver time
- Company retreats** – twice-yearly gatherings in inspiring locations
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Engineer
SagentSagent powers banks and lenders to make loans and homeownership simpler and safer for millions of consumers.
• Operate and improve multi-region GKE clusters hosting hundreds of microservices across multiple environments from development through production • Manage the Kubernetes platform layer: Istio service mesh, cert-manager, external-dns, RBAC, HPA/KEDA autoscaling, HashiCorp Vault secret injection, and Helm-based deployments • Develop and maintain Terraform modules across multiple IaC repositories covering GKE, networking (Shared VPC, Cloud NAT, Private Service Connect), Cloud SQL, Cloud Storage, Dataproc, Cloud Composer, Vault, and web hosting • Maintain and extend Azure DevOps CI/CD pipelines using shared Terraform templates with multi-environment deployment workflows • Support Confluent Kafka infrastructure including Connect workers with JDBC source connectors, consumer group health monitoring, and Kafka-lag-based autoscaling with KEDA • Manage Redis Enterprise clusters on Kubernetes with operator-managed lifecycle and replication • Operate the observability stack: Grafana Cloud (Alloy, Loki, Mimir, Tempo, Pyroscope via Private Service Connect), kube-prometheus-stack, Google Managed Prometheus, OpenTelemetry Operator/Collector, Beyla, and Kubecost • Harden cluster security posture: NetworkPolicies, Pod Security Standards, admission policy enforcement, CrowdStrike Falcon, Lacework, kube-bench, and cert-manager with Let’s Encrypt ACME • Support data infrastructure including Cloud SQL (PostgreSQL), Dataproc (Spark), Cloud Composer (Airflow), Matillion CDC pipelines, Snowflake, and BigQuery • Manage DNS across multiple providers (Azure DNS, Cloudflare, GCP Cloud DNS) via external-dns, and support Azure APIM and Cloudflare CDN/WAF • Partner directly with application development teams to troubleshoot deployment failures, tune resource limits and autoscaling, and resolve Kafka consumer lag and connectivity issues • Contribute to the Internal Developer Portal (Backstage) and internal CLI tooling that enables self-service for product engineers.
• Design, implement, and evolve GCP-based infrastructure using Infrastructure as Code with Terraform and Google Cloud deployment automation patterns. • Build and maintain scalable CI/CD pipelines using Cloud Build, GitHub Actions, Jenkins, or equivalent platforms for application, infrastructure, and platform workloads. • Administer and optimize GCP delivery workflows including Cloud Build triggers, Artifact Registry, source integrations, deployment approvals, and service account access patterns. • Partner with engineering teams to improve build, release, and deployment workflows across microservices and cloud-native applications. • Implement robust observability across systems using Google Cloud Operations Suite, Cloud Logging, Cloud Monitoring, and related telemetry tooling. • Strengthen platform security by integrating secrets management, policy enforcement, vulnerability scanning, and least-privilege access control. • Manage and optimize containerized environments using Kubernetes, Helm, and Google Kubernetes Engine (GKE). • Drive reliability engineering practices including incident response, root cause analysis, SLO thinking, and automated remediation where appropriate. • Standardize reusable templates, modules, and platform patterns that improve developer productivity and consistency. • Mentor engineers and provide technical leadership on GCP architecture, deployment automation, release governance, and DevSecOps practices.
• Design, implement, and evolve Azure-based infrastructure using Infrastructure as Code with Terraform, Bicep, or ARM templates. • Build and maintain scalable CI/CD pipelines using Azure DevOps Pipelines for application, infrastructure, and platform workloads. • Administer and optimize Azure DevOps services, including Azure Repos, Pipelines, Artifacts, Boards, and service connections. • Partner with engineering teams to improve build, release, and deployment workflows across microservices and cloud-native applications. • Implement robust observability across systems using Azure Monitor, Log Analytics, Application Insights, and related monitoring tooling. • Strengthen platform security by integrating secrets management, policy enforcement, vulnerability scanning, and least-privilege access controls. • Manage and optimize containerized environments using Kubernetes, Helm, and Azure Kubernetes Service (AKS). • Drive reliability engineering practice, including incident responses, root cause analysis, SLO thinking, and automated remediation, where appropriate. • Standardize reusable templates, modules, and platform patterns that improve developer productivity and consistency. • Mentor engineers and provide technical leadership on Azure architecture, deployment automation, release governance, and DevSecOps practices.
• Contributing to long-term initiatives such as, but not limited to: Unlocking on-premises deployment of space algorithms with Kubernetes • Continuously improving the developer experience of your teammates (software and space) • Collaborating with space engineers and other software engineers to develop algorithms as services • Designing and developing core algorithms and infrastructure from early-stage prototypes to final deployment • Enhancing and extending the observability of our systems • Reviewing PRs and design documents, contributing to code via business-driven implementations and bug fixes, and reducing tech debt



