Site Reliability Engineer, K8s
Location
United Kingdom
Posted
40 days ago
Salary
0
Seniority
Mid Level
No structured requirement data.
Job Description
Site Reliability Engineer, K8s
PulsePoint
WebMD and its affiliates is an Equal Opportunity/Affirmative Action employer and does not discriminate on the basis of race, ancestry, color, religion, sex, gender, age, marital status, sexual orientation, gender identity, national origin, medical condition, disability, veterans status, or any other basis protected by law. Position Overview Our BI team runs a set of GCP-based APIs and data services that a lot of internal products depend on. As we've grown, keeping things running has increasingly been a side responsibility for engineers who are primarily building features — and that's not sustainable. We're looking for an SRE to own that space: service health, incident response, infrastructure monitoring, and making sure we're not blindly burning cloud budget. The Site Reliability Engineer will ensure the availability, performance, and security of the Business Intelligence team's GCP-hosted APIs and data infrastructure. This role is responsible for proactive monitoring, incident response, and continuous improvement of platform reliability across a cloud-native stack. The engineer will work closely with backend and data engineers to maintain service health and drive operational excellence. This position also carries responsibility for GCP cost visibility, helping the team track and optimize cloud spend through structured monitoring and alerting. Responsibilities - Monitor and maintain uptime of GCP-hosted APIs and services, keeping performance within agreed targets - Lead incident response for BI platform services — triage, resolve, and follow up with post-mortems that actually prevent recurrence - Build and manage observability infrastructure: dashboards, alerts, and logging across GCP services - Track GCP cloud spend and set up cost alerting to flag anomalies before they become problems - Review and fix security gaps — IAP configs, service account permissions, API access controls - Work with data and backend engineers to shore up reliability of data pipelines and BigQuery workflows - Contribute to infrastructure-as-code and help keep deployments documented and reproducible Qualifications - 2+ years in a Site Reliability, DevOps, or Cloud Infrastructure role in a production environment - Bachelor's degree in Computer Science, Engineering, or related field, or equivalent hands-on experience - Practical experience with GCP — Cloud Run, API Gateway, and BigQuery in particular - Experience with monitoring and observability tooling (Cloud Monitoring, Datadog, or similar) - Solid grasp of cloud security fundamentals — IAM, network controls, access management - Proficiency with Git and version control in a team setting Please list the preferred skills here: - CI/CD pipelines and deployment automation (GitHub Actions, Cloud Build, or similar) - Terraform or other infrastructure-as-code tools - Python for scripting or automation - MySQL, Spanner, or BigQuery at any meaningful depth - GCP cost management and spend optimization - Experience with dbt or Looker - Comfortable working across CET/EST hours in a distributed team
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps Engineer
RE PartnersWe make the Aspirational Attainable. We Do Better Together to Deliver Real Change.
DevOps Engineer We are looking for a DevOps Engineer with basic Golang skills to support a team focused on improving development processes and the product. In this role, you will work closely with a Staff Engineer building with a focus on designing, migrating, implementing, and maintaining the infrastructure and cloud environments that enable these systems to operate reliably and at scale. Your work will involve containerized services, AWS Lambda, Terraform-managed infrastructure, and secure data transfer between AWS accounts, with a strong emphasis on scalability, security, and operational excellence. You will collaborate with engineering, product, and operational teams, while also coordinating technical requirements with the vendor. Strong communication skills are important, as you’ll often act as the bridge between the team and other internal teams, ensuring infrastructure and systems are aligned with internal standards and best practices. What we’re looking for: ● Strong AWS and DevOps experience (infrastructure, deployments, operations, observability) ● Solid experience with infrastructure as code (Terraform preferred) ● Experience with containerized environments (Docker, ECS/EKS, etc.) ● Hands-on experience with AWS services such as Lambda and related cloud tooling ● Golang and Bash skills for scripting, automation, and backend support ● Ability to work independently while collaborating with multiple stakeholders ● Strong communication and coordination skills Nice to have: ● Familiarity with TypeScript or Node.js This role is ideal for engineers who enjoy working towards engineering excellence and helping the team excel. You will collaborate across infrastructure and development teams to deliver impactful solutions. Join Our Global Team: We invite you to apply for the position at RE Partners. Join us in shaping the future of business technology consulting and transforming the way organizations thrive in a digital world. As a diverse, woman-owned global business, we pride ourselves on keeping talent happy – our 7% attrition rate speaks volumes. Bring your talented friends along and earn a referral bonus Equal Opportunity Employer: We are an equal opportunity employer and welcome applications from all qualified individuals regardless of race, color, religion, gender, sexual orientation, gender identity or expression, national origin, age, disability, or veteran status.
About BoxCast Launched in 2013, BoxCast is a complete professional live production platform trusted by thousands of organizations delivering over one million broadcasts annually. BoxCast serves a wide range of clients including houses of worship, sports teams, local government, and live sound engineers. The platform makes it easy to stream high-quality video, mix audio remotely, and distribute content via custom streaming apps. BoxCast’s patented streaming protocol ensures reliable delivery even on challenging networks, while features like built-in multi-streaming, automation, and AI simplify the entire broadcasting workflow for organizations of all sizes. About BoxCast Launched in 2013, BoxCast is a complete professional live production platform trusted by thousands of organizations delivering over one million broadcasts annually. BoxCast serves a wide range of clients including houses of worship, sports teams, local government, and live sound engineers. The platform makes it easy to stream high-quality video, mix audio remotely, and distribute content via custom streaming apps. BoxCast’s patented streaming protocol ensures reliable delivery even on challenging networks, while features like built-in multi-streaming, automation, and AI simplify the entire broadcasting workflow for organizations of all sizes. About the Role As a Full Stack DevOps Software Engineer for the BoxCast Internal Systems team, you will be responsible for developing, testing, deploying, maintaining our software systems, and running our cloud operations. Your DevOps experience will help you monitor systems, troubleshoot issues, optimize performance, and build scalable infrastructure. Leveraging your robust development expertise, you'll navigate a variety of technologies, make well-considered architecture and design decisions, and deliver user value. BoxCast is proud to offer a hybrid work model. You'll enjoy the flexibility of working remotely or from our office, based on your preference. What You’ll Do - Architect, design, develop, test, deploy, and maintain: Work on full-stack systems and microservices that support our products and the overall BoxCast experience. - Monitor, troubleshoot, optimize, and tune: Keep our systems running smoothly during volume peaks, address escalations, and resolve deprecations and outages. Our rotation currently includes being on-call every fifth Sunday morning. - Integrate BoxCast systems and our suite of SaaS applications, including Stripe, Hubspot, Avalara, Churnzero, Intercom, Pandadoc, Panoply, Okta, Shipstation, Atlassian, Duda, Datadog, and Jumpcloud. - Stay up to date with key technologies and keep an eye on the horizon: The rapid pace of Cloud and AI development makes it crucial to keep up with emerging technologies and assess their potential benefit. - Execute excellently: Actively participate in planning, status, and review meetings to deliver value at a high velocity. - Mentor and learn: Share your expertise and learn from colleagues. - Work together: Collaborate seamlessly with a diverse team of product designers, software engineers, hardware engineers, and business visionaries to drive the BoxCast vision. How You’ll Display Our Values - Integrity: You are a beacon of trust and ethical engineering among a team of talented engineers - Collaboration: You routinely interact with employees at all levels of the company, including company leadership and employees in other departments - Judgment: You make the best architectural decisions for the given supporting data and requirements - Achievement: You build reliable, high quality systems that inspire colleagues and empower the organization - Innovation: You find new ways to reduce cost, improve efficiency, make our systems more secure Must Haves - 7+ years of professional experience or equivalent - Back-end languages - Go, Java, Typescript, Ruby, and C/C++ - Infrastructure-as-code - Terraform and CloudFormation/CDK - Virtualization/containerization technologies - Docker and Kubernetes - Cloud technologies - AWS EC2, ECS, S3 and Lambda - AWS networking - ALB/NLB, VPC, security groups, and CloudFront - Monitoring/alerting/telemetry systems - Datadog, InfluxDB, Grafana, and Opensearch - SaaS app integration - API, webhooks, and logging - SQL and databases - MySQL, postgreSQL, MSSQL, and Oracle - Experience with personal and/or team DevX enhancements - Front-end frameworks - Vue.Js, Angular, and React - A desire to solve complex problems at scale Nice to Haves - Knowledge of the live production ecosystem, with specific expertise in video and audio live streaming technologies - Experience in a post-startup tech environment Location: This role can work remotely in the U.S. from one of the following states: Arkansas, Colorado, Connecticut, Florida, Georgia, Indiana, Maine, Maryland, Michigan, Minnesota, Missouri, North Carolina, Ohio, Oklahoma, Pennsylvania, South Carolina, Tennessee, or Wisconsin. Employees located near our Cleveland, OH office are welcome to work on-site, but this isn't required for this role. Sponsorship for work authorization for foreign national candidates is not available for this position. Compensation: The salary range for this role is $115,000 - $140,000 annually. Final compensation is determined by a combination of factors, including job-related experience, skills, knowledge, internal pay equity, and overall market conditions. Because of this, every offer is unique. Additional details on total compensation and benefits will be discussed during the hiring process. Benefits: We offer a comprehensive benefits package including Medical, Vision, and Dental coverage, a generous time-off policy, a parental leave policy for birthing and non-birthing parents, and the opportunity to contribute to your 401k plan to support your long-term goals. Equal Opportunity Statement: BoxCast is proud to be an Equal Employment Opportunity employer. All qualified applicants receive equal consideration regardless of race, place of origin, color, age, marital status, religion, sex, sexual orientation, gender expression or identity, protected veteran status, disability status, or any other status protected by law. Candidate Accommodations: We’re committed to providing reasonable accommodations for individuals with disabilities. If you need assistance or an accommodation during the interview process, please contact our People Operations team during the recruitment process. A Final Note: You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.
• CI/CD & Pipeline Management: Design, implement, and optimize robust CI/CD pipelines to enable fast, secure, and automated deployments across environments. • Infrastructure Leadership: Define and lead infrastructure strategy, architecture, and scalability planning. • Stakeholder Engagement: Collaborate with clients and stakeholders to gather infrastructure requirements, provide technical guidance, and ensure delivery alignment. • Monitoring & Observability: Implement and manage comprehensive logging, monitoring, and alerting systems for operational visibility and incident response. • Cloud Infrastructure Management: Architect and manage scalable, secure systems on AWS and GCP, following best practices and cost optimization strategies. • Cross-Functional Collaboration: Partner with engineering, product, and support teams to align infrastructure with evolving product needs. • Security & Compliance: Define, enforce, and audit infrastructure security policies and ensure compliance with organizational and regulatory standards. • Operational Ownership: Oversee the reliability, cost-efficiency, and uptime of cloud and on-prem systems. • Team Development & Hiring: Recruit, mentor, and scale the infrastructure team. Define team responsibilities and delegate effectively to meet goals.
• Owning the reliability, uptime, and scalability of critical production services 24/7. • Participating in the on-call rotation to respond to incidents, troubleshoot live production issues, and lead post-incident analysis. • Building robust operational playbooks, escalation paths, and improve Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR). • Ensuring operational excellence by proactively detecting and addressing reliability risks through SLO monitoring, chaos testing, and capacity planning. • Automating operational tasks to minimize human intervention. • Architecting, implementing, and managing infrastructure across AWS, Oracle Cloud Infrastructure (OCI), and OpenStack environments. • Optimizing cloud resources to balance performance, security, and cost-efficiency. • Managing Kubernetes clusters (EKS, OKE, Rancher RKE2) for scalability, availability, and performance. • Managing and optimizing high-performance messaging and caching systems including Kafka, RabbitMQ, and Redis. • Managing and optimizing production-grade MySQL and PostgreSQL databases. • Leading the planning and execution of comprehensive disaster recovery strategies. • Implementing advanced observability solutions (Prometheus, Grafana, CloudWatch). • Driving automation initiatives using Terraform, Helm, Jenkins, Tekton or GitLab CI/CD. • Integrating security best practices into infrastructure and applications. • Collaborating with cross-functional teams to foster SRE culture and mentoring junior engineers.



