GitLab, founded in 2011 and based in San Francisco, California, maintains a distributed team of professionals that work remotely across multiple continents. GitLab advocates for pr
Intermediate Site Reliability Engineer, Cloud Cost Utilization
Location
United Kingdom
Posted
37 days ago
Salary
0
Seniority
Senior
Job Description
Intermediate Site Reliability Engineer, Cloud Cost Utilization
GitLab
• Design and maintain cloud resource tagging and labeling strategies across GCP and AWS to support accurate cost attribution • Develop tooling and pipelines to ingest, normalize, and report on cloud billing data using the FOCUS specification • Automate cost anomaly detection, forecasting, and alerting so engineering teams can respond quickly to changes in infrastructure spend • Contribute to GitLab's observability and monitoring stacks, including Prometheus, LGTM (Loki, Grafana, Tempo, and Mimir), and ELK, with a focus on surfacing cost efficiency signals • Partner with Finance and Engineering leadership to support cloud cost forecasting for planning and budget discussions • Act as a subject matter expert for cloud cost attribution, tagging strategy, and FOCUS adoption across GitLab Infrastructure • Collaborate with Finance and Compliance teams on audits, certifications, and financial reporting needs related to cloud infrastructure usage • Contribute to infrastructure-as-code efforts, including Terraform and Ansible, so cost controls and tagging requirements are built into provisioning workflows from the start.
Job Requirements
- Hands-on experience with cloud cost management in GCP and/or AWS, including billing data, pricing models, and optimization approaches
- Familiarity with, or interest in adopting, the FinOps FOCUS specification for multi-cloud cost analysis
- Experience designing or implementing cloud resource tagging and labeling strategies and improving adoption across teams
- Comfort working across technical and business functions, including Engineering, Finance, and other stakeholders
- Experience with infrastructure as code, including Terraform and Ansible
- Familiarity with observability tooling, including Grafana, and an understanding of how reliability and cost signals can be connected
- Ability to explain technical cost data clearly to non-engineering audiences and support informed decision-making
- A self-directed approach to work, with comfort operating in a fully remote and asynchronous environment.
Benefits
- Benefits to support your health, finances, and well-being
- Flexible Paid Time Off
- Team Member Resource Groups
- Equity Compensation & Employee Stock Purchase Plan
- Growth and Development Fund
- Parental leave
- Home office support
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Audit existing Terraform state and understand Sarwa’s Infrastructure as Code philosophy • Provision a temporary Sandbox account within Sarwa’s AWS Organization, deploying core networking (VPC, Subnets) via Terraform to validate infrastructure portability across accounts and regions • Participate in the DR Automation project - helping automate regional failover and promotion workflows • Write and maintain automation scripts using boto3 for tasks that IaC cannot handle dynamically • Contribute to documentation: build a comprehensive, code-backed DR Playbook that enables any engineer to trigger a regional failover with high confidence • Collaborate with the backend and platform teams to ensure infrastructure changes align with application requirements • Learn and apply cross-region peering, multi-account routing, and EKS deployment management
• Manage and optimize release pipelines to ensure smooth deployment of software updates. • Define and maintain versioning strategies, ensuring consistency across multiple environments. • Coordinate with engineering, QA, and DevOps teams to ensure timely and stable releases. • Automate and improve build, release, and deployment processes for efficiency and reliability. • Monitor and troubleshoot release-related issues, ensuring minimal downtime. • Maintain documentation for release workflows, rollback plans, and deployment strategies. • Ensure compliance with security, performance, and quality standards in all releases. • Work with CI/CD tools (e.g., Jenkins, GitHub Actions, GitLab CI, CircleCI) to manage automated releases. • Implement and maintain feature flagging strategies to enable controlled rollouts. • Analyze release performance and drive continuous improvements in deployment processes.
• Manage and optimize release pipelines to ensure smooth deployment of software updates. • Define and maintain versioning strategies, ensuring consistency across multiple environments. • Coordinate with engineering, QA, and DevOps teams to ensure timely and stable releases. • Automate and improve build, release, and deployment processes for efficiency and reliability. • Monitor and troubleshoot release-related issues, ensuring minimal downtime. • Maintain documentation for release workflows, rollback plans, and deployment strategies. • Ensure compliance with security, performance, and quality standards in all releases. • Work with CI/CD tools (e.g., Jenkins, GitHub Actions, GitLab CI, CircleCI) to manage automated releases. • Implement and maintain feature flagging strategies to enable controlled rollouts. • Analyze release performance and drive continuous improvements in deployment processes.
• Design and implement scalable, reliable, and fault-tolerant systems across cloud environments. • Develop and maintain observability tools, including monitoring, logging, and alerting (e.g., Prometheus, Grafana, Datadog, ELK). • Automate infrastructure provisioning, deployment, and incident response using Infrastructure as Code (IaC) tools like Terraform or CloudFormation. • Optimize system performance, scalability, and incident response workflows to improve uptime. • Work closely with development and DevOps teams to improve system design for reliability. • Conduct root cause analysis (RCA) and implement preventative measures to minimize failures. • Ensure high availability by designing and maintaining load balancing, failover, and disaster recovery strategies. • Improve CI/CD pipelines to enhance deployment speed while maintaining stability. • Optimize cloud cost and resource utilization for AWS, Azure, or Google Cloud Platform (GCP). • Participate in on-call rotations to quickly address system failures and minimize downtime.


