Devsu logo
Devsu

Devsu is a technology agency that provides software development services, IT augmentation and staffing.

Senior Site Reliability Engineer (SRE) - (GCP)

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200H1B No SponsorCompany SiteLinkedIn

Location

PST (UTC-8)

Posted

7 days ago

Salary

0

Seniority

Senior

No structured requirement data.

Job Description

Senior Site Reliability Engineer (SRE) - (GCP)

Devsu

Role Description We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP). This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments. As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required. Responsibilities - Monitoring & Observability (Core Focus) - Own and operate the monitoring and observability stack across on-prem and GCP environments - Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications - Define, tune, and maintain alerts to ensure high signal-to-noise ratio - Establish observability standards and best practices across teams - Improve visibility into system health, performance, and reliability - Site Reliability Engineering - Apply SRE principles to improve availability, performance, and resilience - Define and track SLIs, SLOs, and error budgets - Participate in on-call rotations and SEV incident response - Lead or contribute to incident investigations and root cause analysis (RCA) - Drive preventative actions to reduce repeat incidents - Kubernetes & Platform Reliability - Support and monitor Kubernetes environments (GKE and on-prem clusters) - Monitor cluster health, capacity, and resource utilization - Troubleshoot platform-level issues impacting application reliability - Collaborate with Platform and Engineering teams on reliability improvements - Secondary Responsibilities (Backup Application Support) - Provide L2/L3 application support coverage during: - Support team resource shortages - High-severity incidents (SEVs) - Peak support periods or escalations - Triage and troubleshoot application issues using existing runbooks and dashboards - Collaborate with Application Support and Engineering teams during incidents - Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW) Qualifications - Strong experience as a Site Reliability Engineer or Reliability Engineer - Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting) - Solid experience with monitoring and observability systems - Production experience operating Kubernetes environments - Experience supporting systems in GCP and on-prem environments (mandatory) - Strong Linux systems and troubleshooting skills - Fluent English (written and spoken) - Ability to work in PST time zone - Ability to participate in an on-call rotation that includes coverage for one weekend day Requirements - Technology Stack: - Observability: Grafana, Prometheus, logging platforms - Containers: Kubernetes (GKE and on-prem) - Cloud: Google Cloud Platform (GCP) - Operations: Linux, networking, infrastructure monitoring - Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents) - Nice to have: - Experience supporting application teams during SEV incidents - Knowledge of capacity planning and performance tuning - Scripting skills (Python, Bash, etc.) - Experience with hybrid infrastructure environments Benefits - A stable, long-term contract with opportunities for career growth - Private health insurance - A remote-friendly culture that promotes work-life balance - Continuous training, mentorship, and learning programs to keep you at the forefront of the industry - Free access to AI training resources and state-of-the-art AI tools to elevate your daily work - A flexible Paid Time Off (PTO) policy as well as paid holiday days - Challenging, world-class software projects for clients in the US and LatAm - Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Dynatrace logo

Senior Software Engineer with SRE background (AWS, Azure, GCP)

Dynatrace

Dynatrace is a global application performance management software firm and a former member of Compuware. As an employer, the company is in support of helping its team achieve a hea

DevOps Engineer7 days ago
Full TimeRemoteTeam 5,200Since 2005

Your role at Dynatrace Are you a passionate Senior Software Engineer with SRE background ready to shape the future of product development? Do you thrive in a collaborative, international environment and want to make a real impact on our customers? If you're excited about observability platforms and want to contribute to a globally leading product, this is your opportunity. This role is designed for a Senior Software Engineer from an SRE background, experienced in Cloud monitoring who wants to move from operating to building—come and help us advance our Observability Solution. Our engineering culture is built on technical excellence, ownership, and continuous feedback. We live by the principle of "You build it, you run it", and work in agile iterations to deliver high-quality customer value. As Senior Software Engineer with SRE background you will: - Leverage your SRE experience to enhance our cloud observability product and create tailored solutions that empower our users like other SREs to monitor, diagnose, and maintain systems with greater experience. - Ensure best practices across AWS, Azure, and GCP integrations that efficiently ingest cloud telemetry data, ensuring customers receive the highest quality and most relevant datasets - Utilize AI capabilities for elevating features like anomaly detection and correlate multidirectional signals for faster root cause analysis - Work with cloud technologies (AWS, Azure, GCP), researching and building knowledge around modern cloud architectures. You will build visualizations that help our users understand the complexity of cloud observability data. - Drive architectural decisions and contribute to the evolution of our platform. - Collaborate with stakeholders and drive decisions aligned with the product strategy. - Foster high-quality software engineering practices, automation and optimization of tooling and processes (CI/CD integrations). What will help you succeed - 5 + years of hands-on experience in Site Reliability Engineering, working with Cloud Infrastructure on AWS, Azure or GCP - Experience in software engineering, developing with JavaScript/TypeScript and/or working with backend languages such as Python and Java - Technical studies related to Software Engineering or equivalent experience - Hands-on experience with monitoring, logging, and observability tools like Dynatrace, Datadog, Splunk, Grafana or CloudWatch, Azure Monitor - Experience working closely with development teams to improve application delivery and build efficient, automated pipelines - Excellent verbal and written communication skills, with the ability to convey complex technical concepts clearly. - Strong analytical skills with the ability to understand end-to-end use cases, map system flows. - Good English communication Why you will love being a Dynatracer - Dynatrace is a leader in unified observability and security. - We provide a culture of excellence with competitive compensation packages designed to recognize and reward performance. - Our employees work with the largest cloud providers, including AWS, Microsoft, and Google Cloud, and other leading partners worldwide to create strategic alliances. - The Dynatrace platform uses cutting-edge technologies, including our own Davis hypermodal AI, to help our customers modernize and automate cloud operations, deliver software faster and more securely, and enable flawless digital experiences. - Over 50% of the Fortune 100 companies are current customers of Dynatrace. Compensation and Rewards - We offer only employment contracts and a remote working setup. This is a permanent role and not a B2B contract. - We offer attractive compensation packages and stock purchase options with numerous benefits and advantages. - Base salary range — 21.300 - 26.700 PLN gross per month with higher pay based on experience and qualifications. Equal Employment OpportunityDynatrace provides equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, veteran status, or any other protected characteristic. We actively foster an inclusive workplace that celebrates differences and promotes accessibility, collaboration, and growth for all.

Poland

Senior Data Site Reliability Engineer

Encora Digital

Encora, a leader in digital engineering, drives innovation by crafting cutting-edge, cloud-first, data-first, and AI-first solutions that redefine industries. Since its inception i

DevOps Engineer7 days ago

Role Description We at Coforge are hiring a Senior Data Site Reliability Engineer (#20669) with the following skill set. - Ensure the reliability, scalability, and performance of data platforms built on Databricks and Microsoft Azure. - Design, implement, and maintain CI/CD pipelines, cloud infrastructure, and automated deployment processes using Azure DevOps and Terraform. - Monitor, troubleshoot, and optimize data workloads, cloud environments, and platform stability while supporting incident and release management processes. - Collaborate with data engineering and platform teams to improve operational excellence, production reliability, cost optimization, and continuous platform improvements. Qualifications - Bachelor’s degree in Computer Science, Engineering, Information Technology, Data Engineering, or equivalent practical experience. - 5+ years of experience in Site Reliability Engineering, Data Engineering, Cloud Engineering, or related technical roles. - Strong SQL expertise including data querying, optimization, troubleshooting, and performance tuning. - Strong scripting skills in Python and Bash for automation and operational workflows. - Hands-on experience designing and managing CI/CD pipelines using Azure DevOps, including environments, triggers, permissions, and project administration. - Strong experience with Terraform and infrastructure-as-code practices. - Strong expertise in Microsoft Azure services including compute, storage, networking, and containerization technologies. - Experience working with Databricks technologies including Unity Catalog, Delta Lake, and Lakehouse Federation. - Solid experience in incident management, troubleshooting, post-mortems, runbooks, and stakeholder communication. - Experience managing production releases, release coordination, and ensuring stable deployment cycles. Requirements - Experience with observability and monitoring tools such as Prometheus or Grafana. - Background in data engineering, distributed systems, or big data technologies such as Apache Spark. - Experience optimizing cloud cost, platform performance, and scalability. - Familiarity with DevOps culture and Infrastructure as Code best practices. Company Description At Coforge, we hire professionals based solely on their skills and do not discriminate based on age, disability, religion, gender, sexual orientation, socioeconomic status, or nationality.

Bolivia

DevOps Team Lead

Datassential

Datassential is committed to revolutionizing how food and beverage companies plan for the future through innovative marketing intelligence tools. Dedicated to empowering industry o

DevOps Engineer7 days ago

Role Description The DevOps Team lead reports to the Sr. Director, DevOps & IT. It is a hands-on technical lead position with no direct people-management responsibilities. The position is responsible for building and maintaining the infrastructure that powers Datassential’s applications and data platforms. We focus on automation, reliability, and scalability to ensure our systems remain secure, efficient, and resilient. Our goal is to provide a strong, stable foundation that allows our engineering teams to deliver high-quality products quickly and reliably. - Build, manage, and enhance automated CI/CD pipelines on Buildkite and AWS and GCP-based infrastructure. - Implement and manage cloud infrastructure using Terraform. - Collaborate closely with development teams to understand requirements and troubleshoot application and infrastructure issues. - Own uptime management and platform observability. - Partner with other senior engineers on design, architecture, implementation, and security. - Help organize the team's day-to-day operations. - Own the health of our Jira boards. - Represent DevOps in stakeholder meetings. - Champion best practices and contribute to architectural direction. Qualifications - 8–12 years of experience in DevOps. - Extensive hands-on experience managing AWS cloud infrastructure (GCP exposure is a plus). - Expert-level proficiency with Infrastructure as Code (Terraform). - Strong experience with modern CI/CD pipeline tooling — Buildkite preferred. - Comfortable deploying and managing containerized workloads on AWS ECS. - Solid Linux administration skills. - Experienced with secure networking concepts. - Hands-on experience with monitoring and observability platforms. - Demonstrated experience coordinating technical work across a team. Requirements - Experience with security and compliance tools (Snyk, SonarCloud, Intruder.io, Vanta). - Experience managing AWS IAM and familiarity with JumpCloud. - Experience integrating Atlassian tools into CI/CD workflows. - Experience supporting ML workloads and MLOps pipelines. - Hands-on experience with AI developer tools. - Previous experience using AWS Serverless services. - Working knowledge of Kubernetes. Benefits - Competitive salary plus eligibility for a performance-linked annual bonus plan. - Affordable Medical, Dental, and Vision Insurance, including a no cost employee-only plan. - Paid parental leave. - 401K with dollar-for-dollar company match up to 5%. - Unlimited PTO, recharge days, and Christmas through New Year break. - 100% business travel reimbursement. - Wellness Reimbursement or HQ Gym access. - Remote stipend for 100% remote roles.

United States
$165K - $185K / year
Inmetrics logo

SRE Specialist I

Inmetrics

We make a difference, solve outstanding problems and make the digital transformation of our clients possible.

DevOps Engineer7 days ago
Full TimeRemoteTeam 501-1,000Since 2002H1B No Sponsor

• Technical Leadership and Best Practices: Serve as a technical reference for the team, supporting development and promoting Site Reliability Engineering (SRE) best practices. • Availability and Performance: Ensure the availability, scalability, performance and security of the company's systems and infrastructure. • Maintain a stable, reliable and secure environment for all users and services. • DevOps Culture and Integration: Promote a DevOps culture, encouraging collaboration between development, infrastructure and information security teams. • Automation and Monitoring: Implement and manage tools and processes for automation, monitoring and orchestration of infrastructure and applications. • Incident Management and Continuous Improvement: Analyze incidents, identify root causes and propose preventive solutions to avoid recurrence. • Related Activities: Perform other duties inherent to the role, contributing to the efficiency and continuous improvement of services and processes.

Brazil