Cribl, the Data Engine for IT and Security, empowers organizations to transform their data strategy.
Senior Site Reliability Engineer
Location
Poland
Posted
15 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Cribl
• Engage with teams and improve service delivery and reliability across their entire lifecycle • Measure and monitor all production systems with an eye towards availability, latency and overall system health • Seek out the cause of errors and instability in our production cloud services and drive teams towards better operational excellence • Engage with product and platform teams to improve and evolve systems by lobbying for changes that improve reliability, resilience, and observability • Help identify and drive down toil with creative innovation and automation • This position will require stand-by, on-call, or off-hours duties
Job Requirements
- Proven experience designing, implementing, and operating observability systems for complex cloud-based platforms
- Experience with Configuration Management and Infrastructure as a Code Tools like Terraform (preferred) or Ansible
- Knowledge of cloud platforms (prefer AWS and Azure)
- Experience with APM and Observability and related tools such as, New Relic, Splunk, CloudWatch, Prometheus, Grafana/Kibana, Sentry etc.
- Extensive experience with enterprise scale continuous delivery environments
- Development with JavaScript/Node.js/TypeScript in a Linux/Mac environment
- Experience with sustainable incident response in a blameless environment
- Background in Linux Systems Engineering
- Experience with Incident response related tools for instance, PagerDuty, FireHydrant, Blameless etc.
- Comfortable with a high level of autonomy and working with a distributed team
- Knowledge of Cloud and application security best practices
- Strong knowledge of cloud design patterns for scale, data management, resiliency, etc.
- A love for high quality and a knack for testing
- Opinions about business metrics, and SLOs
Benefits
- Diversity drives innovation and better decisions
- Remote-first culture
- Welcoming and valuing differences
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Own production PostgreSQL reliability: HA design, Patroni, PgBouncer, replication, failover, upgrades, vacuum/bloat control, query tuning, locks, indexes, capacity, backups, PITR, and restore validation. • Improve disaster recovery and operational evidence: tested restores, documented recovery paths, measurable RTO/RPO targets, runbooks, and safe maintenance plans. • Support the wider database estate: ClickHouse, MongoDB, and Redis. You will troubleshoot incidents, review access and data-safety changes, improve monitoring, and learn the production ClickHouse patterns already in use. • Automate DBA workflows with Ansible, Terraform/OpenTofu, GitLab CI/CD, scripts, and reproducible runbooks for provisioning, grants, backups, restores, health checks, and ownership metadata. • Help build DBaaS-style self-service capabilities so engineering teams can request databases, access, credentials, and operational checks with less manual DBA intervention. • Improve observability and incident response through Grafana, metrics, logs, SLOs, alert rules, Opsgenie routing, and clear communication during production issues.
• Design and implement infrastructure-as-code using Terraform for Azure services including AKS, Blob Storage and App Services. • Build, maintain and optimize CI/CD pipelines and mobile/web build pipelines. • Operate, troubleshoot and tune Kubernetes and Docker-based workloads running on AKS. • Implement and manage SSO and External ID flows using Microsoft Entra. • Create reusable templates, Terraform modules and pipeline templates to enable developer self-service. • Collaborate directly with technical leads to define platform direction and deployment patterns. • Mentor engineers on deployment best practices, observability and platform usage. • Own platform-level decisions and improvements, prioritizing strategic work over ticket-level execution. • Write clear, async-friendly documentation and communicate effectively in AI-augmented workflows. • Manage and support PostgreSQL-related deployment and operational concerns as they relate to platform infrastructure.
Site Reliability Engineer
SupplyHouse.comPlumbing, Heating & HVAC Supplies. Real People. Real Service.
• Design, build, and maintain scalable, reliable systems on GCP (Compute Engine, GKE, Cloud Storage, Cloud SQL) • Develop automation for infrastructure provisioning using Terraform, Ansible, or Deployment Manager • Build and maintain observability platforms (monitoring, logging, tracing) using tools such as Stackdriver (Cloud Monitoring), Prometheus, or Grafana • Manage incident response, conduct postmortems, and implement improvements to reduce recurrence • Partner with DevOps and engineering teams to enhance CI/CD pipelines for resilient deployments • Define and monitor SLAs, SLOs, and SLIs to ensure application availability and performance • Implement disaster recovery (DR) and backup strategies across cloud services • Continuously optimize performance, capacity, and cost-efficiency of GCP resources
• Manage, automate and optimize cloud environments, with a particular focus on AWS. • Implement Infrastructure as Code, manage CI/CD pipelines, and support continuous delivery of applications. • Collaborate with development and operations teams to ensure system reliability, scalability and performance. • Contribute to platform evolution and process automation.




