Capital.com logo
Capital.com

We are making the world of finance more accessible, engaging, and useful with an award-winning trading platform and app.

Senior SRE Engineer – Observability Focus

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 501-1,000H1B No SponsorCompany SiteLinkedIn

Location

Poland

Posted

1 day ago

Salary

0

Seniority

Senior

Job Description

Senior SRE Engineer – Observability Focus

Capital.com

• Own the full observability stack: metrics (VictoriaMetrics), logs (OpenSearch), and traces (OpenTelemetry) — from pipeline design to day-2 operations. • Architect and run VictoriaMetrics cluster topology (vmstorage/vminsert/vmselect), including vmagent scraping, remote write configuration, vmalert rules, and cardinality control. • Operate OpenSearch clusters: index lifecycle management (ISM), hot-warm-cold architecture, shard tuning, and ingest pipelines via Data Prepper. • Build and maintain OTEL Collector pipelines — receivers, processors, exporters — and instrument services across Java, Python, and JS/TS stacks (auto and manual). • Run Kafka as the telemetry transport layer (OTEL Collector → Kafka → backends), including topic design, partition strategy, consumer group lag monitoring, and throughput tuning for high-volume telemetry. • Manage log shipping infrastructure using Fluent Bit, Vector, or Fluentd; define structured logging standards and field normalization across services. • Build Grafana dashboards and alerting that engineers actually use — clear, actionable, with well-structured variables and thresholds. • Work with platform and application teams to improve sampling strategies (head/tail), batching, and context propagation across distributed services. • Contribute to incident response, post-mortems, and reliability improvements driven by observability signals. • Mentor engineers on observability practices, tooling, and structured logging standards.

Job Requirements

  • 6+ years in a DevOps, SRE, or platform engineering role, with at least 2 years focused on observability tooling at production scale.
  • Deep hands-on experience with VictoriaMetrics (or Prometheus) — MetricsQL/PromQL, exporters, service discovery, remote write, downsampling, and retention management.
  • Solid OpenSearch or Elasticsearch skills: cluster operations, Query DSL, ISM policies, and ingest pipeline design.
  • Production experience with OpenTelemetry: Collector configuration, OTLP, context propagation, and instrumentation across multiple languages.
  • Strong Kafka skills — producer/consumer patterns, consumer group management, Kafka Connect, Schema Registry, and JMX-based monitoring. Strimzi experience a plus if you've run Kafka on Kubernetes.
  • Proficiency with log shippers (Fluent Bit, Vector, Fluentd) and structured log parsing/normalization.
  • Working knowledge of Kubernetes (operators, Helm), Argo CD/GitOps, and Terraform/Ansible.
  • Comfortable in a hybrid AWS + on-prem environment; solid understanding of networking as it applies to scraping and shipping pipelines.
  • Scripting ability in Bash or Python for automation and tooling.
  • Strong communication skills — you can explain observability tradeoffs clearly to engineers and non-engineers alike.
  • English proficiency.

Benefits

  • Competitive Salary: We believe great work deserves great pay! Your skills and talents will be rewarded with a salary that makes you feel valued and motivated.
  • Work-Life Harmony: Join a company that genuinely cares about you - because your life outside of work matters just as much as your time on the clock. #LI-Hybrid
  • Generous Time Off: Need a breather? Our annual leave policy lets you recharge and enjoy life outside of work without a worry.
  • Employee Referral Program: Love working here? Share the love! Bring your talented friends on board and get rewarded for growing our awesome team.
  • Comprehensive Health & Pension Benefits: From medical insurance to pension plans, we’ve got your back. Plus, location-specific benefits and perks!
  • Workation Wonderland: Live your digital nomad dreams with 30 extra days to work remotely from anywhere in the world (some restrictions apply). Adventure awaits!
  • Volunteer Days: Make a difference! Take two additional paid days each year to support causes you care about and give back to the community.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Full TimeRemoteTeam 51-200Since 2020H1B No Sponsor

• Join fast-paced teams powering an AI-driven digital insurance marketplace • Designing and implementing advanced cloud architectures on AWS and GCP • Building internal developer tools • Advancing Infrastructure-as-Code practices • Driving automation across the full workload lifecycle • Eliminating single points of failure • Continuously optimizing cloud infrastructure • Evaluating emerging technologies with real influence over architectural decisions

Portugal
€2K - €2.5K / month
Full TimeRemoteTeam 501-1,000Since 2005H1B No Sponsor

• Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving, auction, targeting, measurement, and billing systems. • Design, build, and maintain infrastructure, tooling, and automation that improve service reliability and engineering productivity. • Improve observability through monitoring, alerting, tracing, logging, and dashboards. • Participate in on-call rotations and lead incident response efforts for critical production systems. • Run root cause analysis and drive corrective actions following incidents. • Collaborate with software engineers throughout the service lifecycle, from design reviews through production operations. • Drive adoption of SRE best practices including SLIs, SLOs, error budgets, capacity planning, and operational readiness reviews. • Reduce operational toil through automation and self-service tooling. • Help define and measure advertiser-critical user journeys such as campaign creation, ad delivery, reporting, and billing. • Scale Ads systems to support continued traffic growth, increased advertiser demand, and evolving business requirements.

Netherlands
Full TimeRemoteTeam 501-1,000Since 2005H1B No Sponsor

• Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving, auction, targeting, measurement, and billing systems. • Design, build, and maintain infrastructure, tooling, and automation that improve service reliability and engineering productivity. • Improve observability through monitoring, alerting, tracing, logging, and dashboards. • Participate in on-call rotations and lead incident response efforts for critical production systems. • Run root cause analysis and drive corrective actions following incidents. • Collaborate with software engineers throughout the service lifecycle, from design reviews through production operations. • Drive adoption of SRE best practices including SLIs, SLOs, error budgets, capacity planning, and operational readiness reviews. • Reduce operational toil through automation and self-service tooling. • Help define and measure advertiser-critical user journeys such as campaign creation, ad delivery, reporting, and billing. • Scale Ads systems to support continued traffic growth, increased advertiser demand, and evolving business requirements.

Ireland
Full TimeRemoteTeam 501-1,000Since 2005H1B No Sponsor

• Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving, auction, targeting, measurement, and billing systems. • Design, build, and maintain infrastructure, tooling, and automation that improve service reliability and engineering productivity. • Improve observability through monitoring, alerting, tracing, logging, and dashboards. • Participate in on-call rotations and lead incident response efforts for critical production systems. • Run root cause analysis and drive corrective actions following incidents. • Collaborate with software engineers throughout the service lifecycle, from design reviews through production operations. • Drive adoption of SRE best practices including SLIs, SLOs, error budgets, capacity planning, and operational readiness reviews. • Reduce operational toil through automation and self-service tooling. • Help define and measure advertiser-critical user journeys such as campaign creation, ad delivery, reporting, and billing. • Scale Ads systems to support continued traffic growth, increased advertiser demand, and evolving business requirements.

United Kingdom