We are making the world of finance more accessible, engaging, and useful with an award-winning trading platform and app.
Senior SRE Engineer – Observability Focus
Location
Poland
Posted
1 day ago
Salary
0
Seniority
Senior
Job Description
Senior SRE Engineer – Observability Focus
Capital.com
• Own the full observability stack: metrics (VictoriaMetrics), logs (OpenSearch), and traces (OpenTelemetry) — from pipeline design to day-2 operations. • Architect and run VictoriaMetrics cluster topology (vmstorage/vminsert/vmselect), including vmagent scraping, remote write configuration, vmalert rules, and cardinality control. • Operate OpenSearch clusters: index lifecycle management (ISM), hot-warm-cold architecture, shard tuning, and ingest pipelines via Data Prepper. • Build and maintain OTEL Collector pipelines — receivers, processors, exporters — and instrument services across Java, Python, and JS/TS stacks (auto and manual). • Run Kafka as the telemetry transport layer (OTEL Collector → Kafka → backends), including topic design, partition strategy, consumer group lag monitoring, and throughput tuning for high-volume telemetry. • Manage log shipping infrastructure using Fluent Bit, Vector, or Fluentd; define structured logging standards and field normalization across services. • Build Grafana dashboards and alerting that engineers actually use — clear, actionable, with well-structured variables and thresholds. • Work with platform and application teams to improve sampling strategies (head/tail), batching, and context propagation across distributed services. • Contribute to incident response, post-mortems, and reliability improvements driven by observability signals. • Mentor engineers on observability practices, tooling, and structured logging standards.
Job Requirements
- 6+ years in a DevOps, SRE, or platform engineering role, with at least 2 years focused on observability tooling at production scale.
- Deep hands-on experience with VictoriaMetrics (or Prometheus) — MetricsQL/PromQL, exporters, service discovery, remote write, downsampling, and retention management.
- Solid OpenSearch or Elasticsearch skills: cluster operations, Query DSL, ISM policies, and ingest pipeline design.
- Production experience with OpenTelemetry: Collector configuration, OTLP, context propagation, and instrumentation across multiple languages.
- Strong Kafka skills — producer/consumer patterns, consumer group management, Kafka Connect, Schema Registry, and JMX-based monitoring. Strimzi experience a plus if you've run Kafka on Kubernetes.
- Proficiency with log shippers (Fluent Bit, Vector, Fluentd) and structured log parsing/normalization.
- Working knowledge of Kubernetes (operators, Helm), Argo CD/GitOps, and Terraform/Ansible.
- Comfortable in a hybrid AWS + on-prem environment; solid understanding of networking as it applies to scraping and shipping pipelines.
- Scripting ability in Bash or Python for automation and tooling.
- Strong communication skills — you can explain observability tradeoffs clearly to engineers and non-engineers alike.
- English proficiency.
Benefits
- Competitive Salary: We believe great work deserves great pay! Your skills and talents will be rewarded with a salary that makes you feel valued and motivated.
- Work-Life Harmony: Join a company that genuinely cares about you - because your life outside of work matters just as much as your time on the clock. #LI-Hybrid
- Generous Time Off: Need a breather? Our annual leave policy lets you recharge and enjoy life outside of work without a worry.
- Employee Referral Program: Love working here? Share the love! Bring your talented friends on board and get rewarded for growing our awesome team.
- Comprehensive Health & Pension Benefits: From medical insurance to pension plans, we’ve got your back. Plus, location-specific benefits and perks!
- Workation Wonderland: Live your digital nomad dreams with 30 extra days to work remotely from anywhere in the world (some restrictions apply). Adventure awaits!
- Volunteer Days: Make a difference! Take two additional paid days each year to support causes you care about and give back to the community.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Join fast-paced teams powering an AI-driven digital insurance marketplace • Designing and implementing advanced cloud architectures on AWS and GCP • Building internal developer tools • Advancing Infrastructure-as-Code practices • Driving automation across the full workload lifecycle • Eliminating single points of failure • Continuously optimizing cloud infrastructure • Evaluating emerging technologies with real influence over architectural decisions
• Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving, auction, targeting, measurement, and billing systems. • Design, build, and maintain infrastructure, tooling, and automation that improve service reliability and engineering productivity. • Improve observability through monitoring, alerting, tracing, logging, and dashboards. • Participate in on-call rotations and lead incident response efforts for critical production systems. • Run root cause analysis and drive corrective actions following incidents. • Collaborate with software engineers throughout the service lifecycle, from design reviews through production operations. • Drive adoption of SRE best practices including SLIs, SLOs, error budgets, capacity planning, and operational readiness reviews. • Reduce operational toil through automation and self-service tooling. • Help define and measure advertiser-critical user journeys such as campaign creation, ad delivery, reporting, and billing. • Scale Ads systems to support continued traffic growth, increased advertiser demand, and evolving business requirements.
• Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving, auction, targeting, measurement, and billing systems. • Design, build, and maintain infrastructure, tooling, and automation that improve service reliability and engineering productivity. • Improve observability through monitoring, alerting, tracing, logging, and dashboards. • Participate in on-call rotations and lead incident response efforts for critical production systems. • Run root cause analysis and drive corrective actions following incidents. • Collaborate with software engineers throughout the service lifecycle, from design reviews through production operations. • Drive adoption of SRE best practices including SLIs, SLOs, error budgets, capacity planning, and operational readiness reviews. • Reduce operational toil through automation and self-service tooling. • Help define and measure advertiser-critical user journeys such as campaign creation, ad delivery, reporting, and billing. • Scale Ads systems to support continued traffic growth, increased advertiser demand, and evolving business requirements.
• Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving, auction, targeting, measurement, and billing systems. • Design, build, and maintain infrastructure, tooling, and automation that improve service reliability and engineering productivity. • Improve observability through monitoring, alerting, tracing, logging, and dashboards. • Participate in on-call rotations and lead incident response efforts for critical production systems. • Run root cause analysis and drive corrective actions following incidents. • Collaborate with software engineers throughout the service lifecycle, from design reviews through production operations. • Drive adoption of SRE best practices including SLIs, SLOs, error budgets, capacity planning, and operational readiness reviews. • Reduce operational toil through automation and self-service tooling. • Help define and measure advertiser-critical user journeys such as campaign creation, ad delivery, reporting, and billing. • Scale Ads systems to support continued traffic growth, increased advertiser demand, and evolving business requirements.


