Job Closed
This listing is no longer active.
Observability Engineer (Prometheus / Grafana / Datadog)
Location
United States
Posted
7 days ago
Salary
0
Seniority
Mid Level
Job Description
Observability Engineer (Prometheus / Grafana / Datadog)
Bright Vision Technologies
Role Description We are looking for an Observability Engineer to design and operate the metrics, logging, tracing, and alerting platforms that give engineering teams confidence in the systems they run. The role spans the full observability stack — from collection agents and pipelines to long-term storage, dashboards, and alerting workflows — with a strong focus on usability, signal quality, and operational ROI. The ideal candidate has built and operated observability platforms at scale, understands the trade-offs between open-source and SaaS approaches, and can translate noisy telemetry into actionable insight for both engineers and business stakeholders. Key Responsibilities - Design and operate enterprise-grade observability platforms covering metrics, logs, traces, events, and synthetic monitoring. - Architect Prometheus / Thanos / Mimir, Grafana, Loki, Tempo, OpenTelemetry, and Datadog deployments for high availability and scale. - Develop standards for service instrumentation, including OpenTelemetry adoption, metric naming, label cardinality, and structured logging conventions. - Define and enforce SLOs, SLIs, and error budgets, and build the dashboards and alerts that operationalize them. - Build alerting strategies that minimize noise, surface actionable signals, and integrate cleanly with on-call workflows in PagerDuty, Opsgenie, or similar tools. - Operate large-scale time-series and log storage platforms, balancing retention, query performance, and cost. - Design distributed tracing pipelines and help teams use traces to diagnose latency and reliability issues. - Develop self-service tooling, paved-road libraries, and templates that make adoption of observability standards easy for product teams. - Drive cost management and label-cardinality discipline across the observability estate. - Lead incident response readiness improvements through better dashboards, alerting hygiene, and post-incident analysis tooling. - Partner with SRE and platform teams to integrate observability into deployment pipelines, canary analysis, and progressive delivery workflows. - Evaluate and recommend observability vendors and open-source tools based on cost, capability, and operational maturity. - Mentor engineering teams on observability fundamentals, debugging techniques, and SLO-driven operations. - Maintain documentation, onboarding guides, and runbooks for the observability platform. Qualifications - Bachelor’s degree in Computer Science or a related field. - Five or more years of experience in SRE, platform engineering, or observability roles. - Deep hands-on experience with Prometheus, Grafana, and at least one major commercial observability platform such as Datadog, New Relic, or Splunk. - Strong understanding of OpenTelemetry, distributed tracing, and structured logging. - Proficiency in at least one general-purpose language such as Go, Python, or Java. - Experience operating high-cardinality, high-throughput metrics and log pipelines. - Strong understanding of SLOs, error budgets, and SRE principles. - Experience integrating observability with CI/CD and incident management tooling. - Solid grasp of Linux internals, networking, and container platforms. - Excellent communication and collaboration skills. Preferred Qualifications - Experience with Thanos, Mimir, Cortex, Loki, or Tempo at scale. - Contributions to OpenTelemetry or observability open-source projects. - Familiarity with eBPF-based observability tooling. - Experience driving observability cost optimization initiatives. - Exposure to regulated environments with audit-grade logging requirements. Requirements - No new H1B sponsorship available. H1B transfers welcomed for qualified candidates. - This is a 100% remote, full-time, direct W2 position with Bright Vision Technologies. - Candidates must be willing to work directly as a full-time W2 employee of Bright Vision Technologies and contribute to our in-house SOW deliverables. - For every role, a technical coding assessment is mandatory. How to Apply Would you like to know more about this opportunity? For immediate consideration, please send your resume to [email protected] or contact us at (908) 698-4899. Learn more about Bright Vision Technologies at www.bvteck.com .
Related Guides
Related Categories
Related Job Pages
More Engineer Jobs
Senior Site Reliability Engineer
DevsuDevsu is a technology agency that provides software development services, IT augmentation and staffing.
Role Description We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP). This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments. As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required. Responsibilities - Monitoring & Observability (Core Focus) - Own and operate the monitoring and observability stack across on-prem and GCP environments - Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications - Define, tune, and maintain alerts to ensure high signal-to-noise ratio - Establish observability standards and best practices across teams - Improve visibility into system health, performance, and reliability - Site Reliability Engineering - Apply SRE principles to improve availability, performance, and resilience - Define and track SLIs, SLOs, and error budgets - Participate in on-call rotations and SEV incident response - Lead or contribute to incident investigations and root cause analysis (RCA) - Drive preventative actions to reduce repeat incidents - Kubernetes & Platform Reliability - Support and monitor Kubernetes environments (GKE and on-prem clusters) - Monitor cluster health, capacity, and resource utilization - Troubleshoot platform-level issues impacting application reliability - Collaborate with Platform and Engineering teams on reliability improvements - Secondary Responsibilities (Backup Application Support) - Provide L2/L3 application support coverage during: - Support team resource shortages - High-severity incidents (SEVs) - Peak support periods or escalations - Triage and troubleshoot application issues using existing runbooks and dashboards - Collaborate with Application Support and Engineering teams during incidents - Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW) Qualifications - Strong experience as a Site Reliability Engineer or Reliability Engineer - Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting) - Solid experience with monitoring and observability systems - Production experience operating Kubernetes environments - Experience supporting systems in GCP and on-prem environments (mandatory) - Strong Linux systems and troubleshooting skills - Fluent English (written and spoken) - Ability to work in PST time zone - Ability to participate in an on-call rotation that includes coverage for one weekend day Requirements - Technology Stack: - Observability: Grafana, Prometheus, logging platforms - Containers: Kubernetes (GKE and on-prem) - Cloud: Google Cloud Platform (GCP) - Operations: Linux, networking, infrastructure monitoring - Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents) - Nice to have: - Experience supporting application teams during SEV incidents - Knowledge of capacity planning and performance tuning - Scripting skills (Python, Bash, etc.) - Experience with hybrid infrastructure environments Benefits - A stable, long-term contract with opportunities for career growth - Private health insurance - A remote-friendly culture that promotes work-life balance - Continuous training, mentorship, and learning programs to keep you at the forefront of the industry - Free access to AI training resources and state-of-the-art AI tools to elevate your daily work - A flexible Paid Time Off (PTO) policy as well as paid holiday days - Challenging, world-class software projects for clients in the US and LatAm - Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment
Senior QA Engineer
Encora DigitalEncora, a leader in digital engineering, drives innovation by crafting cutting-edge, cloud-first, data-first, and AI-first solutions that redefine industries. Since its inception i
Role Description We at Coforge are hiring a Senior QA Engineer (#20628) with the following skill set. - Design, execute, and maintain automated and manual testing strategies to ensure software quality, reliability, and performance across applications. - Develop and maintain automation frameworks using Playwright with TypeScript/JavaScript and perform API testing using Postman collections, environments, and assertions. - Validate data integrity through SQL queries and collaborate with engineering teams to integrate testing across SDLC and STLC phases. - Support quality engineering best practices by contributing to CI/CD-enabled testing workflows, performance testing initiatives, and continuous quality improvements. Qualifications - Bachelor’s degree in Computer Science, Engineering, Information Technology, or equivalent practical experience. - 5+ years of experience in software quality assurance, including manual and automation testing. - Strong hands-on experience with Playwright automation using TypeScript and/or JavaScript. - Practical experience with API testing using Postman, including collections, environments, assertions, and validation workflows. - Strong understanding of Software Development Life Cycle (SDLC) and Software Testing Life Cycle (STLC). - Experience integrating testing practices throughout software development and release processes. - Intermediate to advanced SQL skills for data validation, querying, troubleshooting, and database verification. - Familiarity with DevOps concepts, CI/CD pipelines, and automated testing integration practices. - Strong analytical, debugging, troubleshooting, and problem-solving skills. - Strong communication and collaboration skills with experience working in Agile environments. Requirements - Exposure to performance testing methodologies and load testing concepts. - Familiarity with security testing principles and practices. - Experience supporting automated testing strategies in CI/CD environments. - Knowledge of software quality engineering best practices and scalable testing frameworks. Company Description At Coforge, we hire professionals based solely on their skills and do not discriminate based on age, disability, religion, gender, sexual orientation, socioeconomic status, or nationality.
Data Engineer
íliaSomos especialistas em tecnologia, dados e design, impulsionando a transformação digital de grandes players do mercado há mais de 10 anos, nos setores financeiro, seguros e mobilidade. Com mais de 450 profissionais, estamos presentes no Brasil e Europa, atendendo aos mercados da América Latina, Europa e América do Norte, desenvolvendo produtos digitais de alta qualidade e com foco em resultados de negócios. Certificada pelo 7° ano consecutivo como Great Place to work aqui na ília acreditamos que pessoas mudam o mundo, e investimos nelas. Nossas awesome deliveries são feitas de pessoas para pessoas, afinal awesome people make awesome deliveries!
Role Description Esta posição é estratégica para uma iniciativa de alta complexidade técnica em um dos maiores players do setor bancário. O profissional atuará de forma consultiva e executora, sendo referência na implementação de soluções de dados em larga escala, otimizando o uso da plataforma Databricks em áreas críticas como CRM e Segurança. Responsabilidades: - Desenvolver e otimizar pipelines de dados complexos utilizando a plataforma Databricks. - Atuar consultivamente para arquitetar soluções que destravem desafios de negócio através de dados. - Garantir a excelência técnica, escalabilidade e a performance das entregas conforme padrões rigorosos de governança. - Implementar arquiteturas de Lakehouse e gerenciar o ciclo de vida dos dados (camadas Bronze, Prata e Ouro). - Projetar e manter fluxos de CI/CD aplicados à engenharia de dados e monitoramento de qualidade (Data Quality). Qualifications - Sólida experiência e domínio profundo em Databricks (experiência prática com a plataforma). - Experiência comprovada em Engenharia de Dados e arquiteturas de Big Data (Spark, Delta Lake, Python/PySpark). - Certificações Databricks obrigatórias (ex: Data Engineering Associate ou Professional). - Experiência com integrações de dados complexos e arquitetura de nuvem (AWS ou Azure). - Histórico de atuação em projetos de alta criticidade e complexidade. - Conhecimento avançado em SQL para tunagem de performance e otimização de consultas em ambientes distribuídos. Benefits - Contratação CLT - 40h semanais com jornada flexível, executada de forma remota. - Plano de Saúde e Odontológico SulAmérica extensivo a dependentes. - Vale Alimentação/Refeição em cartão flexível Caju benefícios. - Seguro de Vida. - Auxílio Home-Office em cartão flexível Caju benefícios. - Wellhub (Gympass). - Sesc – extensivo a dependentes, com acesso aos serviços em todo o Brasil. - TotalPass. - ília University: Universidade Corporativa com mais de 20 mil cursos disponíveis para desenvolvimento pessoal e profissional. - Language Academy: Escola de idiomas para ílians. - í-talks e Chapter: Momentos e cerimônias em que o time compartilha práticas, estudos, projetos e ideias. - Plano de Saúde PET - Guapeco. - Onhappy – benefício de viagem a lazer, com liberdade para você viajar com quem quiser. - BYOD - Alugamos o seu notebook pessoal te pagando um valor mensal para que você o use. - Seu Niver, seu bolo! - Seu Networking Vale Prêmio - Programa de premiação a indicação de novos funcionários.
Senior Support Engineer
Redpanda Data[formerly Vectorized] The streaming platform for developers. Kafka compatible. Safe. 10x faster.
Role Description As a Technical Support Engineer at Redpanda, you will help our organization embody our commitment to delivering exceptional customer-centric technical support. Reporting directly to the Director, Technical Support, you will be a vital contributor to our growing support team in the Customer Success organization. In this role, you will play a fundamental part in ensuring our customers' success, fostering their confidence in our solutions, and elevating their overall experience. Your primary focus will be to leverage your technical expertise to provide world-class support for Redpanda’s range of products and services. Your ability to understand and address our customers' needs and technical challenges will be at the heart of our customer-centric approach. - Be the primary face of our organization to our customers to ensure we meet and/or exceed customer expectations on the Redpanda operation. - Work with engineering to drive and solve customer challenges from creation through resolution. - Partner with product engineering groups on periodic root cause analysis on customer issues, and distill lessons learned for the rest of the organization. - Build tools & services to create and improve support infrastructure, from issue life cycles to trending on root causes. - Participate in on-call rotations to follow the sun in support of our customers. - Ensure customer satisfaction through strong relationships with our Customer Success team. Qualifications - 3+ years of experience in L3 support of enterprise products, with a significant focus on distributed systems. - Strong understanding of Linux troubleshooting commands and regular expressions (grep/awk/sed). - Experience with deploying and troubleshooting applications in Kubernetes. - Strong experience with public cloud providers and containerization. - Proficiency in bash scripting and/or Python. - Willingness to participate in an on-call rotation. - Excellent written communication skills. - Comfortable working with a 100% distributed engineering team and remote first company. - Experience using AI tooling to automate repetitive tasks, or enhance troubleshooting. Nice to Have - Proficiency with Go. - Experience supporting a SaaS platform. - Experience supporting a streaming platform. Benefits - Join Redpanda if you’d enjoy being part of a fast-moving, diverse, people-first organization with team members around the globe and a culture based on trust, transparency, communication, and kindness.

