Quantiphi logo
Quantiphi

Pioneering AI-first solutions, solving complex business challenges through expertise, cloud, data engineering, and AI.

DevOps/Observability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteMid LevelTeam 1,001-5,000Since 2013H1B SponsorCompany SiteLinkedIn

Location

United States

Posted

3 days ago

Salary

0

Seniority

Mid Level

No structured requirement data.

Job Description

DevOps/Observability Engineer

Quantiphi

Role Description We are seeking a highly experienced Senior DevOps/Observability Engineer with over 8 years of experience to lead the design and implementation of our next-generation, unified observability platform. This pivotal role will focus on architecting a sophisticated observability pipeline from the ground up, leveraging a modern, open-source-centric stack on Amazon Web Services (AWS). The ideal candidate will have deep expertise in designing and deploying observability solutions, with a strong emphasis on OpenTelemetry (OTel) and Kubernetes observability. You will be responsible for deploying, configuring, and integrating a suite of tools including Prometheus, Grafana, and Splunk to provide comprehensive insights into our complex, distributed systems. This is a hands-on role for a technical leader who is passionate about building scalable, reliable, and efficient monitoring and logging systems. Qualifications - Unified Pipeline Architecture: Proven ability to design and implement end-to-end observability pipelines using OpenTelemetry, Prometheus, and Grafana on centralized infrastructure. - Cross-Account AWS Observability: Deep expertise in centralizing AWS telemetry, including multi-account CloudTrail organization trails, cross-account CloudWatch metrics/logs, and VPC Flow Logs. - Log Aggregation & Routing: Strong experience designing log aggregation strategies, implementing noise reduction/filtering at the collector level, and configuring Splunk HTTP Event Collector (HEC) integrations. - Advanced Alerting & Dashboarding: Hands-on experience building comprehensive alerting frameworks using Alertmanager and CloudWatch Alarms, coupled with advanced dashboard engineering in Grafana (using PromQL). - Infrastructure as Code (IaC): Advanced proficiency in writing Terraform modules specifically for deploying and managing observability stacks and EC2 infrastructure. Requirements - Enterprise Scale Log Management: Demonstrated experience managing, routing, and optimizing log pipelines at massive scale (TB/day). - Kubernetes/Container Observability: Experience deploying Prometheus and OTel within Kubernetes (EKS) or containerized (ECS) environments. - Cost Optimization: Proven track record of reducing observability spend through strategic metric dropping, log filtering, and efficient storage tiering. Benefits - Join one of the world’s fastest-growing AI-first digital engineering companies and make a real impact at scale. - Lead and collaborate with a high-energy team of talented, driven individuals solving complex, meaningful challenges. - Work with Fortune 500 companies and disruptive innovators in a research-driven environment with 60+ patents. - Stay ahead of the curve by gaining hands-on experience with cutting-edge AI, ML, data, and cloud technologies while continuously upskilling. - If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!

Related Categories

Related Job Pages

More DevOps Engineer Jobs

OneStream Software logo

Senior Cloud DevOps Engineer

OneStream Software

A comprehensive cloud-based platform to modernize the Office of the CFO.

DevOps Engineer3 days ago
Full TimeRemoteTeam 1,001-5,000H1B No Sponsor

• Develop and maintain Infrastructure-as-Code such as Terraform, PowerShell, ARM, Bicep, Bash, and YAML languages • Deliver high-quality implementations in a timely manner • Design and maintain CI/CD pipelines supporting secure, reliable, and repeatable deployments • Update technical documentation, workflows, and knowledge base articles • Build knowledge in focused areas of the OneStream platform and deployment stack • Participate in collaborative engineering, peer reviews, and knowledge sharing initiatives • Collaborate with other teams to define, estimate, and implement requirements for new automations or services needed for development • Apply software engineering best practices to infrastructure and automation development • Optimize cloud environments for scalability, reliability, and cost efficiency • Participate in troubleshooting and resolution of complex production issues across cloud platforms and services • Work with Compliance and Security teams to ensure compliance with required controls

United States
$131K - $170K / year
Redox logo

Staff DevSecOps Engineer

Redox

Welcome to composable healthcare.

DevOps Engineer3 days ago
Full TimeRemoteTeam 201-500Since 2014H1B Sponsor

• Champion a security-first mindset within Engineering to help set the security posture of our platform infrastructure — supply chain hardening, secrets management, IAM/IRSA, container image integrity, and vulnerability remediation across our AWS/EKS environment • Design and build automation that makes compliance evidence continuous, not manual — translating HITRUST controls into passing tests and structured outputs that flow into our compliance tooling (Vanta) • Embed security into the platform by default: make the secure path the easy path for application engineers, through guardrails, policy-as-code, and well-documented patterns • Partner with our Security team to translate threat assessments and control gaps into engineering proposals with clear scope, tradeoffs, and recommended paths forward • Lead platform security initiatives from design to operationalization — requirements, technical design, code and code review, deployment, and documentation • Contribute hands-on to the broader platform: CI/CD pipelines, container orchestration, observability, and developer tooling — this is an IC role, not a governance role • Participate in on-call rotation and own the systems you build, including production incidents • Mentor engineers on security practices and raise the security baseline across the team

United States
$190K - $199K / year
Serve Robotics logo

Senior Reliability Operations Engineer

Serve Robotics

Meet the future of sustainable, self-driving delivery.

DevOps Engineer3 days ago
Full TimeRemoteTeam 51-200Since 2017H1B Sponsor

• Serve as the primary incident lead during your region’s daytime hours, coordinating technical investigations, centralizing communication, and engaging the appropriate engineering and SRE teams when escalation is required. • Respond to escalations from Tier 1 support, using runbooks, metrics, logs, and system diagnostics to investigate and remediate issues or determine when escalation to Tier 3 is necessary. • Develop and update runbooks, workflows, and operational documentation to ensure consistent and reliable responses to recurring issues, collaborating with product teams to expand coverage over time. • Write, maintain, and enhance automation scripts and tools that streamline common remediation steps, improve response times, and reduce manual operational overhead. • Use metrics, logs, and tracing tools (Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify problems, validate system behavior, and support continuous improvement of detection mechanisms. • Act as the central point of communication during active incidents, ensuring timely updates and clear routing to the correct product engineering and SRE stakeholders. • Collaborate with reliability and product teams to share insights, recommend improvements, and help refine processes that enhance the stability and operability of our systems. • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise. • Help establish operational best practices, refine workflows, and prepare the foundation for a broader reliability operations function.

Malaysia
RM90K - RM110K / year
Serve Robotics logo

Reliability Operations Engineer

Serve Robotics

Meet the future of sustainable, self-driving delivery.

DevOps Engineer3 days ago
Full TimeRemoteTeam 51-200Since 2017H1B Sponsor

• Lead incident investigations during your region’s daytime hours, providing timely updates, escalating appropriately, and supporting senior engineers leading the response. • Respond to escalations from Tier 1 support using established runbooks, metrics, logs, and diagnostics to remediate issues or escalate to Tier 3 when needed. • Update runbooks and operational documentation based on new issues, discoveries, and feedback, ensuring clarity and consistency across all procedures. • Run existing automations and collaborate with senior team members to enhance tooling and scripts that streamline troubleshooting and remediation tasks • Use observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to interpret metrics, logs, and traces, helping identify anomalies and validate system performance. • Provide concise, accurate updates during incidents, ensuring information reaches the correct engineering and SRE contacts and supporting structured incident coordination. • Participate in discussions around root causes, share operational insights, and contribute to process improvements that enhance system stability and supportability. • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise. • Proactively strengthen workflows, adopt best practices, and build the foundation of the Reliability Operations function as it evolves.

Malaysia
RM80K - RM100K / year