Quantiphi

Pioneering AI-first solutions, solving complex business challenges through expertise, cloud, data engineering, and AI.

DevOps/Observability Engineer

DevOps EngineerDevOps EngineerFull Time Remote Mid LevelTeam 1,001-5,000Since 2013H1B SponsorCompany Site LinkedIn

Location

United States

Posted

3 days ago

Salary

Seniority

Mid Level

No structured requirement data.

Job Description

Role Description We are seeking a highly experienced Senior DevOps/Observability Engineer with over 8 years of experience to lead the design and implementation of our next-generation, unified observability platform. This pivotal role will focus on architecting a sophisticated observability pipeline from the ground up, leveraging a modern, open-source-centric stack on Amazon Web Services (AWS). The ideal candidate will have deep expertise in designing and deploying observability solutions, with a strong emphasis on OpenTelemetry (OTel) and Kubernetes observability. You will be responsible for deploying, configuring, and integrating a suite of tools including Prometheus, Grafana, and Splunk to provide comprehensive insights into our complex, distributed systems. This is a hands-on role for a technical leader who is passionate about building scalable, reliable, and efficient monitoring and logging systems. Qualifications - Unified Pipeline Architecture: Proven ability to design and implement end-to-end observability pipelines using OpenTelemetry, Prometheus, and Grafana on centralized infrastructure. - Cross-Account AWS Observability: Deep expertise in centralizing AWS telemetry, including multi-account CloudTrail organization trails, cross-account CloudWatch metrics/logs, and VPC Flow Logs. - Log Aggregation & Routing: Strong experience designing log aggregation strategies, implementing noise reduction/filtering at the collector level, and configuring Splunk HTTP Event Collector (HEC) integrations. - Advanced Alerting & Dashboarding: Hands-on experience building comprehensive alerting frameworks using Alertmanager and CloudWatch Alarms, coupled with advanced dashboard engineering in Grafana (using PromQL). - Infrastructure as Code (IaC): Advanced proficiency in writing Terraform modules specifically for deploying and managing observability stacks and EC2 infrastructure. Requirements - Enterprise Scale Log Management: Demonstrated experience managing, routing, and optimizing log pipelines at massive scale (TB/day). - Kubernetes/Container Observability: Experience deploying Prometheus and OTel within Kubernetes (EKS) or containerized (ECS) environments. - Cost Optimization: Proven track record of reducing observability spend through strategic metric dropping, log filtering, and efficient storage tiering. Benefits - Join one of the world’s fastest-growing AI-first digital engineering companies and make a real impact at scale. - Lead and collaborate with a high-energy team of talented, driven individuals solving complex, meaningful challenges. - Work with Fortune 500 companies and disruptive innovators in a research-driven environment with 60+ patents. - Stay ahead of the curve by gaining hands-on experience with cutting-edge AI, ML, data, and cloud technologies while continuously upskilling. - If you like wild growth and working with happy, enthusiastic over-achievers, you'll enjoy your career with us!

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior Cloud DevOps Engineer

OneStream Software

A comprehensive cloud-based platform to modernize the Office of the CFO.

DevOps Engineer3 days ago

Full Time RemoteTeam 1,001-5,000H1B No Sponsor

Company Site LinkedIn

• Develop and maintain Infrastructure-as-Code such as Terraform, PowerShell, ARM, Bicep, Bash, and YAML languages • Deliver high-quality implementations in a timely manner • Design and maintain CI/CD pipelines supporting secure, reliable, and repeatable deployments • Update technical documentation, workflows, and knowledge base articles • Build knowledge in focused areas of the OneStream platform and deployment stack • Participate in collaborative engineering, peer reviews, and knowledge sharing initiatives • Collaborate with other teams to define, estimate, and implement requirements for new automations or services needed for development • Apply software engineering best practices to infrastructure and automation development • Optimize cloud environments for scalability, reliability, and cost efficiency • Participate in troubleshooting and resolution of complex production issues across cloud platforms and services • Work with Compliance and Security teams to ensure compliance with required controls

Ansible AWS Azure Chef Cloud Google Cloud Platform Kubernetes MS SQL Server OpenShift Puppet Python SQL Terraform

View details: Senior Cloud DevOps Engineer

United States

$131K - $170K / year

Apply

Staff DevSecOps Engineer

Redox

Welcome to composable healthcare.

DevOps Engineer3 days ago

Full Time RemoteTeam 201-500Since 2014H1B Sponsor

Company Site LinkedIn

• Champion a security-first mindset within Engineering to help set the security posture of our platform infrastructure — supply chain hardening, secrets management, IAM/IRSA, container image integrity, and vulnerability remediation across our AWS/EKS environment • Design and build automation that makes compliance evidence continuous, not manual — translating HITRUST controls into passing tests and structured outputs that flow into our compliance tooling (Vanta) • Embed security into the platform by default: make the secure path the easy path for application engineers, through guardrails, policy-as-code, and well-documented patterns • Partner with our Security team to translate threat assessments and control gaps into engineering proposals with clear scope, tradeoffs, and recommended paths forward • Lead platform security initiatives from design to operationalization — requirements, technical design, code and code review, deployment, and documentation • Contribute hands-on to the broader platform: CI/CD pipelines, container orchestration, observability, and developer tooling — this is an IC role, not a governance role • Participate in on-call rotation and own the systems you build, including production incidents • Mentor engineers on security practices and raise the security baseline across the team

AWS Cloud JavaScript Kubernetes Node.js Python Terraform TypeScript Go

View details: Staff DevSecOps Engineer

United States

$190K - $199K / year

Apply

Senior Reliability Operations Engineer

Serve Robotics

Meet the future of sustainable, self-driving delivery.

DevOps Engineer3 days ago

Full Time RemoteTeam 51-200Since 2017H1B Sponsor

Company Site LinkedIn

• Serve as the primary incident lead during your region’s daytime hours, coordinating technical investigations, centralizing communication, and engaging the appropriate engineering and SRE teams when escalation is required. • Respond to escalations from Tier 1 support, using runbooks, metrics, logs, and system diagnostics to investigate and remediate issues or determine when escalation to Tier 3 is necessary. • Develop and update runbooks, workflows, and operational documentation to ensure consistent and reliable responses to recurring issues, collaborating with product teams to expand coverage over time. • Write, maintain, and enhance automation scripts and tools that streamline common remediation steps, improve response times, and reduce manual operational overhead. • Use metrics, logs, and tracing tools (Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify problems, validate system behavior, and support continuous improvement of detection mechanisms. • Act as the central point of communication during active incidents, ensuring timely updates and clear routing to the correct product engineering and SRE stakeholders. • Collaborate with reliability and product teams to share insights, recommend improvements, and help refine processes that enhance the stability and operability of our systems. • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise. • Help establish operational best practices, refine workflows, and prepare the foundation for a broader reliability operations function.

Cloud Distributed Systems Google Cloud Platform Grafana Linux Microservices Prometheus

View details: Senior Reliability Operations Engineer

Malaysia

RM90K - RM110K / year

Apply

Reliability Operations Engineer

Serve Robotics

Meet the future of sustainable, self-driving delivery.

DevOps Engineer3 days ago

Full Time RemoteTeam 51-200Since 2017H1B Sponsor

Company Site LinkedIn

• Lead incident investigations during your region’s daytime hours, providing timely updates, escalating appropriately, and supporting senior engineers leading the response. • Respond to escalations from Tier 1 support using established runbooks, metrics, logs, and diagnostics to remediate issues or escalate to Tier 3 when needed. • Update runbooks and operational documentation based on new issues, discoveries, and feedback, ensuring clarity and consistency across all procedures. • Run existing automations and collaborate with senior team members to enhance tooling and scripts that streamline troubleshooting and remediation tasks • Use observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to interpret metrics, logs, and traces, helping identify anomalies and validate system performance. • Provide concise, accurate updates during incidents, ensuring information reaches the correct engineering and SRE contacts and supporting structured incident coordination. • Participate in discussions around root causes, share operational insights, and contribute to process improvements that enhance system stability and supportability. • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise. • Proactively strengthen workflows, adopt best practices, and build the foundation of the Reliability Operations function as it evolves.

Cloud Google Cloud Platform Grafana Linux Prometheus

View details: Reliability Operations Engineer

Malaysia

RM80K - RM100K / year

Apply