iCapital

Site Reliability Engineer - Vice President

DevOps EngineerDevOps EngineerFull Time Remote Mid LevelTeam 1,001-5,000

Location

Worldwide

Posted

72 days ago

Salary

Seniority

Mid Level

AWS Kubernetes Terraform OpenTelemetry Prometheus Grafana New Relic Splunk Amazon CloudWatch PostgreSQL MongoDB DynamoDB

Job Description

Role Description The Site Reliability Engineering team at iCapital is fundamental to ensuring our platform delivers consistent, reliable service to our client base. As a Site Reliability Engineer, you'll work at the intersection of software engineering and operations, applying engineering principles to infrastructure challenges. You'll be responsible for designing and implementing systems that scale efficiently, architecting observability solutions that provide actionable insights, and building automation that enhances our platform's reliability. This role requires someone who thinks systematically about reliability, can translate business requirements into technical implementations, and thrives on making complex systems more robust. - Define, implement, and iterate service level objectives (SLOs) and service level indicators (SLIs) that reflect customer and business expectations. - Lead monitoring and alerting standardization through “monitors as code” (Terraform preferred), including quality gates such as severity, ownership, and runbook links. - Develop observability standards across metrics, logs, and traces, including instrumentation and dependency mapping patterns (OpenTelemetry where applicable). - Lead technical evaluations and PoCs for observability platforms and integrations; define success criteria and migration approach for adoption. - Define and implement reliability and operability standards for Kubernetes-based services, including scaling patterns, resource constraints, rollout safety, and baseline dashboards and alerts as part of service onboarding. - Drive automation to eliminate toil, improve repeatability, and accelerate recovery (incident workflows, runbooks, and remediation where appropriate). - Serve as Incident Commander for high-severity incidents, lead postmortems, and drive systemic improvements through action items and measurable follow-through using established tooling workflows. - Participate in on-call rotations with a focus on improving reliability, reducing alert noise, and increasing signal quality over time. Qualifications - 7+ years in SRE or related roles, with evidence of technical seniority across multiple services and teams. - Strong experience with AWS and container orchestration (Kubernetes) in production environments. - Demonstrated experience defining SLOs/SLIs and using them to drive operational and engineering decisions. - Proven ability to design and implement observability solutions that produce actionable insights while reducing alert fatigue and operational noise. - Strong IaC skills (Terraform preferred) and the ability to build reusable automation and standards (monitoring as code, configuration patterns). - Familiarity with common data stores and managed services (e.g., Postgres, MongoDB, DynamoDB) and how they fail in distributed systems. - Experience with at least two observability stacks (Prometheus/Grafana, New Relic, Splunk, CloudWatch, ELK, etc.) and driving standardization across them. - Strong incident response skills, including leading retrospectives/postmortems and improving reliability through systematic follow-up. - Strong debugging skills across distributed systems and production environments, including performance and reliability investigations. - Clear written and verbal communication skills with the ability to influence engineering teams through standards, tooling, and practical guidance. Benefits - Comprehensive benefits package including competitive salary, annual performance bonus, and equity for all full-time employees. - Healthcare with 100% employer-paid health and dental insurance. - Generous paid time off (PTO).

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior Database Site Reliability Engineer

TherapyNotes, LLC

TherapyNotes™ is the industry-preferred online EHR for behavioral health. Try one month free!

DevOps Engineer72 days ago

Full Time RemoteTeam 51-200Since 2010H1B No Sponsor

Company Site LinkedIn

• Responsible to design, implement, and maintain high-availability, high throughput, data and compute intensive, critical database systems running PostgreSQL which supports a growing 24x7 SaaS platform. • Define and improve database service reliability through monitoring/alerting, SLO-oriented metrics, and operational readiness. • Participate in and help drive incident response, root cause analysis, and post-incident corrective actions for database-related production events. • Partner with other technical leaders to ensure all newly introduced systems are supportable and maintainable by both development and operations. • Provides escalated technical guidance and support to other technology teams throughout the organization • Provides on-call coverage for production support and other duties as required. • Accountable for complying with HIPAA security policies within the database platform • Ensure all solutions and operational activities adhere to the security and operating policies established by the organization • Own and continuously improve our Datadog database observability by building actionable dashboards, alerts, and service-level views using an observability stack (e.g., Prometheus, Grafana, New Relic, or equivalent). Familiarity with PGAnalyze or Percona a plus. • Automate system maintenance tasks using Bash, Powershell, Python, or Ansible. Manage infrastructure as code (IaC) writing Ansible playbooks. Some exposure to Terraform a plus. • Experience with writing & designing ETL pipelines using Python a plus • Understand and maintain various PostgreSQL ecosystem components like: PgBouncer, PgBackrest, HaProxy, RepMgr a plus • Excellent communication and interpersonal skills.

Ansible Azure ETL Grafana HAProxy ITSM Kubernetes Linux PostgreSQL Prometheus Python Terraform

View details: Senior Database Site Reliability Engineer

United States

$120K - $160K / year

Apply

Senior Cloud DevOps Network Engineer – FedRAMP, Azure, Advanced Networking

OneStream Software

A comprehensive cloud-based platform to modernize the Office of the CFO.

DevOps Engineer72 days ago

Full Time RemoteTeam 1,001-5,000H1B No Sponsor

Company Site LinkedIn

• Lead the design, continuous monitoring, implementation, and security operations of Azure cloud solutions, ensuring they meet industry best practices and comply with FedRAMP High, IL4 requirements • Lead team in developing modular Infrastructure-as-Code utilizing Terraform, PowerShell, ARM, Bicep, and YAML languages • Lead projects of moderate complexity to completion • Sustain a high level of reliability for key automated systems • Leads teams to define, estimate, and implement requirements for new automations or services of moderate complexity needing development • Stay up to date with the latest Azure and FedRAMP regulatory changes and industry trends, advising teams on potential impacts and necessary adjustments • Update technical documentation, workflows, and knowledge base articles • Provide feedback in pull requests and peer coding reviews • Solid knowledge in focused areas of OneStream Software • Participate in on-call rotation to support production systems • Assist in efforts to debug the problems which arise in production • Ability to mentor others in several technical areas • Understanding practical use of FedRAMP/SOC controls to assist Compliance and Security teams

Ansible AWS Azure Chef Firewalls GCP Kubernetes Microsoft SQL Server OpenShift Puppet Python SQL Terraform

View details: Senior Cloud DevOps Network Engineer – FedRAMP, Azure, Advanced Networking

United States

$140K - $172.3K / year

Apply

Job Closed

Senior DevOps

journaway

Bucket list Moments in over 100 countries

DevOps Engineer72 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Design, implement, and manage cloud infrastructure for high availability and performance • Improve pipelines, automate workflows, and drive innovation in infrastructure setup • Monitor platform performance and manage incident responses • Apply and enforce security best practices and ensure compliance • Collaborate with cross-functional teams to optimize application performance and deployments

AWS Azure Docker GCP Grafana Jenkins Kubernetes Linux Prometheus Terraform

View details: Senior DevOps

Germany

Apply

Job Closed

DevOps Engineer

smartclip Europe GmbH

smartclip ist bestrebt, ein diverses und inklusives Umfeld zu schaffen. Alle qualifizierten Bewerber*innen werden für die Anstellung in Erwägung gezogen, unabhängig von Ethnizität, Nationalität, Alter, Geschlecht, Geschlechtsidentität, Religion, sexueller Orientierung, Behinderung, oder anderen diversen Eigenschaften.

DevOps Engineer72 days ago

Full Time RemoteTeam 51-200

Role Description Remote first – mit echtem Teamspirit vor Ort: - Wir arbeiten remote, kommen aber regelmäßig in Berlin zusammen – für Workshops, Brainstormings und alles, was vor Ort einfach besser funktioniert: gemeinsames Tüfteln, direkter Austausch und echter Teamspirit. - Du bist nicht hier um Tickets abzuarbeiten. - Du bist hier, um Infrastruktur zu bauen, die hält — und Teams zu enablen, die schneller shippen. - Kein Ticket-Waterfall. Kein Change-Freeze. Kein „Das haben wir schon immer so gemacht." Dein Playground: - Developer Platforms: Design & automate platforms für schnelles Feedback, hohe Qualität und maximale Flexibilität - CI/CD & Toolchain: Moderne Tools in DevOps-Pipelines integrieren — immer mit Blick auf funktionale Sicherheit - GKE / Kubernetes: Direkter Zugang zur hochmodernen K8s-Infrastruktur auf Google Cloud – du formst sie mit - Security Engineering: Du härtst Systeme ab, bevor ein Pen-Test das für dich tut — proaktiv, nicht reaktiv Qualifications - 5+ Jahre DevOps Engineering – nicht als Zuschauer, sondern als Macher - GCP, Kubernetes und Linux sind dein natürlicher Lebensraum - Infrastructure as Code, Docker/CRI-O und Shell Scripting gehören zu deinem Alltag - Du bist Ansprechpartner für deinen Stack – kein Ping-Pong zwischen Teams, kein Alibi-Ticket Requirements - Nice-to-have: Graylog, Grafana, Prometheus, Python oder Jenkins - Ansible/Terraform oder AWS-Erfahrung Benefits - Kein Bullshit, echte Verantwortung: Eigenverantwortung statt Bürokratie – wir vertrauen darauf, dass du Dinge anpackst. - Fail Fast, Learn Faster: Wir probieren aus, was wirklich funktioniert – und nicht, was sich in der Theorie gut anhört. - Tech-Mindset & Humor: Wir nehmen unsere Arbeit ernst – uns selbst nicht immer. Das macht den Unterschied. - Immer am Puls der Zeit: Hackathons, Konferenzen, Community – wir investieren, damit du vorne bleibst. - Remote-First – überall in Deutschland: Du arbeitest von wo du willst – solange es Deutschland ist. Unsere Offices in Berlin, Hamburg und im Raum Gütersloh stehen dir offen. - Alle 2–3 Monate trifft sich das Team persönlich, meistens in Berlin. Echte Menschen, echte Gespräche – dann aber wieder zurück in den Flow. Company Description smartclip ist bestrebt, ein diverses und inklusives Umfeld zu schaffen. Alle qualifizierten Bewerber*innen werden für die Anstellung in Erwägung gezogen, unabhängig von Ethnizität, Nationalität, Alter, Geschlecht, Geschlechtsidentität, Religion, sexueller Orientierung, Behinderung, oder anderen diversen Eigenschaften.

Google Kubernetes Engine Kubernetes GCP Linux Docker Grafana Prometheus Python Jenkins Ansible AWS

View details: DevOps Engineer

Germany

Apply

Site Reliability Engineer - Vice President

Job Description

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior Database Site Reliability Engineer

Senior Cloud DevOps Network Engineer – FedRAMP, Azure, Advanced Networking

Senior DevOps

DevOps Engineer