AlpacaDB, Inc., also known as Alpaca and Alpaca Securities, is an API stock and crypto brokerage platform that enables services to embed investing and developers to build applicati
Site Reliability Engineer
Location
Europe
Posted
5 days ago
Salary
$0 - $150 / month
Seniority
Senior
Job Description
Site Reliability Engineer
AlpacaDB
• Operate production day-to-day - oncall, incident response, postmortems, and the follow-ups that actually close the loop. • Own reliability practice - define and refine SLIs/SLOs and error budgets, and help product teams live within them. • Strengthen our observability across metrics, logs, traces, and alerting. • Ship infrastructure through code in a GitOps workflow - cloud resources and Kubernetes workloads alike. • Look after PostgreSQL: performance tuning, schema and migration review, online migrations on large tables, HA/DR, and CDC pipelines. • Mentor engineers on reliability and database fundamentals through code review, design review, and pairing.
Job Requirements
- 4+ years in SRE, DevOps, Platform/Infrastructure, or backend engineering with significant production operations ownership.
- Hands-on experience operating production services on Kubernetes, and shipping infrastructure as code in a GitOps workflow.
- Solid working knowledge of PostgreSQL in production — query plans, pg_stat_*, indexing and schema trade-offs, and what a safe online migration looks like on a non-trivial table.
- Cloud networking fundamentals (VPCs, routing, L4/L7 load balancing, DNS, TLS) and comfort debugging cross-service connectivity.
- Comfortable with a modern observability stack and proficient with Linux at the operator level.
- Practiced in incident response - calm under pressure, structured debugging, postmortems that drive change.
- At least working proficiency in Go or Python, plus strong written and verbal communication.
- Genuine interest in databases and in growing your PostgreSQL/DBA expertise.
Benefits
- Competitive Salary & Stock Options
- Health Benefits
- New Hire Home-Office Setup: One-time USD $500
- Monthly Stipend: USD $150 per month via a Brex Card
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Engineer - Storage Engineer
GoDaddyGoDaddy is a web services platform that helps individuals and businesses worldwide start, grow, and manage their online presence. GoDaddy employs team members across North America,
Role Description GoDaddy is seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our dynamic team. This role will focus on automating and maintaining our storage infrastructure with a focus on Ceph, ensuring the reliability, scalability, and performance of our systems. - Automate and maintain day-to-day operations of storage systems to support application demands - Develop and maintain tools and automation scripts to streamline storage operations and improve efficiency - Monitor system performance, identify issues, and implement solutions to ensure high availability and reliability - Participate in agile concepts such as daily stand-up meetings, task tracking boards, design and code reviews, automated testing, continuous integration, and deployment - Continuously improve system reliability, performance, and capacity through proactive monitoring, automation, and optimization Qualifications - 2+ years of professional experience with Ceph, working in a production environment - 2+ years of experience in site reliability engineering or a similar role - 2+ years of professional experience with Ceph, including deployment, configuration, and management of Ceph clusters and systems - Experience working on Linux/Unix systems, with a focus on automation and operating at scale - Proficiency in Python or Bash - Experience with Ansible, Terraform, or SaltStack - Experience with Nagios-based monitoring tools, such as Icinga2 - Experience with observability tooling, such as Prometheus, Grafana, Mimir, and Loki - Solid understanding of core networking concepts and protocols, particularly in relation to Linux/Unix systems Requirements - Experience with containerization and orchestration tools (e.g., Docker, Kubernetes) - Exposure to and experience working with compute platforms (e.g., OpenStack, AWS) - Familiarity with ability to contribute to CI/CD pipelines and automation workflows Benefits - Competitive pay - Generous time off - Parental and wellness leave - Healthcare - Retirement savings program - Comprehensive benefits package, including: - Medical, dental, and vision insurance - 401(k)-retirement plan - Paid sick time - Paid flexible time off - Paid parental leave - Life insurance - Short- and long-term disability - AD&D insurance - Mental health or EAP programs - Remote or hybrid work options - Paid holidays - Paid Wellness days - Tuition assistance - Adoption, surrogacy, and fertility benefits - Dependent daycare and backup care benefits - Employee stock purchase plan - Financial education and advice Compensation - Bay Area (Santa Clara, San Francisco) and Los Angeles: $128,000 — $192,000 USD - Austin, D.C. Metro, CA (non-Bay Area), HI, IL, MA, NH, OR, VA, WA: $110,500 — $165,500 USD - New York City Metro, Kirkland/Seattle: $117,200 — $175,800 USD - All other US locations not previously listed: $98,500 — $147,500 USD
Junior DevOps Engineer
OpenVPN Inc.OpenVPN® helps businesses of all sizes create secure, virtualized, reliable networks that scale with your team.
• Assist in designing, implementing, and maintaining scalable, fault-tolerant systems that leverage cluster orchestration and containerization technologies, with guidance from senior engineers. • Work alongside the Software Engineering and QA teams to support deployment processes for microservices-based architectures. • Help build and maintain CI/CD pipelines that support container-based application deployment and rollback. • Support system availability by running health checks and contributing to zero-downtime deployments. • Participate in a supported on-call rotation, escalating critical system issues appropriately as you build incident-response experience. • Collaborate with information security teams to follow industry best practices and compliance requirements. • Use AI-assisted tooling (e.g., LLM-powered coding assistants, chat-ops agents, scripted automations) to accelerate routine tasks such as log triage, runbook execution, ticket drafting, and code review.
Senior Systems Site Reliability Engineer, B2B
JamfJamf is an IT solutions company focusing on enterprise management of the Apple suite of products. The company was founded in 2002 by Zach Halmstad and Chip Pear
• Site Reliability Engineers are responsible for helping balance development velocity against customer-centric stability of systems through the use of SRE best practices and the creation of new processes with automation. • The Senior Site Reliability Engineer is responsible for creating and leading projects around how services and other workflows should be measured as well as participating in the observability of production systems and services through day to day operational responsibilities with the intention to gain that wisdom of production to then decide what toil should be automated next. • The Senior Site Reliability Engineer is expected to operate with a DevOps mindset at the convergence point of Cloud Operations, Engineering, and Technical Support within the framework of the Agile process. • Identify improvements in both the platform and processes by implementing established SRE concepts with the goal of improving product and system reliability. • Proactively engage and collaborate with other individuals and teams as issues arise by serving as an escalation point of customer issues to ensure successful outcomes. • Perform root cause analysis for customer impacting issues and be able to clearly document the solution and advise others from the results of those findings. • Create technical documentation based upon new technology proof of concepts, project work, root cause issue analysis, identification of alerting patterns, and proactively sharing this knowledge with other teams as part of the Continuous Improvement Model. • Participate in team ceremonies to identify and refine potential work, communicate findings, and drive opportunities to collaborate. • Assign and communicate the business value and benefit hypothesis of new projects, initiatives, and strategies while being able to break down the technical work require to achieve a successful outcome. • Lead cross-team and cross-department technical collaboration in critical customer escalations. • Advise stakeholders and senior leadership on critical customer escalations. • Occasionally provide off hours support for deployments and customer escalations.
• Own and evolve the edge proxy platform: Maintain, upgrade, and extend a high-performance reverse proxy — including maintaining the proxy binary and its configuration tooling, writing Go and Python automation, managing the full container image lifecycle on hardened Linux base images, and working across the broader edge layer, including CDN, WAF, and traffic management capabilities. • Build and maintain cloud infrastructure as code: Design and implement Terraform/Terragrunt modules and live environment configurations managing EKS clusters, load balancers, IAM roles, VPC networking, ECR registries, and supporting AWS services across multiple regions including GovCloud. • Operate Kubernetes clusters at scale: Manage multi-region, multi-cluster EKS deployments via FluxCD GitOps workflows and Helm charts, including node AMI rotation, add-on lifecycle management, and horizontal pod autoscaling. • Build and own CI/CD pipelines: Design, maintain, and improve shared GitLab CI/CD pipeline templates used across all team repositories; build and operate alternative pipeline workflows for isolated government cloud environments. • Automate operational toil: Build and maintain tooling for tasks such as container image patching, EKS AMI rotation, air-gapped ECR image sync to GovCloud, and automated MR creation for monthly version-bump patching cycles. • Manage observability and on-call: Provision and maintain Datadog SLOs, monitors, and dashboards via Terraform; participate in the team's on-call rotation responding to edge proxy incidents across production and GovCloud environments. • Support FedRAMP/GovCloud operations: Operate the GovCloud environment with its unique constraints — air-gapped image distribution, infrastructure automation in isolated networks, and alert management with compliance-aware data handling. • Evaluate and adopt internal developer tooling: Research, prototype, and drive the adoption of internal tools that improve engineering productivity across the company — including developer portals, platform self-service capabilities, and other tooling that raises the bar for the developer experience at Smartsheet. • Mentor and collaborate: Share knowledge across the team through code reviews, architecture discussions, and runbook authorship; foster a culture of engineering excellence and operational rigour. • Strategically apply AI tools: Strategically apply and champion AI tools within your team's domain to improve project execution, infrastructure design, quality, and debugging, leading adoption of AI best practices.




