Job Closed
This listing is no longer active.
Premium dedicated GPU servers and clusters. Raw performance at an unmatched price.
Senior – Principal Site Reliability Engineer
Location
Germany
Posted
70 days ago
Salary
0
Seniority
Senior
Job Description
Senior – Principal Site Reliability Engineer
DataCrunch
• Ensure the reliability, scalability, and performance of HPC and cloud systems. • Build and maintain automation, observability, and monitoring frameworks for compute clusters. • Collaborate with ML, data, and infrastructure teams to deliver high-availability systems. • Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes. • Participate in architecture design and long-term infrastructure strategy discussions. • Participate in a 24/7 on-call rotation, with at least one full on-call week per month.
Job Requirements
- 7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems.
- Linux expertise (Ubuntu or Debian preferred).
- Strong experience with scripting and automation (Python, Go, Bash).
- Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius).
- Deep understanding of networking (DNS/TCP) and infrastructure-as-code tools (Terraform, Ansible).
- Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs.
Benefits
- Generous cash + equity compensation
- Various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.)
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Design, build and operate our AWS- and Kubernetes-based platform • Own one or more areas and act as the go-to person in the team • Operate production AWS environments and Kubernetes clusters • Maintain observability stack: Metrics, Logs, Traces, Instrumentation • Define SLOs, dashboards and alerting for teams • Work on Kubernetes networking, Ingress controllers and traffic routing • Build and maintain Terraform modules for AWS and Kubernetes • Support connectivity between cloud and on-prem systems • Participate in design reviews, incident reviews and on-call.
API Reliability Engineer
EmpowerWe are an equal opportunity employer with a commitment to diversity. All individuals, regardless of personal characteristics, are encouraged to apply. All qualified applicants will receive consideration for employment without regard to age, race, color, national origin, ancestry, sex, sexual orientation, gender, gender identity, gender expression, marital status, pregnancy, religion, physical or mental disability, military or veteran status, genetic information, or any other status protected by applicable state or local law.
• Own and improve the reliability, performance, and scalability of API services in production. • Troubleshoot and resolve P1/P2 production incidents end-to-end, analyzing issues across application, infrastructure, and integrations. • Work closely with API developers to identify and address reliability issues and application-level security vulnerabilities in service design and implementation. • Contribute targeted code-level or configuration fixes to resolve issues and prevent recurrence. • Participate in root cause analysis (RCA) and drive durable, long-term fixes. • Improve API resilience through patterns such as timeouts, retries, circuit breakers, and graceful degradation. • Establish and enhance observability and service health metrics, including logs, metrics, traces, and SLOs, using Datadog and Splunk. • Define and monitor SLAs/SLOs for API performance and availability. • Work with API Gateway and ALB/NLB for traffic management, routing, and system reliability. • Contribute to CI/CD pipelines using Jenkins to ensure safe and consistent deployments. • Contribute to disaster recovery readiness and system resilience planning. • Collaborate across engineering teams to improve system design and operational readiness. • Participate in an on-call rotation for critical incidents (P1/P2).
• Build the automation tools that ensure our internal and external customers receive resources quickly and painlessly while making our team’s lives easier • Work closely with engineering teams to deliver a high quality product to our customers that meets all of their needs • Aim for at least 99.9% uptime across all of our managed customers • Work on several major projects including automating parts of our infrastructure, creating new monitors and alerts, creating new tooling for both team consumption and company consumption, etc. • Take ownership of Lucidworks’ company-wide cloud-first initiative by making the onboarding process for new customers as smooth as possible for them.
• Build the automation tools that ensure our internal and external customers receive resources quickly and painlessly while making our team’s lives easier • Work closely with engineering teams to deliver a high quality product to our customers that meets all of their needs • Aim for at least 99.9% uptime across all of our managed customers • Work on several major projects including automating parts of our infrastructure, creating new monitors and alerts, creating new tooling for both team consumption and company consumption, etc. • Take ownership of Lucidworks’ company-wide cloud-first initiative by making the onboarding process for new customers as smooth as possible for them.



