Job Closed

This listing is no longer active.

DataCrunch

Premium dedicated GPU servers and clusters. Raw performance at an unmatched price.

Senior – Principal Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 11-50H1B No SponsorCompany Site LinkedIn

Location

Germany

Posted

70 days ago

Salary

Seniority

Senior

Bachelor Degree7 yrs expEnglishAnsible AWS Azure Distributed Systems DNS GCP Linux Python Terraform

Job Description

• Ensure the reliability, scalability, and performance of HPC and cloud systems. • Build and maintain automation, observability, and monitoring frameworks for compute clusters. • Collaborate with ML, data, and infrastructure teams to deliver high-availability systems. • Develop and enhance CI/CD pipelines, deployment workflows, and on-call processes. • Participate in architecture design and long-term infrastructure strategy discussions. • Participate in a 24/7 on-call rotation, with at least one full on-call week per month.

Job Requirements

7+ years in SRE, DevOps, or Infrastructure Engineering—preferably in HPC or large-scale distributed systems.
Linux expertise (Ubuntu or Debian preferred).
Strong experience with scripting and automation (Python, Go, Bash).
Proven ability with cloud platforms (AWS, GCP, Azure, or modern HPC providers such as CoreWeave, Lambda, Nebius).
Deep understanding of networking (DNS/TCP) and infrastructure-as-code tools (Terraform, Ansible).
Experience managing Slurm-based HPC GPU clusters, diagnosing performance issues, and designing efficient HPC jobs.

Benefits

Generous cash + equity compensation
Various fringe benefits (e.g., healthcare, lunch, wellbeing, etc.)

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior Platform, DevOps Engineer

beyonnex.io

Pioneer of smart real estate

DevOps Engineer71 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Design, build and operate our AWS- and Kubernetes-based platform • Own one or more areas and act as the go-to person in the team • Operate production AWS environments and Kubernetes clusters • Maintain observability stack: Metrics, Logs, Traces, Instrumentation • Define SLOs, dashboards and alerting for teams • Work on Kubernetes networking, Ingress controllers and traffic routing • Build and maintain Terraform modules for AWS and Kubernetes • Support connectivity between cloud and on-prem systems • Participate in design reviews, incident reviews and on-call.

AWS DNS Firewalls Grafana Kubernetes Prometheus TCP/IP Terraform

View details: Senior Platform, DevOps Engineer

Germany

Apply

API Reliability Engineer

Empower

We are an equal opportunity employer with a commitment to diversity. All individuals, regardless of personal characteristics, are encouraged to apply. All qualified applicants will receive consideration for employment without regard to age, race, color, national origin, ancestry, sex, sexual orientation, gender, gender identity, gender expression, marital status, pregnancy, religion, physical or mental disability, military or veteran status, genetic information, or any other status protected by applicable state or local law.

DevOps Engineer71 days ago

Full Time RemoteTeam 10,001+H1B Sponsor

Company Site LinkedIn

• Own and improve the reliability, performance, and scalability of API services in production. • Troubleshoot and resolve P1/P2 production incidents end-to-end, analyzing issues across application, infrastructure, and integrations. • Work closely with API developers to identify and address reliability issues and application-level security vulnerabilities in service design and implementation. • Contribute targeted code-level or configuration fixes to resolve issues and prevent recurrence. • Participate in root cause analysis (RCA) and drive durable, long-term fixes. • Improve API resilience through patterns such as timeouts, retries, circuit breakers, and graceful degradation. • Establish and enhance observability and service health metrics, including logs, metrics, traces, and SLOs, using Datadog and Splunk. • Define and monitor SLAs/SLOs for API performance and availability. • Work with API Gateway and ALB/NLB for traffic management, routing, and system reliability. • Contribute to CI/CD pipelines using Jenkins to ensure safe and consistent deployments. • Contribute to disaster recovery readiness and system resilience planning. • Collaborate across engineering teams to improve system design and operational readiness. • Participate in an on-call rotation for critical incidents (P1/P2).

AWS Distributed Systems DynamoDB EC2 Java Jenkins Splunk Spring Spring Boot SpringBoot

View details: API Reliability Engineer

United States

$87.4K - $123.4K / year

Apply

Job Closed

Senior DevOps Engineer

Lucidworks

Leaders in AI-Powered Search

DevOps Engineer71 days ago

Full Time RemoteTeam 201-500H1B Sponsor

Company Site LinkedIn

• Build the automation tools that ensure our internal and external customers receive resources quickly and painlessly while making our team’s lives easier • Work closely with engineering teams to deliver a high quality product to our customers that meets all of their needs • Aim for at least 99.9% uptime across all of our managed customers • Work on several major projects including automating parts of our infrastructure, creating new monitors and alerts, creating new tooling for both team consumption and company consumption, etc. • Take ownership of Lucidworks’ company-wide cloud-first initiative by making the onboarding process for new customers as smooth as possible for them.

Distributed Systems Kubernetes

View details: Senior DevOps Engineer

United States

$128K - $176K / year

Apply

Job Closed