Senior Site Reliability Engineer, Compute Node Team
Location
Netherlands
Posted
134 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer, Compute Node Team
Nebius Group
• Ensure reliability, availability and performance of compute nodes running VMs • Analyze and debug Linux systems across user space and kernel space, understanding capabilities, limitations and trade-offs at each layer • Troubleshoot complex production issues involving CPU, memory, NUMA, cgroups and scheduling • Work hands-on with virtualization and containerization, primarily using QEMU/KVM and Linux-native technologies • Design and evolve observability as a core capability of the node layer: metrics, logs, traces, alerts, SLIs and SLOs • Lead incident response, root-cause analysis, and postmortems, driving long-term reliability improvements • Collaborate closely with platform, kernel/hypervisor, GPU and infrastructure teams to improve system design and operability.
Job Requirements
- Strong Linux expertise:
- deep understanding of Linux user space and kernel space
- knowledge of kernel subsystems (scheduler, memory management, filesystems, cgroups, namespaces)
- clear understanding of system boundaries and constraints at different layers
- Virtualization experience:
- hands-on experience with QEMU/KVM
- understanding of VM lifecycle, performance characteristics and failure modes
- Containerization knowledge:
- practical experience with containers, namespaces and cgroups
- strong understanding of resource isolation and control
- Strong debugging skills:
- ability to reason about complex system failures
- structured, hypothesis-driven approach to incident analysis
- SRE mindset:
- clear understanding of the SRE role in system design and operations
- experience building and operating observability stacks, not just consuming them
- ability to turn system behavior into actionable reliability signals.
Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within Nebius.
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Evolver is looking for a DevOps Engineer to join our team in support of our federal health IT customer. The DevOps Engineer will play a pivotal role serving as the bridge between the development and operations teams, with the primary goal of enhancing the software development lifecycle's efficiency, reliability, and collaboration. responsible for automating and streamlining the processes of building, testing, deploying, and monitoring software applications. They will leverage their technical expertise to implement Infrastructure as Code (IaC), containerization, and orchestration solutions, making it easier to manage and scale infrastructure. They will design and maintain Continuous Integration/Continuous Deployment (CI/CD) pipelines, enabling rapid and reliable software releases. Additionally, they will focus on monitoring and logging, ensuring that the system's performance and health are continuously tracked and analyzed, thus enabling rapid responses to issues. This is a remote position requiring the person to work an EST schedule and be based within the United States. Responsibilities: - CI/CD Pipeline Management: Design, implement, and maintain continuous integration and continuous deployment (CI/CD) pipelines. Automate build, test, and deployment processes to ensure reliable and rapid software delivery. - Infrastructure as Code (IaC): Use tools like Terraform, CloudFormation, or Ansible to provision and manage infrastructure in AWS. Maintain version-controlled infrastructure for reproducibility and scalability. Automate environment setups across development, staging, and production. - Monitoring, Logging, and Incident Response: Implement and manage monitoring tools (e.g., Prometheus, Grafana, CloudWatch, New Relics, Splunk). Set up alerting and reporting for performance and reliability issues. Participate in on-call rotations and incident management to ensure uptime and reliability. - Containerization and Orchestration: Build, deploy, and manage containerized applications using Docker and ECS. - Security and Compliance: Integrate DevSecOps practices into the CI/CD pipeline (e.g., vulnerability scanning, secret management). Manage access controls, IAM roles, and encryption policies. - Scripting and Automation: Develop automation scripts in languages like Python, Bash, or PowerShell. Automate repetitive operational tasks to improve efficiency and reliability. - Performance Optimization: Analyze system bottlenecks and optimize application and infrastructure performance. Implement caching, load balancing, and scaling strategies. - Documentation and Knowledge Sharing: Maintain detailed documentation for infrastructure, automation, and deployment processes. Train and support development teams on DevOps tools and workflows. Basic Qualifications: - Bachelor's Degree or 10 years of equivalent experience in a related field may be substituted for the degree. - 5 years of experience in IT industry comprising of DevOps/Cloud Engineer, Software Configuration Management (SCM), Cloud Management, Containerization, Deployment and Tool Engineering in Agile Environment. - 5 years of experience as a DevOps / Build & Release Engineer in automating, building, deploying, managing Configuration Management, Continuous Integration (CI), Continuous Deployment (CD). - 5 years of experience in Infrastructure Development and Operations, involved in designing and deploying utilizing AWS stack like EC2, EBS, EFS, IAM, S3, VPC, RDS, SES, ELB, ECS, SQS, Auto scaling, Cloud Front, Cloud Formation, Cloud Watch, SNS, Route 53. - 3 years of experience with managed servers on the Amazon Web Services (AWS) platform using Ansible configuration management Tools and Created instances in AWS. - 3 years of experience with designed AWS Cloud Formation templates to create custom sized VPC, subnets, NAT to ensure successful deployment of Web applications and database templates. - 3 years of experience with database management tools like Liquibase. - 3 years of experience with Application Deployments and Environments Configuration like Chef, Puppet or Ansible. - 3 years of experience with written Ansible playbooks for configuration management and multi - machine deployment. - 3 years of experience in branching, tagging, and maintaining the version control and source code management tools like GIT, SVN (subversion) on Linux and windows platforms. - 3 years of experience using build tools like Maven or NPM for the building of deployable artifacts. - 3 years of experience with managing artifact repositories like Nexus or Artifactory. - 3 years of experience in creating Jenkins CI pipelines and good experience in automating deployment pipelines. - 3 years of experience working on several Docker components like Docker Engine, Hub, Machine, Compose, Docker Registry, ECR ECS. - 3 years of experience working under various protocols like HTTP, HTTPS, POP, FTP, TCP/IP and SMTP. - 3 years of experience working with monitoring systems and tools like New Relic, Splunk, Cloud Watch etc. - 3 years of experience in Bash, Perl, Python, Ruby, PowerShell scripting on Linux & Windows. - 3 years of experience in configuring and maintaining network services such as LDAP, DNS, NIS, DHCP, NFS, Webmail, FTP. - 3 years of experience in deploying system stacks for different environments like Dev, UAT, Prod on AWS cloud infrastructure. - 3 years of experience managing users and groups using the Amazon Identity and Access Management (IAM) (with MFA) and IAM policies to meet security audit & compliance requirements. - 3 years of experience with Apache, Nginx, and JBOSS configurations. - Bachelor's Degree required. Equivalent years of experience in a related field may be substituted for the degree. - US Citizen or Permanent Resident required, and all applicants shall have lived in the United States for at least three (3) out of the last five (5) years - Must be able to pass a comprehensive background check that includes a client-specific Public Trust background investigation Preferred Qualifications: - AWS Cloud Practitioner or DevOps Engineer certifications - Excellent written and verbal communication skills, strong organizational skills, and a hard-working team player. - Able to prioritize and execute tasks in a high-pressure environment. Highly self-motivated and directed. Evolver Federal is an equal opportunity employer and welcomes all job seekers. It is the policy of Evolver Federal not to discriminate based on race, color, ancestry, religion, gender, age, national origin, gender identity or expression, sexual orientation, genetic factors, pregnancy, physical or mental disability, military/veteran status, or any other factor protected by law. Actual salary will depend on factors such as skills, qualifications, experience, market and work location. Evolver Federal offers competitive benefits, including health, dental and vision insurance, 401(k), flexible spending account, and paid leave (including PTO and parental leave) in accordance with our applicable plans and policies.
• Design and maintain the pipelines that produce our Single-Tenant SaaS updates and our Self-Managed customer bundles. • Ensure 'Build Once, Deploy Anywhere' consistency across standard cloud and restricted GovCloud environments. • Manage artifact lifecycle, including versioning, container registries, and software signing to meet federal security standards. • Act as the primary technical point of contact for ISSM (Information System Security Manager) approvals for GovCloud deployments. • Maintain the 'Version Map'—tracking which customers are on which versions and managing the complexities of 'version lag' for those who opt out of weekly updates. • Coordinate across teams to validate bundles before they are shipped to customer-managed environments. • Continually improve our release operations and processes through automation • Develop and track metrics for release operations, recommend and develop solutions to improve alongside the engineering team. • Lead 'Go/No-Go' decisions, synthesizing input from QA, Support, and Product. • Empower Customer Support and Sales Engineering by providing them with clear 'Known Issues' lists and migration paths for each release, with the support of the engineering and product team for input.
Senior Release Engineer
Red Cell PartnersRed Cell Partners, founded in 2020, is a dynamic and rapidly growing firm specializing in launching and scaling innovative companies across various industries.
• Design and maintain the pipelines that produce our Single-Tenant SaaS updates and our Self-Managed customer bundles. • Ensure "Build Once, Deploy Anywhere" consistency across standard cloud and restricted GovCloud environments. • Manage artifact lifecycle, including versioning, container registries, and software signing to meet federal security standards. • Act as the primary technical point of contact for ISSM (Information System Security Manager) approvals for GovCloud deployments. • Maintain the "Version Map"—tracking which customers are on which versions and managing the complexities of "version lag" for those who opt out of weekly updates. • Coordinate across teams to validate bundles before they are shipped to customer-managed environments. • Continually improve our release operations and processes through automation • Develop and track metrics for release operations, recommend and develop solutions to improve alongside the engineering team. • Lead "Go/No-Go" decisions, synthesizing input from QA, Support, and Product. • Empower Customer Support and Sales Engineering by providing them with clear "Known Issues" lists and migration paths for each release, with the support of the engineering and product team for input.
• Architect and operate high-availability, low-latency cloud infrastructure for real-time AI inference services. • Build and maintain CI/CD pipelines that support rapid, safe iteration across backend and ML systems. • Own production reliability: monitoring, alerting, incident response, and capacity planning. • Lead infrastructure automation using Infrastructure as Code (Terraform, Pulumi, etc.). • Partner closely with backend and ML engineers to productionize new models and services. • Harden our systems for security, compliance, and data protection. • Optimize cost, performance, and scalability as usage grows. • Establish best practices around observability, deployment strategies, and disaster recovery.


