Job Closed
This listing is no longer active.
Cornelis Networks delivers the world’s highest performance scale-out networking solutions for AI and HPC datacenters. Our differentiated architecture seamlessly integrates hardware, software and system level technologies to maximize the efficiency of GPU, CPU and accelerator-based compute clusters at any scale. Our solutions drive breakthroughs in AI & HPC workloads, empowering our customers to push the boundaries of innovation. We are a fast-growing, forward-thinking team of architects, engineers, and business professionals with a proven track record of building successful products and companies. As a global organization, our team spans multiple U.S. states and six countries, and we continue to expand with exceptional talent in onsite, hybrid, and fully remote roles.
Senior Software Engineer - DevOps and Infrastructure
Location
United States
Posted
78 days ago
Salary
0
Seniority
Senior
Job Description
Senior Software Engineer - DevOps and Infrastructure
Cornelis Networks, Inc.
Cornelis Networks delivers the world’s highest performance scale-out networking solutions for AI and HPC datacenters. Our differentiated architecture seamlessly integrates hardware, software and system level technologies to maximize the efficiency of GPU, CPU and accelerator-based compute clusters at any scale. Our solutions drive breakthroughs in AI & HPC workloads, empowering our customers to push the boundaries of innovation. Backed by top-tier venture capital and strategic investors, we are committed to innovation, performance and scalability - solving the world’s most demanding computational challenges with our next-generation networking solutions. We are a fast-growing, forward-thinking team of architects, engineers, and business professionals with a proven track record of building successful products and companies. As a global organization, our team spans multiple U.S. states and six countries, and we continue to expand with exceptional talent in onsite, hybrid, and fully remote roles. We are seeking an experienced Senior Software Engineer to enhance Cornelis Networks' development and testing infrastructure. This role focuses on the design and maintenance of onsite and cloud infrastructure, including CPU and GPU-accelerated systems, automation, and CI/CD pipelines. The ideal candidate will have deep Linux systems expertise and hands-on experience managing server hardware and software stacks in Enterprise IT, Cloud or HPC environments. Responsibilities - Design, build, and maintain CI/CD pipelines using GitHub Actions, including custom workflows, across multiple Linux distributions (RHEL, Rocky, SLES, Ubuntu). - Integrate HPC/AI workload and Cornelis software build and test stages into CI pipelines, including driver compatibility checks and GPU-accelerated test suites. - Interface with SW teams to validate and test new packages and ensure test hardware environments are provisioned with the correct package versions. - Collaborate with developers to help debug, interpret, and communicate testing results, ensuring issues are triaged and resolved efficiently. - Monitor pipeline health, troubleshoot failures, and continuously improve build reliability and speed. - Administer and maintain onsite Linux-based development and test servers across a heterogeneous multi-distro environment. - Perform OS provisioning, patching, and lifecycle management, including kernel module and driver management. - Manage hardware and software stacks for both CPU and GPU systems, including installation, upgrade, and troubleshooting of AMD (ROCm) and NVIDIA (CUDA/driver) environments. - Maintain driver and package compatibility matrices to ensure consistent environments across development, CI, and test infrastructure. - Manage test hardware environments to ensure they are provisioned with the correct packages and configurations for ongoing test campaigns. Minimum Qualifications - 5+ years of experience in DevOps or infrastructure engineering in a Linux-based environment. - Strong proficiency with Git and GitHub Actions, including designing and maintaining custom CI/CD workflows. - Deep Linux systems mastery, especially across RHEL/Rocky, SLES, and Ubuntu, such as package management (RPM, DEB, zypper, dnf, apt), systemd, kernel module management, and system troubleshooting. - Hands-on experience managing server systems, including installation and maintenance of drivers, software, and firmware across multiple Linux distributions. - Proficiency in Linux scripting for automation and tooling. - Strong troubleshooting and debugging skills across hardware, OS, driver, and application layers. - Excellent communication skills and ability to work effectively in a remote, collaborative environment. Preferred Qualifications - Experience with Reframe or similar HPC regression testing frameworks. - Experience with HPC, high-performance networking, or RDMA technologies (InfiniBand, Omni-Path, Ethernet RDMA). - Familiarity with kernel driver development or kernel module packaging (DKMS, kmod). - Experience managing multi-GPU server configurations (e.g., DGX, MI300X-class systems, or similar). - Knowledge of GPU monitoring and management tools (nvidia-smi, rocm-smi, DCGM, GPU operator frameworks). - Experience with infrastructure-as-code tools such as Ansible. - Familiarity with monitoring and observability stacks (Grafana, Loki, Prometheus). - Proficiency in Python for automation and tooling. - Familiarity with C/C++. - Experience in a networking, semiconductor, or systems software company. Location: This is a remote position for employees residing within the United States. We offer a competitive compensation package that includes equity, cash, and incentives, along with health and retirement benefits. Our dynamic, flexible work environment provides the opportunity to collaborate with some of the most influential names in the semiconductor industry. At Cornelis Networks your base salary is only one component of your comprehensive total rewards package. Your base pay will be determined by factors such as your skills, qualifications, experience, and location relative to the hiring range for the position. Depending on your role, you may also be eligible for performance-based incentives, including an annual bonus or sales incentives. In addition to your base pay, you’ll have access to a broad range of benefits, including medical, dental, and vision coverage, as well as disability and life insurance, a dependent care flexible spending account, accidental injury insurance, and pet insurance. We also offer generous paid holidays, 401(k) with company match, and Open Time Off (OTO) for regular full-time exempt employees. Other paid time off benefits include sick time, bonding leave, and pregnancy disability leave. Cornelis Networks does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. Cornelis Networks is an equal opportunity employer, and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity or expression, pregnancy, age, national origin, disability status, genetic information, protected veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process. Location Austin, Texas (Remote) Department Software Engineering Employment Type Full-Time
Job Requirements
- 5+ years of experience in DevOps or infrastructure engineering in a Linux-based environment.
- Strong proficiency with Git and GitHub Actions, including designing and maintaining custom CI/CD workflows.
- Deep Linux systems mastery, especially across RHEL/Rocky, SLES, and Ubuntu, such as package management (RPM, DEB, zypper, dnf, apt), systemd, kernel module management, and system troubleshooting.
- Hands-on experience managing server systems, including installation and maintenance of drivers, software, and firmware across multiple Linux distributions.
- Proficiency in Linux scripting for automation and tooling.
- Strong troubleshooting and debugging skills across hardware, OS, driver, and application layers.
- Excellent communication skills and ability to work effectively in a remote, collaborative environment.
- Preferred Qualifications
- Experience with Reframe or similar HPC regression testing frameworks.
- Experience with HPC, high-performance networking, or RDMA technologies (InfiniBand, Omni-Path, Ethernet RDMA).
- Familiarity with kernel driver development or kernel module packaging (DKMS, kmod).
- Experience managing multi-GPU server configurations (e.g., DGX, MI300X-class systems, or similar).
- Knowledge of GPU monitoring and management tools (nvidia-smi, rocm-smi, DCGM, GPU operator frameworks).
- Experience with infrastructure-as-code tools such as Ansible.
- Familiarity with monitoring and observability stacks (Grafana, Loki, Prometheus).
- Proficiency in Python for automation and tooling.
- Familiarity with C/C++.
- Experience in a networking, semiconductor, or systems software company.
- Location
- This is a remote position for employees residing within the United States.
Benefits
- Competitive compensation package that includes equity, cash, and incentives.
- Health and retirement benefits.
- Access to a broad range of benefits, including medical, dental, and vision coverage.
- Disability and life insurance.
- Dependent care flexible spending account.
- Accidental injury insurance and pet insurance.
- Generous paid holidays.
- 401(k) with company match.
- Open Time Off (OTO) for regular full-time exempt employees.
- Other paid time off benefits include sick time, bonding leave, and pregnancy disability leave.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Middle DevOps Engineer
hireforyou.proWe look forward to receiving your CV and learning more about your experience! Dear Candidates, due to a high volume of applications, only selected candidates will be contacted for interviews. We appreciate your understanding. Thank you for considering a career with us!
Role Description We are looking for a Middle DevOps Engineer to join a high-performing engineering team providing infrastructure and DevOps support for fast-growing technology companies across fintech, SaaS, e-commerce, and entertainment. This role is ideal for engineers who enjoy working in a dynamic environment with multiple projects, solving complex infrastructure challenges, and collaborating closely with both internal teams and client engineers. What You’ll Do: - Infrastructure Management: - Manage hybrid environments including cloud infrastructure and physical servers - Work with platforms such as AWS, DigitalOcean, Hetzner, Azure, or GCP - Infrastructure as Code: - Build and maintain infrastructure using Terraform and Ansible - Containerized Environments: - Work with Kubernetes in production environments - Monitoring & Incident Response: - Configure and maintain monitoring and alerting systems (Prometheus, Grafana, ELK/Graylog) - Participate in incident analysis and root cause investigations - Technical Collaboration: - Communicate with engineering teams and participate in architectural discussions - Engineers participate in on-call rotations and may occasionally respond to incidents outside standard hours. Tech Stack: - Core Requirements: - Linux administration (Debian / Ubuntu / CentOS) - Terraform & Ansible - Kubernetes (production experience) - Cloud infrastructure (AWS / DigitalOcean / Hetzner) - Monitoring and alerting systems (Prometheus, Grafana, ELK / Graylog) - Nice to Have: - Bare metal infrastructure or virtualisation (Proxmox) - Experience with distributed databases (ClickHouse, Cassandra, ScyllaDB) - Networking knowledge (BGP, VLAN, L2/L3 troubleshooting) - Experience with Azure or GCP - Automation for hybrid or Windows environments Qualifications - 3+ years of experience in DevOps, infrastructure engineering, or system administration - Experience working in fast-paced environments with multiple projects - Strong troubleshooting and incident management skills - Ability to collaborate with engineers and communicate technical solutions clearly - Technical English (B1+) for written communication Requirements - Remote full-time position - Market-level compensation - 21 days of paid vacation - Paid sick leave - Support for professional development and certifications Benefits - We look forward to receiving your CV and learning more about your experience! - Dear Candidates, due to a high volume of applications, only selected candidates will be contacted for interviews. - We appreciate your understanding. Thank you for considering a career with us.
Senior DevOps/SRE Engineer
Capital.comWe are making the world of finance more accessible, engaging, and useful with an award-winning trading platform and app.
• Design, deploy, and maintain scalable cloud infrastructure on AWS, ensuring high availability, performance, and security across all environments. • Own and evolve Kubernetes cluster management — including bare-metal deployments — and ensure reliable containerised workloads using Docker and Helm. • Build and maintain CI/CD pipelines using GitLab CI, incorporating GitOps principles with FluxCD or ArgoCD to streamline and automate delivery workflows. • Define and manage Infrastructure as Code using Terraform, ensuring all infrastructure changes are version-controlled, repeatable, and reviewed. • Lead monitoring and observability initiatives: implement and maintain dashboards, alerting, and log pipelines using VictoriaMetrics/Prometheus, Grafana, and the ELK stack. • Operate and optimize Apache Kafka ecosystems, including Strimzi, Kafka Connect, and MirrorMaker, to support real-time data pipelines. • Drive incident response, root cause analysis, and post-mortem culture to continuously improve system reliability. • Collaborate closely with Engineering, Security, and Product teams to embed DevOps best practices across the organisation. • Mentor and guide junior engineers, raising the overall engineering bar for infrastructure reliability and automation.
Senior Site Reliability Engineer, Linux, Kubernetes, Go, Python
Red HatFounded in 1993, Red Hat is an award-winning technology firm working to serve as the go-to company for communities of contributors, customers, and partners in c
• Contribute code to increase the scalability and reliability of the service. • Contribute software tests and participate in peer review to increase the quality of our codebase. • Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration. • Participate in a regular on-call schedule, including occasional paid weekends and holidays. • Practice sustainable incident response and blameless postmortems. • Resolve customer issues escalated from the Red Hat Global Support team. • Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve. • Collaborate with cross-functional teams to identify opportunities for AI integration within the software development lifecycle, driving continuous improvement and innovation in engineering practices; share use cases for successful experiments with stakeholders for broader use.
Senior Site Reliability Engineer – Self-Hosted Infrastructure
TinesNo-code automation for security teams
• Changing the way customers provision and operate our self-hosted product offering along with the services and infrastructure powering self-hosted installations. • Identifying and fixing availability risks and monitoring gaps to ensure our services stay healthy and available. • Enabling software engineers to build new product features that work seamlessly across cloud and self-hosted environments: observability, logging, and simplifying deployments. • Using our own product extensively to automate infrastructure maintenance and to build DevOps tooling for customer deployments. • Identifying areas for improvement in our containerized architecture and deployment strategies. • Using your knowledge and experience to mentor other engineers in container orchestration and Kubernetes best practices across all of Tines: Sales, Support, and Engineering. • Act as an subject matter expert for critical self-hosted customer issues.



