ML Infrastructure Engineer
Location
California
Posted
23 days ago
Salary
0
Seniority
Senior
Job Description
ML Infrastructure Engineer
Nebius Group
• Work closely with hardware, development teams to profile and analyse GPU performance at the system and kernel level. • Evaluate and compare GPU performance across different platforms, architectures, and software stacks (e.g.,CUDA, ROCm). • Debug and optimise ML workloads to run efficiently on GPU hardware, identifying and resolving performance bottlenecks. • Perform acceptance testing for new GPU clusters, ensuring hardware and software meet performance, stability, and compatibility requirements for AI workloads. • Perform experiments across diverse GPU system configurations to assess the impact of varying interconnect strategies and system-level optimisations on performance and scalability. • Develop tools and dashboards to visualise performance metrics, bottlenecks, and trends. • Contribute to internal tooling, frameworks, and best practices
Job Requirements
- A profound understanding of theoretical foundations of machine learning
- Deep understanding of performance aspects of large neural networks training and inference (data/tensor/context/expert parallelism, offloading, custom kernels, hardware features, attention optimisations, dynamic batching etc.)
- Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, Tensort-LLM)
- Good understanding of the GPU stack: CUDA,NCCL, drivers, and relevant libraries
- Familiarity with containerized environments (e.g., Docker, Kubernetes).
- Strong communication and ability to work independently
Benefits
- Competitive compensation
- Career growth and learning opportunities
- Flexibility and work-life balance
- Collaborative and innovative culture
- Opportunity to work on impactful AI projects
- International environment and talented teams
Related Guides
Related Categories
Related Job Pages
More Infrastructure Engineer Jobs
Cloud Infrastructure Engineer – Azure, Windows, Linux
TietoevryWe create purposeful technology that reinvents the world for good. #purposefultechnology #tietoevry
• Support and maintain Azure cloud infrastructure, including virtual machines, networking, backups, monitoring, and access-related services • Administer Windows and Linux servers • Deploy and configure cloud resources • Troubleshoot infrastructure and application issues • Support monitoring, backup, patching, and lifecycle activities • Participate in migrations and infrastructure improvement initiatives • Contribute to automation activities using scripting and standard tools • Create and maintain technical documentation, SOPs, and runbooks • Support incident, problem, and change management processes
• Access Management: Provision and manage user access, accounts, and permissions across public and private cloud infrastructure • Operations: Triage and resolve day-to-day infrastructure issues, escalating to senior engineers when appropriate • Deploy and configure infrastructure resources such as virtual machines, app services, containers, storage, and virtual networks • Create and maintain documentation including runbooks, standard operating procedures, and knowledge base articles • Systems Administration & Scripting: Administer and troubleshoot Linux and Windows-based systems across public and private cloud environments • Assist with patching, hardening, and basic performance tuning activities • Write and maintain scripts (PowerShell, Bash, Python) to support routine operational tasks • Support basic monitoring and alerting, escalating issues as needed • Growth Opportunities: Gain exposure to Infrastructure as Code (IaC) development and CI/CD pipeline improvements • Engage in incident response and root cause analysis • Take on security analysis and engineering initiatives • Assist with infrastructure security and compliance efforts aligned to organizational requirements • Shape infrastructure design decisions and identify opportunities to automate operational workflows • Drive progress toward SOC 2 and regulatory compliance objectives
Staff Infrastructure Software Engineer
BNSF RailwayFor more than 170 years, BNSF Railway has worked to connect its users with the global marketplace, playing “a vital role in building and sustaining this natio
Role Description Be part of a team that values safety, inclusion, and excellence. As a member of our team, you will play a role in supporting the movement of essential products and materials that help feed, clothe, supply, and power communities throughout America and the world. This is a full stack infrastructure engineering role, involving backend services, APIs, and developer-facing services for our physical/digital infrastructure and platforms. Key responsibilities may include: - Develop and operate cloud-native systems and services using GitOps workflows. - Build and operate IaaS (compute, storage, and networking), PaaS (DBs, messages, API, and serverless), identity/access, observability/alerting/remediation, and developer services. - Collaborate with internal application teams to improve platform usability and feedback. - Help design and run systems that scale across our data centers, edge devices, and public cloud environments powering always on services. - Demonstrate operational excellence by monitoring, troubleshooting, and resolving production issues, including participating in a 24/7 on‑call rotation. The duties and responsibilities in this posting are representative categories to be used in deciding whether to apply for this position. This is not an exhaustive list of the position’s duties. Qualifications - Authorized to work in the US. - 6+ years of proficiency with modern programming languages used in infrastructure (e.g. Go, Python, and Java). - 5+ years of experience with IaaS, PaaS, Service Mesh, block/object storage, and Linux. - 3+ years experience with streaming data services and SQL/NoSQL/Graph DBs. - Experience with CI/CD pipelines, Git workflows, and DevOps practices. - Interest in open-source infrastructure and developer platforms. - Experience with observability stacks (e.g., Prometheus, Grafana, OpenTelemetry). - Experience with secure coding practices and infrastructure security principles. - Experience with secrets management and identity and access control. - Proven ability to independently deliver features that improve developer workflows or platform capabilities. - Experience collaborating across teams and mentoring junior engineers. - Familiarity with designing scalable systems and contributing to architectural decisions. - Experience participating in design reviews, incident retrospectives, or RFC processes. - Demonstrated ability to learn new technical concepts and to adapt to new technologies quickly. - Strong communication and collaboration skills. Requirements - Bachelor’s degree or higher in computer science or a related field (preferred). - Engineering experience with a public IaaS company (AWS, Digital Ocean, GCP, CoreWeave, etc.) (preferred). - Experience running infrastructure platforms in production such as OpenStack, Ceph, Kubernetes, DBaaS, service mesh, and API gateways (preferred). - Github commits to open-source infrastructure and developer platforms projects (preferred). - Experience with edge systems and hybrid environments (preferred). - Experience with developer tooling as a builder and/or user (preferred). - Proficiency with container-based development (preferred). - Interest in sustainable infrastructure and cost/resource awareness (preferred). - Familiarity with frameworks like React, Angular, Node.js, Spring Boot (preferred). - Able to work now and in the future without BNSF’s assistance in obtaining, maintaining, or extending employment authorization (preferred). Benefits - An industry-leading 401(k) and renowned Railroad Retirement program. - A range of robust health care options for you and your dependents (including domestic partners), including medical, dental, vision, telemedicine, mental health, cancer support, and high-quality care network options. - Health care spending accounts (HSA) with employer contributions, as well as life and disability insurance, provided at no cost. - Family benefits including parental, pediatric and family building support, adoption and surrogacy reimbursement, and dependent care spending account (with employer match). - Access to discounts on travel, gym memberships, counseling services and wellness support. - Annual bonus (Incentive Compensation Program). - Generous leave / time off policies.
Senior Cloud Infrastructure Engineer
FUJIFILMFUJIFILM is a publicly traded, multinational photography and imaging company with global headquarters in Tokyo, Japan and regional headquarters in Valhalla, New York. Established i
Role Description Join us to shape and run a resilient, secure, and high‑performing global infrastructure that powers life‑changing work. As a Senior Infrastructure Engineer, you’ll be the technical lead ensuring the availability, scalability, and security of our enterprise platforms across data centers and cloud. If you love solving complex problems at scale, mentoring others, and turning architecture into robust, operational reality—this is your next challenge. Qualifications - Deep hands‑on expertise operating global enterprise infrastructure, including: - Virtualization and platforms: VMware, Windows Server, Citrix, Linux, hyperconverged - Cloud: AWS and/or Azure - Storage: design, build, and administration - Backup/DR: Veeam or similar replication technologies - Networking fundamentals: TCP/IP, DNS, DHCP, VPN - Security: implementing controls aligned to frameworks such as ISO 27000 - Scripting/automation for efficiency and reliability - Strength in producing clear technical documentation, SOPs, and validation evidence; at ease representing systems in audits/inspections. - Excellent communication with both technical and non‑technical audiences; ability to influence stakeholders and lead through delivery. Requirements - Bachelor’s degree in Computer Science, Information Technology, or relevant technical certifications. - 8+ years’ experience in a 3rd/4th‑line role within an enterprise IT infrastructure environment, or - 5+ years’ experience in a biomedical IT service environment, - Any combination of education and experience that provides an equivalent background to deliver against role expectations. Benefits This is a global position that will support all our FLBG sites. This position can be based at any of our locations around the globe. Benefits and compensation will be governed by the location that you are based from and considered your home site.




