Job Closed
This listing is no longer active.
Premium dedicated GPU servers and clusters. Raw performance at an unmatched price.
Principal Cluster Engineer, Training Infrastructure
Location
Finland
Posted
75 days ago
Salary
0
Seniority
Lead
Job Description
Principal Cluster Engineer, Training Infrastructure
DataCrunch
• Design, deploy, and continuously improve large-scale InfiniBand-connected GPU training clusters • Drive cluster-level storage performance, translating customer SLAs into internal throughput and IOPS performance targets • Build and maintain automation for cluster provisioning, OS imaging, firmware management, and day-two operations using Python • Contribute to infrastructure-as-code and CI/CD pipelines for cluster and platform management • Establish and own performance baselines across compute, network fabric, and storage layers • Identify, diagnose, and resolve performance bottlenecks across the full cluster stack • Implement and maintain observability tooling including metrics, alerting, and anomaly detection systems • Work closely with datacenter operations, cloud platform teams, ML researchers, and procurement to translate requirements into infrastructure architecture • Participate in the on-call rotation and help maintain production reliability of the training clusters
Job Requirements
- 7+ years of hands-on infrastructure or systems engineering experience
- Experience operating large-scale HPC or AI training clusters (1000+ GPU nodes)
- Strong production experience with InfiniBand fabrics
- Experience working with NVIDIA GPU hardware in training workloads (Hopper or newer preferred)
- Proven experience leading or tech-leading engineering teams, setting technical direction, reviewing work, and mentoring engineers
- Experience with automation and scripting (Python preferred)
- Experience working with infrastructure-as-code tools such as Terraform, Ansible, or Salt
Benefits
- Competitive cash and equity package
- Benefits (healthcare, lunch, wellbeing, etc.)
Related Guides
Related Categories
Related Job Pages
More Engineer Jobs
• In this role, you will be instrumental in shaping Chainlink’s developer-focused narrative in the blockchain ecosystem through clear, compelling, and technically accurate content. • As part of the Docs Team, you will create, refine, and enhance a wide range of materials that communicate Chainlink’s impact on onchain finance and asset tokenization. • Your work will span technical documentation, code deep dives, developer-focused tooling, and more. • You will collaborate closely with product managers, product engineers, developer relations, and other cross-functional teams to ensure content is both engaging and informative for global audiences. • This is a unique opportunity to contribute to one of the fastest-growing projects in the blockchain ecosystem and shape how the industry understands the future of decentralized finance.
• Leading large-scale migrations from legacy environments (ESXi 7.x) to VMware vSphere 8 across multiple geographic sites using VMware HCX for workload mobility and minimal downtime. • Designing, deploying, and maintaining VMware Cloud Foundation (VCF) environments, including the use of SDDC Manager for automated lifecycle management, patching, and upgrades of the full stack. • Architecting and managing NSX (NSX-T) infrastructures, including the configuration of Tier-0/Tier-1 gateways, load balancing, and Distributed Firewall (DFW) policies for micro-segmentation. • Extending existing PowerShell/PowerCLI and vRealize Orchestrator (vRO) workflows to automate VCF workload domain deployments and NSX object creation. • Serving as the VM Template Manager for modern operating systems, utilizing automated patching and vulnerability remediation tailored for vSphere 8 secure baseline configurations. • Managing high-performance computing clusters and MSCS VMware clustering to ensure high availability for SQL Server and other critical workloads during vMotion and site-to-site migrations. • Utilizing HCX to establish network extensions (L2) and execute live (vMotion), bulk, or cold migrations between on-premises data centers and private/hybrid cloud environments. • Creating and updating detailed design documentation (HLD/LLD), migration runbooks, and process guides to ensure compliance with federal security standards.
• Instrumentation Hands-on: Make code changes to implement distributed tracing, custom metrics and structured logs using OpenTelemetry SDKs. • Time Series Data Architecture: Design and maintain high-performance metrics pipelines using modern metrics storage solutions (time-series databases) capable of handling large volume and high cardinality. • Visualization Ecosystem: Create analytical dashboards in Grafana and advanced monitors in Datadog, focusing on the Golden Signals (Latency, Errors, Traffic and Saturation). • Multi-Cloud Operation: Configure metrics and traces collection across AWS, GCP and Azure environments, ensuring a unified view of the infrastructure. • Business Monitoring: Create time series that reflect the health of the SaaS platform and real-time betting behavior (e.g., bets/sec vs. API latency). • Error Culture: Define and implement technical SLIs/SLOs, ensuring the engineering team has actionable alerts and avoids alert fatigue.
IT Service Desk Tech 2
TrulieveWe strive to bring you the relief you need in a product you can trust.
If you have an interest in being part of one of the fastest growing industries in the nation in you may consider wanting to work for Trulieve! If you have a desire to help others in need through your efforts, this may be the role for you! At Trulieve, we strive to bring our patients the relief they need in a product they can trust. Our plants are hand-grown in an environment specially designed to reduce unwanted chemicals and pests, keeping the process as natural as possible at every turn. Our products are designed to alleviate seizures, severe and persistent muscle spasms, pain, nausea, loss of appetite, and other symptoms associated with serious medical conditions such as cancer. Our specially trained staff works hand-in-hand with physicians to provide the right products and the correct dosage to ensure patients get the compassionate care they need. To learn more about our company, please visit our website; https://www.trulieve.com Requisition ID: 18862 Remote Work Available: Yes Job Title: Service Desk Tech Level 2 Department: Information Technology Reports to: IT Service Desk Supervisor Location: Remote Role Summary: The IT (Information Technology) Service Desk Tech is a role that requires a customer service approach to assist in keeping Trulieve technology running strong. They will be focused on First Contact Resolution of all inbound support tickets from Trulieve Departments nationwide. Service Desk Techs are responsible for assisting our internal customers by providing world class customer service with attention to detail as we enable the business to operate without limitations from IT. Service Desk Techs will work as a team to manage all planned and non-planned tech communications for the organization. They work directly with Support Techs and IT leadership to design documented solutions to issues that arise as the organization grows. The Service Desk is the key position for upward mobility within IT as Techs touch all systems throughout the organization. Key Duties and Responsibilities: - Provide outstanding customer service to our internal employees - Work within the service desk to process support request tickets and meet target SLA’s. - Properly utilize documentation for IT procedures, policies and troubleshooting guides while providing feedback for improvements to each. - Possess a good understanding of Windows Operating systems, Microsoft print management and standard O365 applications. - Answer incoming support calls from grow, production and retail facilities nationwide. - Work with vendor engineers, project managers and support techs to solve application related issues. - Possess basic network troubleshooting skills as well as an understanding of TCP/IP addressing. - Assists in the management and tracking of all digital assets. - Assists in the management of all existing and legacy applications. (AD Groups) - Creation and removal of user accounts across various systems as part of new hire and terminations - Works with the IT Service Desk Manager and IT Service Desk Leads to provide a seamless customer experience. - Provide innovative solutions to uncommon problems. - Responsible for working scheduled hours both inside and outside of regular working hour. Occasional emergency scheduling may occur. Responsible for checking your schedule and being on time. - Any other tasks as assigned by the IT Service Desk Manager and Or IT Service Desk Supervisor SKILLS AND QUALIFICATIONS: - Relevant associate degree or 2+ years' experience in related field - Effective time-management skills and ability to multi-task - Ability to make educated decisions quickly. - Ability to pass a Level 2 Background Screening - Must have a passion for customer service and be customer obsessed and problem solving. - Excellent communication skills, both written and verbal Work Schedule: - 40+ hours weekly with flexible hours depending on department needs. Must be available to work occasional evenings, weekends, and holidays Equal Opportunity Employer l Trulieve Supports a Drug Free Workplace Salary will be commensurate with experience. A comprehensive benefits package including paid time off is offered with this position. Trulieve provides equal employment opportunities to all employees and prohibits discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, pregnancy or any other characteristic protected by federal, state or local laws.




