Andromeda logo
Andromeda

Where technology meets empathy – pioneering the future of human-robot interaction.

Senior Site Reliability Engineer – AI Infrastructure

Location

California

Posted

57 days ago

Salary

0

Seniority

Senior

Job Description

Senior Site Reliability Engineer – AI Infrastructure

Andromeda

• Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training • Serve as the primary technical point of contact for customers running large-scale training workloads • Define SLOs and error budgets that account for the unique failure modes of GPU infrastructure • Ensure the health and performance of high-speed interconnects • Build deep visibility into GPU utilization, memory pressure, interconnect throughput • Build production-grade automation for cluster provisioning, GPU health checks, job scheduling • Lead incident response for complex failures spanning hardware, networking, orchestration

Job Requirements

  • Deep, hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200 or equivalent)
  • Production experience with InfiniBand, RoCE, or NVLink fabrics in the context of distributed training
  • Working knowledge of NCCL, CUDA, PyTorch distributed, DeepSpeed, Megatron, FSDP, or similar
  • Expert-level Linux knowledge
  • Strong experience running Kubernetes in production with GPU workloads
  • Strong engineering skills in Python, Go, or Bash
  • Hands-on experience building monitoring and alerting for GPU infrastructure
  • Proven track record leading incident response for complex distributed systems

Benefits

  • Health insurance
  • Retirement plans
  • Paid time off
  • Flexible work arrangements
  • Professional development

Related Categories

Related Job Pages

More DevOps Engineer Jobs

conology GmbH logo

Teamleiter DevOps – Operations

conology GmbH

IT-Beratung & Entwicklung | Wir entwerfen Ihren Weg in die Zukunft

DevOps Engineer57 days ago
Full TimeRemoteTeam 1-10Since 2009H1B No Sponsor

• Du baust und leitest ein kleines DevOps- & Operations-Teams • Du stellst einen stabilen und skalierbaren IT-Betriebs (Services, Plattformen) sicher • Du arbeitest eng mit Entwicklung und Produktmanagement zusammen • Du führst DevOps-Standards ein und entwickelst sie weiter • Du trägst die Verantwortung für CI/CD, Monitoring, Security und Infrastruktur • Deine Hands-on Mitarbeit im Tagesgeschäft ist deine größte Stärke

Germany
€70K - €90K / year
Job Closed
National Debt Relief, LLC. logo

DevOps & Cloud Architect

National Debt Relief, LLC.

National Debt Relief was founded in 2009 with the goal of helping an expanding number of consumers deal with overwhelming debt. We are one of the most-trusted and best-rated consumer debt relief providers in the United States. As a leading debt settlement organization, we have helped over 450,000 people settle over $10 billion of debt, while empowering them to lead a healthier financial lifestyle and feel free to live their best life. At National Debt Relief, we treat our clients like real people. Our purpose is to elevate, empower, and transform their lives. Rated A+ by the Better Business Bureau, our goal is to help individuals and families get out of debt with the least possible cost through conducting financial consultations, educating the consumer and recommending the appropriate solution. We become our clients' number one advocate to help them reestablish financial stability as quickly as possible.

DevOps Engineer57 days ago
Full TimeRemoteTeam 1,001-5,000

Overview We are seeking a highly skilled DevOps & Cloud Architect to design, implement, and manage scalable, secure, and high-performance cloud infrastructure. This role will bridge development and operations, ensuring efficient CI/CD processes, robust system architecture, and adherence to best practices across cloud and Salesforce environments, with a strong emphasis on AI-driven automation. Responsibilities - Design and architect cloud-native and hybrid solutions across platforms such as AWS, Azure, or GCP - Define and implement DevOps strategies, frameworks, and best practices - Build and maintain scalable, resilient, and secure infrastructure - Lead the design and optimization of CI/CD pipelines for faster and reliable deployments - Implement and manage Salesforce DevOps processes using Salesforce DX (SFDX) - Design and automate Salesforce deployment pipelines (metadata/API-based deployments, CI/CD integration) - Manage version control and branching strategies for Salesforce and cloud applications - Establish infrastructure as code (IaC) practices using tools like Terraform, CloudFormation, or ARM templates - Ensure system reliability, performance, and cost optimization - Implement monitoring, logging, and alerting solutions - Leverage AI-assisted development tools to improve developer productivity, code quality, and delivery speed - Design and implement AI-driven automation for CI/CD pipelines, testing, code reviews, and operational workflows - Define guardrails and governance for safe and controlled use of AI tools to prevent unnecessary refactoring and risk - Integrate AI-based insights into monitoring, anomaly detection, and incident management - Evaluate and adopt emerging AI/ML tools for DevOps, cloud operations, and Salesforce development - Collaborate with development, QA, Salesforce teams, and architecture teams to streamline delivery workflows - Define and enforce security, compliance, and governance standards across cloud and Salesforce platforms - Evaluate and integrate new tools, technologies, and automation practices - Provide technical leadership and mentorship to DevOps engineers and teams Qualifications Education/Experience: - Bachelor’s degree in Computer Science, Engineering, or related field (or equivalent experience) - 8+ years of experience in DevOps, cloud engineering, or infrastructure roles Required Skills/Abilities: Required - Strong hands-on experience with at least one major cloud provider (AWS, Azure, or GCP) - Proven experience with Salesforce DX (SFDX) and Salesforce deployment processes - Experience with AI-assisted development tools and automation frameworks - Expertise in CI/CD tools (e.g., Jenkins, GitHub Actions, GitLab CI, Azure DevOps) - Proficiency in Infrastructure as Code (Terraform, CloudFormation, etc.) - Strong scripting skills (Python, Bash, or similar) - Experience with containerization and orchestration (Docker, Kubernetes) - Solid understanding of networking, security, and system design principles - Experience with monitoring tools (Prometheus, Grafana, ELK, Datadog, etc.) - Experience with Git-based version control and branching strategies Preferred - Cloud certifications (e.g., AWS Certified Solutions Architect, Azure Solutions Architect Expert) - Salesforce certifications (e.g., Salesforce Platform Developer, DevOps or Architect certifications) - Experience with Salesforce environments (Scratch Orgs, Sandboxes, Packaging) - Knowledge of DevSecOps and secure deployment practices - Familiarity with cost optimization and FinOps practices - Experience implementing AI-driven DevOps (AIOps) or intelligent automation solutions National Debt Relief Role Qualifications: - Computer competency and ability to work with a computer. - Prioritize multiple tasks and projects simultaneously. - Exceptional written and verbal communication skills. - Punctuality expected, ready to report to work on a consistent basis. - Attain and maintain high performance expectations on a monthly basis. - Work in a fast-paced, high-volume setting. - Use and navigate multiple computer systems with exceptional multi-tasking skills. - Remain calm and professional during difficult discussions. - Take constructive feedback. - Available for full-time position, overtime eligible if classified non-exempt. Compensation Information Our salary ranges are determined by role, level, and location. The range displayed on each job posting reflects the minimum and maximum target for each position across the US. Within the range, individual pay is determined by work location, job-related skills, experience, and relevant education or training. This good faith pay range is provided in compliance with NYC law and the laws of other jurisdictions that may require a salary range in job postings. The salary for this position is $164,000 - $188,500 annually. About National Debt Relief National Debt Relief was founded in 2009 with the goal of helping an expanding number of consumers deal with overwhelming debt. We are one of the most-trusted and best-rated consumer debt relief providers in the United States. As a leading debt settlement organization, we have helped over 450,000 people settle over $10 billion of debt, while empowering them to lead a healthier financial lifestyle and feel free to live their best life. At National Debt Relief, we treat our clients like real people. Our purpose is to elevate, empower, and transform their lives. Rated A+ by the Better Business Bureau, our goal is to help individuals and families get out of debt with the least possible cost through conducting financial consultations, educating the consumer and recommending the appropriate solution. We become our clients' number one advocate to help them reestablish financial stability as quickly as possible. Want to learn more about who we are? Connect with us on social! Benefits National Debt Relief is a team-oriented environment full of rewards and growth opportunities for our employees. We are dedicated to our employee's success and growth within the company, through our employee mentorship and leadership programs. Our extensive benefits package includes: - Generous Medical, Dental, and Vision Benefits - 401(k) with Company Match - Paid Holidays, Volunteer Time Off, Sick Days, and Vacation - 12 weeks Paid Parental Leave - Pre-tax Transit Benefits - No-Cost Life Insurance Benefits - Voluntary Benefits Options - ASPCA Pet Health Insurance Discount - Wellness Incentive Program National Debt Relief is a certified Great Place to Work®! National Debt Relief is an equal opportunity employer and makes employment decisions without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, protected veteran status, disability status, or any other status protected by law. For information about our Employee Privacy Policy, please see hereFor information about our Applicant Terms, please see here #LI-REMOTE

United States
$164K - $188K / year
H&R Block logo

Principal Software Engineer-C#/AKS/Azure/DevOps

H&R Block

Since 1955, we have been leaders in tax preparation, financial services, and small business solutions. With 70,000 associates and 9,000 retail tax locations across North America, Australia, Ireland, and India, we have helped millions of clients and countless communities. If you embrace challenges as opportunities, value winning as a team, and seek to make a meaningful difference, join us on our journey.

DevOps Engineer57 days ago
Full TimeRemoteTeam 10,001

Our Company We care about helping people. Our purpose is to provide help and inspire confidence in our clients and communities everywhere. Our associates feel a sense of belonging in an inclusive place with an amazing history and a sharp focus on our future. Our connected culture is who we are and how we work together to achieve our strategies, accelerate our transformation, and achieve extraordinary results. It’s an exciting time to be a part of H&R Block! What you'll do... As a Principal Engineer, you will be a senior technical leader operating horizontally across multiple teams and platforms. You will serve as a high‑impact individual contributor focused on elevating engineering practices, strengthening operational reliability, and enabling teams to deliver scalable, maintainable, and observable solutions. You will influence technical direction and architectural decisions across initiatives, guide teams through complex technical challenges, and partner closely with Product, Engineering, Architecture, and Platform leaders to translate business needs into sustainable technical approaches. This role emphasizes how engineering work is done—design quality, operational readiness, consistency, and long‑term system health—rather than ownership of a single product or domain. Day to day, you'll... - Define and influence technical strategy and architecture across multiple teams and platforms. - Collaborate with Product Owners to ensure Epics and initiatives clearly capture business intent, technical outcomes, and operational considerations. - Lead design discussions and proofs of concept (POCs) to evaluate solution options and architectural trade‑offs. - Break down complex business and technical problems into practical, scalable, and supportable solution approaches. - Guide teams in resolving highly complex, ambiguous, or cross‑cutting technical challenges. - Ensure consistent implementation of architectural standards, frameworks, conventions, and engineering best practices. - Drive alignment in system design, testing strategies, quality practices, and operational readiness across teams. - Foster a strong culture of observability, operational excellence, and data‑driven decision making. - Actively engage in incident response, production issue resolution, and post‑incident reviews, with a focus on systemic improvements and resilience. - Partner with platform, infrastructure, and SRE teams to improve system reliability, availability, and performance. - Evaluate and guide adoption of new technologies, tools, and design patterns aligned with enterprise standards and long‑term needs. - Anticipate future use cases and guide design decisions that minimize long‑term cost of change. - Mentor senior engineers and technical leads, elevating technical judgment, system thinking, and operational awareness. - Identify and address cross‑team friction in tools, processes, or architectural approaches to improve engineering effectiveness at scale. What you'll bring to the team... Education - Bachelor’s degree in a related field or the equivalent combination of education and relevant work experience. - 12+ years of progressive experience in software engineering, systems design, or related technical roles. - Strong experience with the Microsoft technology ecosystem, including modern .NET‑based services and Angular or comparable frontend frameworks. - Hands‑on experience designing, deploying, and operating cloud‑based solutions with an emphasis on scalable, secure, and resilient architectures. - Experience building and supporting containerized workloads using Kubernetes, including platforms such as Azure Kubernetes Service (AKS) or equivalent. - Familiarity with cloud‑native and DevOps practices, including CI/CD pipelines, infrastructure as code, and automated testing strategies. - Demonstrated experience operating at enterprise scale and influencing architecture and technical direction beyond a single team. - Deep technical expertise with the ability to connect business strategy to sustainable technology solutions. - Strong system design and architectural skills, with a focus on scalability, reliability, operability, and long‑term maintainability. - Proven ability to influence without direct authority and drive organization‑wide alignment on engineering and operational practices. - Experience mentoring engineers and serving as a role model for technical excellence. - Excellent written and verbal communication skills, with the ability to convey complex technical concepts to diverse audiences - Broad technical perspective and willingness to contribute beyond a single area of specialization. It would be even better if you also had... - Experience with modern observability practices, including monitoring, logging, tracing, and using operational signals to guide system improvements. - Exposure to AI‑enabled solutions or applied machine learning, including evaluating where such capabilities add meaningful business value. - Experience supporting high‑availability, high‑volume, or transaction‑heavy systems in production environments. - A track record of contributing to engineering culture, communities of practice, or cross‑team technical initiatives. - Experience guiding teams through architectural modernization and platform evolution while managing and reducing long‑term technical debt. Why work for us Since 1955, we have been leaders in tax preparation, financial services, and small business solutions. With 70,000 associates and 9,000 retail tax locations across North America, Australia, Ireland, and India, we have helped millions of clients and countless communities. If you embrace challenges as opportunities, value winning as a team, and seek to make a meaningful difference, join us on our journey. You’ll reap the rewards of helping others along with competitive compensation and benefits to support your health and well-being. Specific benefits may vary based on your role. For detailed eligibility requirements and benefits information, visit blockbenefits.com. Equal Opportunity Employer: H&R Block does not tolerate discrimination based on a person’s race, color, religion, ancestry, age, sex/gender (including pregnancy, childbirth, related medical conditions and sex-based stereotypes and transgender status), sexual orientation, gender identity or expression, service in the Armed Forces, national origin, physical or mental disability, genetic information, citizenship status or any other status protected by law. Sponsored Job #LI-SH1 #LI-Remote

United States
Crawford & Company logo

Sr Claim System Ops Analyst

Crawford & Company

We’re Crawford, a global leader in claims management, where every claim represents a person and a community we help restore. At Crawford, employees are empowered to grow, emboldened to act and inspired to innovate. Our industry-leading team pioneers new solutions for the industries and customers we serve. We’re looking for the next generation of leaders to take this journey with us. We hail from more than 70 countries and speak dozens of languages, reflecting the global fabric of the audience we serve. Though our reach is vast, we proudly operate as One Crawford: united in purpose, vision and values.

DevOps Engineer57 days ago
Full TimeRemoteTeam 10,001

🚀 Join Crawford & Company as a Sr Claim System Ops Analyst! (Remote) Under general supervision, you’ll provide a variety of critical operational support services to the Claims Department, its business partners, and clients. Become the go-to expert for optimizing claims systems and processes. In this role, you’ll lead critical projects, analyze and enhance system functionality, and provide subject matter expertise that shapes the future of claims operations. ✨ Why Crawford & Company? ✅ Remote Flexibility – Work from anywhere in the U.S. ✅ Excellent Crawford Benefits programs that empower financial, physical, and mental wellness ✅ Generous Employee Referral Bonus Program ✅ Multiple Employee Discounts If you thrive on problem-solving, project management, and collaboration, this is your chance to make a difference—while enjoying remote flexibility and outstanding perks. 📩 Apply today and help us transform claims operations! Why Crawford? Because a claim is more than a number — it’s a person, a child, a friend. It’s anyone who looks to Crawford on their worst days. And by helping to restore their lives, we are helping to restore our community – one claim at a time. At Crawford, employees are empowered to grow, emboldened to act and inspired to innovate. Our industry-leading team pioneers new solutions for the industries and customers we serve. We’re looking for the next generation of leaders to take this journey with us. We hail from more than 70 countries and speak dozens of languages, reflecting the global fabric of the audience we serve. Though our reach is vast, we proudly operate as One Crawford: united in purpose, vision and values. Learn more at www.crawco.com. When you accept a job with Crawford, you become a part of the One Crawford family. Our total compensation plans provide each of our employees with far more than just a great salary - Pay and incentive plans that recognize performance excellence - Benefit programs that empower financial, physical, and mental wellness - Training programs that promote continuous learning and career progression while enhancing job performance - Sustainability programs that give back to the communities in which we live and work - A culture of respect, collaboration, entrepreneurial spirit and inclusion Crawford & Company participates in E-Verify and is an Equal Opportunity Employer. M/F/D/V Crawford & Company is not accepting unsolicited assistance from search firms for this employment opportunity. All resumes submitted by search firms to any employee at Crawford via-email, the Internet or in any form and/or method without a valid written Statement of Work in place for this position from Crawford HR/Recruitment will be deemed the sole property of Crawford. No fee will be paid in the event the candidate is hired by Crawford as a result of the referral or through other means.

United States