The platform to build ML models & build/publish Lightning Apps that “glue” together your favorite ML lifecycle tools.
Infrastructure Engineer - Observability
Location
United States
Posted
7 days ago
Salary
$180K - $200K / year
Seniority
Senior
Job Description
Infrastructure Engineer - Observability
Lightning AI
Title: Infrastructure Engineer Observability Location: United States Job Description: New York, New York, United States; Remote; San Francisco, California, United States; Seattle, Washington, United States Who We Are Lightning AI is the company behind PyTorch Lightning. Founded in 2019, we build an end-to-end platform for developing, training, and deploying AI systems—designed to take ideas from research to production with less friction. Through our merger with Voltage Park, a neocloud and AI Factory, Lightning AI combines developer-first software with cost-efficient, large-scale compute. Teams get the tools they need for experimentation, training, and production inference, with security, observability, and control built in. We serve solo researchers, startups, and large enterprises. Lightning AI operates globally with offices in New York City, San Francisco, Seattle, and London, and is backed by Coatue, Index Ventures, Bain Capital Ventures, and Firstminute. What We’re Looking For Lightning AI is seeking an Observability Infrastructure Engineer to join our Infrastructure Engineering team. In this role, you will own and evolve observability systems across large-scale, GPU-enabled bare-metal infrastructure. You’ll operate at the intersection of infrastructure, data, and product, building platforms for metrics, logs, traces, and alerting that power both internal operations and customer-facing visibility. You will play a key role in productizing observability, enabling scalable, multi-tenant monitoring experiences while keeping pace with rapid infrastructure buildouts. This includes designing telemetry pipelines, improving signal quality, and delivering actionable insights that ensure reliability and transparency across our platform. We’re flexible on location for this team. This role can work hybrid out of one of our US-based hubs (Seattle, NYC, or SF) or fully remote within the U.S., with occasional company and team offsites. We are not able to provide visa sponsorship for this position at this time. What You’ll Do Observability Platform & Productization - Own and evolve a scalable observability platform spanning metrics, logs, traces, and events - Drive the productization of observability capabilities for both internal teams and external customers - Design multi-tenant observability systems with scoped access, RBAC, and customer-facing visibility - Continuously improve observability systems to keep pace with rapid infrastructure buildouts Telemetry & Data Pipelines - Design and operate telemetry pipelines ingesting data from GPUs, CPUs, networking (Ethernet & InfiniBand), containers, APIs, and BMC/Redfish - Build systems to correlate signals across infrastructure layers to enable faster debugging and root cause analysis - Implement streaming and real-time data pipelines using tools such as Kafka, OTEL, Promtail, or similar Alerting, Reliability & Insights - Design and implement noise-resistant alerting systems to improve signal quality and reduce operational load - Create dashboards and alerting for InfraOps, Engineering, and Customer Success teams - Build automated insights and enable proactive detection, forecasting, and system health visibility at scale Systems & Infrastructure Engineering - Contribute to broader infrastructure engineering projects beyond observability - Partner with infrastructure and platform teams to embed observability into core systems and workflows - Support large-scale, distributed systems across compute, networking, and storage environments Cross-Functional Collaboration - Work closely with customer-facing teams to deliver external observability experiences - Collaborate with engineering, operations, and support teams to improve system transparency and reliability - Help define best practices for observability across the organization What You’ll Need Required Qualifications - 5+ years of experience in infrastructure engineering, SRE, or observability-focused roles - Strong experience with monitoring systems such as Prometheus, Grafana, ELK, or VictoriaMetrics - Experience building and operating observability platforms at scale - Proficiency in Python, Go, or bash for automation and data integration - Familiarity with containerized environments and Kubernetes observability - Experience with streaming telemetry pipelines (Kafka, OTEL, Promtail, or equivalent) - Experience with multi-tenant monitoring architectures - Strong written and verbal communication skills Ideal Experience - Experience with GPU observability, particularly NVIDIA DCGM - Experience monitoring large-scale GPU or HPC clusters - Familiarity with InfiniBand fabric observability - Experience building customer-facing or productized infrastructure systems - Experience with correlation engines, RCA workflows, or predictive alerting systems - Broad exposure to infrastructure domains including networking, storage, and provisioning Compensation We are committed to offering competitive compensation that reflects the value each team member brings to our mission. Final offers are based on factors such as experience, skills, geographic location, and role expectations. In addition to base salary, our total rewards package for eligible roles includes a discretionary bonus, a meaningful equity component, and comprehensive benefits. The anticipated annual base salary range for this role is: $180,000 - $200,000 USD Benefits and Perks We offer a comprehensive and competitive benefits package designed to support our employees’ health, well-being, and long-term success. Benefits may vary by location, team, and role. Benefits include: - Comprehensive medical, dental and vision coverage (U.S.); Private medical and dental insurance (U.K.) - Retirement and financial wellness support (U.S.); Pension contribution (U.K.) - Generous paid time off, plus holidays - Paid parental leave - Professional development support - Wellness and work-from-home stipends - Flexible work environment At Lightning AI, we are committed to fostering an inclusive and diverse workplace. We believe that diverse teams drive innovation and create better products. We provide equal employment opportunities to all employees and applicants without regard to race, color, religion, gender, sexual orientation, gender identity, national origin, age, disability, veteran status, or any other protected characteristic. We are dedicated to building a culture where everyone can thrive and contribute to their fullest potential.
Related Guides
Related Categories
Related Job Pages
More Infrastructure Engineer Jobs
Infrastructure Automation Engineer
Bright Vision TechnologiesBright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications. We recognize that our people are our strength. We are an equal opportunity employer and place a high value on diversity and inclusion. We do not discriminate on the basis of any protected attribute. We make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs. Bright Vision Technologies is an Equal Opportunity Employer, including Disability/Veterans.
Role Description We are seeking an Infrastructure Automation Engineer with deep Terraform expertise to design, build, and maintain the infrastructure-as-code foundations that power our cloud and hybrid environments. This role focuses on creating reusable Terraform modules, hardening pipelines, enforcing policy-as-code, and standardizing infrastructure delivery across multiple teams and cloud providers. The ideal candidate brings strong software engineering discipline to infrastructure work, has shipped production-grade Terraform at scale, and understands the operational realities of managing thousands of resources across many environments and accounts. Key Responsibilities - Design, develop, and maintain modular, composable Terraform code that codifies the entire infrastructure estate across cloud accounts and environments. - Build a library of well-tested, reusable Terraform modules with clear interfaces, semantic versioning, and comprehensive documentation. - Implement Terraform automation pipelines using GitHub Actions, GitLab CI, Atlantis, Terraform Cloud, or Spacelift, with plan/apply gating, drift detection, and policy enforcement. - Define and enforce policy-as-code using Sentinel, Open Policy Agent (OPA), Conftest, or Checkov to prevent insecure or non-compliant infrastructure changes. - Manage Terraform state at scale with appropriate backend strategies, state locking, workspace organization, and disaster recovery patterns. - Drive multi-account, multi-region, and multi-cloud infrastructure provisioning strategies with clear isolation, naming, and tagging standards. - Implement infrastructure testing including unit tests with terraform-compliance, integration tests with Terratest, and policy tests across pull requests. - Collaborate with security, networking, and platform teams to embed guardrails directly into reusable modules and pipelines. - Standardize patterns for secrets management, identity federation, and least-privilege IAM through reusable Terraform abstractions. - Lead migrations from legacy, ClickOps, or non-IaC infrastructure into managed Terraform footprints with minimal disruption. - Drive cost optimization, tagging hygiene, and lifecycle management across the Terraform-managed estate. - Mentor engineering teams on Terraform best practices, anti-patterns, and pull-request review standards. - Maintain comprehensive runbooks, architecture diagrams, and onboarding materials for the infrastructure platform. - Stay current with Terraform, OpenTofu, and broader IaC ecosystem developments and recommend adoption where beneficial. Qualifications - Bachelor’s degree in Computer Science, Engineering, or a related field. - Five or more years of experience in cloud infrastructure or DevOps engineering, with significant Terraform focus. - Deep, hands-on expertise authoring and maintaining production Terraform across at least one major cloud provider. - Strong experience designing reusable Terraform modules with clean APIs and version discipline. - Hands-on experience with Terraform state management, backends, and large-scale workspace organization. - Strong scripting skills in Python, Go, or Bash. - Experience with CI/CD pipelines for infrastructure code and automated policy enforcement. - Solid understanding of cloud networking, identity, and security primitives. - Strong Git-based workflows including code review, branching, and release management. - Excellent troubleshooting and root-cause analysis skills. Preferred Qualifications - Experience with multi-cloud Terraform (AWS + Azure or AWS + GCP). - Familiarity with Terragrunt, Atlantis, Spacelift, or env0. - Experience with policy-as-code engines (Sentinel, OPA, Checkov). - Contributions to public Terraform modules or providers. - Exposure to FinOps practices and tagging-driven cost governance. How to Apply Would you like to know more about this opportunity? For immediate consideration, please send your resume to [email protected] or contact us at (908) 650-6699. Learn more about Bright Vision Technologies at www.bvteck.com .
Cloud & Infrastructure Engineer
Switzerland Global EnterpriseWe support Swiss SMEs in their international business and help innovative foreign companies to establish in Switzerland.
Role Description Zur Verstärkung unseres IT-Teams suchen wir per sofort oder nach Vereinbarung an unserem Standort in Zürich eine engagierte und technisch versierte Persönlichkeit als Cloud & Infrastructure Engineer mit Schwerpunkt auf Microsoft Cloud, Infrastruktur und moderner Arbeitsplatztechnologie. - Konzeption, Betrieb, Weiterentwicklung und Lifecycle-Management unserer IT-Infrastruktur in der Cloud, inklusive AVD - Gestaltung, Ausbau und operative Betreuung der Microsoft Cloud-Umgebung mit Fokus auf Azure, Microsoft 365, Entra ID, Teams, SharePoint und Intune - Betrieb und Weiterentwicklung zentraler Infrastruktur- und Plattformservices, insbesondere AD, DNS, DHCP, GPO, Server- und Domain Services - Implementierung, Wartung und Optimierung von Netzwerk- und Sicherheitskomponenten wie Firewall, WAN, LAN, WLAN, VPN, VLAN und SD-WAN - Verantwortung für Themen im Bereich Identity & Access Management, inklusive Conditional Access, Enterprise Applications, App Registrations sowie Identity-, Mail- und AD-Security - Mitarbeit bei der technischen Security sowie Sicherstellung der Einhaltung von Standards, Architekturprinzipien, Security- und Compliance-Vorgaben - Engineering zukunftsgerichteter Lösungen sowie Automatisierung bestehender und neuer Infrastrukturen - Digitalisierung und Standardisierung von Prozessen, unter anderem mit PowerShell - Überwachung und Monitoring von Infrastruktur- und Services mit PRTG - Unterstützung und Mitwirkung in Projekten, Change Requests sowie im Requirements Engineering und bei der Weiterentwicklung von IT-Architekturen - Verantwortung für das Lizenzmanagement im Microsoft-Umfeld, insbesondere im Rahmen von Cloud Solution Provider (CSP) Modellen - Beratung und Coaching interner Stakeholder sowie Sicherstellung des Know-how-Transfers zu Methoden, Prozessen und Technologien - Mitarbeit im 1st-, 2nd- und 3rd-Level-Support sowie technische Koordination mit Applikationsverantwortlichen und externen Partnern Qualifications - Abgeschlossene Ausbildung oder Weiterbildung in Informatik, Wirtschaftsinformatik oder einem vergleichbaren Bereich - Mehrjährige Erfahrung im Microsoft-Umfeld mit Fokus auf Cloud, Serveradministration, Infrastruktur und Plattformentwicklung - Fundierte Kenntnisse in Azure, Microsoft 365, Storage, Backup, Virtualisierung sowie moderner Infrastruktur- und Cloud-Technologien - Sehr gute Kenntnisse in IAM, Entra ID, Conditional Access sowie Security-relevanten Themen im Microsoft-Umfeld - Erfahrung in der Entwicklung, Optimierung und Bewertung von IT-Architekturen und Lösungen unter Berücksichtigung von Security und Compliance - Sicherer Umgang mit PowerShell und Freude an Automatisierung und Standardisierung - Erfahrung als System Engineer, Cloud Engineer oder IT-Architektin beziehungsweise IT-Architekt - Strukturierte, lösungsorientierte und selbstständige Arbeitsweise sowie ausgeprägte Kommunikations- und Präsentationsfähigkeiten - Sehr gute Deutsch- und Englischkenntnisse in Wort und Schrift - Von Vorteil sind Microsoft-Zertifizierungen sowie Erfahrung mit methodischen Frameworks im Microsoft-Umfeld Benefits - S-GE bietet Ihnen eine faszinierende Tätigkeit an der Schnittstelle zwischen Politik und internationaler Wirtschaft in einem modernen Arbeitsumfeld im Herzen von Zürich. - S-GE setzt auf Flexibilität in der Arbeitsgestaltung und fördert ein kollegiales Umfeld sowie die fachliche und persönliche Entwicklung ihrer Mitarbeitenden. Company Description Wir freuen uns auf Ihre Online-Bewerbung. Allfällige Fragen beantworten wir Ihnen gerne unter hr@s-ge.com. Bitte beachten Sie, dass wir Ihre Bewerbung jedoch ausschliesslich über unser Online-Tool entgegennehmen können.
Senior Infrastructure Systems Engineer
Ascend TechnologiesInnovation & Technology Enabling Business Growth
• Provide escalated support and operational maintenance to client environments. • Work on project-based initiatives for clients, including upgrades to existing infrastructure and deployment of new technology. • Serve as escalation point for the Ascend Technologies Network Operations Center (NOC) and support engineers
Senior Infrastructure Engineer
CoinTrackerThe gold standard in crypto portfolio tracking and taxes. Increasing the world's financial freedom and prosperity.
• Build and operate the platform that powers consumer tax product and Nino, our AI-driven personal finance tool • Scale databases, automate deployments, and stand up infrastructure foundation for AI-powered features • Maintain Redis caching layers for session and data caching at scale • Operate GitOps pipelines and image-driven promotion workflows from staging to production • Develop and maintain Helm charts for consistent deployments across environments • Implement observability with OpenTelemetry — monitors, dashboards, metric routing with PII stripping, and Pub/Sub export • Secure the platform with Cloudflare integration, Cloud Armor WAF, CloudOrigin CA TLS certificates, and GCP Secret Manager via External Secrets


