Job Closed

This listing is no longer active.

Dandy Dental Lab logo
Dandy Dental Lab

Dandy oversees a platform created to help modernize the dental lab process. The company’s platform is designed to make the entire process digital from start t

Senior Data Infrastructure Engineer – Tech Lead

Location

United States

Posted

73 days ago

Salary

$201.5K - $237K / year

Seniority

Senior

Job Description

Senior Data Infrastructure Engineer – Tech Lead

Dandy Dental Lab

• Set the vision and technical direction for Dandy’s Data Infrastructure team, collaborating with stakeholders. • Own Dandy’s data pipelines and warehouse, ingesting data from hundreds of sources and scaling the tooling and platform to support the data engineering and analytics teams’ needs as the business grows. • Develop and maintain infrastructure, systems, and tooling to support Dandy’s data engineering... • Improve developer experience and productivity across our range of software repositories. • Collaborate with stakeholders within the tech org to influence overall objectives and long-term goals of your team. • Advocate for improvements to product quality, security, and performance that have a particular impact across your team and others. • Design improvements to infrastructure quality, security, and performance. • Craft code that meets our internal standards for style, maintainability, and best practices. • Foster a culture of proactive collaboration and systems resiliency and performance. Drive scalable improvements to resiliency and effectiveness through automation. • Recognize impediments to our efficiency as a team ("technical debt"), propose and implement solutions.

Job Requirements

  • 5+ years of data infrastructure engineering experience, preferably in a high-growth startup environment
  • Proficiency in DBT, Dagster/Airflow, Python, SQL
  • Experience in Public Cloud Infrastructure (GCP preferred, AWS, Azure)
  • Experience creating and maintaining CI/CD pipelines for DBT models
  • Experience with infrastructure as code platforms (Terraform, Pulumi)
  • Experience designing the architecture and automation of infrastructure within a cloud environment
  • Hands-on expertise in designing and implementing data pipelines, ETL processes, and data platforms
  • A collaborative, pragmatic, and growth-oriented mindset
  • The ability to clearly and concisely communicate about complex technical, architectural, and/or organizational problems
  • Experience with performance and optimization problems and a demonstrated ability to both diagnose and prevent these problems
  • Comfort working in a highly agile, intensely iterative software development process
  • Self-motivated, self-managing and takes ownership, with excellent organizational skills.

Benefits

  • healthcare
  • dental
  • mental health support
  • parental planning resources
  • retirement savings options
  • generous paid time off

Related Categories

Related Job Pages

More Infrastructure Engineer Jobs

General Electric - GE logo

SRE Observability SLO Engineer

General Electric - GE

Built on more than 130 years of experience, GE Vernova, a division of General Electric (GE), is leading a new era of energy by electrifying the world while work

Role Description GE Vernova's GridOS Platform Engineering team is building the next generation of SaaS reliability for critical energy infrastructure. The Observability & SLO Engineer is the eyes and ears of the GridOS SRE team. In this role you will build and own the full telemetry stack — from instrumentation standards to SLO dashboards to synthetic monitors — that give GE Vernova and its utility customers real-time confidence in the reliability of mission-critical energy management systems. This is a cyclical, high-impact position: you will drive an intensive initial ramp to establish v1.0 observability coverage across all customer environments, then shift into an ongoing improvement cadence aligned to new product releases and customer onboarding. Qualifications - 2–3 years in SRE, observability engineering, or infrastructure reliability roles. - Deep expertise with at least one major observability platform — Datadog, Grafana + Prometheus, AWS CloudWatch, Dynatrace, or New Relic. - Hands-on experience implementing SLIs, SLOs, and error budget burn-rate alerting in a production SaaS environment. - Strong understanding of distributed systems telemetry: metrics (Prometheus/CloudWatch), structured logging (CloudWatch Logs Insights, ELK), and distributed tracing (OpenTelemetry, AWS X-Ray). - Experience with Kubernetes observability — kube-state-metrics, node exporters, Helm-deployed monitoring stacks, and namespace-level resource metrics. - Proficiency in at least one query/visualization language: PromQL, Splunk SPL, Datadog Query Language, or CloudWatch Logs Insights query syntax. - Experience designing alerting strategies that minimize alert fatigue through symptom-based and burn-rate approaches. - Scripting skills in Python and/or Bash for automation of monitoring configuration and report generation. Requirements - Cloud Technologies - AWS Cloud Infrastructure - EKS, RDS, MSK, S3, EC2, EBS, SQS, etc. - Kubernetes - EKS, Rancher - Infrastructure as Code: Terraform - Deployment and Configuration Tools - Ansible, Chef or Puppet - Telemetry standards and tools - Open Telemetry, CloudWatch, Cloudtrail - Observability tools and technology - Datadog, Splunk, NewRelic, etc. - Alerting and notification - AWS and Azure alerting notification - Scripting - Go, Python, Groovy, Bash - Strong Linux Administration Skills - Strong analytical and problem solving skills Benefits - Relocation Assistance Provided: Yes - #LI-Remote - This is a remote position Leadership - Influences through others; builds direct and "behind the scenes" support for ideas. - Preemptively sees downstream consequences and effectively tailors influencing strategy to support a positive outcome. - Able to verbalize what is behind decisions and downstream implications. - Continuously reflecting on success and failures to improve performance and decision-making. - Understands and encourages change when needed. - Proactively identifies and removes project obstacles or barriers on behalf of the team. - Able to navigate accountability in a matrixed organization. - Self-starter; communicates and demonstrates a shared sense of purpose. Learns from failure. Personal Attributes - Critical thinker; able to quickly adapt to changing environments. - A hacker or tinkerer at heart. - Risk taker, not afraid to think outside the box or challenge the status quo. - Emotional Intelligence, ability to influence up and out and the ability to work independently. - Must be a team player with a strong desire to win. - Passionate about continuously learning. - Highly organized and efficient; able to balance competing priorities and execute accordingly. - Strong oral and written communication skills.

Worldwide
Job Closed
General Electric - GE logo

SRE Platform Engineer

General Electric - GE

Built on more than 130 years of experience, GE Vernova, a division of General Electric (GE), is leading a new era of energy by electrifying the world while work

Role Description The Platform System Reliability Engineer is the primary operations engineer and operator of our EKS Kubernetes environment, which serves as the foundation for our global grid software SaaS products. This role focuses on the "middle-mile" of software delivery, ensuring that the underlying compute, networking, and storage layers are secure, hardened, scalable, and resilient to support critical energy infrastructure in the cloud. You will be responsible for the full lifecycle of production clusters, from initial bootstrapping, performance tuning, patching and securing. Qualifications - Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience. - 6–8 years in SRE or Platform Engineering roles supporting mission-critical, 24/7 cloud environments. Requirements - 5 years of experience operating production-grade Kubernetes clusters at scale. - Expert-level knowledge of multi-cluster management, performance tuning and experience implementing observability tools such as Prometheus/Grafana, Dynatrace, Splunk, Datadog, etc. - Deep hands-on experience with AWS core services (EKS, EC2, ALB, S3, RDS, MSK). - Proficiency in Terraform, Ansible, and Python or Go for infrastructure automation and deployment tools like ArgoCD or Flux. - Strong understanding and hands-on experience of cloud networking concepts such as VPCs, routing, load balancing and security configurations such as encryption, certificate management. Benefits - Relocation Assistance Provided: Yes - #LI-Remote - This is a remote position Roles and Responsibilities - Day 0: Provision & Infrastructure Hardening - Kubernetes Cluster Orchestration: Help design and deploy hardened EKS clusters across multiple AWS regions, ensuring consistent security baselines. - Infrastructure as Code (IaC): Build and maintain reusable Terraform and Ansible modules for automated provisioning of cloud infrastructure services including networking services, compute, storage, queue and cache, etc. - Security Architecture: Implement "Policy as Code" guardrails and secure network perimeters (ESPs) in alignment with NERC CIP and IEC 62443 standards. - Operationalize Cloud Infrastructure: Standardize run books, operating processes required to run critical infrastructure with highest reliability. - Day 1: Platform Readiness & Scaling - Resource Governance: Define and enforce Kubernetes resource quotas, limit ranges, and Pod Priority classes to ensure mission-critical services receive prioritized compute resources. - Connectivity & Ingress: Manage the ingress strategy and service mesh architecture to facilitate secure, performant connectivity between distributed microservices. - Acceptance Testing: Lead platform-level smoke, load testing and disaster recovery exercises to validate that the infrastructure can meet 99.99% uptime targets. - Sizing & Optimization: Partner with application teams to right-size containerized workloads, optimizing for both performance and cloud cost (FinOps). - Day 2: Operational Excellence & Tier 3 Support - L3 Escalation: Act as the highest technical escalation point for complex Kubernetes internals, troubleshooting issues such as failed pods, memory leaks, and network partitions. - Incident Response: Lead root cause analysis (RCA) for platform-level outages, implementing systemic fixes to prevent recurring failures. - Toil Elimination: Proactively identify and automate repetitive operational tasks—such as cluster upgrades and OS patching—to ensure the team spends at least 50% of their time on engineering improvements. - Observability Integration: Institutionalize platform monitoring using Prometheus and Grafana, creating dashboards that surface the "Golden Signals" of cluster health. Preferred Qualifications - Practical knowledge of NERC CIP, SOC2, ISO 27001, or IEC 62443 compliance standards in a SaaS context. - AWS Certified DevOps Engineer – Professional, CKA (Certified Kubernetes Administrator), or SRE Practitioner Certification. - Experience supporting mission-critical systems in energy, utilities, or other high-stakes industrial sectors. Personal Attributes - High level of energy and enthusiasm with the ability to thrive in a rapidly changing environment. - Demonstrated customer focus – evaluates decisions through the eyes of the customer; builds strong customer relationships; creates processes with customer viewpoint; partners with customers. - Change oriented – actively generates process improvements; champions and drives change initiatives; confronts. - Ability to work with global teams, act independently and as part of a team. - Strong analytical and problem-solving skills - communicates in a clear and succinct manner and effectively evaluates information/data to make decisions; anticipates obstacles and develops plans to resolve.

Worldwide
Job Closed

Role Description Buscamos un Ingeniero de Sistemas e Infraestructura con experiencia en entornos NOC y SOC, responsable de diseñar, implementar, monitorear y mantener la infraestructura tecnológica de la organización. Este rol es clave para garantizar la disponibilidad, rendimiento y seguridad de los sistemas tanto en ambientes on-premise como en la nube. Colaborará con equipos multidisciplinarios para asegurar la continuidad operativa, la detección proactiva de incidentes y el fortalecimiento de la postura de ciberseguridad. Responsibilities - Diseñar, implementar y mantener la infraestructura de TI, incluyendo servidores, almacenamiento, redes y plataformas de virtualización (on-premise y nube). - Operar en entornos NOC/SOC, monitoreando sistemas, redes y eventos de seguridad para asegurar la continuidad del servicio y la atención oportuna de incidentes. - Configurar y administrar componentes de infraestructura física y virtual alineados a los requerimientos del negocio. - Monitorear el rendimiento, capacidad y disponibilidad de los sistemas, implementando mejoras para garantizar alta disponibilidad y confiabilidad. - Ejecutar tareas de administración de sistemas: instalación, configuración, mantenimiento y actualización. - Automatizar procesos de infraestructura mediante scripting, herramientas de automatización e Infraestructura como Código (IaC). - Gestionar procesos de respaldo y recuperación de información (backup & disaster recovery). - Implementar controles de seguridad, gestión de accesos y mecanismos de cifrado para proteger la información. - Realizar evaluaciones de vulnerabilidades, escaneos de seguridad y apoyar en la respuesta a incidentes. - Administrar y optimizar servicios en la nube (cómputo, almacenamiento, redes e identidades). - Monitorear consumo y costos en la nube, proponiendo estrategias de optimización. - Mantenerse actualizado en tendencias tecnológicas y proponer mejoras continuas. Qualifications - Sólidos conocimientos en administración de servidores, redes e infraestructura. - Experiencia en entornos NOC (Network Operations Center) y SOC (Security Operations Center). - Conocimientos en virtualización (VMware, Hyper-V o similares). - Experiencia con plataformas cloud (AWS, Azure o GCP). - Manejo de PowerShell, Bash u otros lenguajes de scripting. - Conocimiento en automatización e Infraestructura como Código (Terraform, Ansible, etc.). - Conocimientos en ciberseguridad: controles de acceso, cifrado, gestión de vulnerabilidades y cumplimiento. - Capacidad analítica y de resolución de problemas. - Habilidades de comunicación y trabajo en equipo. - Organización, atención al detalle y manejo de múltiples tareas. Requirements - Licenciatura en Sistemas, Tecnologías de la Información o afín (deseable). - Experiencia comprobable en administración de sistemas, ingeniería de infraestructura o roles similares. - Experiencia en implementación y soporte de infraestructura compleja. - Experiencia con herramientas de automatización y control de versiones. Benefits - Contratación directa con la empresa. - Esquema 100% nómina. - Prestaciones de ley. - Fondo de ahorro. - Aguinaldo de 30 días. - Seguro de vida. - Seguro de gastos médicos mayores. - Vales de despensa.

Mexico
Full TimeRemoteTeam 10,001+Since 1915H1B Sponsor

• Operate as part of a team responsible for the 24x7 availability of Geisinger's network, cloud, and data center infrastructure • Monitors and maintains the IT infrastructure • Deployment and maintenance of the IT infrastructure • Informs appropriate personnel of new features, limitations, and considerations from upgrades or new products • Participates in the evaluation of new hardware products and upgrades • Assists with troubleshooting and diagnosing errors in equipment

Pennsylvania
Job Closed