The Enterprise MLOps platform powering over 20% of the Fortune 100
Staff Site Reliability Engineer
Location
Argentina
Posted
19 days ago
Salary
0
Seniority
Lead
Job Description
Staff Site Reliability Engineer
Domino Data Lab
• Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil • Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle • Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur • Guide the development of customer and user-facing observability tools within our products • Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards • Scale cloud operations practices for Domino’s single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades • Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture
Job Requirements
- Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
- Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
- A strong ability to perceive and close reliability gaps in technical products, tools and processes
- Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
- Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
- A history of improving reliability through engineering and automation, not just putting out fires manually
- Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
- Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
- Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams
Benefits
- We strongly believe in the value of growing a diverse team and encourage people of all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply
- We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
- We believe in individuals who seek truth and speak the truth and can be their whole selves at work.
- We value all of you that believe improving is always possible. At Domino, everything is a work in progress – we can do better at everything.
- We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer – B2B Contract
futureproof consultingData, analytics and cybersecurity staffing. We connect professionals and companies to deliver successful projects.
• Lead reliability initiatives across production platforms and services • Define and manage SLOs, SLIs, SLAs, error budgets, and availability targets • Design and implement scalable, resilient cloud-native architectures • Automate infrastructure and deployments using IaC and CI/CD best practices • Build and maintain monitoring, logging, tracing, and alerting solutions • Drive incident management, troubleshooting, root cause analysis, and postmortems • Improve operational maturity through automation, runbooks, and best practices • Mentor engineers and support knowledge sharing across teams • Collaborate with product, security, platform, and vendor teams globally • Perform capacity planning, performance optimization, and reliability analysis • Maintain technical documentation and compliance-related artifacts where required
Senior Site Reliability Engineer
ShippoFounded in 2013, Shippo is a logistics and supply company that provides shipping services to retailers, ecommerce platforms, marketplaces, and more. Operating from its headquarters
Role Description - Shipping & handling responsibilities - Design, scale, and secure infrastructure to stay ahead of business needs through: - Fault-tolerant architecture design - Performance testing, profiling, and tuning - Capacity planning - Design, build, deploy, and maintain automation, monitoring, and alerting systems, as well as: - Design, implement, and test disaster recovery solutions - Ensure scalability and maintainability through: - Microservices adoption - Decoupling of concerns and data model - Queuing of jobs and application layering - Enhance and maintain our CI/CD pipeline for smooth and safe production releases via automated testing and verification - Verify and ensure performance and correctness of systems in response time and throughput - Participate in peer reviews and testing and contribute to automated test suites and in design reviews for new features, products, and systems - Participate in an on-call rotation Qualifications - Experience developing, managing, and troubleshooting highly available distributed systems, including operational experience with Kubernetes in a production environment - Extensive expertise with at least one public cloud provider (AWS, GCP, Azure) - Exceptional verbal, written, and interpersonal communication skills - Interest in and understanding of best-in-class security practices, and automation and testing methods - Familiarity with configuration and maintenance of common infrastructure components such as Redis, Elasticsearch, and Hadoop - Deep understanding of customer needs and passion for customer success - BS or MS degree in Computer Science or equivalent experience Requirements - Advanced knowledge of managing and optimizing PostgreSQL server configuration - 3+ years of experience in software development - Experience with: - Managing service meshes (e.g. Istio) - Defining and monitoring Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs) to ensure that systems meet reliability and performance targets - Monitoring Tools like New Relic, Prometheus, Grafana, and/or Datadog - OpenTelemetry knowledge for distributed tracing and metrics collection and experience on using it in production environments - Managing Python and Golang applications in production - Microservices architectures - DevOps tooling such as Docker, Terraform, ArgoCD, ArgoWorkflows, CircleCI, Github Actions, New Relic, PagerDuty, etc. - AWS/Cloud services such as EKS, EC2, S3, Lambda, Route 53, CloudFront, Cloudflare, IAM, etc. Benefits - Here at Shippo, we celebrate inclusivity and are committed to creating equal access to opportunities for people from all backgrounds, perspectives, and geographies. These values define who we are and everything we do. - All qualified individuals are encouraged to apply. If you need assistance, or a reasonable accommodation during the application and recruiting process, please contact us at accommodations@goshippo.com Company Description - Our people, much like the packages we help ship, are all over the world. - Through our remote-first program, “Shippos Everywhere”, our roles can be based anywhere in the US with the exception of Delaware, Nevada, Ohio, Oregon, Hawaii, New Mexico, and West Virginia. - Many roles can be based internationally. - For locations outside of the US and Ireland, the employment contracts are powered by Rippling.com.
• Build, deploy safely and incrementally and operate critical production systems with focus on scalability, reliability, observability, performance and security. • Monitor, support and enhance developer experience across services. • Build automation to remove toil and efficiently operate production systems. • Proactively monitor, respond to, and enhance alerts and set up automated alert handling • Create and maintain the incident response runbooks. • Triage platform/infrastructural issues and help Arista software engineers in their triages. • Engage with 3rd party vendor support. • Write postmortem documents and build solutions to avoid incidents from repeating. • Plan and communicate maintenance windows on production systems. • Work with Arista’s product development teams to identify infrastructural issues that are causing bottlenecks and limitations in their workflows. • Design and implement solutions to resolve them. • Survey and adopt best practices around infrastructure/platform to maintain secure, scalable and fault-tolerant systems. • Study the design and sufficient implementation details of OSS systems for better triage and fix resolution.
Role Description Du berätst unsere Kunden auf Augenhöhe zu Cloud-Architekturen, DevOps-Strategien und Plattformkonzepten mit Fokus auf nachhaltige, tragfähige Lösungen statt kurzfristiger Workarounds. - Du konzipierst und begleitest den Aufbau von Cloud-Service-Management-Plattformen auf und integrierst diese in bestehende Betriebs- und Governance-Modelle. - Du entwirfst Zielarchitekturen für CI/CD, Container- und Infrastrukturlandschaften und übersetzt geschäftliche Anforderungen in tragfähige technische Konzepte. - Du berätst zur Einführung von DevOps-Praktiken, Plattform-Engineering-Ansätzen und Self-Service-Modellen, von der Strategie bis zur Operationalisierung. - Du entwickelst Architekturentscheidungsvorlagen, Referenzarchitekturen und Governance-Konzepte für Monitoring, Logging und Security in Multicloud-Umgebungen. - Du moderierst Architektur-Workshops, präsentierst auf Entscheiderebene und überzeugst sowohl IT-Führungskräfte als auch technische Teams von deinen Konzepten. - Du übernimmst die fachliche Führung in komplexen Beratungsprojekten und wirkst an der strategischen Weiterentwicklung unseres Cloud- und DevOps-Beratungsportfolios mit. Qualifications - Du kannst ein abgeschlossenes Studium (bspw. in Informatik oder Wirtschaftsinformatik) vorzeigen oder vergleichbare Qualifikation mit entsprechender Berufserfahrung. - Mehrjährige Beratungserfahrung als Cloud- oder DevOps-Architekt, idealerweise im Kontext regulierter Branchen. - Tiefes konzeptionelles Verständnis von Cloud-Service-Management- und Brokerage-Plattformen, sowie Erfahrung in deren Integration in komplexe IT-Landschaften. - Fundiertes Architekturwissen zu Kubernetes (OpenShift, Rancher), Infrastructure-as-Code (Terraform, Ansible), GitLab CI und gängigen Hyperscalern (AWS, Azure, GCP) und Sovereign/Private Cloud (OpenStack). - Erfahrung in der Konzeption von Sicherheitsarchitekturen und Identity-Konzepten auf Basis moderner Authentifizierungsverfahren (OAuth, OpenID Connect, SAML). - Methodische Sicherheit in Architekturarbeit, etwa nach arc42, TOGAF oder vergleichbaren Frameworks, und Erfahrung in der Erstellung von Architekturdokumentationen und Entscheidungsvorlagen. - Programmierverständnis in Sprachen wie Python, Go oder Java, ausreichend, um technische Diskussionen auf Augenhöhe zu führen und Konzepte einordnen zu können. - Du kommunizierst und präsentierst sicher auf Deutsch (C2) und Englisch (mind. B2). - Deine Arbeitsweise ist gekennzeichnet durch eine Hands-on-Mentalität, ein hohes Maß an Eigenmotivation und eine serviceorientierte Denkweise. - Du bist teamfähig, dokumentierst klar und nachvollziehbar und gibst dein Wissen aktiv weiter. - Du verfolgst Trends und Treiber in Cloud, DevOps und Plattform-Engineering aktiv und bringst neue Impulse in unsere Beratung ein. - Vielleicht hast du schon folgende Zertifizierungen vorliegen oder wirst die bei uns erwerben: AWS Certified SysOps Administrator, AWS Certified Solutions Architect, Certified Kubernetes Administrator (CKA), PRINCE2 Agile Foundation, ITIL Practitioner. Benefits - Wir sind eine Remote-First-Company. Wirklich. Auch Arbeiten aus dem Ausland machen wir dir möglich. - Falls du doch mal ins Büro kommst, freu dich auf ein modernes Office mit allen Annehmlichkeiten wie kostenfreien Getränken, höhenverstellbaren Tischen und einem tollen Blick in den Lohsepark - hier sehen wir uns zu regelmäßigen Treffen und Events im Team oder mit der gesamten Firma. - Bei uns hast du eine 40-Stunden-Woche, Überstunden werden erfasst und entweder wieder abgebummelt oder vergütet. - Deine berufliche Weiterentwicklung ist für uns essenziell. Daher übernehmen wir natürlich die Kosten deiner Fortbildungen. - Du machst gern Sport? Finden wir auch gut, deswegen gibts bei uns die Urban Sportsclub Mitgliedschaft geschenkt. - Nutze für deinen persönlichen Bedarf die Corporate Benefits Plattform. - Wir sind stolz darauf, dass es bei uns im Team eine sehr geringe Fluktuation an Mitarbeitenden gibt. Es ist uns wichtig, tolle Charaktere nicht nur zu finden, sondern auch zu halten und weiterzuentwickeln.




