Dive into anything
Senior SRE, Ads
Location
Ireland
Posted
1 day ago
Salary
0
Seniority
Senior
Job Description
Senior SRE, Ads
Reddit, Inc.
• Partner with Ads Engineering teams to improve reliability, scalability, and operational excellence of ad-serving, auction, targeting, measurement, and billing systems. • Design, build, and maintain infrastructure, tooling, and automation that improve service reliability and engineering productivity. • Improve observability through monitoring, alerting, tracing, logging, and dashboards. • Participate in on-call rotations and lead incident response efforts for critical production systems. • Run root cause analysis and drive corrective actions following incidents. • Collaborate with software engineers throughout the service lifecycle, from design reviews through production operations. • Drive adoption of SRE best practices including SLIs, SLOs, error budgets, capacity planning, and operational readiness reviews. • Reduce operational toil through automation and self-service tooling. • Help define and measure advertiser-critical user journeys such as campaign creation, ad delivery, reporting, and billing. • Scale Ads systems to support continued traffic growth, increased advertiser demand, and evolving business requirements.
Job Requirements
- 5+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or related roles operating large scale distributed systems.
- Strong experience supporting high traffic, user facing production environments.
- Good understanding of distributed systems, networking, Linux systems, cloud native architectures.
- Good programming skills in languages such as Go, Python, or similar.
- Demonstrated ability to troubleshoot complex issues across applications, infrastructure, networking, and services.
- Experience with observability platforms, monitoring systems, alerting, and incident response.
- Experience driving automation and operational improvements.
Benefits
- Global Benefit programs that fit your lifestyle, from workspace to professional development to caregiving support
- Family Planning Support
- Gender-Affirming Care
- Mental Health & Coaching Benefits
- Private Medical, Dental, and Vision Benefits
- Personal Retirement Savings Account with matching contribution
- Cycle to Work and Tax Saver schemes
- Flexible Vacation & Paid Volunteer Time Off
- Generous Paid Parental Leave
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer
SynthesiaCreate studio-quality videos with AI avatars and voiceovers in 140+ languages. Trusted by Reuters, BBC, Amazon and more.
• We're hiring a dedicated SRE to take real ownership of operational excellence across Cloud Infrastructure. • Your mission is to take genuine ownership of those domains, make them resilient to any single person, and raise the bar on how reliably we run. • You'll own domains end to end: understand them deeply, operate them well, and build the automation and tooling that make them boring. • take custody of the incident process: on-call quality, response, post-mortems, and driving down incident count, time-to-detect, and time-to-resolve. • automate low-frequency, high-consequence operations (the certificate-renewal class of problem), not just the high-frequency toil. • over time, deep ownership of a domain such as Temporal, observability, or Kubernetes operations, partnering with the engineers building in it. • own key external relationships and integrations (e.g. LLM API providers, third-party services), today managed manually and informally. • own cloud and platform cost visibility and efficiency, and the mechanics of how usage maps to billing.
Role Description Software Mind is building a private, tenant-isolated AI assistant for the real estate title and settlement industry. The platform is a retrieval-first (RAG) system that ingests historical email, documents, and structured metadata into a per-tenant vector index, and serves grounded, cited, expert-weighted answers through a chat-style Q&A interface with single sign-on and full audit logging. The platform is AWS-native with a Python/FastAPI backend, Vue.js frontend, OpenSearch/Pinecone vector store, and OpenAI/Anthropic/Bedrock as LLM provider. You will join a senior, cross-functional LATAM-based team where hands-on AI delivery experience not just familiarity is the baseline expectation. You stand up and own the cloud infrastructure and CI/CD foundation the entire project runs on. Your work is on the critical path from day one: delivery begins with environment provisioning. You design for tenant isolation, observability, and security from the outset not as an afterthought. This role requires prior experience operating infrastructure for production AI or LLM-based workloads. Responsibilities - Provision and configure a dedicated VPC and segmented cloud environment on AWS - Build the baseline CI/CD pipeline and maintain and evolve it across all delivery phases - Configure and manage the vector store infrastructure (OpenSearch/Pinecone on AWS) - Set up and manage the observability stack: CloudWatch, X-Ray, alerting thresholds, and LLM-specific monitoring - Implement infrastructure-as-code for all environments (dev, staging, production) using Terraform or CDK - Manage secrets, KMS encryption key configuration, and tenant-scoped access controls - Configure LLM provider connectivity (OpenAI / Anthropic / Amazon Bedrock enterprise tier, zero-data-retention) - Define and implement environment promotion strategy aligned with the 2-week sprint cadence - Support incremental ingestion pipeline infrastructure requirements and nightly scheduling Qualifications - +90% English written and oral (at least B2 level) with excellent communication skills - 6+ years in DevOps or cloud infrastructure engineering; strong AWS specialization required - Infrastructure-as-code: Terraform, CloudFormation, or AWS CDK - CI/CD tooling: GitHub Actions, AWS CodePipeline, or equivalent - Core AWS services: VPC, ECS, Lambda, S3, DynamoDB, API Gateway, Cognito, CloudWatch, X-Ray - Experience designing and operating multi-tenant cloud environments with tenant-level data isolation - At least one project operating infrastructure for a production AI/ML or LLM-integrated system not just general cloud workloads - Experience configuring and managing vector store infrastructure (OpenSearch, Pinecone, Weaviate, or equivalent) in a production environment - Familiarity with LLM provider APIs (OpenAI, Anthropic, or Amazon Bedrock) in a production/enterprise configuration, including zero-data-retention tier setup - Understanding of AI-specific observability concerns: token usage monitoring, latency profiling for LLM calls, and model response logging Additional Information - Experience with enterprise SSO and identity federation: Cognito, Okta, or Azure AD - Background in HIPAA, SOC 2, or regulated-data cloud environment configuration - Familiarity with OCR or document processing service infrastructure (AWS Textract, etc.) - We are accepting applications from LATAM countries
• Design, implement, and maintain scalable Kubernetes infrastructure on GKE/EKS • Develop and manage Infrastructure as Code using Terraform, Helm, and Ansible • Build and improve CI/CD pipelines for fast and reliable deployments • Implement and maintain monitoring, logging, and alerting solutions • Support PostgreSQL and Kafka environments • Automate operational tasks using Python and Bash scripting • Troubleshoot production issues across cloud and Kubernetes environments • Collaborate with developers to improve deployment and operational processes • Participate in on-call rotation and production support
Senior Network Deployment Engineer
CodiLimeA strategic partner for technology-driven companies | Network engineering | Software engineering
• Leading design, architecture, and optimization for networking infrastructure & devices in Large scale DCs and/or Offices • Overseeing entire site deployment cycles from initial requirements to operational handover • Collaborating with cross-functional IT, security, and facility teams plus external vendors • Enforcing and maintaining robust technical documentation and architectural standards • Developing Python and/or Ansible scripts to drive network automation initiatives




