Synthesia logo
Synthesia

Create studio-quality videos with AI avatars and voiceovers in 140+ languages. Trusted by Reuters, BBC, Amazon and more.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 501-1,000Since 2017H1B No SponsorCompany SiteLinkedIn

Location

United States

Posted

23 hours ago

Salary

0

Seniority

Senior

Bachelor DegreeEnglishAWSCloudKubernetesMongoDBPython

Job Description

Senior Site Reliability Engineer

Synthesia

• We're hiring a dedicated SRE to take real ownership of operational excellence across Cloud Infrastructure. • Your mission is to take genuine ownership of those domains, make them resilient to any single person, and raise the bar on how reliably we run. • You'll own domains end to end: understand them deeply, operate them well, and build the automation and tooling that make them boring. • take custody of the incident process: on-call quality, response, post-mortems, and driving down incident count, time-to-detect, and time-to-resolve. • automate low-frequency, high-consequence operations (the certificate-renewal class of problem), not just the high-frequency toil. • over time, deep ownership of a domain such as Temporal, observability, or Kubernetes operations, partnering with the engineers building in it. • own key external relationships and integrations (e.g. LLM API providers, third-party services), today managed manually and informally. • own cloud and platform cost visibility and efficiency, and the mechanics of how usage maps to billing.

Job Requirements

  • Strong production operations experience on AWS and Kubernetes; comfortable with MongoDB and scripting/automation in Python.
  • An operations-and-reliability mindset — you take pride in systems that run quietly — paired with the instinct to engineer the problem away rather than absorb it manually.
  • Sound judgement on incidents and risk; calm and clear under pressure.
  • Influences through relationships and evidence, not escalation; comfortable owning a domain and partnering across teams.
  • Bonus: vendor/cost management exposure, Temporal, observability tooling.

Benefits

  • Remote (US East Coast preferred, for timezone coverage)

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Software Mind logo

Senior DevOps Engineer

Software Mind

Software House focused on results since 1999

Full TimeRemoteTeam 1,001-5,000Since 1999H1B No Sponsor

Role Description Software Mind is building a private, tenant-isolated AI assistant for the real estate title and settlement industry. The platform is a retrieval-first (RAG) system that ingests historical email, documents, and structured metadata into a per-tenant vector index, and serves grounded, cited, expert-weighted answers through a chat-style Q&A interface with single sign-on and full audit logging. The platform is AWS-native with a Python/FastAPI backend, Vue.js frontend, OpenSearch/Pinecone vector store, and OpenAI/Anthropic/Bedrock as LLM provider. You will join a senior, cross-functional LATAM-based team where hands-on AI delivery experience not just familiarity is the baseline expectation. You stand up and own the cloud infrastructure and CI/CD foundation the entire project runs on. Your work is on the critical path from day one: delivery begins with environment provisioning. You design for tenant isolation, observability, and security from the outset not as an afterthought. This role requires prior experience operating infrastructure for production AI or LLM-based workloads. Responsibilities - Provision and configure a dedicated VPC and segmented cloud environment on AWS - Build the baseline CI/CD pipeline and maintain and evolve it across all delivery phases - Configure and manage the vector store infrastructure (OpenSearch/Pinecone on AWS) - Set up and manage the observability stack: CloudWatch, X-Ray, alerting thresholds, and LLM-specific monitoring - Implement infrastructure-as-code for all environments (dev, staging, production) using Terraform or CDK - Manage secrets, KMS encryption key configuration, and tenant-scoped access controls - Configure LLM provider connectivity (OpenAI / Anthropic / Amazon Bedrock enterprise tier, zero-data-retention) - Define and implement environment promotion strategy aligned with the 2-week sprint cadence - Support incremental ingestion pipeline infrastructure requirements and nightly scheduling Qualifications - +90% English written and oral (at least B2 level) with excellent communication skills - 6+ years in DevOps or cloud infrastructure engineering; strong AWS specialization required - Infrastructure-as-code: Terraform, CloudFormation, or AWS CDK - CI/CD tooling: GitHub Actions, AWS CodePipeline, or equivalent - Core AWS services: VPC, ECS, Lambda, S3, DynamoDB, API Gateway, Cognito, CloudWatch, X-Ray - Experience designing and operating multi-tenant cloud environments with tenant-level data isolation - At least one project operating infrastructure for a production AI/ML or LLM-integrated system not just general cloud workloads - Experience configuring and managing vector store infrastructure (OpenSearch, Pinecone, Weaviate, or equivalent) in a production environment - Familiarity with LLM provider APIs (OpenAI, Anthropic, or Amazon Bedrock) in a production/enterprise configuration, including zero-data-retention tier setup - Understanding of AI-specific observability concerns: token usage monitoring, latency profiling for LLM calls, and model response logging Additional Information - Experience with enterprise SSO and identity federation: Cognito, Okta, or Azure AD - Background in HIPAA, SOC 2, or regulated-data cloud environment configuration - Familiarity with OCR or document processing service infrastructure (AWS Textract, etc.) - We are accepting applications from LATAM countries

Latin America (LATAM)
Full TimeRemoteTeam 11-50Since 2006H1B No Sponsor

• Design, implement, and maintain scalable Kubernetes infrastructure on GKE/EKS • Develop and manage Infrastructure as Code using Terraform, Helm, and Ansible • Build and improve CI/CD pipelines for fast and reliable deployments • Implement and maintain monitoring, logging, and alerting solutions • Support PostgreSQL and Kafka environments • Automate operational tasks using Python and Bash scripting • Troubleshoot production issues across cloud and Kubernetes environments • Collaborate with developers to improve deployment and operational processes • Participate in on-call rotation and production support

Europe
CodiLime logo

Senior Network Deployment Engineer

CodiLime

A strategic partner for technology-driven companies | Network engineering | Software engineering

Full TimeRemoteTeam 201-500Since 2011H1B No Sponsor

• Leading design, architecture, and optimization for networking infrastructure & devices in Large scale DCs and/or Offices • Overseeing entire site deployment cycles from initial requirements to operational handover • Collaborating with cross-functional IT, security, and facility teams plus external vendors • Enforcing and maintaining robust technical documentation and architectural standards • Developing Python and/or Ansible scripts to drive network automation initiatives

Egypt
Fabric Group logo

Senior Consultant – Site Reliability Engineering

Fabric Group

Good Problems. Unlocking value from business challenges

ContractRemoteTeam 51-200Since 2006H1B No Sponsor

• Consultative Ownership: Work with autonomy to own problems and deliver solutions, acting as a bridge between development and operations. • Observability Architecture: Design and implement robust monitoring solutions using the LGTM stack to ensure system health and performance. • Reliability Strategy: Advise clients on defining meaningful SLOs/SLIs and managing error budgets to balance innovation with stability. • AI Assistance: Drive use of AI Agents or AI tools for intelligent automation and improving operational efficiency. • Incident Leadership: Lead post-incident reviews (Blameless Post-Mortems) to identify systemic improvements and reduce future toil. • Mentorship: Coach less experienced engineers within Fabric and our client teams on SRE principles and modern infrastructure patterns. • Advising our clients on the right technical decisions and advocating for the right practices to use. • Participate in interviewing and recruitment based on business needs. • Thought Leadership: Contribute to the SRE community through blog posts, meetups, or internal knowledge sharing. • Operational Support & Availability: Rotational Support Coverage: Participate in a sustainable team rotation to provide extended service coverage (including weekends) for business-critical systems. • Incident Response: Act as a primary responder for high-priority (P1/P2) incidents during your rostered shift, focusing on rapid restoration and clear stakeholder communication.

India