Create studio-quality videos with AI avatars and voiceovers in 140+ languages. Trusted by Reuters, BBC, Amazon and more.

Senior Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 501-1,000Since 2017H1B No SponsorCompany Site LinkedIn

Location

United States

Posted

23 hours ago

Salary

Seniority

Senior

Bachelor DegreeEnglishAWS Cloud Kubernetes MongoDB Python

Job Description

• We're hiring a dedicated SRE to take real ownership of operational excellence across Cloud Infrastructure. • Your mission is to take genuine ownership of those domains, make them resilient to any single person, and raise the bar on how reliably we run. • You'll own domains end to end: understand them deeply, operate them well, and build the automation and tooling that make them boring. • take custody of the incident process: on-call quality, response, post-mortems, and driving down incident count, time-to-detect, and time-to-resolve. • automate low-frequency, high-consequence operations (the certificate-renewal class of problem), not just the high-frequency toil. • over time, deep ownership of a domain such as Temporal, observability, or Kubernetes operations, partnering with the engineers building in it. • own key external relationships and integrations (e.g. LLM API providers, third-party services), today managed manually and informally. • own cloud and platform cost visibility and efficiency, and the mechanics of how usage maps to billing.

Job Requirements

Strong production operations experience on AWS and Kubernetes; comfortable with MongoDB and scripting/automation in Python.
An operations-and-reliability mindset — you take pride in systems that run quietly — paired with the instinct to engineer the problem away rather than absorb it manually.
Sound judgement on incidents and risk; calm and clear under pressure.
Influences through relationships and evidence, not escalation; comfortable owning a domain and partnering across teams.
Bonus: vendor/cost management exposure, Temporal, observability tooling.

Benefits

Remote (US East Coast preferred, for timezone coverage)

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior DevOps Engineer

Software Mind

Software House focused on results since 1999

DevOps Engineer1 day ago

Full Time RemoteTeam 1,001-5,000Since 1999H1B No Sponsor

Company Site LinkedIn

Role Description Software Mind is building a private, tenant-isolated AI assistant for the real estate title and settlement industry. The platform is a retrieval-first (RAG) system that ingests historical email, documents, and structured metadata into a per-tenant vector index, and serves grounded, cited, expert-weighted answers through a chat-style Q&A interface with single sign-on and full audit logging. The platform is AWS-native with a Python/FastAPI backend, Vue.js frontend, OpenSearch/Pinecone vector store, and OpenAI/Anthropic/Bedrock as LLM provider. You will join a senior, cross-functional LATAM-based team where hands-on AI delivery experience not just familiarity is the baseline expectation. You stand up and own the cloud infrastructure and CI/CD foundation the entire project runs on. Your work is on the critical path from day one: delivery begins with environment provisioning. You design for tenant isolation, observability, and security from the outset not as an afterthought. This role requires prior experience operating infrastructure for production AI or LLM-based workloads. Responsibilities - Provision and configure a dedicated VPC and segmented cloud environment on AWS - Build the baseline CI/CD pipeline and maintain and evolve it across all delivery phases - Configure and manage the vector store infrastructure (OpenSearch/Pinecone on AWS) - Set up and manage the observability stack: CloudWatch, X-Ray, alerting thresholds, and LLM-specific monitoring - Implement infrastructure-as-code for all environments (dev, staging, production) using Terraform or CDK - Manage secrets, KMS encryption key configuration, and tenant-scoped access controls - Configure LLM provider connectivity (OpenAI / Anthropic / Amazon Bedrock enterprise tier, zero-data-retention) - Define and implement environment promotion strategy aligned with the 2-week sprint cadence - Support incremental ingestion pipeline infrastructure requirements and nightly scheduling Qualifications - +90% English written and oral (at least B2 level) with excellent communication skills - 6+ years in DevOps or cloud infrastructure engineering; strong AWS specialization required - Infrastructure-as-code: Terraform, CloudFormation, or AWS CDK - CI/CD tooling: GitHub Actions, AWS CodePipeline, or equivalent - Core AWS services: VPC, ECS, Lambda, S3, DynamoDB, API Gateway, Cognito, CloudWatch, X-Ray - Experience designing and operating multi-tenant cloud environments with tenant-level data isolation - At least one project operating infrastructure for a production AI/ML or LLM-integrated system not just general cloud workloads - Experience configuring and managing vector store infrastructure (OpenSearch, Pinecone, Weaviate, or equivalent) in a production environment - Familiarity with LLM provider APIs (OpenAI, Anthropic, or Amazon Bedrock) in a production/enterprise configuration, including zero-data-retention tier setup - Understanding of AI-specific observability concerns: token usage monitoring, latency profiling for LLM calls, and model response logging Additional Information - Experience with enterprise SSO and identity federation: Cognito, Okta, or Azure AD - Background in HIPAA, SOC 2, or regulated-data cloud environment configuration - Familiarity with OCR or document processing service infrastructure (AWS Textract, etc.) - We are accepting applications from LATAM countries

View details: Senior DevOps Engineer

Latin America (LATAM)

Apply

DevOps Engineer

Traffic Label Limited

DevOps Engineer1 day ago

Full Time RemoteTeam 11-50Since 2006H1B No Sponsor

Company Site LinkedIn

• Design, implement, and maintain scalable Kubernetes infrastructure on GKE/EKS • Develop and manage Infrastructure as Code using Terraform, Helm, and Ansible • Build and improve CI/CD pipelines for fast and reliable deployments • Implement and maintain monitoring, logging, and alerting solutions • Support PostgreSQL and Kafka environments • Automate operational tasks using Python and Bash scripting • Troubleshoot production issues across cloud and Kubernetes environments • Collaborate with developers to improve deployment and operational processes • Participate in on-call rotation and production support

Ansible AWS Cloud Docker Google Cloud Platform Kafka Kubernetes PostgreSQL Prometheus Python Terraform

View details: DevOps Engineer

Europe

Apply

Senior Network Deployment Engineer

CodiLime

A strategic partner for technology-driven companies | Network engineering | Software engineering

DevOps Engineer1 day ago

Full Time RemoteTeam 201-500Since 2011H1B No Sponsor

Company Site LinkedIn

• Leading design, architecture, and optimization for networking infrastructure & devices in Large scale DCs and/or Offices • Overseeing entire site deployment cycles from initial requirements to operational handover • Collaborating with cross-functional IT, security, and facility teams plus external vendors • Enforcing and maintaining robust technical documentation and architectural standards • Developing Python and/or Ansible scripts to drive network automation initiatives

Ansible Linux Python Switching

View details: Senior Network Deployment Engineer

Egypt

Apply

Senior Consultant – Site Reliability Engineering

Fabric Group

Good Problems. Unlocking value from business challenges

DevOps Engineer1 day ago

Contract RemoteTeam 51-200Since 2006H1B No Sponsor

Company Site LinkedIn

• Consultative Ownership: Work with autonomy to own problems and deliver solutions, acting as a bridge between development and operations. • Observability Architecture: Design and implement robust monitoring solutions using the LGTM stack to ensure system health and performance. • Reliability Strategy: Advise clients on defining meaningful SLOs/SLIs and managing error budgets to balance innovation with stability. • AI Assistance: Drive use of AI Agents or AI tools for intelligent automation and improving operational efficiency. • Incident Leadership: Lead post-incident reviews (Blameless Post-Mortems) to identify systemic improvements and reduce future toil. • Mentorship: Coach less experienced engineers within Fabric and our client teams on SRE principles and modern infrastructure patterns. • Advising our clients on the right technical decisions and advocating for the right practices to use. • Participate in interviewing and recruitment based on business needs. • Thought Leadership: Contribute to the SRE community through blog posts, meetups, or internal knowledge sharing. • Operational Support & Availability: Rotational Support Coverage: Participate in a sustainable team rotation to provide extended service coverage (including weekends) for business-critical systems. • Incident Response: Act as a primary responder for high-priority (P1/P2) incidents during your rostered shift, focusing on rapid restoration and clear stakeholder communication.

AWS Cloud Google Cloud Platform Grafana Kubernetes Python Terraform Go

View details: Senior Consultant – Site Reliability Engineering

India

Apply

Senior Site Reliability Engineer

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevOps Engineer

DevOps Engineer

Senior Network Deployment Engineer

Senior Consultant – Site Reliability Engineering