Senior Site Reliability Engineer
Location
Texas
Posted
71 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Satsuma Technology Ltd
• Own infrastructure across AWS, GCP, and Azure environments • Build and maintain CI/CD pipelines, observability stacks, and incident response workflows • Define and enforce SLOs/SLIs; lead postmortems • Author and maintain IaC (Terraform preferred) • Write internal tooling and automation using AI-assisted development workflows • Partner closely with engineering on reliability reviews and architecture decisions
Job Requirements
- 5-8 years in SRE, DevOps, or infrastructure engineering
- Hands-on experience across at least two major cloud providers
- Strong Kubernetes, Terraform, and observability tooling (Datadog, Grafana, or equivalent)
- Comfortable reading and editing code; able to ship scripts and internal tools
- Experience with AI-assisted development (Copilot, Cursor, Claude Code)
- On-call maturity -- you've owned incidents end-to-end and made systems better afterward
- Prior experience at a startup or high-growth SaaS company
- Familiarity with API gateway infrastructure or commerce tech stacks
- Hands-on experience with MCP or agentic AI infrastructure
Benefits
- Unlimited PTO
- 401(K)
- Healthcare Stipend
- Gym stipend
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Attentive® is the AI marketing platform for 1:1 personalization redefining the way brands and people connect. We’re the only marketing platform that combines powerful technology with human expertise to build authentic customer relationships. By unifying SMS, RCS, email, and push notifications, our AI-powered personalization engine delivers bespoke experiences that drive performance, revenue, and loyalty through real-time behavioral insights. Recognized as the #1 provider in SMS Marketing by G2, Attentive partners with more than 8,000 customers across 70+ industries. Leading global brands like Crate and Barrel, Urban Outfitters, and Carter’s work with us to enable billions of interactions that power tens of billions in revenue for our customers. With a distributed global workforce and employee hubs in New York City, San Francisco, London, and Sydney, Attentive’s team has been consistently recognized for its performance and culture. We’re proud to be included in Deloitte’s Fast 500 (four years running!), LinkedIn’s Top Startups, Forbes’ Cloud 100 (five years running!), Inc.’s Best Workplaces, and the Human Rights Campaign Foundation's Corporate Equality Index! About the Role What You’ll Accomplish - Design and deliver high-impact solutions: Design and implement systems that enhance reliability, observability, traceability, and incident management, ensuring the platform scales effectively - Lead execution on key projects: Take ownership of projects, driving them from discovery through execution - Partner across teams: Collaborate with engineers from AI/ML, Data, Platform, and Product teams to develop best-in-class platforms and services - Establish standards and best practices: Define and enforce production standards, processes, and tools to ensure operational excellence - Champion reliability goals: Advocate for and implement SLIs, SLOs, and other reliability-focused metrics across the engineering organization - Mentor and knowledge share: Guide and mentor junior team members, fostering technical growth and helping to develop the next generation of engineering leaders - Innovate and inspire: Drive continuous improvement by bringing creative ideas and challenging the status quo Your Expertise - 5+ years of experience in Production Engineering, SRE, Platform Engineering, DevOps, Backend Engineering, or similar roles - Strong coding ability in at least one language (e.g., Golang, Python, Java, Typescript) with the capability to solve complex issues through code - Experience with cloud-native technologies and Infrastructure-as-Code (e.g. Kubernetes, Terraform, AWS) - Demonstrated experience delivering medium to large-scale projects that drive meaningful improvements in platform reliability and scalability - Deep understanding of production reliability concepts, including SLIs, SLOs, and incident management - Proficient in designing and maintaining CI/CD pipelines, deployment strategies, and release automation to enable fast, safe delivery - Fluency in frontier AI-assisted development tools and agents (Claude Code, Codex, Cursor, or similar) - Excellent verbal and written communication skills with the ability to collaborate across technical and non-technical teams - Familiarity with working in dynamic, reliability-focused production environments (preferred) What We Use - Our services run primarily in Kubernetes, hosted on AWS EKS - Our tooling includes Terraform, Helm, ArgoCD, Istio, CloudFlare, Datadog, and Incident.io - Our backend is primarily Java / Spring Boot microservices, built with Gradle, coupled with things like DynamoDB, Kinesis, AirFlow, Postgres, and Redis - Our frontend is built with React and TypeScript, and uses best practices like GraphQL, Storybook, Radix UI, Vite, esbuild, and Playwright - Our automation is driven by custom and open source machine learning models, lots of data and built with Python, Metaflow, HuggingFace 🤗, PyTorch, TensorFlow, and Pandas You'll get competitive perks and benefits, from health & wellness to equity, to help you bring your best self to work. For US based applicants: - The US base salary range for this full-time position is $220,000 - 275,000 annually + equity + benefits - Our salary ranges are determined by role, level and location #LI-EF1 By applying for this position, your data will be processed as per Attentive's Privacy Policy. Attentive Company Values Default to Action - Move swiftly and with purpose Be One Unstoppable Team - Rally as each other’s champions Champion the Customer - Our success is defined by our customers' success Act Like an Owner - Take responsibility for Attentive’s success Learn more about AWAKE, Attentive’s collective of employee resource groups. If you do not meet all the requirements listed here, we still encourage you to apply! No job description is perfect, and we may also have another opportunity that closely matches your skills and experience. At Attentive, we know that our Company's strength lies in the diversity of our employees. Attentive is an Equal Opportunity Employer and we welcome applicants from all backgrounds. Our policy is to provide equal employment opportunities for all employees, applicants and covered individuals regardless of protected characteristics. We prioritize and maintain a fair, inclusive and equitable workplace free from discrimination, harassment, and retaliation. Attentive is also committed to providing reasonable accommodations for candidates with disabilities. If you need any assistance or reasonable accommodations, please let your recruiter know.
• Build and operate metrics/monitoring platforms: **Prometheus and/or VictoriaMetrics** (scrape configs, exporters, recording rules) • Design and maintain alerting strategy: thresholds, anomaly detection where applicable, alert routing, deduplication, and noise reduction • Integrate monitoring/alerting and events with **BigPanda** (correlation, enrichment, routing, incident workflows) • Create and maintain dashboards and operational visibility (Grafana or equivalent) • Develop and maintain runbooks, operational playbooks, and incident response procedures • Participate in **on-call shifts**: triage alerts, manage incidents, coordinate response, and lead communication during outages • Perform root-cause analysis, postmortems, and implement corrective/preventive actions • Improve service reliability via SLOs/SLIs, capacity planning, and automation to reduce toil • Support monitoring for core infrastructure and services on **Windows and Linux**, including HA components and clusters • Collaborate with DevOps/Engineering to instrument applications and standardize telemetry (metrics, logs, traces where applicable)
DevOps Engineer
Happy Returns LLCHappy Returns is committed to providing a workplace free of discrimination, harassment, and retaliation. Happy Returns is an equal opportunity employer. Happy Returns does not discriminate on the basis of race/color/religion/sex/national origin/veteran/disability/age/sexual orientation/gender identity or any other characteristic protected by law.
Role Description We’re not solving a small problem, and we’re not addressing a small market. We’re tackling returns—the part of the online shopping experience shoppers say they hate most. Our customers, i.e., top online retailers, use Happy Returns’ returns software and reverse logistics to offer shoppers a genuinely delightful return experience while at the same time reducing costs, retaining sales, and making their supply chains more sustainable. We’re making returns legitimately better for everyone, and we’re having fun doing it. - Collaborate on and deploy cloud infrastructure on AWS using infrastructure-as-code (Terraform, Pulumi) that is secure, scalable, and highly available. - Actively collaborate with software engineering to define infrastructure, build, release and deployment tooling. - Collaborate with information security for requirements and integration with security and compliance tooling. - Troubleshoot problems across a wide array of services and function areas. - Build and maintain operational tools for deployment, monitoring, and analysis of AWS infrastructure and systems to ensure availability, performance, security, and scalability. - Provide recommendations for architecture and process improvements. - Help us build and maintain a world class technology system so we can achieve our mission of making returns beautiful for Shoppers, Retailers, and the Planet. Qualifications - At least 1 year of experience in designing, provisioning and maintaining infrastructure using Infrastructure as Code. - Experience in code development in at least one high-level programming language. - Knowledge of Kubernetes and containerized applications. - Proficiency working with a cloud provider (preferably AWS). - Proficiency with Docker, Git and software development processes for deploying applications. - Understanding of/familiarity with using Github Actions (or similar) to build, test and release software. - Familiarity with database technology such as PostgreSQL or MySQL. - Previous experience using cloud observability platforms such as DataDog or NewRelic. - A strong interest in joining a highly collaborative environment and working daily with Product Managers and Engineers. - Interest in e-commerce, logistics and/or sustainability. - Comfort with leveraging the latest available AI-powered tools (IDEs, agents, automated reviewers) to accelerate and assist with daily development, debugging, and technical tasks. Requirements - This is a fully remote position and is not limited to candidates located in Georgia. EEO Statement Integrated into our shared values is Happy Returns’ commitment to diversity. Happy Return is committed to being a globally inclusive company where all people are treated fairly, recognized for their individuality, promoted based on performance and encouraged to strive to reach their full potential. We believe in understanding and respecting differences among all people. This concept encompasses but is not limited to human differences with regard to race, ethnicity, religion, gender, culture and physical ability. Every individual at Happy Return has an ongoing responsibility to respect and support a globally diverse environment. Visa Sponsorship We do not offer visa sponsorship or transfer for this role. Statement to Third Party Agencies To all recruitment agencies: Happy Return only accepts resumes from agencies on the UPS preferred supplier list. Happy Return is not responsible for any fees or charges associated with unsolicited resumes.
• Define a unified vision for observability across all platforms, with golden signals as the foundation for monitoring and alerting • Develop and maintain a comprehensive roadmap to improve observability, reduce tool redundancy, and standardize practices across platforms • Establish and track key performance indicators (KPIs) to measure progress and ensure accountability for roadmap milestones • Partner with the ZEIT SRE team and engineering leads to break down silos and promote consistent observability practices • Standardize the implementation of golden signals across applications to improve system reliability and incident detection • Identify and address gaps in existing observability practices, prioritizing long-term scalability and reliability • Measure and report on observability success metrics, including actionable alert volume and reduced issue escalations



