Spend is the fuel to help your company deliver performance, profitability, and purpose!
Sr. Site Reliability Engineer - 11293
Location
Mexico
Posted
74 days ago
Salary
0
Seniority
Senior
Job Description
Sr. Site Reliability Engineer - 11293
Coupa Software
Coupa makes margins multiply through its community-generated AI and industry-leading total spend management platform for businesses large and small. Coupa AI is informed by trillions of dollars of direct and indirect spend data across a global network of 10M+ buyers and suppliers. We empower you with the ability to predict, prescribe, and automate smarter, more profitable business decisions to improve operating margins. Why join Coupa? 🔹 Pioneering Technology: At Coupa, we're at the forefront of innovation, leveraging the latest technology to empower our customers with greater efficiency and visibility in their spend. 🔹 Collaborative Culture: We value collaboration and teamwork, and our culture is driven by transparency, openness, and a shared commitment to excellence. 🔹 Global Impact: Join a company where your work has a global, measurable impact on our clients, the business, and each other. Learn more on Life at Coupa blog and hear from our employees about their experiences working at Coupa. The Impact of a Sr. Site Reliability Engineer at Coupa: As a Senior Site Reliability Engineer, you will play a crucial role in the development of solutions for our Contract platform. Coupa Contract (Standard) enables customers to author, approve, and operationalize contracts, making them easily available for purchasing by employees across the organization. Contract compliance delivers savings as employees make purchases using negotiated rates and helps to mitigate risk by ensuring that appropriate terms are in place. Contract enforcement and spend visibility are provided through embedded dashboards at both the contract and summary level. Coupa Contract Advanced is an enterprise-class contract management solution to help companies improve contract visibility, risk management, and operational efficiency at scale. Contract Advanced is designed to handle the creation, storage, and optimization of any contract across any industry or department. At a business level, together with the product management and development team you will change the way our customers deal with Contracts life cycle management ecosystem and build best in class hosting infrastructure on cloud. At a technical level we will jointly drive scaling our Business Spend Management platform on public cloud by following Site reliability engineering (SRE) best practices. What You'll Do: • Administration of Linux machines, Web servers, Application servers, Databases Application and cloud infrastructure support for customer environments. • Provide application support on Java and Ruby applications. • Own end-to-end availability and performance of mission critical services and build automation to prevent problem recurrence. • Tools development and automation to increase availability and performance • Ensuring the data, services and infrastructures are reliable, fault-tolerant, efficiently scalable and cost-effective • Collaborate with Product and Release engineering for new product releases and maintenance. • Coordinate incident, problem and change management. • Participate in on-call rotation for after-hours and weekend emergencies What You Will Bring to Coupa: - Bachelor's Degree with 8+ years of professional experience handling large scale production systems. - Experience with AWS or comparable cloud providers with certification. - Experience in designing of new services on AWS or comparable cloud provider, migration of services to cloud and deployment of new services on AWS or comparable cloud provider. - Hands on experience with Terraform and configuration management tools like Chef, Ansible or equivalent. - Experience in application support/development on Java or Ruby. - Hands on scripting experience with anyone of these: Python or Bash. - Excellent knowledge of large scale web applications/distributed systems. - Experience in Kubernetes, Docker, and/or cloud deployment technologies. - Experience in observability tools like NewRelic, Datadog etc - Expertise in problem solving and analyzing global scale distributed systems. - Excellent written and verbal communication skills. - Critical thinking, continuously challenging how and why we do things to help us improve #LI-REMOTE #LI-AA2 Coupa complies with relevant laws and regulations regarding equal opportunity and offers a welcoming and inclusive work environment. Decisions related to hiring, compensation, training, or evaluating performance are made fairly, and we provide equal employment opportunities to all qualified candidates and employees. Please be advised that inquiries or resumes from recruiters will not be accepted. By submitting your application, you acknowledge that you have read Coupa’s Privacy Policy and understand that Coupa receives/collects your application, including your personal data, for the purposes of managing Coupa's ongoing recruitment and placement activities, including for employment purposes in the event of a successful application and for notification of future job opportunities if you did not succeed the first time. You will find more details about how your application is processed, the purposes of processing, and how long we retain your application in our Privacy Policy.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• You will be the first person who owns this area entirely. • Your mission is to make MAIA's platform reliable, secure, auditable, and developer-friendly, at a stage where every decision you make has lasting impact. • You take full ownership of our infrastructure and establish the standards that will govern how it grows - Hetzner first, AWS, GCP, and Azure for selected services. • You own and evolve our CI/CD pipelines (GitHub Actions) and deployment workflows - improving rollout strategies, versioning, and rollback procedures from where they stand today. • You build out and mature our observability stack (Grafana, Loki, Sentry, PostHog) so that problems surface before customers notice them. • You implement and own our security fundamentals: IAM, secrets management, TLS, vulnerability scanning, and patch management. • You drive the technical controls required for our ISO 27001 certification and build the systems that produce auditable evidence continuously.
• Defining the reliability architecture for Akamai's AI compute and platform services, including SLO frameworks, fault tolerance patterns, and capacity planning models • Hands-on building of automation and tooling that reduces operational toil and scales the SRE team's impact • Designing observability strategy by leveraging Akamai's existing platform to build the telemetry, dashboards, alerts, and GPU-specific monitoring needed for AI workloads • Architecting deployment safety practices including progressive rollouts, canary analysis, rollback automation, and change safety processes • Influencing product engineering architecture and design decisions, embedding reliability into the development lifecycle at the system level • Mentoring and elevating other SREs through design reviews, code reviews, and hands-on problem-solving, setting the technical bar for the team
• Lead the team responsible for reliability across Akamai's AI compute and platform services • Build the team, owning hiring strategy, candidate evaluation, and interview coordination for AI SRE roles • Partner with product engineering teams to embed reliability into the development lifecycle • Define and implement SRE practices for Akamai's AI compute and platform services • Ensure operational readiness for AI products by establishing quality gates, on-call rotations, runbooks, and escalation paths for AI infrastructure failure mode • Scale operations through software and automation, reducing toil and driving the team toward programmatic solutions over manual intervention • Own incident management integration for AI workloads, including post-incident analysis and driving systemic improvements that prevent recurrence
We're looking for a senior Site Reliability Engineer to join our small, high-ownership SRE team. In this hands-on individual contributor role, you'll own the reliability, scalability, and security of AbsenceSoft's production infrastructure on AWS — supporting a B2B SaaS platform that processes sensitive employee leave data for enterprise customers. You'll work closely with infrastructure, application engineering, product leadership, and cross-functional partners in Security and Compliance, with a clear path to grow toward a Tech Lead opportunity as our team and platform continue to mature. WHAT YOU'LL DO - Architect, implement, and operate scalable, resilient, and secure AWS infrastructure — including GuardDuty, Lambda, EventBridge, SNS, SES, S3, ALB, and ECS container workloads. - Lead infrastructure-as-code initiatives to ensure all environments are reproducible, auditable, and consistently configured in support of SOC 2 change management controls. - Design, maintain, and improve CI/CD pipelines using Jenkins and GitHub to enable reliable, repeatable software delivery — partnering with application engineering to reduce release risk and increase deployment frequency. - Own the Datadog observability platform, including dashboards, monitors, alerting thresholds, and log management; define and maintain SLOs, SLIs, and error budgets to guide reliability investment and reduce alert fatigue. - Serve as a senior technical responder across the full incident lifecycle — detection, containment, resolution, and postmortem — within a shared on-call rotation, and lead blameless postmortems to drive down incident frequency and MTTR. - Refine, implement, and test disaster recovery plans to meet RTO/RPO objectives, while contributing to SOC 2 audit readiness with a focus on access controls, incident response, and risk mitigation. - Mentor junior SREs through code reviews, incident pairing, and documentation of runbooks and engineering standards. WHAT YOU'LL BRING - 5+ years of experience in SRE, DevOps, or a related engineering role, with advanced hands-on expertise in AWS production environments and core services including Lambda, ECS, S3, ALB, and GuardDuty. - Strong proficiency in infrastructure-as-code tooling such as Terraform, CloudFormation, or CDK, paired with experience building and operating CI/CD pipelines using Jenkins and GitHub. - Proficiency in Python, Go, or Bash for automation, alongside hands-on experience with Datadog or a comparable observability platform for monitoring, alerting, and log management. - Demonstrated experience leading incident response in complex, distributed systems, with working knowledge of SLO/SLI frameworks, error budgets, and disaster recovery planning against defined RTO/RPO objectives. - Familiarity with SOC 2 compliance frameworks and experience contributing to audit readiness, access controls, and security control evidence collection. - A collaborative, ownership-driven mindset with strong communication skills, a passion for mentoring junior engineers, and a commitment to reducing toil through automation and AI-assisted tooling. At AbsenceSoft, we LEAD with our values: Lead with Innovation - We create meaningful change through intelligence, focus and passion. We embrace curiosity, data, and insight to shape the future of our industry. Always innovating, learning and evolving. Elevate Every Voice - Every perspective matters. We listen, learn, and build a culture where diversity of thought and experience drives better solutions and smarter decisions. Achieve Together - The customer fuels everything we do. We share knowledge, collaborate, celebrate wins, and face challenges as one team because success is always a collective achievement. Drive Outcome - Every action we take delivers measurable value to our teams, our customers, and the employees they support. Accountability is non-negotiable. We honor our commitments, take responsibility for results, and see every success and setback as a chance to grow stronger. We offer: - Impact that matters. You’ll do work that shapes the future of the modern workplace - Flexibility and trust. We’re remote-first and results driven. You’ll have the freedom and flexibility to do your best work, wherever you do it best. - Growth and development. We believe the best work happens when people are growing. You’ll have access to learning resources, leadership programs, and real opportunities to take on new challenges and expand your impact. - Competitive rewards. We offer comprehensive benefits, a performance-based bonus program, and equity opportunities – because when we grow, you should too. - Time for life. Recharge and reconnect with flexible time off, paid holidays, and flexible leave programs designed to support every season of life. - Belonging and balance. We’re building an inclusive culture where every voice is valued, collaboration is celebrated, and success is shared. We’re committed to building a team as diverse as the customers we serve. If your experience doesn’t align perfectly with every qualification, we still encourage you to apply you might be exactly what we’re looking for. If this sounds like a fit, apply today, we’d love to meet you!


