The Simpler, Safer Way to Connect MCPs
Senior Site Reliability Engineer
Location
United States
Posted
68 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Runlayer
• Own reliability and performance of our cloud infrastructure across AWS (ECS, Aurora, CloudWatch) and GCP • Manage and optimize Kubernetes clusters and container orchestration • Drive database reliability engineering, including performance tuning and scaling • Build and maintain CI/CD pipelines for rapid, safe deployments • Run incident response and on-call rotations • Partner with product engineers to design scalable, resilient systems
Job Requirements
- Strong AWS experience, particularly ECS, Aurora, and CloudWatch
- GCP experience as we expand cross-cloud
- Kubernetes and container orchestration expertise
- DBRE experience with database performance tuning
- CI/CD pipeline ownership and incident response experience
- Background at a B2B SaaS company serving enterprise customers, ideally in infrastructure
- Bonus Qualifications: Experience deploying and supporting on-prem or hybrid environments, Python backend familiarity (our platform is Python-based), Experience at an early-stage or high-growth company
Benefits
- Competitive salary and equity — compensation that reflects your expertise and customer-facing responsibilities.
- Paid time off — 4 weeks paid vacation, paid sick leave, and paid parental leave.
- Professional development — budget for conferences, courses, and certifications in AI, enterprise software, and customer success.
- Top-tier equipment — your choice of laptop and accessories to create your ideal work environment.
- Health benefits — comprehensive health, dental, and vision coverage.
- Customer interaction opportunities — work directly with innovative companies and see the immediate impact of your work.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
About Muvr Muvr is building the future of on-demand logistics and moving services. Our platform powers real-time booking, pricing, matching, payments, and fulfillment across customers, drivers, and partners. As we scale, infrastructure reliability and operational excellence become product requirements. This role exists to keep production stable, observable, secure, and scalable so engineering teams can ship quickly without sacrificing uptime, correctness, or customer trust. Role Overview The DevOps / Site Reliability Engineer (SRE) owns the reliability foundations of Muvr’s platform. You will design and operate cloud infrastructure, improve deployment speed and safety, strengthen observability, and lead incident practices that prevent repeat failures. This is a hands-on, production-ownership role for someone who values automation, low-toil systems, and practical guardrails that make delivery faster and safer at the same time. You will partner closely with Engineering, Security, Product, and adjacent teams to harden the platform as usage grows. Key Responsibilities 1) Platform Reliability and Production Ownership - Own uptime, latency, availability, and error-rate outcomes for core services. - Establish SLOs, SLIs, and alerting aligned to customer impact and service health. - Improve reliability through resilient patterns such as retries, timeouts, circuit breakers, load shedding, and queue protections. - Reduce operational toil by building automation and self-service tools that improve engineering velocity and operational safety. 2) Cloud Infrastructure and Infrastructure as Code - Design, build, and maintain scalable cloud infrastructure across AWS, GCP, or Azure environments. - Automate provisioning, configuration, and change management using Infrastructure as Code, preferably Terraform. - Improve disaster recovery readiness through backups, restore validation, redundancy, and failover planning. - Maintain strong environment consistency across development, staging, and production to reduce deployment surprises and configuration drift. 3) CI/CD and Release Engineering - Build and improve CI/CD pipelines to increase deployment frequency while reducing release risk. - Standardize deployment practices, including versioning, environment promotion, staged rollouts, canary releases, and rollback mechanisms. - Implement release guardrails such as required test gates, policy checks, dependency scanning, and secrets detection. - Improve developer experience through faster builds, clearer failure signals, and more reliable deployment workflows. 4) Observability and Operational Excellence - Build and maintain observability across logs, metrics, tracing, dashboards, and service-level visibility. - Design alerting that catches critical failures early while minimizing noise and paging fatigue. - Create runbooks and playbooks that are actionable under pressure and linked to specific alerts or operational scenarios. - Improve MTTR through better instrumentation, faster diagnosis paths, and clearer service ownership. 5) Incident Management and Root-Cause Discipline - Lead or coordinate incident response, including triage, communication, mitigation, recovery, and follow-through. - Run blameless postmortems with clear root-cause narratives, contributing factors, and prevention actions. - Ensure corrective actions are tracked to completion and meaningfully reduce recurrence. - Establish incident severity levels, escalation paths, and communication templates that improve consistency during outages or degradation events. 6) Security and Compliance Baselines - Partner with Engineering to implement security best practices, including least privilege, secrets management, encryption, and audit logging. - Improve access hygiene through MFA coverage, key rotation, access reviews, and break-glass procedures. - Identify infrastructure risks and drive remediation with clear prioritization, ownership, and operational follow-through. - Support audit and compliance readiness through clear documentation, logging, and evidence-friendly processes when needed. 7) AI-Enabled Productivity and Execution - Use AI tools thoughtfully to improve productivity, troubleshooting speed, documentation quality, and automation efficiency. - Apply AI responsibly to support analysis, scripting, incident investigation, and workflow improvement while maintaining security, accuracy, and sound operational judgment. Qualifications Required - 3+ years of experience in DevOps, Site Reliability Engineering, Infrastructure Engineering, or similar roles supporting production systems. - Strong experience with at least one major cloud provider: AWS, GCP, or Azure. - Experience building or maintaining CI/CD pipelines using GitHub Actions, Jenkins, CircleCI, or similar tools. - Familiarity with containerization using Docker and orchestration platforms such as Kubernetes. - Strong troubleshooting skills across infrastructure, core networking concepts, deployments, and service operations. - Ability to write automation scripts and tooling using Bash, Python, or similar languages. - Comfortable using AI tools to improve efficiency and work quality, with a willingness to learn emerging AI workflows and apply them responsibly. Preferred - Experience supporting marketplace, logistics, dispatch, delivery, or other real-time operational platforms. - Experience with observability tools such as Prometheus and Grafana, Datadog, New Relic, or similar platforms. - Strong Infrastructure as Code experience using Terraform, CloudFormation, or equivalent tooling. - Experience scaling distributed systems in production, including autoscaling, queue management, caching strategies, and traffic spike handling. - Familiarity with security best practices and compliance expectations for production systems. - Familiarity with tools and systems such as Slack, Google Workspace, ChatGPT, ClickUp, Hubstaff, GitHub, CI/CD platforms, Kubernetes, Terraform, Datadog, Grafana, cloud consoles, ticketing tools, and other infrastructure or reliability platforms. Why Join Muvr - Own reliability and infrastructure for a fast-growing real-time logistics marketplace. - Take on a high-impact role shaping scalability, operational readiness, and production discipline. - Partner directly with engineering leadership to build systems that scale safely and sustainably. - Work on meaningful infrastructure problems where uptime, speed, and correctness directly affect real-world outcomes. - Competitive compensation.
• Design, implement, and maintain robust, scalable infrastructure and automation solutions. • Automate and optimize infrastructure provisioning, configuration management, deployment processes and operational repetitive tasks. • Execute infrastructure and application deployments to upper cloud environments. • Develop, maintain, optimize CI/CD pipelines and operational workflows. • Monitor & optimize system performance & resource utilization and identify bottlenecks • Implement scalable and reliable systems that support product growth and team agility. • Troubleshoot, determine root causes and resolve technical issues. • Participate in code reviews and provide feedback on best practices. • Collaborate cross-functionally with cloud, firmware, and hardware development teams to integrate and align solutions. • Document operational processes, tools, and workflows to ensure knowledge sharing and smooth transitions.
• Administer and maintain project infrastructures for optimal performance and reliability • Deploy and manage infrastructure services, including Kubernetes clusters and cloud environments • Manage cloud resources (VMs, storage, databases) using Infrastructure as Code (IaC) • Ensure high availability, disaster recovery, and cost-effective optimization of cloud deployments • Implement tools for building, deploying, securing, and monitoring infrastructure and services • Design, create, and manage CI/CD pipelines for streamlined software delivery and deployment • Configure monitoring and logging systems for continuous operation and quick issue resolution • Collaborate with developers, operations teams, QA, Architecture, IT, and project management • Provide first-line DevOps support and troubleshoot issues for developers
• Design, implement and maintain CI/CD pipelines; • Automate deployment, monitoring and infrastructure processes; • Manage cloud environments (AWS, Azure, GCP); • Monitor applications and infrastructure, ensuring high availability; • Collaborate with development and security teams, promoting DevOps best practices; • Implement infrastructure-as-code (IaC) best practices; • Optimize system performance, costs and scalability, working in partnership with FinOps practices; • Define, apply and sustain code versioning strategies using GitFlow and branch management best practices; • Ensure the security and compliance of environments.




