Order.co, formerly known as Negotiatus, has developed a cloud-based spend management software for its customers to “centralize and streamline the purchasing process.” As an emp
Senior Site Reliability Engineer
Location
New York
Posted
19 days ago
Salary
$175K - $200K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Order.co
• Ensure software systems are reliable, scalable, performant, and operationally efficient • Design, build, and operate highly available, scalable, and fault-tolerant infrastructure and platform services • Define and maintain service level objectives (SLOs), service level indicators (SLIs), and error budgets across platform systems • Lead incident response efforts for complex production outages; drive root-cause analysis and long-term remediation actions • Develop infrastructure automation and self-service tooling to reduce operational toil and improve engineering velocity • Build and maintain CI/CD pipelines, deployment automation, and release engineering workflows • Design and maintain comprehensive monitoring, logging, tracing, and alerting systems for distributed services
Job Requirements
- Strong foundation in computer science fundamentals: data structures, algorithms, and system design
- Familiarity with building production-grade applications and services using Ruby and Ruby on Rails
- Deep expertise with Linux systems administration and production troubleshooting
- Strong experience operating cloud infrastructure at scale, particularly within AWS environments
- Experience with Kubernetes, container orchestration, and cloud-native infrastructure patterns
- Proficiency with infrastructure as code tools such as Terraform or CloudFormation
- Expertise designing and operating CI/CD pipelines and deployment automation systems
- Deep understanding of observability tooling including Datadog, OpenTelemetry, or similar platforms
- Strong knowledge of distributed systems reliability patterns including redundancy, failover, autoscaling, rate limiting, and graceful degradation
- Experience supporting distributed microservices architectures and event-driven systems
Benefits
- Competitive compensation including base salary, bonus, and equity
- Employer-sponsored 401(k) with match
- Comprehensive medical, dental, and vision coverage
- Flexible time off and hybrid work environment
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Engineer
Hunt StWe help Aussie companies find top 3% remote talent in the Philippines & Nepal for a single finder's fee.
Role Description We are seeking an experienced and highly skilled Senior DevOps Engineer to join our engineering team. This role is critical to the development and deployment of our infrastructure, ensuring robust CI/CD pipelines, infrastructure as code (IaC), cloud environment optimization, and seamless collaboration across development, QA, and operations teams. The ideal candidate is passionate about automation, performance, scalability, and reliability. Key Responsibilities - Maintaining and improving the resiliency of our core applications and our hybrid infrastructure platform - Providing continued improvement to the platform infrastructure through automation and standardisation - Providing complementary skills and expertise to the teams and continuously learning from peers and seniors - Ensuring that all of our core services are up to date and security patched - Working closely with development teams to ensure applications are configured for security, efficiency and scalability Qualifications - Bachelor's degree in Computer Science, Information Technology, or a related field - 5+ years of experience in DevOps, Systems Engineering, or a related field - Linux native; if you do not use Linux as your preferred OS this may not be the role for you - Great communication skills (verbal and written) - Strong experience with the following: - Linux administration - Bash scripting - Kubernetes - Docker - AWS - Good knowledge of networking, DNS, load balancing and CDN's Preferred Qualifications - Experience with Terraform and Ansible - Experience working on and supporting container-based CI/CD pipelines - Keen interest in SecOps practices - AWS Certifications (we will fully support any AWS certification you are seeking) - Experience configuring observability platforms for monitoring and alerting (including Prometheus and New Relic) - Experience with Hashicorp vault, Redis, RabbitMQ or MSSQL - Experience with any programming languages (i.e. Node JS, PHP, Typescript, Python) Work Arrangement & Expectations This is a remote role that will be set up as an independent contractor engagement. To ensure alignment and transparency, successful candidates will be expected to: - Be available for meetings and collaboration during core [AEST or PHT] business hours - Disclose any existing ongoing roles or client work - Reflect this engagement on their LinkedIn profile (clearly marked as “Independent Contractor”)
• Design, deploy, and manage containerized workloads using Amazon ECS (Elastic Container Service) and Amazon EKS (Elastic Kubernetes Service). • Build and maintain CI/CD pipelines to automate software delivery workflows. • Develop and manage Docker container images, registries (ECR), and container lifecycle best practices. • Implement Infrastructure as Code (IaC) using tools such as Terraform, CloudFormation, or CDK. • Monitor, troubleshoot, and optimize cloud infrastructure performance, availability, and cost. • Enforce security best practices across containerized environments (IAM roles, network policies, secrets management). • Collaborate with software engineers to containerize applications and migrate workloads to ECS/EKS. • Manage Kubernetes cluster configurations, namespaces, Helm charts, and service mesh integrations. • Define and maintain observability standards using tools like CloudWatch, Prometheus, Grafana, or Datadog. • Participate in on-call rotations and incident response processes.
Lead Site Reliability Engineer
Akka (formerly Lightbend)Responsive by Design, Akka apps are elastic, agile, and resilient.
• Own Service Level Objectives/Service Level Indicators (SLOs/SLIs) and error budgets across multi-cloud clusters (EKS, GKE, AKS); drive blameless post-mortems and systemic remediation. • Lead capacity planning with our customers, cluster lifecycle management, and Kubernetes and database upgrade cycles. • Define and enforce runbooks, on-call rotations, and escalation paths for the wider engineering organisation. • Own and evolve the IaC layer: Helm charts, Crossplane compositions, and FluxCD GitOps pipelines. • Design and maintain cloud-resource provisioning workflows that span all three cloud providers, with consistent policy controls. • Architect and operate connectivity patterns: AWS PrivateLink / Transit Gateway, GCP NCC, Azure VNet Peering, and cross-region ingress with Contour/Envoy. • Maintain and evolve the Linkerd service mesh for mTLS, workload identity (OIDC), and zero-trust authorisation policies. • Drive PKI hygiene with cert-manager: root/intermediate CA rotation, ACME certificate lifecycle, and secret management via KMS-backed Kubernetes vaults. • Own the observability stack: Prometheus, Cortex (multi-tenant metrics), OpenTelemetry sidecars, centralised log pipelines, and Groundcover / Grafana dashboards. • Establish alerting standards and SLO-based alerting rules; ensure distributed traces are actionable across JVM, Rust, and Go workloads. • Actively participate in on-call and lead the technical response for platform-level incidents. • Set engineering standards and review infrastructure changes across the team. • Partner with Security, Product, and Application Engineering to translate reliability requirements into platform capabilities. • Grow a team of 3–5 SREs through code review, architecture sessions, and career conversations.
• Lead Reliability Engineering for User Experience • Drive reliability, scalability, and operational excellence for critical user facing systems and services. Improve performance and resiliency across APIs, content delivery, feed generation, search, messaging, and real-time experiences. • Partner with product and infrastructure engineering teams to design systems that remain highly available and performant under massive global load. Guide architectural decisions around failover, redundancy, graceful degradation, traffic management, and capacity planning. • Identify systemic risks and reliability bottlenecks across services, dependencies, deployments, and infrastructure. Build proactive mitigation strategies and drive engineering improvements that reduce incidents and improve service health. • Eliminate repetitive operational work through automation and tooling. Build systems that improve deployment safety, incident response, remediation workflows, and reliability guardrails • Lead complex incident response efforts across engineering teams. Drive blameless postmortems, identify root causes, and ensure sustainable long-term fixes are implemented. • Define and champion best practices around reliability engineering, SLIs/SLOs, capacity management, release engineering, and operational maturity across the company. • Provide technical leadership and mentorship to engineers across SRE and software engineering teams. Help shape reliability culture and raise the operational excellence bar across the organization.




