The top Telecom-as-a-Service Platform in the world
Site Reliability Engineer
Location
United States
Posted
13 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer
OXIO
• Design and implement platform on the cloud to support OXIO backend services • Automate technical operations: deployments, scaling, recovery, etc. • Monitor and maintain mission-critical production infrastructure to ensure maximum uptime • Participate in an on-call rotation and culture of continuous improvement through blameless postmortems • Enable the Engineering/Telecom/Data Engineering teams by providing them the tools to operate the service they build
Job Requirements
- Understanding of Linux/Unix systems (most systems are Linux-based).
- Familiarity with Linux/Unix system internals like process management, filesystems, memory management, and networking.
- Proficiency in at least one programming language (Python, Go, or Ruby) and strong skills in scripting (Bash, Perl).
- Experience with infrastructure provisioning tools such as Terraform, CloudFormation, or Ansible.
- Familiarity with containerization (Docker) and orchestration tools (Kubernetes).
- Familiarity with monitoring tools like Prometheus, Grafana, or Datadog.
- Knowledge of setting up alerts, analyzing logs, and creating dashboards for observability.
- Familiarity with incident management practices (e.g., runbooks, postmortems).
- Experience in being part of an on-call rotation and handling incidents.
- Experience in setting up and maintaining Continuous Integration/Continuous Delivery pipelines (Jenkins, GitLab CI, CircleCI, etc.).
- Hands-on experience with cloud providers (AWS, Google Cloud, Azure).
- Knowledge of virtualization technologies (VMware, KVM) and cloud-native architecture.
- Understanding of TCP/IP, DNS, HTTP/HTTPS, load balancing, and firewalls.
- Strong understanding of deployment strategies (canary releases, blue-green deployments, etc.).
- Familiarity with high availability and understanding failover mechanisms.
- Familiarity with IAM (Identity and Access Management) and zero trust principles.
- Experience working with distributed systems (e.g., Kafka, Cassandra, Elasticsearch).
- Building custom monitoring tools or writing complex automation scripts.
- Functional knowledge of database management (SQL and NoSQL).
- Familiarity with distributed tracing (Jaeger, OpenTelemetry) and advanced log aggregation strategies (ELK stack, Splunk).
- Familiarity with performance profiling tools and optimizing application performance under heavy load.
- Familiarity in load testing and identifying bottlenecks.
- Familiarity with Configuration Management using SaltStack for maintaining server configurations.
Benefits
- N/A
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Collaborate closely with fellow devops engineers and the development team to deploy and maintain application infrastructure. • Assist in the development and support of tooling to streamline the deployment and maintenance of our products. • Work with Github, Jenkins, and Chef to deploy applications from development through to production environments. • Support both in-house and third-party applications, including handling deployments, upgrades, and troubleshooting. • Build and manage automation pipelines for application deployment and maintenance. • Engage in the day-to-day management of Linux servers via the command line. • Create monitoring dashboards and alerts in Grafana leveraging Prometheus and Alertmanager. • Document processes and best practices clearly and concisely. • Participate in incident solving on-call rotation
• Own end-to-end release and deployment lifecycle: build → package → deploy → verify → rollback • Develop and support **Octopus Deploy** projects, lifecycles, channels, variables, and deployment processes • Implement deployment automation with **Ansible** (playbooks/roles, inventories, idempotent changes) • Maintain Git-based release workflows in **GitHub** (branching, tagging, versioning, release notes) • Build/maintain CI pipelines in GitHub Actions (or existing tooling) to produce artifacts and trigger Octopus releases • Standardize deployment patterns across applications (templates, shared steps, reusable Ansible roles) • Manage environment configuration and secrets in a controlled way (variable sets, permissions, auditing) • Improve deployment safety: approvals, health checks, smoke tests, automated validation, and rollback strategies • Support production releases, troubleshoot deployment failures, and drive root-cause analysis • Maintain release documentation, runbooks, and change management practices • Collaborate with developers, QA, and operations to plan releases and reduce downtime
• Act as technical lead for DevOps/Platform/Release engineering: set direction, standards, and best practices • Architect and govern end-to-end delivery: infrastructure provisioning, configuration management, CI/CD, release processes, and operations • Design and support Windows-based high availability solutions, with deep ownership of Windows clustering (failover/HA patterns, maintenance, upgrades, troubleshooting) • Lead Linux automation and platform standardization (configuration, patching, hardening, performance tuning) • Own Infrastructure as Code strategy with Terraform (modules, environments, state, governance) • Own automation strategy with Ansible (reusable roles, inventories, secure secrets handling, idempotency) • Build and standardize deployments using Octopus Deploy, GitHub, and Ansible (templates, shared steps, release promotion, rollback) • Design and mature CI/CD pipelines (artifact versioning, approvals, promotion strategy, policy-as-code where applicable) • Establish observability standards using VictoriaMetrics/Prometheus (metrics strategy, alerting, SLO/SLA monitoring, dashboards) • Provide production leadership: incident response, RCA/postmortems, reliability improvements, capacity planning • Mentor engineers, review designs/code, and raise overall engineering quality across teams • Produce and maintain architecture docs, runbooks, and platform roadmaps
• Build and maintain CI/CD pipelines for application builds, automated testing, packaging, and deployment activities. • Implement automation solutions for environment provisioning, operational workflows, release processes, and infrastructure support tasks. • Support secure delivery practices including code scanning, dependency validation, secrets management, and policy enforcement activities. • Troubleshoot and resolve build, deployment, pipeline, and environment-related issues across multiple applications and services. • Collaborate with development and QA teams to improve release quality, deployment reliability, and software delivery timelines. • Support cloud-based infrastructure and shared platform services in coordination with engineers, architects, and operations teams. • Maintain documentation for deployment pipelines, environment configurations, release procedures, and operational support processes. • Participate in incident response efforts, root cause analysis, and continuous process improvement initiatives. • Monitor system and pipeline performance and recommend improvements to automation, tooling, and workflow efficiency. • Support change management, deployment coordination, and release readiness activities across production and non-production environments. • Contribute to various projects and initiatives as assigned, demonstrating adaptability and a collaborative mindset.


