Job Closed
This listing is no longer active.
Your growth starts here.
System Reliability Engineer – DevOps
Location
United States
Posted
84 days ago
Salary
0
Seniority
Senior
Job Description
System Reliability Engineer – DevOps
Growe Talents
• Ensure availability, performance, and scalability of infrastructure and services through monitoring, automation, and operational best practices; • Lead incident response, perform root cause analysis, and implement recovery and long-term fixes; • Manage infrastructure using Terraform, Terragrunt, and automation tools for consistency and repeatability; • Implement and maintain metrics, logs, and tracing solutions (Prometheus, Grafana, Loki, VictoriaMetrics, CloudWatch) to ensure system visibility; • Identify bottlenecks, tune systems, and improve infrastructure performance; • Monitor resources, forecast growth, and implement scaling strategies; • Integrate security best practices into IaC, CI/CD pipelines, and deployments; • Support vulnerability management; • Participate in 24/7 rotations (once a week) for timely resolution of critical incidents; • Work with DevOps, PRE, development, and security teams to improve reliability and design resilient systems; • Maintain operational runbooks, incident reports, and system documentation.
Job Requirements
- 3+ years in a DevOps, SRE, or related role;
- Strong hands-on experience with AWS services including EC2, ECS, EKS, RDS, DocumentDB, ElastiCache, Keyspaces, S3, EBS, VPC, Route53, KMS, ACM, and CloudWatch;
- Proficiency with Terraform, Terragrunt, and Atlantis for reproducible and version-controlled infrastructure;
- Experience with GitLab CI, FluxCD, Argo Rollouts, and automation tools (Ansible, Python, Bash);
- Solid experience with Docker, Kubernetes (AWS EKS), and Helm (including custom templates, ChartMuseum);
- Familiarity with cluster add-ons such as KEDA, VPA, Karpenter, External-DNS, ingress-nginx, aws-alb-controller, and ebs-csi-driver;
- Experience with Grafana, VictoriaMetrics stack, Tempo, metrics exporters, Pingdom, AWS CloudWatch, and alerting systems like PagerDuty, VMAlert, and Alertmanager;
- Proficiency with OpenSearch, and Vector Agent for centralized logging;
- Strong understanding of networking concepts, AWS networking (VPC, Network Firewall, Transit Gateway, Site-to-Site VPN), identity and access management, certificate management (ACM, Vault, SOPS), and application security best practices;
- Familiarity with Cloudflare services, including caching, DNS, and Workers;
- Exposure to AWS Cost Explorer, KubeCost, and custom cost export tools;
- Certifications: AWS, Terraform, Kubernetes, or Helm are a plus.
Benefits
- Health & Wellness Focus;
- Global Medical Coverage;
- Growth Opportunities;
- Benefits Programs (compensation for the gym/stomatology/psychological service & etc.);
- Performance-Driven Rewards;
- Dynamic Work Environment.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Work closely with developers and operations teams to scale and optimize our infrastructure for sustained growth. • Design, deploy, and operate our core backend infrastructure using automated, Infrastructure-as-Code approach. • Prioritize and own delivery in a small, highly efficient team — you set the bar, not just maintain it. • Serve as the first line of defense as on-call engineer on workdays and weekends (no night shifts).
DevOps Engineer
Pluribus DigitalWe help government agencies deliver public services as modern digital products.
• Design, build, and maintain scalable cloud-based solutions using Microsoft Azure or AWS • Develop monitoring and alerting templates, blue-green deployment strategies, and IAM automation workflows • Collaborate with cross-functional teams to contribute to the conceptual, logical, and physical design of cloud solutions • Continuously adopt new tools to enhance performance, automation, and scalability
• Lead incident response, perform root cause analysis, and implement recovery and long-term fixes; • Manage infrastructure using Terraform, Terragrunt, and automation tools for consistency and repeatability; • Support vulnerability management; • Monitor resources, forecast growth, and implement scaling strategies; • Participate in 24/7 rotations for timely resolution of critical incidents;
• Design, implement, and automate components of large-scale distributed cloud systems. • Implement and support PAM solutions primarily on OpenStack, ensuring secure and reliable access management. • Build tools, automation, and workflows to improve availability, scalability, latency, and operational efficiency. • Work closely with engineering and delivery teams to deploy high-quality software in a fast-paced environment. • Monitor production and development environments and implement preventive and corrective measures to ensure platform reliability. • Participate in incident response, debugging, and root cause analysis for production issues. • Collaborate across teams to deliver consistent and reliable solutions aligned with. • Document designs, operational procedures, and troubleshooting guides clearly and effectively. • Contribute to improvements in reliability metrics such as availability, MTTD, and MTTR.




