Job Closed
This listing is no longer active.
Premium boutique software development company that helps brands with big ideas to make a difference in people’s lives.
Senior Reliability Engineer, AWS/Python
Location
Latin America
Posted
65 days ago
Salary
0
Seniority
Senior
Job Description
Senior Reliability Engineer, AWS/Python
Truelogic Software
• Designs, implements, and continuously improves observability strategies across services, including metrics, logs, traces, alerts, and dashboards. • Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks. • Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards rather than basic infrastructure provisioning. • Maintains and operates core platform components such as VPC, EKS clusters, RDS, OpenSearch, and MSK, ensuring they expose meaningful operational signals. • Operates and enhances Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring, logging, and tracing stacks. • Defines and maintains SLIs, SLOs, and alerting strategies that clearly distinguish between symptoms, root causes, and actionable operational events. • Improves automated operational responses, including autoscaling, self-healing mechanisms, and runbook-driven remediation. • Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), noise reduction, alert quality improvements, and recovery mechanisms. • Collaborates with engineering teams to investigate production incidents, perform root cause analysis, and drive long-term reliability improvements. • Owns CI/CD pipelines for Infrastructure as Code (IaC) and observability-related platform components. • Applies Site Reliability Engineering (SRE) principles—including observability-first design, error budgets, and operational readiness—to shared platform services. • Supports IAM roles, secrets management, and tenant isolation best practices.
Job Requirements
- Has 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure roles, with significant hands-on experience operating and supporting production systems.
- Demonstrates strong experience in observability operations, including defining metrics, logs, traces, dashboards, alerts, and reliability indicators for complex systems.
- Has hands-on experience with AWS services such as VPC, IAM, RDS, MSK, S3, and CloudWatch, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
- Demonstrates fluency in Python and experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks.
- Possesses a strong understanding of Prometheus, Grafana, alert tuning, alert fatigue reduction, and incident-driven monitoring improvements.
- Has experience improving existing systems rather than building greenfield infrastructure, with a focus on operational excellence and system reliability.
- Shows a proven track record of using observability data to drive automation, scaling decisions, and operational improvements.
- Has experience designing reusable infrastructure or observability patterns, or contributing to internal developer or platform tooling.
- Has experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines (nice to have).
Benefits
- 100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection.
- Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings.
- Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed.
- Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock.
- Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Engineer
StablecoreStablecore enables banks and credit unions to offer digital assets to their customers. We build a side core that connects traditional financial technology like cores and digital banking platforms to a variety digital asset and compliance platforms.
Infrastructure / Site Reliability Engineer (SRE) Bank-Grade Cloud Infrastructure & Platform Reliability Location: US-based (Remote friendly) Experience: Senior-level preferred (5–10+ years) Company: Stablecore About Stablecore Stablecore is building the digital-asset side-core for banks. We help regulated financial institutions safely launch and operate crypto-native products, stablecoins, tokenized deposits, custody, and on-chain payments, while integrating directly with their existing banking cores and compliance frameworks. Our customers are banks. Our counterparts are regulators. Our systems must be correct, auditable, resilient, and secure by default. Role Overview We are looking for an Infrastructure / SRE Engineer to design, operate, and harden the cloud and platform foundations that Stablecore runs on. This role owns reliability, security posture, deployment safety, and operational maturity across a multi-region, bank-grade environment. You will work closely with backend engineers, security, and compliance to ensure that the systems moving real money remain available, correct, and defensible under audit. This is not “keep the lights on” IT. You’ll be building and operating infrastructure that regulators, bank risk teams, and third-party auditors scrutinize in detail. What You’ll Work On - Design and operate multi-region AWS infrastructure with strong isolation and failure boundaries - Own production EKS clusters (multi-region, multi-AZ) including: - Cluster lifecycle, upgrades, and node management - Network policies, ingress/egress, and service isolation - Operate Aurora PostgreSQL (Global Database): - Writer / reader topology - Cross-region replication, failover, and DR testing - Build and maintain CI/CD pipelines in GitLab: - Secure build pipelines - Environment promotion and deployment safety - Artifact integrity and traceability - Manage container supply chain security using Harbor: - Image scanning - Provenance and access control - Operate identity and access control via Keycloak: - Realm design and lifecycle - Integration with internal services and gateways - Design and maintain Cloudflare protections: - WAF, rate limiting, bot protection - Zero Trust access patterns - Implement observability, alerting, and incident response: - Metrics, logs, traces - On-call readiness and runbooks - Partner with security and compliance teams on: - SOC 2 / ISO / bank TPRM requirements - Evidence generation and audit readiness - Participate in architecture decisions around: - Multi-tenancy vs isolation - Blast-radius reduction - Availability vs consistency tradeoffs Technical Stack - Cloud: AWS (EKS, Aurora Global, IAM, KMS, VPC, networking) - CI/CD: GitLab - Edge & Security: Cloudflare - Identity: Keycloak - Container Registry: Harbor - Platform: Kubernetes - Observability: Metrics, logs, tracing (tool-agnostic, but production-grade) What We’re Looking For - Strong experience operating production AWS infrastructure at scale - Deep hands-on experience with Kubernetes / EKS - Experience running PostgreSQL or Aurora in high-availability, regulated environments - Comfort owning production reliability, including on-call and incident response - Strong security instincts around: - IAM and least privilege - Network boundaries - Secrets management - Experience designing systems that survive audits, failures, and human error - Clear communication and calm judgment during incidents Nice to Have - Experience supporting fintech, banking, or regulated platforms - Prior exposure to SOC 2, ISO 27001, FFIEC, or bank TPRM processes - Experience with multi-region DR testing and failover exercises - Familiarity with GitOps or infrastructure-as-code patterns - Experience supporting platforms that move money or other irreversible assets Why This Role Is Interesting - You’ll build infrastructure that banks trust with real money - Reliability, security, and correctness actually matter here - Problems are concrete, high-stakes, and non-theoretical - Close collaboration with backend, security, and compliance teams - Opportunity to shape foundational platform decisions early
• Own the DevSecOps roadmap • Define and execute the strategy for integrating security across our SDLC — SAST, DAST, dependency scanning, secrets detection, container security — ensuring controls are comprehensive without becoming delivery bottlenecks • Lead complex, cross-functional programs • Manage a portfolio of interdependent security and infrastructure initiatives • Map dependencies, hold delivery cadences accountable, and escalate the right things at the right time • Build paved roads • Design shared pipeline templates, hardened base images, and reusable IaC modules that embed security as a default — reducing cognitive load on developers and eliminating per-team reinvention of compliance • Own risk and compliance • Maintain a clear view of technical security risk across your portfolio • Keep teams continuously audit-ready against relevant frameworks (SOC 2, ISO 27001, HIPAA, HITRUST) through automation, not heroics • Communicate across all levels • Translate security risk into business language for executives, and compliance requirements into engineering priorities for teams
• Define product infrastructure according to the architecture guidelines; • Ensure environment resilience; • Align and manage SLIs, SLAs, and SLOs; • Troubleshoot application infrastructure (understands, participates, and proposes solutions); • Assist with application troubleshooting when requested by developers; • Drive monitoring, logging, and automation solutions; • Document product infrastructure; • Understand and participate in capacity and cost planning for the infrastructure; • Analyze application trends; • Propose new solutions for the product; • Participate in POCs and tests for new solutions; • IaC: Infrastructure as Code; • Deploy/create cloud infrastructure (Azure, OCI, AWS, and GCP); • Request and follow up on on-premises infrastructure work with the respective teams.
• Help build and scale the internal development platform. • Build tools, services, and automation for the engineering team. • Provide autonomy and a self-serve culture for teams. • Foster adoption of IA and agentic development while ensuring security and architectural standards.



