Job Closed

This listing is no longer active.

Truelogic Software

Premium boutique software development company that helps brands with big ideas to make a difference in people’s lives.

Senior Reliability Engineer, AWS/Python

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 501-1,000Since 2004H1B No SponsorCompany Site LinkedIn

Location

Latin America

Posted

65 days ago

Salary

Seniority

Senior

5 yrs expEnglishAWS Grafana Apache Kafka Kubernetes Prometheus Python Apache Spark

Job Description

• Designs, implements, and continuously improves observability strategies across services, including metrics, logs, traces, alerts, and dashboards. • Focuses on understanding system behavior in production, identifying failure modes, performance bottlenecks, and reliability risks. • Evolves and maintains shared AWS CDK and CDK8s constructs, with emphasis on observability, autoscaling, and operational safeguards rather than basic infrastructure provisioning. • Maintains and operates core platform components such as VPC, EKS clusters, RDS, OpenSearch, and MSK, ensuring they expose meaningful operational signals. • Operates and enhances Kubernetes cluster addons such as ingress controllers, cert-manager, autoscalers, and monitoring, logging, and tracing stacks. • Defines and maintains SLIs, SLOs, and alerting strategies that clearly distinguish between symptoms, root causes, and actionable operational events. • Improves automated operational responses, including autoscaling, self-healing mechanisms, and runbook-driven remediation. • Ensures high reliability through structured alerting systems (Prometheus, CloudWatch), noise reduction, alert quality improvements, and recovery mechanisms. • Collaborates with engineering teams to investigate production incidents, perform root cause analysis, and drive long-term reliability improvements. • Owns CI/CD pipelines for Infrastructure as Code (IaC) and observability-related platform components. • Applies Site Reliability Engineering (SRE) principles—including observability-first design, error budgets, and operational readiness—to shared platform services. • Supports IAM roles, secrets management, and tenant isolation best practices.

Job Requirements

Has 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure roles, with significant hands-on experience operating and supporting production systems.
Demonstrates strong experience in observability operations, including defining metrics, logs, traces, dashboards, alerts, and reliability indicators for complex systems.
Has hands-on experience with AWS services such as VPC, IAM, RDS, MSK, S3, and CloudWatch, as well as Kubernetes components like Helm, RBAC, and ServiceAccounts.
Demonstrates fluency in Python and experience with Infrastructure-as-Code using AWS CDK, CDK8s, or equivalent frameworks.
Possesses a strong understanding of Prometheus, Grafana, alert tuning, alert fatigue reduction, and incident-driven monitoring improvements.
Has experience improving existing systems rather than building greenfield infrastructure, with a focus on operational excellence and system reliability.
Shows a proven track record of using observability data to drive automation, scaling decisions, and operational improvements.
Has experience designing reusable infrastructure or observability patterns, or contributing to internal developer or platform tooling.
Has experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines (nice to have).

Benefits

100% Remote Work: Enjoy the freedom to work from the location that helps you thrive. All it takes is a laptop and a reliable internet connection.
Highly Competitive USD Pay: Earn an excellent, market-leading compensation in USD, that goes beyond typical market offerings.
Paid Time Off: We value your well-being. Our paid time off policies ensure you have the chance to unwind and recharge when needed.
Work with Autonomy: Enjoy the freedom to manage your time as long as the work gets done. Focus on results, not the clock.
Work with Top American Companies: Grow your expertise working on innovative, high-impact projects with Industry-Leading U.S. Companies.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior DevOps Engineer

Stablecore

Stablecore enables banks and credit unions to offer digital assets to their customers. We build a side core that connects traditional financial technology like cores and digital banking platforms to a variety digital asset and compliance platforms.

DevOps Engineer65 days ago

Full Time RemoteTeam 20Since 2025

Company Site

Infrastructure / Site Reliability Engineer (SRE) Bank-Grade Cloud Infrastructure & Platform Reliability Location: US-based (Remote friendly) Experience: Senior-level preferred (5–10+ years) Company: Stablecore About Stablecore Stablecore is building the digital-asset side-core for banks. We help regulated financial institutions safely launch and operate crypto-native products, stablecoins, tokenized deposits, custody, and on-chain payments, while integrating directly with their existing banking cores and compliance frameworks. Our customers are banks. Our counterparts are regulators. Our systems must be correct, auditable, resilient, and secure by default. Role Overview We are looking for an Infrastructure / SRE Engineer to design, operate, and harden the cloud and platform foundations that Stablecore runs on. This role owns reliability, security posture, deployment safety, and operational maturity across a multi-region, bank-grade environment. You will work closely with backend engineers, security, and compliance to ensure that the systems moving real money remain available, correct, and defensible under audit. This is not “keep the lights on” IT. You’ll be building and operating infrastructure that regulators, bank risk teams, and third-party auditors scrutinize in detail. What You’ll Work On - Design and operate multi-region AWS infrastructure with strong isolation and failure boundaries - Own production EKS clusters (multi-region, multi-AZ) including: - Cluster lifecycle, upgrades, and node management - Network policies, ingress/egress, and service isolation - Operate Aurora PostgreSQL (Global Database): - Writer / reader topology - Cross-region replication, failover, and DR testing - Build and maintain CI/CD pipelines in GitLab: - Secure build pipelines - Environment promotion and deployment safety - Artifact integrity and traceability - Manage container supply chain security using Harbor: - Image scanning - Provenance and access control - Operate identity and access control via Keycloak: - Realm design and lifecycle - Integration with internal services and gateways - Design and maintain Cloudflare protections: - WAF, rate limiting, bot protection - Zero Trust access patterns - Implement observability, alerting, and incident response: - Metrics, logs, traces - On-call readiness and runbooks - Partner with security and compliance teams on: - SOC 2 / ISO / bank TPRM requirements - Evidence generation and audit readiness - Participate in architecture decisions around: - Multi-tenancy vs isolation - Blast-radius reduction - Availability vs consistency tradeoffs Technical Stack - Cloud: AWS (EKS, Aurora Global, IAM, KMS, VPC, networking) - CI/CD: GitLab - Edge & Security: Cloudflare - Identity: Keycloak - Container Registry: Harbor - Platform: Kubernetes - Observability: Metrics, logs, tracing (tool-agnostic, but production-grade) What We’re Looking For - Strong experience operating production AWS infrastructure at scale - Deep hands-on experience with Kubernetes / EKS - Experience running PostgreSQL or Aurora in high-availability, regulated environments - Comfort owning production reliability, including on-call and incident response - Strong security instincts around: - IAM and least privilege - Network boundaries - Secrets management - Experience designing systems that survive audits, failures, and human error - Clear communication and calm judgment during incidents Nice to Have - Experience supporting fintech, banking, or regulated platforms - Prior exposure to SOC 2, ISO 27001, FFIEC, or bank TPRM processes - Experience with multi-region DR testing and failover exercises - Familiarity with GitOps or infrastructure-as-code patterns - Experience supporting platforms that move money or other irreversible assets Why This Role Is Interesting - You’ll build infrastructure that banks trust with real money - Reliability, security, and correctness actually matter here - Problems are concrete, high-stakes, and non-theoretical - Close collaboration with backend, security, and compliance teams - Opportunity to shape foundational platform decisions early

PostgreSQL AWS Cloudflare GitLab Keycloak Kubernetes

View details: Senior DevOps Engineer

Texas + 1 more

$120K - $190K / year

Email to Apply

Principal TPM, DevSecOps

Prescryptive Health, Inc.

Let's Rewrite the Script

DevOps Engineer65 days ago

Full Time RemoteTeam 201-500Since 2017H1B No Sponsor

Company Site LinkedIn

• Own the DevSecOps roadmap • Define and execute the strategy for integrating security across our SDLC — SAST, DAST, dependency scanning, secrets detection, container security — ensuring controls are comprehensive without becoming delivery bottlenecks • Lead complex, cross-functional programs • Manage a portfolio of interdependent security and infrastructure initiatives • Map dependencies, hold delivery cadences accountable, and escalate the right things at the right time • Build paved roads • Design shared pipeline templates, hardened base images, and reusable IaC modules that embed security as a default — reducing cognitive load on developers and eliminating per-team reinvention of compliance • Own risk and compliance • Maintain a clear view of technical security risk across your portfolio • Keep teams continuously audit-ready against relevant frameworks (SOC 2, ISO 27001, HIPAA, HITRUST) through automation, not heroics • Communicate across all levels • Translate security risk into business language for executives, and compliance requirements into engineering priorities for teams

Ansible AWS Azure GCP SDLC Terraform

View details: Principal TPM, DevSecOps

Virginia + 4 more

$148K - $205K / year

Apply

Job Closed

SRE Specialist

credsystem

Tornando novas conquistas possíveis.

DevOps Engineer65 days ago

Full Time RemoteTeam 201-500Since 1996H1B No Sponsor

Company Site LinkedIn

• Define product infrastructure according to the architecture guidelines; • Ensure environment resilience; • Align and manage SLIs, SLAs, and SLOs; • Troubleshoot application infrastructure (understands, participates, and proposes solutions); • Assist with application troubleshooting when requested by developers; • Drive monitoring, logging, and automation solutions; • Document product infrastructure; • Understand and participate in capacity and cost planning for the infrastructure; • Analyze application trends; • Propose new solutions for the product; • Participate in POCs and tests for new solutions; • IaC: Infrastructure as Code; • Deploy/create cloud infrastructure (Azure, OCI, AWS, and GCP); • Request and follow up on on-premises infrastructure work with the respective teams.

Ansible AWS Azure Docker GCP Grafana Apache Kafka Kubernetes Linux Prometheus Terraform

View details: SRE Specialist

Brazil

Apply

Job Closed

Senior DevOps / DevEx Engineer

RevenueCat

The subscription platform for mobile apps

DevOps Engineer65 days ago

Full Time RemoteTeam 51-200Since 2017H1B No Sponsor

Company Site LinkedIn

• Help build and scale the internal development platform. • Build tools, services, and automation for the engineering team. • Provide autonomy and a self-serve culture for teams. • Foster adoption of IA and agentic development while ensuring security and architectural standards.

AWS Docker Kubernetes Python

View details: Senior DevOps / DevEx Engineer

Australia

$227K / year

Apply

Senior Reliability Engineer, AWS/Python

Job Description

Job Requirements

Benefits

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior DevOps Engineer

Principal TPM, DevSecOps

SRE Specialist

Senior DevOps / DevEx Engineer