Job Closed
This listing is no longer active.
Building the future
Senior SRE/Platform Engineer
Location
Brazil
Posted
124 days ago
Salary
0
Seniority
Senior
Job Description
Senior SRE/Platform Engineer
Ryz Labs
• Cloud & Systems Troubleshooting: Solve complex performance bottlenecks by navigating the entire stack - from GNU/Linux kernel tuning and networking (TCP/IP, DNS) to modern AWS architectural trends. You understand how cloud-native abstractions interact with the underlying hardware and never stop at a "restart" but have the tenacity to find the true root cause. • Performance Engineering: Partner with developers as a peer to architect and execute rigorous load and stress tests. You take ownership of the results, helping teams refactor code or infrastructure to ensure we stay up when traffic spikes. • Infrastructure as Code: Maintain and scale our platform using Terraform. (Experience with Pulumi is welcomed). • FinOps as a Product: Treat "Cost" as a primary engineering metric. You will lead cloud cost optimization, ensuring every dollar spent translates to performance and reliability. • Modern Observability: Stay ahead of the curve on observability practices. You will ensure we have deep, actionable visibility across our services, regardless of the underlying tool (Datadog, OpenTelemetry, etc.). • Engineering Culture: Act as a force multiplier. You build "paved roads" and self-service tools that allow developers to own their reliability without the friction. Work closely with Engineering, Product, Project Managers and CX to troubleshoot time-sensitive production issues. You act as a mentor, sharing knowledge to raise the collective expertise of the entire organization.
Job Requirements
- Hybrid Background:4+ years in SRE/DevOps with a background in Software Development, building and operating production systems at scale.
- Systems Knowledge: Experience with large-scale, business-critical Linux environments on AWS. You know how to debug a system when abstraction layers fail.
- Programming: High proficiency in Python, Go, BASH. Experience with Java or TypeScript/Next.js is a significant plus.
- Foundations: Solid cloud foundations (networking, identity/IAM, storage, compute, databases, and AWS Cloud caching services) and managed Kubernetes (AWS EKS) fundamentals.
- Security: Familiarity with access boundaries, guardrails, secrets management, and encryption.
- E-commerce Domain: Experience with e-commerce, billing/fulfillment systems, or the Shopify ecosystem is a big plus.
Benefits
- Customer First Mentality - Every decision we make should be made through the lens of the customer.
- Bias for Action - urgency is critical, expect that the timeline to get something done is accelerated.
- Ownership - Step up if you see an opportunity to help, even if it's not your core responsibility.
- Humility and Respect - Be willing to learn, be vulnerable, and treat everyone who interacts with RYZ with respect.
- Frugality - being frugal and cost-conscious helps us do more with less.
- Deliver Impact - get things done most efficiently.
- Raise our Standards - always be looking to improve our processes, our team, and our expectations. The status quo is not good enough and never should be.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer
Aspire SoftwareWe never stop building. A vertical acquisition software company that owns, operates and manages a diverse portfolio.
• Own and operate a production cloud platform running on Microsoft Azure and Cloud Foundry (or comparable platforms) • Ensure availability, performance, and reliability across infrastructure and platform components • Serve as the primary escalation point for platform-level incidents • Lead incident response, root cause analysis, and post-incident remediation • Use modern monitoring, alerting, and AI-assisted observability tools to improve detection, diagnosis, and resolution of incidents • Drive continuous improvements to reduce operational risk, after-hours incidents, and manual intervention • Own certificate and secrets lifecycle management, including TLS automation and secure secrets handling (e.g., CredHub, Vault) • Ensure secure and compliant practices around identity, access, and credential management • Partner with engineering teams to embed security and reliability best practices into platform workflows • Automate common operational tasks using Bash and/or PowerShell • Support and extend infrastructure-as-code using Terraform and/or Bicep • Improve platform consistency and repeatability through Git-driven, automation-first workflows • Leverage AI-assisted tooling to support scripting, troubleshooting, and operational documentation • Support PCI and other compliance activities, including technical control implementation, audit support, and remediation tracking • Maintain clear runbooks, diagrams, and documentation to enable repeatable operations and knowledge transfer • Partner with internal teams and external auditors to support compliance requirements • Work closely with application engineers, junior SRE/support staff, and vendor partners • Provide technical guidance and mentorship to junior teammates • Act as a trusted partner to engineering teams on reliability, performance, and operational readiness
• Design, operate, and continuously improve automated CI/CD pipelines using GitLab CI to support zero-downtime deployments across multiple environments. • Support development teams with standardized deployment tooling, automation, and operational best practices. • Produce monthly CI/CD pipeline performance reports, identifying risks, trends, and optimization opportunities. • Administer and support containerized workloads using Kubernetes (EKS) and Docker-based container platforms. • Configure and manage Linux-based servers and systems. • Implement Infrastructure as Code (IaC) using Terraform and/or AWS CDK for repeatable, auditable deployments. • Support provisioning and configuration of AWS services including EC2, EKS, ECS, S3, RDS, VPC, Lambda, and related services. • Coordinate infrastructure changes without performing AWS account provisioning or organizational administration. • Integrate security scanning into CI/CD pipelines using tools such as Trivy, AWS Inspector, and AWS Security Hub. • Perform vulnerability triage and coordinate remediation with development teams in accordance with defined timelines. • Implement and manage IAM least-privilege policies, secrets, and encryption using AWS KMS, Secrets Manager, and SSM. • Ensure encryption in transit and at rest across all in-scope systems. • Configure and maintain monitoring and observability using CloudWatch, Prometheus, Grafana, and centralized logging solutions. • Support Tier 2 and Tier 3 incident response for production systems, meeting SLA requirements. • Participate in root-cause analysis and continuous improvement initiatives. • Participate in Agile sprints, including backlog grooming, sprint planning, stand-ups, and retrospectives. • Track work in JIRA, using story-point estimation and sprint metrics. • Support reprioritization of backlog items in coordination with the COR and Product Owner. • Produce and maintain technical documentation covering architecture, pipelines, monitoring, security, and disaster recovery. • Conduct knowledge transfer and mentoring sessions for staff and contractor teams. • Support Business Continuity and Disaster Recovery (BCDR) planning, documentation, and exercises. • Ensure all deliverables comply with ADA, Section 508, WCAG 2.2 A/AA, and digital accessibility standards.
Public Trust Eligibility Required This is a contingent position, meaning employment is dependent upon the successful award of the associated contract to Aretum and completion of any required background investigation or security clearance verification. About Aretum Aretum is a mission-driven organization committed to delivering innovative, technology-enabled solutions to our customers across defense, civilian, and homeland security sectors. Our teams work at the intersection of strategy, technology, and transformation, helping agencies solve their most critical challenges. We believe in investing in our people and creating a culture where collaboration, inclusion, and professional growth are at the forefront. Job Summary Aretum is seeking a skilled and motivated Sr. DevSecOps Engineer. As a Sr. DevSecOps Engineer you will provide your insight and expertise relating to the client's cloud and systems operations and management. Due to the nature of our work as a federal consulting organization, employees may be expected to handle Controlled Unclassified Information (CUI) and must adhere to applicable safeguarding and compliance requirements. Responsibilities - Design, operate, and continuously improve automated CI/CD pipelines using GitLab CI to support zero-downtime deployments across multiple environments. - Support development teams with standardized deployment tooling, automation, and operational best practices. - Produce monthly CI/CD pipeline performance reports, identifying risks, trends, and optimization opportunities. - Administer and support containerized workloads using Kubernetes (EKS) and Docker-based container platforms. - Configure and manage Linux-based servers and systems. - Implement Infrastructure as Code (IaC) using Terraform and/or AWS CDK for repeatable, auditable deployments. - Support provisioning and configuration of AWS services including EC2, EKS, ECS, S3, RDS, VPC, Lambda, and related services. - Coordinate infrastructure changes without performing AWS account provisioning or organizational administration. - Integrate security scanning into CI/CD pipelines using tools such as Trivy, AWS Inspector, and AWS Security Hub. - Perform vulnerability triage and coordinate remediation with development teams in accordance with defined timelines. - Implement and manage IAM least-privilege policies, secrets, and encryption using AWS KMS, Secrets Manager, and SSM. - Ensure encryption in transit and at rest across all in-scope systems. - Configure and maintain monitoring and observability using CloudWatch, Prometheus, Grafana, and centralized logging solutions. - Support Tier 2 and Tier 3 incident response for production systems, meeting SLA requirements. - Participate in root-cause analysis and continuous improvement initiatives. - Participate in Agile sprints, including backlog grooming, sprint planning, stand-ups, and retrospectives. - Track work in JIRA, using story-point estimation and sprint metrics. - Support reprioritization of backlog items in coordination with the COR and Product Owner. - Produce and maintain technical documentation covering architecture, pipelines, monitoring, security, and disaster recovery. - Conduct knowledge transfer and mentoring sessions for staff and contractor teams. - Support Business Continuity and Disaster Recovery (BCDR) planning, documentation, and exercises. - Ensure all deliverables comply with ADA, Section 508, WCAG 2.2 A/AA, and digital accessibility standards.
• Collaborate with a diverse team of software engineers, engaging in iterative processes and effective task planning to drive our projects forward. • Take ownership of the availability, scalability, and performance of our services, to proactively identify issues, and implement automation to prevent the recurrence of problems. • Participate in the on-call rotation, responding to incidents and working with the team to restore service and prevent recurrence. • Contribute to automating infrastructure provisioning, configuration, and management using IaC principles with tools like Terragrunt and Ansible. • Help design and enhance monitoring, logging, and alerting systems to improve observability and ensure system health. • Participate in blameless post-mortems, documenting issues, and following up on action items to foster a culture of learning and continuous improvement. • Foster collaboration with other engineering teams, promoting the reuse of existing frameworks and gaining insights into their operation. • Stay current with industry trends, emerging technologies, and best practices in SRE, DevOps, and automation.



