Job Closed

This listing is no longer active.

AM53 Smart Solutions logo
AM53 Smart Solutions

A tecnologia certa. O talento ideal. No momento exato.

Senior SRE Engineer – AWS Cloud

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 11-50Since 2010H1B No SponsorCompany SiteLinkedIn

Location

Brazil

Posted

78 days ago

Salary

0

Seniority

Senior

Job Description

Senior SRE Engineer – AWS Cloud

AM53 Smart Solutions

• Develop, maintain and evolve CI/CD pipelines, ensuring continuous, stable and secure deliveries • Automate infrastructure and application deployment processes, reducing toil and increasing reliability • Continuously monitor and optimize the performance, availability and security of production environments • Administer and support AWS cloud environments, ensuring resilience and scalability • Serve as a technical reference for the development team, promoting best practices for continuous delivery • Ensure end-to-end observability with robust practices for metrics, logs, tracing, versioning and rollback • Manage and ensure availability and performance of MongoDB and PostgreSQL databases • Act as a FinOps mentor and point of reference, fostering a culture of cloud cost efficiency and governance • Lead the response to critical incidents — rapidly diagnose issues, coordinate resolution and ensure clear communication during crises • Conduct blameless post-mortems, turning incidents into lessons and concrete improvements

Job Requirements

  • Proven experience with CI/CD — Jenkins and similar tools
  • Expertise with containers and orchestration — Docker and Kubernetes
  • Strong experience with Infrastructure as Code — Terraform
  • Experience with observability and monitoring — Datadog
  • Experience working in AWS cloud environments
  • Knowledge of FinOps — cloud cost optimization and governance
  • Experience with security automation — DevSecOps
  • AWS certifications (Solutions Architect, DevOps Engineer or SysOps)
  • Participation in cloud migration projects
  • Knowledge of Python for automation and scripting

Benefits

  • Not specified

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Kyivstar logo

DevOps/MLOps Engineer

Kyivstar

Kyivstar.Tech is a Ukrainian hybrid IT company and a resident of Diia.City. We are a subsidiary of Kyivstar, one of Ukraine's largest telecom operators. Our mission is to change lives in Ukraine and around the world by creating technological solutions and products that unleash the potential of businesses and meet users' needs. Over 600+ KS.Tech specialists work daily in various areas: mobile and web solutions, as well as design, development, support, and technical maintenance of high-performance systems and services. We believe in innovations that truly bring quality changes and constantly challenge conventional approaches and solutions. Each of us is an adherent of entrepreneurial culture, which allows us never to stop, to evolve, and to create something new.

DevOps Engineer78 days ago
Full TimeRemoteTeam 1,001-5,000

Role Description We are looking for a DevOps Engineer to design, build, and operate the infrastructure behind our LLM platform. You will be responsible for keeping our ML infrastructure reliable, scalable, and efficient - from data pipelines to training and inference. In this role, you will develop and maintain CI/CD pipelines, orchestration workflows, and observability for distributed ML workloads across GPU/TPU/CPU environments. This is a DevOps-first role with strong exposure to ML infrastructure. You will work closely with ML Engineers and Data Engineers, while focusing on building a robust, automated, and production-grade platform that accelerates model development and delivery. Responsibilities - Design, build, and operate scalable ML infrastructure on GCP (GKE), supporting both experimentation and production workloads for LLMs and NLP systems. - Manage Kubernetes-based environments (GKE): deployment, scaling, upgrades, and reliability of training and inference workloads across GPU/TPU/CPU pools. - Build and maintain CI/CD pipelines (GitHub Actions, Jenkins) to automate testing, training, and deployment of ML services and infrastructure. - Implement infrastructure as code (Terraform, Ansible) to provision and manage cloud resources in a reproducible, secure, and cost-efficient way. - Ensure observability of ML systems: monitoring, logging, and alerting for infrastructure, pipelines, and production inference workloads. - Collaborate with ML engineers and Data Engineers to design and support reliable training and inference pipelines. - Optimize resource utilization and cost, improving efficiency of training and serving infrastructure. - Troubleshoot and resolve issues across the ML platform - from data pipelines to distributed training and production deployments. - Contribute to engineering best practices: code reviews, automation, and continuous improvement of platform reliability and developer experience. Qualifications - Experience: 4+ years in DevOps, Platform Engineering, or ML Infrastructure roles, with strong understanding of production systems and distributed workloads. - Cloud & Infrastructure: Hands-on experience with GCP. Other major cloud platforms is a plus. Strong understanding of cloud-native architectures and experience designing scalable systems for compute and data-intensive workloads. - Kubernetes & Containers: Solid experience with Docker and Kubernetes (preferably GKE), including deploying, scaling, and operating production workloads. Familiarity with Helm and Kubernetes networking fundamentals. - CI/CD & Automation: Experience building and maintaining CI/CD pipelines (GitHub Actions, Jenkins, or similar) to automate testing, deployment, and infrastructure changes. - Workflow Orchestration: Experience with Airflow (or similar tools). - Infrastructure as Code: Strong experience with Terraform (preferred) or similar tools for provisioning and managing infrastructure in a reproducible way. - Programming: Strong hands-on scripting languages experience (Bash and/or Python). - Observability & Reliability: Experience with monitoring and logging systems (e.g., Prometheus, Grafana). Understanding of reliability, alerting, and debugging in distributed systems. - ML Infrastructure Understanding: Familiarity with the ML lifecycle (training, evaluation, inference) and experience supporting ML workloads in production environments. - Collaboration: Ability to work closely with ML Engineers and Data Engineers, translating ML requirements into reliable and scalable infrastructure solutions. Benefits - Office or remote — it’s up to you. - Remote onboarding. - Performance bonuses. - We train employees with the opportunity to learn through the company’s library, internal resources, and programs from partners. - Health and life insurance. - Wellbeing program and corporate psychologist. - Reimbursement of expenses for Kyivstar mobile communication.

Worldwide
Job Closed
Coterie logo

Senior Site Reliability Engineer

Coterie

A modern baby care brand changing everything about changing.

DevOps Engineer78 days ago
Full TimeRemoteTeam 11-50H1B Sponsor

• Manage and maintain cloud infrastructure on Azure, including Azure Kubernetes Service (AKS) clusters and supporting resources • Build, improve, and maintain CI/CD pipelines using GitHub Actions to support reliable and repeatable deployments • Own and enhance our Grafana implementation; designing dashboards, configuring alerts, and supporting incident management workflows • Monitor system health, triage incidents, and drive root cause analysis to prevent recurrence • Collaborate with development teams to define and track SLIs, SLOs, and error budgets that align with business goals • Contribute to infrastructure-as-code practices using Pulumi • Identify and resolve reliability risks through capacity planning, performance tuning, and proactive system improvements • Participate in an on-call rotation to support production systems and respond to incidents • Document runbooks, operational procedures, and architectural decisions to support team knowledge sharing

United States
$140K - $170K / year
Testsieger.de logo

Junior DevOps Engineer

Testsieger.de

Einfach die richtige Entscheidung.

DevOps Engineer78 days ago
Full TimeRemoteTeam 11-50H1B No Sponsor

• Build and maintain infrastructure as code using Ansible, Docker, and cloud services (AWS), treating infrastructure like a software project — version-controlled, tested, and reviewable • Develop internal tools and automation in Python or Bash to streamline deployments, monitoring, and data workflows • Own CI/CD pipelines — set up, improve, and maintain continuous integration and deployment processes so our teams can ship with confidence • Manage and optimize databases (MySQL) including backups, performance tuning, and query optimization • Operate and improve our Linux-based server infrastructure, including our on-premise Proxmox environment and cloud resources • Implement monitoring, logging, and alerting to ensure high availability and fast incident response (e.g. Grafana, NewRelic) • Strengthen security and compliance — from firewall management and system hardening to access control and update policies • Collaborate closely with Engineering, Data Science, and Product to support data pipelines, internal tools, and deployment workflows

Portugal
NinjaOne logo

Senior Site Reliability Engineer - Pacific Time Zone

NinjaOne

The world’s best IT teams and MSPs use NinjaOne.

DevOps Engineer78 days ago
Full TimeRemoteTeam 1,001-5,000H1B Sponsor

Description About the Role At NinjaOne we are passionate about building unified IT solutions that simplify the way IT organizations work. We are currently looking for a Senior Site Reliability Engineer to join our SRE team in the Platform Engineering organization and help us scale our products to millions of end-users. We are looking for individuals with a passion for automation and observability, ensuring the quality and availability of our services. We are looking for candidates located in Pacific time zone at this time. Location - We are flexible on remote working from home, if you are located in the USA and reside in one of the following states - CA, CO, CT, FL, GA, *IL, KS, MA, MD, ME, NJ, NC, NY, OR, TN, TX, VA, and WA. We have physical offices in Austin, TX and Tampa, FL, if you prefer a hybrid option. We hire the best software engineers, but experience in our stack can't hurt: NinjaOne is builton Java, Kotlin, C++, Golang and Postgres; supporting millions of user endpoints and running as a scalable cloud service in AWS. Knowing large-scale datastore bottlenecks, asynchronous application design and client-server architecture will help you. What You'll be Doing - Diagnose and resolve complex application and infrastructure issues - Participate in our 24x7 on-call rotation, SCRUM, and deployment planning - Perform Root Cause Analysis (RCA) and provide recommendations for application teams - Improve availability and reduce customer impact using Industry best observability tools - Ensure best-practice and security-minded architecture by influencing design decisions - Create and maintain technical documentation and SOP's - Develop software, scripts, or tooling to improve efficiency and reduce delivery time of applications and infrastructure. - Other duties as needed About You - 10+ years' experience in DevOps and/or Site Reliability Engineering roles - 3+ years' experience with an object-oriented language (preferably Java, .NET or C++) - Intermediate+ level Linux administration, scripting, and troubleshooting - Demonstrable knowledge of Observability tools (New Relic, Splunk, DataDog) - Comprehensive experience with AWS (Amazon Web Services) and its core capabilities(VPC, EC2, ECS, Route53, Fargate, ALB/NLB distributions, etc) - Experience with cloud automation and infrastructure-as-code (IaC) toolsets, primarily CloudFormation but also including Terraform, Helm and Ansible. CDK a plus. - Good understanding of containers, Fargate, Kubernetes, and overall distributed microservice architectures - Passionate about automation, security, and self-service environments/portals - Hands-on experience with CI/CD and SDLC (Software Development Life Cycle) processes - Effective communication skills, both verbal and written. About Us NinjaOne automates the hardest parts of IT to deliver visibility, security, and control over all endpoints for more than 30,000 customers. The NinjaOne automated endpoint management platform is proven to increase productivity, reduce security risk, and lower costs for IT teams and managed service providers. NinjaOne is obsessed with customer success and provides free and unlimited onboarding, training, and support. NinjaOne is #1 on G2 in endpoint management, patch management, remote monitoring and management, and mobile device management. What You'll Love We are a collaborative, kind, and curious community. We honor your flexibility needs with full-time work that is hybrid remote. We have you covered with our comprehensive benefits package, which includes medical, dental, and vision insurance. We help you prepare for your financial future with our 401(k) plan. We prioritize your work-life balance with our unlimited PTO. We reward your work with opportunity for growth and advancement. Additional Information This position is NOT eligible for Visa sponsorship. Due to federal government security requirements associated with our FedRAMP-authorized environment, candidates must be U.S. citizens or lawful permanent residents. *Due to operational policies, NinjaOne is unable to hire for this role within the city limits of Chicago. We will consider all qualified candidates who reside outside of the city proper or are willing to self-relocate. Starting pay for the successful applicant depends on a variety of job-related factors, including but not limited to location, market demands, experience, job-related knowledge, and skills. The benefits available for this position include medical, dental, vision, 401(k) plan, life insurance coverage and PTO. For roles based in California, Colorado, Maryland, New Jersey, or Washington the base salary hiring range for this position is $130,000 - $180,000 per year. For roles based in New York, the base salary hiring range for this position is $130,000 - $180,000 per year. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, disability, genetic information, marital status, veteran status, or any other status protected by applicable law. We are committed to providing an inclusive and diverse work environment. #LI-KS2 #LI-Remote #LI-Hybrid #BI-Remote #BI-Hybrid

Texas + 16 moreAll locations: Texas | Maine | Kansas | Maryland | Oregon | Florida | Georgia | Illinois | Virginia | Colorado | New Jersey | Tennessee | Connecticut | Massachusetts | New York | North Carolina | District Of Columbia
$130K - $180K / year
Job Closed