Compass

AWS DevOps Engineer – Senior

DevOps EngineerDevOps EngineerFull Time Remote SeniorTeam 10,001+H1B SponsorCompany Site LinkedIn

Location

Brazil

Posted

13 days ago

Salary

Seniority

Senior

Bachelor DegreeExperience acceptedPortugueseAnsible AWS Docker EC2 Kubernetes Linux Python Terraform

Job Description

• Criar, provisionar e evoluir recursos de infraestrutura em AWS, utilizando práticas de IaC para garantir padronização, segurança e escalabilidade dos ambientes; • Desenvolver, implementar e evoluir pipelines de CI/CD, assegurando automação, rastreabilidade, segurança e eficiência nas entregas; • Atuar na revisão e evolução da arquitetura de aplicações, contribuindo com boas práticas de performance, disponibilidade, continuidade, segurança e escalabilidade; • Acompanhar entregas e garantir a sustentação dos ambientes, promovendo estabilidade operacional e aderência às melhores práticas de DevOps; • Investigar, propor e implementar melhorias, automações e padrões de infraestrutura e operações, visando eficiência operacional e padronização dos ambientes; • Atuar em iniciativas de MLOps, apoiando a operacionalização e evolução de pipelines de Machine Learning em produção; • Participar da investigação, entendimento e planejamento de novas demandas técnicas, promovendo alinhamento entre necessidades de negócio e soluções tecnológicas; • Colaborar com equipes multidisciplinares, mantendo comunicação clara, postura colaborativa e foco em solução de problemas.

Job Requirements

Experiência sólida com AWS, incluindo serviços como EC2, S3, IAM, CloudWatch, Lambda e SageMaker;
Vivência na criação, evolução e entrega de pipelines CI/CD em ambientes AWS;
Experiência com IaC, especialmente utilizando Terraform, Ansible e/ou CloudFormation;
Conhecimento e experiência com Containers (Docker e Kubernetes) em ambientes produtivos;
Conhecimento sólido em Linux, incluindo segurança, troubleshooting, coleta de métricas e análise de performance;
Experiência com monitoramento e observabilidade, especialmente utilizando AWS CloudWatch;
Vivência em MLOps, incluindo experiência com Amazon SageMaker, automações e operação de pipelines de Machine Learning;
Experiência com serviços AWS voltados à automação e orquestração, como Lambda, ECR e Step Functions;
Desejável conhecimento em Python, Bash ou linguagens similares;
Diferencial: familiaridade com MLflow, DVC ou pipelines de Machine Learning em produção;
Perfil Pleno/Sênior, com experiência consolidada em Tecnologia da Informação;
Formação superior em Tecnologia da Informação ou pós-graduação na área de TI.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Site Reliability Engineer - Remote

Optum

Optum, part of the UnitedHealth Group family of businesses, is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health optimization on a global scale. Join us to start Caring. Connecting. Growing together. At Optum, we support your well-being with an understanding team, extensive benefits and rewarding opportunities. By joining us, you’ll have the resources to drive system transformation while we help you take care of your future. We recognize the power of connection to drive change, improve efficiency and make a difference in health care. Join a team where your skills and ideas can make an impact and where collaboration is key to creating technology that produces healthier outcomes.

DevOps Engineer13 days ago

Full Time RemoteTeam 160,000Since 2011

Company Site

Requisition Number: 2361712 Opportunities with Logistics Health Incorporated (LHI), part of the Optum family of business. We're dedicated to simplifying the logistics of complex workforce health programs with cost-effective solutions and a seamless distribution process. With offices in La Crosse, Wis., a satellite office in Chicago and remote employees throughout the country, we have a variety of rewarding career opportunities for you. Elevate your career as you help us create a healthier tomorrow for everyone and discover the meaning behind Caring. Connecting. Growing together. We are seeking a Site Reliability Engineer to join our Optum Serve team. In this remote role, you will build, maintain, and operate our AWS-hosted platform to support critical government healthcare services. You will work closely with development teams to identify and measure SLOs, SLAs, and SLIs, ensuring high availability, performance, and scalability. This is an exciting opportunity to implement advanced automation, self-healing mechanisms, and robust monitoring to improve production systems and ensure seamless service delivery. You'll enjoy the flexibility to work remotely * from anywhere within the U.S. as you take on some tough challenges. For all hires in the Minneapolis or Washington, D.C. area, you will be required to work in the office a minimum of four days per week. Primary Responsibilities: - Build, maintain, and operate the AWS-hosted platform - Work closely with development teams to identify and measure SLOs, SLAs, and SLIs - Contribute to the development of platform services including architecture, provisioning, configuration, deployment, and support - Integrate applications with centralized logging, metrics dashboards, instrumentation, incident monitoring, and management tools - Participate in an on-call rotation for incident resolution for the platform and any dependent components - React to production deficiencies by continuously implementing automation, self-healing systems, and real-time monitoring - Maintain and improve operational tooling and frameworks - Perform root cause analysis and deliver swift resolution for tools and automation failures - Build, integrate, and administer systems and tools that enable engineering teams to observe their applications in production with autonomy (Dashboards, APMs) - Automate alerts for metrics on performance, cost, vulnerabilities, risk, and compliance violations - Conduct comprehensive postmortems after production issues to drive continuous platform improvement You'll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in. Required Qualifications: - 2+ years of experience building, maintaining, and operating platform infrastructure within AWS (specifically with EC2, VPC, IAM, Lambda, S3, and CloudWatch) - 2+ years of experience with Infrastructure-as-Code (IaC) using Terraform, AWS CloudFormation, or CDK - 2+ years of experience with Linux system administration and shell scripting - 1+ years of experience building or managing CI/CD pipelines using Git and GitLab - 1+ years of experience monitoring and alerting with tools such as CloudWatch or Dynatrace - 1+ year of scripting experience in Python or PowerShell Preferred Qualifications: - Bachelor's degree in Information Technology, Computer Science or related field - Experience utilizing AI-driven anomaly detection in CloudWatch for proactive issue resolution - Experience with automation of patching and scaling using predictive models - Experience supporting infrastructure for AI-based applications - Demonstrated understanding of federal security and compliance frameworks, such as FedRAMP Moderate or NIST 800-171 - Familiarity with containerized workloads (e.g., ECS, EKS) *All employees working remotely will be required to adhere to UnitedHealth Group's Telecommuter Policy Pay is based on several factors including but not limited to local labor markets, education, work experience, certifications, etc. In addition to your salary, we offer benefits such as, a comprehensive benefits package, incentive and recognition programs, equity stock purchase and 401k contribution (all benefits are subject to eligibility requirements). No matter where or when you begin a career with us, you'll find a far-reaching choice of benefits and incentives. The salary for this role will range from $72,800 to $130,000 annually based on full-time employment. We comply with all minimum wage laws as applicable. Application Deadline: This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected. Job posting may come down early due to volume of applicants. At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone-of every race, gender, sexuality, age, location and income-deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes - an enterprise priority reflected in our mission. UnitedHealth Group is an Equal Employment Opportunity employer under applicable law and qualified applicants will receive consideration for employment without regard to race, national origin, religion, age, color, sex, sexual orientation, gender identity, disability, or protected veteran status, or any other characteristic protected by local, state, or federal laws, rules, or regulations. UnitedHealth Group is a drug - free workplace. Candidates are required to pass a drug test before beginning employment.

View details: Site Reliability Engineer - Remote

Minnesota

$72.8K - $130K / year

Apply

Job Closed

Data Center Reliability Engineer

Phaidra

Phaidra is an industrial AI company. We create self-learning, intelligent control systems for industrial facilities. The physical world today is filled with static infrastructure. Factories, power plants, and other industrial facilities are frozen in time — they operate in the same way they've operated for decades because their control systems are hard-coded. And hard-coded systems cannot change, leading to performance degradation and a lack of resiliency. Phaidra creates AI-powered control systems that automatically learn, adapt, and get better over time. Just as autonomous vehicles get smarter over time, so too will future industrial facilities. Our team has already delivered 40% energy savings at Google's data centers, and we're rapidly bringing AI technology to other types of industrial facilities.

DevOps Engineer13 days ago

Full Time RemoteTeam 51-200Since 2019H1B No Sponsor

Company Site LinkedIn

• Utilize existing data ingestion and delivery platforms to "teach" models to understand the physical world, filling a critical expertise gap in the data center industry. • Use telemetry tools to analyze sensor data across mechanical (chillers, pumps) and electrical (UPS, switchgear, power feeds) systems to identify "failure signatures" for LLM-driven monitoring tool. • Act as a primary user of platforms, identifying gaps in current mechanisms and collaborating with Engineering to influence future features and data quality. • Translate raw telemetry into "SME-level" logic and directions used by the LLM tool to guide data center operators in real-time. • Cultivate deep domain expertise in all facets of data center infrastructure. • Move from shadowing peers to directly supporting customers, using the platform to provide clear, data-backed direction on complex problems. • Oversee pilot projects to test how AI-driven SME tool interprets real-world stressors, ensuring the output is operationally realistic, accurate, and actionable. • Remain agile and proactive in a fast-moving team environment.

Numpy Pandas Python

View details: Data Center Reliability Engineer

Washington

$101.3K - $163.9K / year

Apply

DevOps Engineer – Secret

Xcelerate Solutions

Xcelerate Solutions is a mission-driven company specializing in security, management, and IT solutions to strengthen America’s national security, safeguard cr

DevOps Engineer13 days ago

Full Time Remote

Company Site

• Automating, optimizing, and securing the delivery pipeline for enterprise-grade, mission-critical systems. • Integrating DevSecOps best practices and innovative toolsets. • Working with application development teams to refactor or create solutions that leverage the DevSecOps CI/CD pipeline and tools. • Facilitating the development of new software solutions and transition of existing solutions from monolithic structures to micro-service structure operating within hardened containers. • Deploying and sustaining microservices factory utilizing COTS and open-source solutions.

Ansible AWS Azure Chef Cloud Docker Microservices OpenShift OpenStack Puppet SaltStack TFS

View details: DevOps Engineer – Secret

Virginia

Apply

Senior Site Reliability Engineer

DraftKings Inc.

Defining what it means to build and deliver the most extraordinary sports & entertainment experiences.The Crown is Yours

DevOps Engineer13 days ago

Full Time RemoteTeam 1,001-5,000Since 2012

Company Site LinkedIn

At DraftKings, AI is becoming an integral part of both our present and future, powering how work gets done today, guiding smarter decisions, and sparking bold ideas. It's transforming how we enhance customer experiences, streamline operations, and unlock new possibilities. Our teams are energized by innovation and readily embrace emerging technology. We're not waiting for the future to arrive. We're shaping it, one bold step at a time. To those who see AI as a driver of progress, come build the future together. The Crown Is Yours As a Senior Site Reliability Engineer, you'll build and scale the critical infrastructure behind every product. In this role, you'll take on complex challenges across global data centers, multiple cloud platforms, and on-premise systems-designing automation-first solutions that elevate performance and eliminate operational friction. You'll be trusted to drive stability at scale, influence architectural decisions, and build tools that empower our teams to move fast and deliver reliably. This is where your impact won't just be felt, it'll be foundational. What You'll Do - Drive stability and scalability across our global compute platform spanning numerous data centers, multiple public clouds, and on-premise environments, serving as the foundation for every product. - Operate and evolve our GitOps delivery model, using Rancher Fleet and Flux with Helm to deploy core cluster services and application workloads declaratively and repeatably. - Build self-healing, fault-tolerant infrastructure and internal tooling that eliminates repetitive operational work and reduces toil for both platform and application teams. - Own cluster autoscaling and capacity strategy, including Karpenter, HPA and KEDA, and predictive scaling driven by event and calendar data. - Define SLOs and reliability metrics for platform components, using Datadog and our logging pipeline to surface cluster and workload health. - Support technical growth by sharing knowledge, participating in design discussions, and contributing to a collaborative team culture, including on-call rotation. What You'll Bring - Bachelor's degree in Computer Science or relevant education, experience, and training. - At least 4 years managing distributed cloud and on-premise environments at scale, with strong hands-on AWS experience. Exposure to GCP, vSphere, or Nutanix is a plus. - Deep expertise in container orchestration with Kubernetes, including the ability to design, scale, and troubleshoot complex workloads. - Strong experience developing software for automation and infrastructure tooling such as Go and Python. - Working knowledge of networking and Linux-based systems, including container runtimes such as Docker and containerd, packet-level debugging, and kernel troubleshooting. - Experience with Infrastructure as Code (IaC) and configuration management tools to ensure scalable and repeatable infrastructure provisioning. #LI-MF1 Join Our Team We're a publicly traded (NASDAQ: DKNG) technology company headquartered in Boston. As a regulated gaming company, you may be required to obtain a gaming license issued by the appropriate state agency as a condition of employment. Don't worry, we'll guide you through the process if this is relevant to your role. The US base salary range for this full-time position is 128,000.00 USD - 160,000.00 USD, plus bonus, equity, and benefits as applicable. Our ranges are determined by role, level, and location. The compensation information displayed on each job posting reflects the range for new hire pay rates for the position across all US locations. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training. Your recruiter can share more about the specific pay range and how that was determined during the hiring process. It is unlawful in Massachusetts to require or administer a lie detector test as a condition of employment or continued employment. An employer who violates this law shall be subject to criminal penalties and civil liability.

View details: Senior Site Reliability Engineer

United States

$128K - $160K / year

Apply

Job Closed

AWS DevOps Engineer – Senior

Job Description

Job Requirements

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Site Reliability Engineer - Remote

Data Center Reliability Engineer

DevOps Engineer – Secret

Senior Site Reliability Engineer