Join PROS, a dedicated travel technology company with nearly 40 years of proven airline expertise and a long runway for future growth, now powering the future of AI-driven airline retailing. If you want to be part of something exceptional, help us shape how airlines compete, innovate, and win. We are Owners: We look for every opportunity to create a better PROS and a better experience for our customers – and we hold ourselves accountable. We are Innovators: We think creatively to find new paths to success – for our people, our customers, and our business. We Care: We are centered on caring for the people, businesses, and communities we serve.

Site Reliability Engineer II

DevOps EngineerDevOps EngineerFull Time Remote Mid LevelTeam 1,001-5,000

Location

United States

Posted

3 days ago

Salary

Seniority

Mid Level

No structured requirement data.

Job Description

Role Description The Site Reliability Engineer II optimizes service performance, actively participates in reliability improvements, and conducts in-depth SLO and capacity analysis. This position exists to enhance system reliability and scalability while contributing to automation and self-service tool development. - Performance Monitoring: Monitor service performance, assist in troubleshooting production issues, and learn system architecture. - Reliability Participation: Monitor service reliability, participate in resolving basic issues, and learn disaster recovery testing procedures. - SLO Implementation: Understand SLO concepts, monitor and analyze SLO patterns, and assist in implementing SLO visualization and alerting. - Capacity Analysis: Perform basic capacity analysis, identify trends in system capacity, and participate in capacity planning. - Automation Deployment: Deploy and maintain existing automation tools, create simple scripts, and troubleshoot automation scripts. Qualifications - 5+ years of experience in enterprise networking, including hands-on work with routing, switching, firewalls, load balancers, and VPN technologies. - Strong understanding of cloud networking architectures including VPC/VNet design, peering, private link, and hybrid connectivity models. - Experience with network security technologies, such as security groups, NACLs, firewall policies, WAF, IDS/IPS, and micro-segmentation. - Proficiency in Layer 2 and Layer 3 network protocols, including BGP, OSPF, EIGRP, DNS, DHCP, NAT, and IP addressing/subnetting. - Hands-on experience with load balancers and ingress technologies, including F5, NGINX, Azure Application Gateway, ALB/NLB, or equivalent. - Strong troubleshooting skills using packet analyzers tools, flow logs, and network monitoring platforms. - Skilled in analyzing performance trends and identifying optimization opportunities. - Collaborates with teams to improve monitoring coverage. - Ability to participate in structured reliability testing and analysis. - Able to evaluate system components for resilience. - Contributes to reliability-focused design discussions. - Skilled in analyzing trends to inform service improvements. - Collaborates with teams to align SLOs with user expectations. - Develops moderately complex automation tools. - Skill in building internal self-service capabilities. - Evaluates automation opportunities for operational efficiency. - Skilled in analyzing capacity data to inform scaling decisions. - Able to recommend improvements for resource utilization. - Ensures scalability is considered in feature development. - Follow predefined procedures to deploy PROS products and third-party applications to the Cloud environments. - Contribute to the release management documentation. - Gain understanding of application architecture and interaction between system components. Requirements - Highly Preferred: Bachelor’s Degree in Computer Science, Information Technology, or a related field. - Practical experience with Fortigate firewalls and F5 appliances is highly desirable. Benefits - PROS culture and its extraordinary people are at the core of our success. - We are passionate about what we do and relentless in delivering on our promises. - Our commitment to customer success inspires us to think smarter and dream bigger. - We foster a culture of care, where people feel supported to grow, innovate, and bring their best selves to work. - From flexible ways of working to continuous learning, we empower our teams to thrive both personally and professionally. - Join PROS, a dedicated travel technology company with nearly 40 years of proven airline expertise. Company Description At PROS, we help airlines deliver seamless retail experiences designed to maximize revenue and margin growth. Powered by AI, the PROS Platform enables commercial teams to align capacity with demand and coordinate pricing, merchandising, and offer strategies to construct and market optimal offers in real time. - We are Owners: We look for every opportunity to create a better PROS and a better experience for our customers. - We are Innovators: We think creatively to find new paths to success. - We Care: We are centered on caring for the people, businesses, and communities we serve.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Cloud Operations Engineer

Made4net

Made4net is a global leader in supply chain execution and warehouse management systems (WMS), delivering agile, scalable, and unified solutions that help companies streamline opera

DevOps Engineer3 days ago

Full Time Remote

Role Description We’re looking for a hands-on Cloud Operations Engineer to join our global operations team. This is a technical, hands-on role at the intersection of cloud infrastructure, production support, and security/ops monitoring. We operate a follow-the-sun support model across global regions, with coverage expectations in shifts during weekends/holiday. You’ll support mission-critical systems across AWS, Windows, and Linux ensuring reliability, performance, and rapid incident response. Key Responsibilities - Provide production support as part of a global follow-the-sun operations team - monitor systems, triage alerts, respond to incidents, and ensure service continuity across all environments. - Monitor infrastructure and application health across security and operations dashboards; detect, triage, and respond to performance, availability, and security alerts; produce and maintain SLA reporting. - Deploy, manage, and maintain AWS infrastructure with a primary focus on EC2-based workloads across Windows and Linux environments. - Configure and manage load balancers (ALB/NLB), including URL rewrite rules, routing policies, and SSL termination for web applications. - Design and maintain high-availability architectures on AWS, ensuring redundancy, multi-AZ deployments, and tested failover procedures. - Own backup and disaster recovery operations - scheduling, retention policies, monitoring, and regular restoration testing. - Automate configuration management and application update deployments using Ansible, ensuring consistency and minimal downtime across all environments. - Manage database infrastructure across environments - provisioning, patching, performance tuning, and backup/recovery operations. - Troubleshoot and configure networking components including VPC, subnets, routing tables, DNS, security groups, and firewall rules. - Handle incident escalation, maintain shift handover documentation, and contribute to post-mortems and continuous improvement of support processes. - Maintain infrastructure runbooks, operational documentation, and contribute to automation and tooling improvements. - Support the company’s ongoing transition toward microservices and containerized architectures, contributing operational knowledge and helping ensure smooth adoption in production environments. Qualifications - Bachelor’s degree in computer science, Information Technology or a related field, or equivalent experience through certifications, vocational training, or hands-on work. - 3-5 years of hands-on experience in a cloud infrastructure, systems engineering, or production support role. - Comfortable with a follow-the-sun support model; availability for occasional weekend coverage is expected as part of the global team rotation. - Strong hands-on experience with AWS, specifically EC2, ALB/NLB, VPC, S3, and AWS Backup services. - Solid experience administering Windows and Linux servers in an enterprise IaaS environment. - Hands-on experience with Ansible for configuration management and automated application deployments, including managing update pipelines. - Proven experience designing and maintaining high-availability architectures on AWS (multi-AZ, auto-scaling, failover). - Proficiency in configuring ALB/NLB, including URL rewrite rules, path-based and host-based routing, and listener rules for web applications. - Solid understanding of AWS networking: VPC design, subnets, routing tables, security groups, DNS, and VPN connectivity. - Hands-on experience managing relational database infrastructure (provisioning, patching, monitoring, and performance tuning (MSSQL Server, Postgres, Oracle). - Experience with security and ops monitoring - interpreting alerts, identifying anomalies, triaging incidents, and escalating appropriately. - Experience with infrastructure and application performance tracking; able to interpret metrics and respond to anomalies. - Scripting proficiency in Bash, Python, or PowerShell for automation and operational tasks. Preferred Qualifications - AWS certifications (Solutions Architect, SysOps Administrator, or equivalent). - Experience with infrastructure-as-code tools such as Terraform or CloudFormation. - Experience with ITSM / ticketing systems in a production support context (Zendesk preferred). - Experience with Grafana or similar monitoring and observability platforms (Grafana preferred). - Familiarity with CI/CD pipelines and DevOps practices. Benefits - Health insurance (medical, dental, vision) with a robust wellness program to support your physical and mental well-being. - Generous paid time off policy. - Company-matched 401(k) retirement plan to help you secure your future. - Tuition reimbursement program to support your continued education and career advancement. - Employee assistance program providing confidential counseling and support services for personal challenges. - Discretionary employee bonus program. - Employee Discounts and perks through our PEO. - Pay range: Starting from $120,000, per year salary.

View details: Cloud Operations Engineer

United States

$120K / year

Apply

Principal Site Reliability Engineer, SRE

SoluStaff

People Powering Technology

DevOps Engineer3 days ago

Full Time RemoteTeam 51-200H1B No Sponsor

Company Site LinkedIn

• Serve as the primary technical owner for production reliability across U.S. customer environments. • Investigate and resolve complex issues spanning web applications, APIs, backend services, data pipelines, cloud infrastructure, and customer integrations. • Lead production incident response efforts, coordinating cross-functional teams to restore service and minimize customer impact. • Perform root cause analysis and drive corrective actions that improve long-term system stability and resilience. • Partner with software engineering and platform teams to identify recurring reliability risks and implement sustainable solutions. • Design, configure, and validate secure customer connectivity solutions including Site-to-Site VPNs, Transit Gateway integrations, routing configurations, and secure network paths. • Support customer onboarding initiatives by troubleshooting connectivity challenges and ensuring consistent implementation processes. • Enhance platform observability through improvements in monitoring, logging, alerting, tracing, and operational dashboards. • Contribute to CI/CD, infrastructure automation, and deployment processes that improve release safety and operational consistency. • Develop operational tooling that supports incident response, troubleshooting, onboarding, and system monitoring activities. • Collaborate with engineering leadership to improve cloud architecture, scalability, security, and operational readiness. • Partner with customer-facing teams to communicate technical issues, remediation plans, and reliability improvements in a clear and effective manner. • Support compliance, security, and risk management initiatives within highly regulated healthcare environments.

AWS Cloud Django Grafana Kubernetes Python Terraform

View details: Principal Site Reliability Engineer, SRE

United States

Apply

Estágio DevOps

Viasoft Korp | Industry ERP

O sistema de gestão nascido na indústria que vive e respira processos industriais e distribuição 💙

DevOps Engineer3 days ago

Internship RemoteTeam 51-200Since 1999H1B No Sponsor

Company Site LinkedIn

• Auxiliar em rotinas envolvendo: - microsserviços; - containers; - observabilidade; - alta disponibilidade; - integração contínua; - entrega contínua; - monitoramento; - deploy contínuo. • Auxiliar em rotinas de automação de infraestrutura em ambientes cloud e on-premise; • Auxiliar na implantação e evolução de ambientes Kubernetes; • Auxiliar na automatização de processos utilizando Ansible; • Auxiliar na implementação, evolução de monitoramentos e observabilidade com Grafana; • Atuar no auxílio de resolução de incidentes e troubleshooting entre serviços e ambientes; • Auxiliar na garantia de estabilidade, disponibilidade e performance dos ambientes; • Auxiliar na evolução de pipelines e ferramentas de CI/CD; • Documentar procedimentos, fluxos e configurações; • Participar ativamente da evolução tecnológica da plataforma da Korp.

Ansible Cloud Grafana Jenkins Kubernetes Linux

View details: Estágio DevOps

Brazil

R$1.3K / month

Apply

Senior Site Reliability Engineer

Akamai Technologies

DevOps Engineer4 days ago

Full Time RemoteTeam 5,001-10,000H1B Sponsor

Company Site LinkedIn

• Owning the SRE infrastructure lifecycle from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management • Designing and implementing frameworks that reflect customer experience for load balancing services and driving action when error budgets are at risk • Building and maintaining observability pipelines from load-balancing components and system-level sources to dashboards that enable rapid incident triage • Leading technical incident response for complex NB/NLB failures, acting as the technical commander and driving root cause analysis and preventive follow-through • Developing and automating safe deployment workflows for phased releases, including bake-period monitoring, feature flag management, and validation across global datacenter rollouts • Reviewing design documents, product-requirement documents and producing actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps • Building automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability

Ansible Distributed Systems Kubernetes Linux Python SaltStack Terraform Go

View details: Senior Site Reliability Engineer

Canada

$120.4K - $216.6K / year

Apply