Site Reliability Engineer II

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 5,001-10,000H1B SponsorCompany SiteLinkedIn

Location

Poland

Posted

65 days ago

Salary

0

Seniority

Senior

Job Description

Site Reliability Engineer II

Akamai Technologies

• Building and maintaining dashboards, alerts, and monitoring for inference workloads using Akamai's existing observability platform • Writing automation and tooling in Python or Go to reduce operational toil and improve system reliability • Participating in on-call rotations, responding to production incidents, and contributing to post-mortem analysis • Building and improving runbooks for inference-specific operational procedures, integrating into Akamai's existing incident management processes • Contributing to SLO tracking and reporting, identifying trends and areas for improvement • Supporting CI/CD pipeline maintenance, deployment safety checks, and rollback procedures • Collaborating with product engineering teams to troubleshoot complex problems across the stack

Job Requirements

  • Have commercial experience in Site Reliability Engineering
  • Show proficiency in a programming language such as Python or Go, with experience creating automation solutions.
  • Have experience with Linux systems administration and the ability to troubleshoot complex infrastructure issues
  • Show familiarity with Kubernetes and containerization concepts
  • Have experience with monitoring and observability tools such as Prometheus, Grafana, or similar
  • Have exposure to CI/CD pipelines and infrastructure-as-code tools (Terraform, SaltStack, or equivalent)
  • Show a willingness to learn and grow, with genuine curiosity about AI infrastructure and distributed systems

Benefits

  • Your health
  • Your finances
  • Your family
  • Your time at work
  • Your time pursuing other endeavors

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Axon logo

Sr. Site Reliability Engineer I

Axon

Formerly known as TASER International, Axon is a leading safety technology company offering smart weapons, cameras, evidence management, and automated reporting solutions for law e

DevOps Engineer65 days ago

Join Axon and be a Force for Good. At Axon, we’re on a mission to Protect Life. We’re explorers, pursuing society’s most critical safety and justice issues with our ecosystem of devices and cloud software. Like our products, we work better together. We connect with candor and care, seeking out diverse perspectives from our customers, communities and each other. Life at Axon is fast-paced, challenging and meaningful. Here, you’ll take ownership and drive real change. Constantly grow as you work hard for a mission that matters at a company where you matter. Your Impact As a senior contributor in the APX SRE organization, you are passionate about delivering solutions to the real-time problems our mission-critical cloud native services encounter—and you’re relentless about the high quality, reliability, and security our customers demand. You will work closely not only with the APX SRE organization, but your technical deliverables will reach the entire engineering organization to enable product teams to continuously deliver features on the vanguard of innovation. In this role, you’ll have a meaningful impact in Identity and Security, helping teams build and operate systems that protect user identity, strengthen authentication/authorization flows, and meet compliance expectations. You’ll partner closely with engineering, security, and identity stakeholders to raise the bar on secure-by-default reliability practices across the organization. Location: Remote—anywhere in Canada What You’ll Do - Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, securely, and with high test confidence. - Exemplify cloud-native site reliability best practices with a strong emphasis on testability, automation, and resilience in distributed systems. - Design and implement test strategies and frameworks that validate reliability, performance, and security requirements (e.g., integration, end-to-end, chaos/resiliency, and regression suites). - Write code that is performant, maintainable, clear, and concise. - Employ strong problem-solving skills, with the ability to debug problems in cloud-native distributed systems using logs, metrics, traces, and automated diagnostics. - Contribute to platform features and tooling by designing clear, well-tested APIs in Go or Python. - Partner with Identity and Security teams to support and strengthen user identity lifecycle and access controls, authentication and authorization patterns, Okta integrations (e.g., SSO, OIDC/SAML flows, group/role mapping, policy enforcement), and compliance-aligned controls and security best practices. - Help ensure systems meet operational and regulatory expectations through secure engineering practices (least privilege, secrets management, auditability, change control, and secure SDLC). - Author design docs, test plans, threat-aware test cases, and usage guides to promote self-service. - Take calculated risks, champion new ideas, and cultivate your craft. What You Bring - 6+ years of applicable software engineering or SRE experience - 3+ years experience managing cloud platforms such as Azure, AWS, or similar. - Experience operating in Kubernetes platforms like AKS, EKS, or similar. - Experience using managed languages such as Python, Go, C#, Java, or similar with demonstrable API and unit/integration testing experience. - Experience utilizing CI/CD platforms to automate provisioning infrastructure, software builds, releases, and automated test pipelines. - Experience using observability tools such as APM, logging, and metrics to assist with debugging issues and reliability improvements. - Experience designing tooling to simplify the operational management of SaaS/PaaS systems, including test automation and validation tooling. - Familiarity with building flexible and testable Infrastructure as Code modules. - Experience or strong working knowledge of identity and access management (IAM) concepts and secure authentication/authorization patterns. - Experience integrating with or supporting Okta (or similar IdP) and identity standards like OIDC/SAML is a strong asset. - Empathy to support the needs of software engineers. - Ability to obtain RCMP Enhanced Reliability Status and Secret clearance. Axon is a total compensation company, meaning compensation is made up of base pay, bonus, and stock awards. The actual base pay is dependent upon many factors, such as: level, function, training, transferable skills, work experience, business needs, geographic market, and often a combination of all these factors. Our benefits offer an array of options to help support you physically, financially and emotionally through the big milestones and in your everyday life. To see more details on our benefits offerings please visit https://www.axon.com/careers. Base Pay Range $184,088—$294,540 CAD Don’t meet every single requirement? That's ok. At Axon, we Aim Far. We think big with a long-term view because we want to reinvent the world to be a safer, better place. We are also committed to building diverse teams that reflect the communities we serve. Studies have shown that women and people of color are less likely to apply to jobs unless they check every box in the job description. If you’re excited about this role and our mission to Protect Life but your experience doesn’t align perfectly with every qualification listed here, we encourage you to apply anyways. You may be just the right candidate for this or other roles. Important Notes The above job description is not intended as, nor should it be construed as, exhaustive of all duties, responsibilities, skills, efforts, or working conditions associated with this job. The job description may change or be supplemented at any time in accordance with business needs and conditions. Some roles may also require legal eligibility to work in a firearms environment. We collect personal information from applicants to evaluate candidates for employment. You may request access, deletion, or exercise other CCPA rights at axongreenhousesupport@axon.com or via our Axon Privacy Web Form. For more information, please see the Your California Privacy Rights section of our Applicant and Candidate Privacy Notice. Axon’s mission is to Protect Life and is committed to the well-being and safety of its employees as well as Axon’s impact on the environment. All Axon employees must be aware of and committed to the appropriate environmental, health, and safety regulations, policies, and procedures. Axon employees are empowered to report safety concerns as they arise and activities potentially impacting the environment. We are an equal opportunity employer that promotes justice, advances equity, values diversity and fosters inclusion. We’re committed to hiring the best talent — regardless of race, creed, color, ancestry, religion, sex (including pregnancy), national origin, sexual orientation, age, citizenship status, marital status, disability, gender identity, genetic information, veteran status, or any other characteristic protected by applicable laws, regulations and ordinances — and empowering all of our employees so they can do their best work. If you have a disability or special need that requires assistance or accommodation during the application or the recruiting process, please email recruitingops@axon.com. Please note that this email address is for accommodation purposes only. Axon will not respond to inquiries for other purposes. Phishing alert: Axon will never ask you to pay for any part of the hiring process, including training, equipment, or background checks. We do not make job offers via text message, WhatsApp, or instant messaging platforms without a formal interview process. All legitimate job openings are listed on our official careers page at https://www.axon.com/careers. If you receive a suspicious offer or outreach from an email address that is not @axon.com, or if you are asked for sensitive personal information (bank details, Social Security Number) prematurely, please ignore the message and report it to recruitingops@axon.com.

Canada
184K - 294K / year
Full TimeRemoteTeam 201-500

About This Role: We are hiring a hands-on DevOps Engineer to manage and support production-grade cloud infrastructure for Kibo’s commerce platform. This role focuses on Kubernetes (EKS), Terraform, and real-time production troubleshooting in a 24/7 on-call environment. ABOUT KIBO KIBO is a composable digital commerce platform for B2C, D2C, and B2B organizations who want to simplify the complexity in their businesses and deliver modern customer experiences. KIBO is the only modular, modern commerce platform that supports experiences spanning B2B and B2C Commerce, Order Management, and Subscriptions. Companies like Ace Hardware, Zwilling, Jelly Belly, Nivel, and Honey Birdette trust Kibo to bring simplicity and sophistication to commerce operations and deliver experiences that drive value. KIBO's cutting-edge solution is MACH Alliance Certified and has been recognized by Forrester, Gartner, IDC, Internet Retailer, and TrustRadius. KIBO has been named a leader in The Forrester Wave™: Order Management Systems, Q1 2025 and in the IDC MarketScape report “Worldwide Enterprise Headless Digital Commerce Applications 2024 Vendor Assessment”. By joining KIBO, you will be part of a team of Kibonauts all over the world in a remote-friendly environment. Whether your job is to build, sell, or support KIBO’s commerce solutions, we tackle challenges together with the approach of trust, growth mindset, and customer obsession. If you’re seeking a unique challenge with amazing growth potential, then come work with us! WHAT YOU’LL DO - Manage and operate production-grade Kubernetes clusters (EKS preferred), ensuring high availability and scalability - Troubleshoot real-time production issues across distributed systems and microservices - Diagnose and resolve issues such as: - Pod failures (CrashLoopBackOff, Pending, OOMKilled) - Node failures, autoscaling, and resource constraints - Networking, ingress, and service connectivity issues - Build, maintain, and debug infrastructure using Terraform (modules, remote state, locking, drift handling) - Implement and enhance monitoring & alerting systems using Prometheus, Grafana, and related tools - Perform root cause analysis (RCA) for incidents and drive permanent fixes to improve system reliability - Participate in a 24/7 on-call rotation, owning incidents and resolving them independently - Collaborate with engineering teams to improve system performance, resilience, and deployment processes - Automate deployments, infrastructure provisioning, and operational workflows to reduce manual effort - Ensure adherence to security best practices across infrastructure and deployments

India
Full TimeRemoteTeam 5,001-10,000H1B Sponsor

• Taking responsibility for observability strategy, designing telemetry, dashboards, alerts, defining SLO/SLI frameworks, and implementing improvements when targets are missed. • Building production-grade automation and tooling that reduces operational toil, improves incident response, and sets patterns that other SREs adopt • Owning incident management integration for inference workloads, designing frameworks, leading incident response during on-call rotations, and driving systemic improvements from post-mortems • Defining and implementing deployment safety practices including progressive rollouts, canary analysis, and rollback automation, establishing standards for the team • Partnering with product engineering teams to influence architecture decisions, ensure operational readiness, and represent the SRE perspective in design reviews • Mentoring Senior and mid-level SREs through code reviews, design discussions, and hands-on problem-solving

Poland
Virtuos logo

DevOps Engineer

Virtuos

Founded in 2004, Virtuos is one of the largest independent video game development companies. We are headquartered in Singapore with offices in Asia, Europe, and North America. Specializing in full-cycle game development and art production, we have delivered high-quality content for more than 1,500 console, PC, and mobile games. Our clients include 23 of the top 25 gaming companies worldwide. Volmi - A Virtuos Studio specializes in game development and game content creation. Over the past 9 years, we have successfully completed over a thousand different projects. We have gained considerable experience in creating 2D and 3D art, as well as developing turnkey games. Since 2022, we have been a part of Virtuos, the world's leading game developer. We have contributed to the development of popular games such as Diablo 2: Resurrected, Metro Exodus, Sniper: Ghost Warrior Contracts 2, Smite, Paladins, Gwent: The Witcher Card Game, and Marvel Snap.

DevOps Engineer65 days ago
Full TimeRemoteTeam 1,001-5,000Since 2004H1B No Sponsor

• Design, implement, and maintain CI/CD pipelines supporting cross-platform builds (Android, iOS, and PC) using Jenkins, GitHub Actions, and Unreal Automation Tool; • Automate build, packaging, and artifact delivery for mobile clients, PC builds, and headless multiplayer servers; • Develop and maintain deployment pipelines for multiplayer infrastructure using Docker and Kubernetes; • Support orchestration of dedicated game servers and matchmaking services via the Agones game server platform; • Manage multiple runtime environments for development, testing, demos, and release candidates; • Maintain and improve infrastructure reliability and deployment reproducibility across environments; • Implement monitoring and observability systems using Prometheus and Grafana to track infrastructure health, build reliability, and server performance; • Collaborate with backend engineers and game teams to ensure smooth integration of multiplayer services, telemetry pipelines, and backend systems; • Support infrastructure running in cloud environments (AWS, GCP, or Azure); • Participate in infrastructure security practices, including management of secrets, credentials, and access permissions in collaboration with the client infrastructure team; • Optimize build pipelines and infrastructure to support large-scale multiplayer environments (100k+ CCU); • Document DevOps processes and infrastructure configurations to support long-term maintainability.

Ukraine
Job Closed