We put the power in your hands to buy, sell, and trade digital currency 🌏
Site Reliability Engineer – AI Agents
Location
United Kingdom
Posted
4 days ago
Salary
0
Seniority
Senior
Job Description
Site Reliability Engineer – AI Agents
Kraken Digital Asset Exchange
• Design, build, and operate the infrastructure layer supporting AI agent workflows in production • Ensure reliability, scalability, and observability of agentic systems across internal and external products • Design and develop platform services, APIs, SDKs, and self-service capabilities that allow engineering teams to easily consume AI infrastructure and agent platform services • Manage and maintain the compute, orchestration, and serving infrastructure powering model inference and agent execution • Implement robust monitoring, alerting, and incident response procedures tailored to AI/ML workloads • Utilize Infrastructure as Code (IaC) tools such as Terraform to provision and manage cloud (AWS) infrastructure components • Build and maintain CI/CD pipelines that support rapid, reliable deployment of AI services and agent workflows • Define and implement guardrails, failure handling, and recovery patterns specific to agentic and LLM-powered systems • Collaborate with AI and Data Engineering teams to translate experimental agent prototypes into hardened production systems • Manage containerized workloads using Kubernetes, ensuring efficient deployment, scaling, and orchestration of AI services • Implement access controls and security best practices across AI infrastructure environments • Document architecture, runbooks, and best practices to support knowledge sharing across the team
Job Requirements
- 5+ years of experience as a Site Reliability Engineer, Infrastructure Engineer, Platform Engineer, or similar role in a production environment
- Hands-on experience supporting ML infrastructure, model serving, or MLOps workflows in production
- Experience building developer platforms, internal tooling, APIs, or SDKs consumed by engineering teams at scale
- Strong understanding of platform engineering principles, including developer experience, self-service infrastructure, and API-driven platform design
- Proficiency with Infrastructure as Code tools, particularly Terraform
- Experience with containerization and orchestration, particularly Kubernetes and Docker
- Solid understanding of cloud infrastructure, preferably AWS
- Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
- Experience designing and operating observability, monitoring, and alerting systems
- Experience implementing incident response procedures and participating in on-call rotations
- Strong collaboration skills working across data, AI, and engineering teams
- High ownership mindset in a fast-moving, high-stakes production environment
Benefits
- Please note, applicants are permitted to redact or remove information on their resume that identifies age, date of birth, or dates of attendance at or graduation from an educational institution.
- We consider qualified applicants with criminal histories for employment on our team, assessing candidates in a manner consistent with the requirements of the San Francisco Fair Chance Ordinance.
- Payward is powered by people from around the world and we celebrate the diverse talents, backgrounds, contributions, and unique perspectives that everyone brings to the table. We hire based on merit, seeking out people with the right abilities, knowledge, and skills for the job. We encourage you to apply for roles where you don't fully meet the listed requirements, especially if you're passionate or knowledgeable about crypto.
- Unless a specific application deadline is stated in the job posting, applications are accepted on an ongoing basis.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Role Description The DevOps Engineer will support the design, implementation, and maintenance of scalable and secure infrastructure and DevOps processes at Akkadian Labs. You will work with development, QA, and product teams to enable reliable deployments, automate workflows, and improve system observability across Rocky OS-based, AWS-hosted, and on-premises solutions. This is a hands-on technical role focused on execution, continuous improvement, and operational excellence within the DevOps function led by the DevOps Manager. Key Responsibilities - Infrastructure and Environment Management - Support deployment and maintenance of scalable infrastructure in AWS and hybrid cloud environments. - Assist in managing infrastructure-as-code (IaC) using Terraform, CloudFormation, or similar tools. - Help maintain Linux-based environments. - Contribute to containerization efforts using Docker and orchestration via Kubernetes. - AI and Agent Infrastructure Implementation & Support - Work on the design, deployment and management of AI agent workloads, including provisioning compute instances and managing resource scaling for inference-heavy tasks. - Play a key role in building and maintaining model deployment pipelines, including versioning, testing, and rollback of AI models in production environments. - Monitor AI API consumption and infrastructure costs, implementing alerting and controls to prevent runaway usage and support budget visibility. - Coordinate the implementation of infrastructure-level security guardrails for AI systems, including access controls and data isolation for model inputs and outputs. - Observability and Reliability - Manage monitoring and observability efforts using tools such as Prometheus, Grafana, and the ELK stack. - Troubleshoot system issues and contribute to incident response and root cause analysis. - Develop and execute strategies for improving system reliability, performance, and uptime. - CI/CD and Automation - Build, maintain, and optimize CI/CD pipelines using tools such as Jenkins, BitBucket CI/CD, or similar. - Automate routine operational tasks including builds, testing, deployments, and system updates. - Work with engineering teams to integrate pipelines with Akkadian tools. - Security and Compliance - Follow secure DevOps practices and assist in implementing security controls. - Support compliance initiatives and vulnerability remediation efforts. - Collaboration and Documentation - Work closely with DevOps, engineering, QA, and product teams to support deployments and releases. - Maintain documentation for infrastructure, processes, and operational procedures. - Participate in team ceremonies and continuous improvement initiatives. Qualifications - Experience: 5+ years of experience in DevOps, Site Reliability Engineering (SRE), or a related role. - Cloud Expertise: Hands-on experience with AWS (e.g., EC2, ECS, S3, IAM, Lambda, CloudWatch). - Linux Knowledge: Working knowledge of Linux environments. - Containerization: Familiarity with Docker and Kubernetes. - Scripting: Basic to intermediate scripting ability in Python, Bash, or similar languages. - CI/CD: Experience building or maintaining CI/CD pipelines and related tools. - Observability: Exposure to monitoring and observability tools such as Prometheus, Grafana, and ELK. - Security: Understanding of secure DevOps practices and basic compliance concepts. Preferred Qualifications - Experience supporting AI or machine learning workloads, compute environments. - Exposure to AI model deployment pipelines and model versioning practices. - Experience with infrastructure-as-code tools such as Terraform or CloudFormation. - Familiarity with hybrid cloud or on-premises environments. - Exposure to security best practices in DevOps contexts, including AI-specific concerns such as data isolation and access controls. - Experience supporting production systems and participating in on-call rotations. Benefits - Fully remote environment. - Competitive benefits package including medical, dental, vision. - Company-paid life insurance and disability policies. - 401(k) with a generous matching program. - Paid time off.
Role Description Do you thrive on solving tough problems—even under pressure? Are you motivated by fast-paced environments with continuous learning opportunities? Do you enjoy collaborating with a team of peers who push you to constantly up your game? At Pythian, we are building a next-generation Site Reliability Engineering team. We need motivated and talented individuals on our teams, and we want you! You’ll act as a technology leader, advisor for our clients, and mentor for other team members. Projects would include infrastructure architecture, automation, and intelligent monitoring systems from design through implementation. If you Love Your Data and want to Love Your Career, this could be the job for you! If this is you, and you wonder what it would be like to work at Pythian, reach out to us and find out! What you will be doing: - Operate, maintain, and administer solutions contributing to customer infrastructure's operational efficiency, availability, and visibility. - Planning maintenance activity, design documentation, and standard procedures. - Provide Root Cause Analysis reports for outages/incidents (ITIL - Problem Management). - Observe and provide feedback on the current state of the client’s infrastructure, and identify opportunities to improve resiliency, reduce incident occurrence, and automate repetitive administrative and operational tasks. - Contribute to, improve, and maintain team documentation about client systems and infrastructure, procedures, policies, and schedules. - Gather and document information about client environments through audit activities, and analyze the information to identify opportunities for improvement and application of best practices. - Work collaboratively with teammates to contribute to the continuous improvement of our working culture. - Act as a technology leader for clients, as well as drive client discussions on technology road maps. - Participate in an on-call rotation in an escalation capacity. Qualifications - Experience working with Google and AWS Clouds (including infrastructure as code deployment with Cloud Formation, Terraform, Opsworks, etc). - Scripting and automation of administrative tasks using Python and Scala is a must. - Solid understanding of microservices architecture and container technologies (Kubernetes is a must, Docker, lxc, etc). - Clear understanding of software development lifecycles and best practices from an infrastructure point of view (PRs, merge, rebase, etc). - Understanding the end-to-end operations of a ‘Business System’ vs components. - Comprehensive systems hardware and network troubleshooting experience. - Common Linux distribution platform installation, configuration, performance tuning, and cloud migration. - TCP/IP networking, NIC bonding, and network services configuration (DNS, NTP, DHCP, SMTP, etc). - Operation and administration of virtual infrastructure, including experience with at least one hypervisor (VMware, Hyper-V, KVM, etc). - Ability to describe IaaS, PaaS, SaaS, pros and cons of each, use cases for virtualization and cloud. - Administration of web servers and supporting technologies, including network load balancers. - Experience with the design, development, and deployment of Puppet. - System and application error investigation, troubleshooting of access/availability issues including deep multi-system root cause analysis. - Experience managing networking devices, such as switches and firewalls from a variety of vendors. - Solid understanding of DevOps tools, processes, and culture. - Ability to pick up new technologies quickly. - Ability to provide accurate work scheduling and task estimations for work delivery. Benefits - Competitive total rewards package. - Flexibly work remotely from your home, there’s no daily travel requirement to an office! All you need is a stable internet connection. - Collaborate with some of the best and brightest in the industry! - Hone your skills or learn new ones with our substantial training allowance; participate in professional development days, attend training, become certified, whatever you like! - We give you all the equipment you need to work from home including a laptop with your choice of OS, and an annual budget to personalize your work environment! - You will have an annual wellness budget to make yourself a priority (use it on gym memberships, massages, fitness and more). - Generous amount of paid vacation and sick days, as well as a day off to volunteer for your favorite charity.
• Operate and maintain on-premise infrastructure environments across DEV, TEST, STAGING, UAT, and PROD. • Ensure network zoning and environment segregation in line with External / DMZ / Internal architecture. • Configure and support Web Application Firewalls (WAFs) and controlled traffic flows between zones. • Operate and maintain External and Internal API Gateways. • Support enterprise integrations via the Software AG integration platform. • Operate identity and access management infrastructure, including miniOrange IdP, MFA, and OIDC integrations. • Design, maintain, and operate CI/CD pipelines using Azure DevOps, including secure release promotion. • Implement and operate Secure SDLC controls (SAST, SCA, DAST). • Implement and maintain monitoring, logging, and audit capabilities (Prometheus, Grafana, Graylog, Sentry, SIEM forwarding). • Support backup, replication, and disaster recovery activities, including DR testing.
DevSecOps Engineer
Blueprint TechnologiesBlueprint Technologies, LLC is an equal employment opportunity employer. Qualified applicants are considered without regard to race, color, age, disability, sex, gender identity or expression, orientation, veteran/military status, religion, national origin, ancestry, marital, or familial status, genetic information, citizenship, or any other status protected by law. If you need assistance or a reasonable accommodation to complete the application process, please reach out to: recruiting@bpcs.com This role is fully remote and part-time (25 hours per week).
Role Description We are looking for a DevSecOps Engineer to join us as we build cutting-edge technology solutions! This is your opportunity to be part of a team that is committed to delivering best in class service to our customers. In this role, you will support secure cloud infrastructure, deployment automation, and operational reliability initiatives for enterprise analytics platforms and applications. You’ll help improve scalability, automation, monitoring, and security posture across development and production environments. Responsibilities - Build and maintain CI/CD pipelines and automation workflows - Support cloud infrastructure and infrastructure-as-code initiatives - Implement security monitoring and vulnerability remediation - Manage containerized workloads and orchestration environments - Support deployment, monitoring, and incident response activities - Collaborate with development teams to streamline release processes - Maintain operational and security documentation Qualifications - Bachelor’s degree in Computer Science, Engineering, or related field - 5+ years of DevOps or DevSecOps experience - Experience with AWS or comparable cloud platforms - Experience with Docker, Kubernetes, or OpenShift - Strong scripting and automation experience Preferred Qualifications - Experience with Terraform, Jenkins, ArgoCD, or GitHub Actions - Familiarity with cloud security and compliance frameworks - Experience supporting analytics or data platforms Salary Range At Blueprint, we strive to offer competitive pay that reflects the value of our team members. Compensation for this role is influenced by a variety of factors, including skills, education, responsibilities, experience, and geographic market. For candidates based in Washington State, the anticipated salary range is $95,000 to $105,000 annually. Please note that we typically do not hire new employees at the top of the posted range. Actual starting pay will be determined based on experience, skills, and internal equity. The final salary and job title may vary depending on the selected candidate’s qualifications and could fall outside the stated range. Benefits - Medical, dental, and vision coverage - Flexible Spending Account - 401k program - Competitive PTO offerings - Parental Leave - Opportunities for professional growth and development Equal Opportunity Employer Blueprint Technologies, LLC is an equal employment opportunity employer. Qualified applicants are considered without regard to race, color, age, disability, sex, gender identity or expression, orientation, veteran/military status, religion, national origin, ancestry, marital, or familial status, genetic information, citizenship, or any other status protected by law. If you need assistance or a reasonable accommodation to complete the application process, please reach out to: recruiting@bpcs.com Location Remote



