Make experiences flow.
Senior Cloud Operations Engineer
Location
United Kingdom
Posted
6 days ago
Salary
0
Seniority
Senior
Job Description
Senior Cloud Operations Engineer
NICE
• Design, implement, and operate scalable, secure, and highly available AWS cloud infrastructure leveraging services such as EC2, EKS, ECS, RDS, S3, VPC, Lambda, and IAM. • Drive the reliability and performance of containerized applications by managing Amazon EKS and ECS environments, including cluster operations, networking, scaling, and troubleshooting. • Ensure the stability, security, and efficiency of production Linux environments through system administration, performance tuning, storage management, networking, and incident resolution. • Maintain and optimize relational databases (PostgreSQL, MySQL, Aurora) and NoSQL platforms (DynamoDB, Redis), ensuring high availability, performance, and disaster recovery readiness. • Strengthen the organization's cloud security posture through effective management of IAM, network security controls, secrets management, and compliance best practices. • Enhance platform observability and operational excellence by implementing and improving monitoring, logging, alerting, and performance analytics using CloudWatch, Prometheus, and Grafana. • Take ownership of production incidents by participating in on-call rotations, leading troubleshooting efforts, performing root cause analysis, and driving continuous improvement initiatives. • Partner closely with software engineering, DevOps, and platform teams to improve deployment processes, application reliability, and operational efficiency. • Identify and implement cloud cost optimization opportunities through resource right-sizing, capacity planning, automation, and governance best practices.
Job Requirements
- 4–5 years in a cloud operation, infrastructure engineering, or SRE role with a strong hands-on technical focus
- Deep hands-on experience with core AWS services: EC2, EKS, ECS, RDS/Aurora, S3, VPC, IAM, Lambda, CloudWatch, Route 53, and ALB/NLB
- Proven ability to design and troubleshoot complex AWS networking topologies (VPCs, subnets, transit gateways, security groups)
- Solid understanding of AWS IAM — roles, policies, permission boundaries, and cross-account access
- Hands-on production experience managing workloads on Amazon EKS and ECS — cluster lifecycle, node group management, networking (CNI, service mesh basics), and autoscaling
- Strong Docker fundamentals: image builds, registries (ECR), multi-stage builds, and container security
- Strong Linux administration skills: Bash/Python scripting, process and memory management, filesystem and storage operations, kernel parameters, and network diagnostics
- Experience managing and hardening Linux servers in production environments (RHEL, Ubuntu, or Amazon Linux)
- Proficient in Terraform — module design, state management, remote backends, and workspace strategies
- Hands-on experience with Puppet for configuration management, node classification, and enforcing system state at scale
- Hands-on experience with relational databases: PostgreSQL, MySQL, or AWS RDS/Aurora — schema management, query optimisation, replication, backups, and failover
- Familiarity with NoSQL databases: DynamoDB, Redis, or MongoDB — data modelling, performance tuning, and operational monitoring
- Familiarity with CI/CD pipelines (GitHub Actions, Jenkins, or AWS CodePipeline)
- Experience with observability tooling: CloudWatch, Datadog, Prometheus, or Grafana.
Benefits
- Flexible working arrangements
- Professional development opportunities
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Principal Operations Engineer – Reliability, Data Center Operations
FluidStackNVIDIA H100 & A100 GPUs available on demand at scale. Access thousands of GPUs for AI/LLM/ML, ready for deployment now.
• Take the on-call escalation when a site hits trouble and triage it virtually, using real knowledge of the team and the systems to decide what to escalate, when, and how to keep the field crew focused without burying them. • Get on a plane when it matters: travel site to site (50%+) to work live incidents and post-incident reviews on the floor, and bring the practices that worked elsewhere with you. • Own root cause analysis on significant events through to closure and track corrective actions to done, killing the underlying class of failure rather than the one instance in front of you. • Read the patterns across the fleet’s incidents and RCAs, push the few highest-value learnings through to closure, and stay honest about what’s achievable and what to drop instead of boiling the ocean. • Carry learnings and practices from one campus to the next so a fix at one site becomes the standard everywhere before the failure repeats. • Write the operational Assessment standard and audit each campus against it, feeding what you find straight back into the corrective-action loop.
• Participate in technical envisioning, technical design, and delivery of assigned projects. • Work with 3Cloud Architects to support project efforts from a technical perspective. • Execute the implementation of designed solutions into client deliverables. • Assist with design and deployment of client workloads into Azure. • Providing technical expertise and support across the following four areas of specialization: • Datacenter Transformation • Azure Infrastructure • DevOps and CI/CI pipelines • Cloud automation
• Design, deploy, and manage Kubernetes clusters • Implement Infrastructure as Code using Terraform • Manage Google Cloud IAM and Workload Identity Federation • Provide incident response and production support • Develop and manage Apigee proxies and policies
Lead DevOps Engineer
ProfitroomEmpowering hotels directly! Maximize Bookings and Convert Site Visitors into Guest
• Collaborate with software development teams and support their releases, develop, and maintain CI/CD pipelines. • Design and implement automated infrastructure. • Maintain existing cloud infrastructure/VMs. • Monitor the environment, analyse and solve problems if required. • Utilise your understanding of the Software Development Life Cycle to proactively optimise and automate infrastructure and processes.




