Elevate contact center agent productivity through conversational workflow software
Senior DevOps – Platform Reliability Engineer
Location
New York
Posted
18 days ago
Salary
0
Seniority
Senior
Job Description
Senior DevOps – Platform Reliability Engineer
Zingtree
• Own and evolve CI/CD pipelines using GitHub Actions and OIDC-based authentication for microservices and agentic workloads, with safe, fast, and reversible deployments. • Automate infrastructure provisioning using Infrastructure as Code (IaC) tools such as Terraform and CloudFormation. • Operate and scale our Kubernetes platform (EKS + Argo CD), including autoscaling, ingress, external-dns, cert-manager, External Secrets Operator, backups, runtime guardrails, and multi-tenant isolation for enterprise customers. • Manage the edge and network perimeter, including Cloudflare (CDN, WAF, Bot Management, DDoS protection, Zero Trust / Access), CloudFront, API Gateway, ALB/NLB, Route 53, and network security controls. • Operate the data and event tier, including Aurora MySQL, ElastiCache/Redis, S3, and MSK (Kafka), with responsibility for backups, point-in-time recovery (PITR), and multi-AZ disaster recovery aligned to defined RTO/RPO objectives. • Build and maintain Lambda workloads where event-driven or serverless architectures are the right fit. • Build observability as a product using Prometheus, Grafana, and OpenTelemetry, including telemetry for LLM and agentic systems such as token cost, tool-call latency, evaluation signals, and prompt/version tracking. • Strengthen our security and compliance posture for SOC 2 and HIPAA, including least-privilege IAM, SCPs, secrets management, SAST/DAST, dependency and container scanning, image signing, AWS Config, Security Hub, GuardDuty, Inspector, and evidence automation. • Drive FinOps initiatives, including tagging standards, Savings Plans and Reserved Instances, per-tenant and per-workload cost attribution, and LLM cost controls. • Build and evolve our AI-native DevOps capabilities.
Job Requirements
- 5+ years of experience in DevOps, SRE, or Platform Engineering operating production systems on AWS.
- Strong experience with CI/CD pipelines and tools such as GitHub Actions, GitLab CI, Jenkins, or CircleCI.
- Hands-on experience operating production EKS environments, including autoscaling, ingress, secrets management, and cluster upgrades.
- Strong AWS networking experience, including multi-account VPC design, subnets, routing, security groups, NACLs, Route 53, ACM, and load balancers.
- Deep experience with Terraform and GitHub Actions, ideally using OIDC-based cloud authentication.
- Experience with Aurora/RDS MySQL, Redis (ElastiCache), and S3, including backups, PITR, migrations, and lifecycle management.
- Strong observability experience using Prometheus, Grafana, and OpenTelemetry.
- Experience operating Argo CD at scale.
- Experience with Infrastructure as Code tools such as Terraform, CloudFormation, or Ansible.
- Experience managing Cloudflare services including WAF, Bot Management, Rate Limiting, and Zero Trust / Access, along with CloudFront.
- Experience operating Kafka/MSK at scale, including topics, consumer groups, and schema registries.
- Experience with Lambda and event-driven architectures.
- Comfortable working with Python, Bash, and Linux systems.
- Strong understanding of security best practices across IAM, KMS, secrets management, networking, and software supply chain security.
- Familiarity with vulnerability scanning and compliance tooling.
Benefits
- Competitive compensation packages
- Comprehensive health benefits:
- 100% of employee premiums covered
- 75%–80% of dependent premiums covered for most health, dental, and vision plans
- 401(k) plans to support retirement planning (no employer matching currently)
- Paid parental leave
- Unlimited PTO
- Flexible remote work from anywhere
- Up to $200/month co-working reimbursement
- Home office stipend:
- Up to $500 for home office setup
- $100/month for internet, phone, and related expenses
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Engineer
Arbor EducationArbor MIS helps schools and MATs work more easily and collaboratively. Join a free webinar: http://bit.ly/Arbor-webinars
• Proactively monitor and analyse platform performance. • Collaborate with engineering teams to address performance bottlenecks and ensure scalability. • Assist engineering teams with implementing and reviewing SLOs • Continually improve observability through monitoring and alerting, and dashboards, using tools such as DataDog or Prometheus for example. • Work with other teams to ensure it is effective and provides full coverage. • Ensure the service is highly available and resilient • Champion best practices in design for high availability • Devise runbooks and run game sessions to test our DR plan, H/A and backups • Conduct assessments of capacity and plan for scaling to meet current and future business needs. • Work closely with the Head of Platform Engineering and Head of SRE to strategize and implement scalable solutions. • Work closely with the Platform team, feature teams and, 2nd line support and other stakeholders to ensure a good level of service is provided for our customers and embed SRE practices. • Key player in the response and troubleshooting of incidents, ensuring rapid resolution and minimising downtime. • Participate in blameless postmortems to identify root cause and corrective actions • Develop and maintain playbooks and documentation
• Work primarily with on‑premise infrastructure (bare metal and VMs): setup, maintenance, troubleshooting • Drive clarity in ambiguous situations by defining requirements, assumptions, and next steps • Own automation projects end‑to‑end (design → rollout → maintenance) • Improve how we operate: harden and tune systems and also improve the way the team works in terms of operational hygiene • Keep the platform stable, fast, and secure: servers, web servers, databases, queues • Investigate production incidents across OS / networking / infrastructure layers, apply temporary mitigations, coordinate with developers and participate in post‑mortems • Participate in on‑call rotations • Use AI in all aspects of day‑to‑day work: researching, troubleshooting, developing
Site Reliability Engineer
QlikFounded in 1993, Qlik is an award-winning, market-leading software company that specializes in business intelligence technology. Qlik provides tools that make d
Role Description Join our dynamic team at Qlik as a Site Reliability Engineer, where you'll play a crucial role in ensuring the security, stability, and scalability of our Qlik and Talend Cloud services. This exciting role offers hands-on experience with cutting-edge technology and scale challenges as we expand to support millions of transactions across our cloud environment. - Exciting Challenges: Maintain the reliability and availability of our cloud platforms, tackling complex problems and driving improvements to enhance performance and scalability. - Collaborative Environment: Work closely with our Engineering organization, collaborating with Architecture, Platforms, and Domains teams to design and develop new infrastructure features and optimize cloud-related practices. - Innovative Solutions: Design and develop effective tooling, alerts, and responses to identify and address reliability risks, utilizing your expertise in cloud technology and backend systems. - Professional Growth: Act as a resource for fellow engineers, sharing your knowledge and expertise in cloud engineering, production service operations, incident management, and troubleshooting. - Continuous Learning: Stay updated on the latest industry trends and technologies, contributing to the adoption of best practices and driving continuous improvement within our cloud environment. Here’s how you’ll be making an impact: - Reliability and Scalability: Ensure high reliability and availability of our cloud platforms, collaborating with cross-functional teams to implement new infrastructure features and optimize performance. - Cloud Optimization: Define and evangelize cloud-related optimizations and best practices, driving improvements in reliability, scalability, and performance. - Problem Solving: Analyze complex issues at the infrastructure, systems, network, and application levels, making recommendations and decisions to resolve them effectively. - Knowledge Sharing: Share your expertise with fellow engineers, providing guidance on cloud technologies, automation, security, and best practices. - On-Call Support: Participate in on-call duties to maintain the availability and performance of our cloud infrastructure, providing regular updates on project status and activities. Qualifications - Bachelor's or master’s degree in computer science or a relevant field. - Self-motivated with the ability to work autonomously and multitask effectively. - Strong analytical skills for solving complex problems and driving innovative solutions. - 3+ years’ experience with Infrastructure as Code (IaC) tools such as Terraform, Crossplane, Ansible, or similar. - 3+ years’ experience working alongside a production system running on Kubernetes. - 3+ years of professional experience in cloud engineering, preferably with AWS and Azure. - 3+ years of professional experience with operating and/or building microservices. - Proficiency in scripting and automation (e.g., Bash, Python, Go) and software engineering concepts. - Experience with CI/CD, cloud and microservice autoscaling. - Experience with networking security and secret-management tools (e.g. Vault, AWS SSM). - Proficiency with observability tooling such as Prometheus, Open Telemetry, distributed tracing, and SIEM such as Splunk. - Experience with Helm including but not limited to managing helm charts as well as creating custom charts from existing ones or building new. - Excellent English communication skills, both oral and written. - Curiosity and desire to learn. - Ability to take a rotating on-call shift. - Knowledge of infrastructure security review and compliance frameworks. - Experience working with database concepts and tooling such as MongoDB. - Demonstrated ability to collaborate with development teams and provide expert guidance on implementing reliability best practices, ensuring systems are robust, scalable, and highly available. - Where applicable, experience with or interest in learning other tools such as Temporal, Clik House, Fire Hydrant, Grafana, Solace, Gloo, Isito, and other cloud native related tools. - Certifications such as CKD, CKS, AWS Certified Solutions Architect Associate/Professional, AWS Certified Advanced Networking Specialty, AWS Certified Security Specialty, all considered assets. - Ability to obtain sufficient clearance status to work on IL5 systems with Qlik support. Benefits - Named in Newsweek’s ‘Americas Greatest Workplaces 2025’. - Genuine career progression pathways and mentoring programs. - Culture of innovation, technology, collaboration, and openness. - Flexible, diverse, and international work environment. - Participation in Corporate Responsibility Employee Programs. - Comprehensive benefits, including medical, dental, and vision coverage, life and AD&D, short and long-term disability coverage, paid time off, paid parental/maternity leave, participation in a 401(k) program that includes company match, and many other additional voluntary benefits.
Senior ServiceDesk Reliability Engineer – SDRE
TabbyOn a mission to create financial freedom. No interest. No fees. Shariah-Compliant.
• Tabby creates financial freedom in the way people shop, earn and save by reshaping their relationship with money. • The company’s flagship offering allows shoppers to split their payments online and in-store with no interest or fees. • Tabby generates over $10 billion in annual transaction volume for its partner brands and is the highest-rated, most-reviewed, largest, and fastest-growing FinTech in the GCC region. • Tabby launched in 2019 and has since raised +$1 billion in equity and debt funding from global and regional investors, and is now valued at $4.5 billion.



