Job Closed
This listing is no longer active.
Engineering new possibilities with platforms, data, and generative AI
Cloud Infrastructure Engineer – GCP
Location
United States
Posted
135 days ago
Salary
$100K - $145K / year
Seniority
Senior
Job Description
Cloud Infrastructure Engineer – GCP
Egen
• Implement cloud-based IaC solutions • Develop and implement automation to support continuous delivery and continuous integration solutions • Use GCP services to deploy highly available, scalable, and secure applications • Implement workflows to automate the release and upgrade process for applications in Development, Test, and Production environments. • Implement secure integrations using GCP security and networking technologies • Administration and engineering of IAM user Role-Based Access Controls and processes • Create and update support documentation and standards. • Develop automated methodologies for deployment activities, configuration management, supporting systems, and business processes. • Investigate and contribute to solving various issues in production environments.
Job Requirements
- 3+ years of professional experience managing infrastructure on AWS, GCP, or Azure including networking and access security
- Experienced in Infrastructure as Code (IaC) frameworks like Terraform, or AWS CloudFormation
- Experienced in CI/CD Pipeline automation and integration using Jenkins, AWS Code{Pipeline, Build, Deploy}, Azure DevOps, or other relevant tooling
- Strong experience with shell scripts, editors, SSH, awk/sed, git, and other Linux toolkits
- Experienced in deploying containers and container orchestration using Docker, Kubernetes, and its components
- Experience with Kubernetes components like Ingress Controllers, Cert Managers, Custom Resource Definitions, and RBAC access security
- Implement secure integrations using security and networking technologies (IAP, VPC, and PSC)
- Administration and engineering of IAM user Role-Based Access Controls and processes
- Experienced in monitoring, alerting, and observability stack using Elastic Stack, Splunk, Prometheus, Grafana, CloudWatch
- Is self-directed, can work independently and make decisions autonomously at a high level.
Related Guides
Related Categories
Related Job Pages
More Infrastructure Engineer Jobs
Role Description We are looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC), a team at the heart of platform reliability for mission-critical SaaS environments. You will help maintain, optimize, and ensure the reliability and performance of the systems that power our cloud infrastructure across AWS and Kubernetes, with a strong focus on automation, observability, and continuous improvement. This role blends reliability engineering with incident command, giving you real ownership over uptime, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and continuous improvement through automation and resilience engineering. Your Responsibilities - Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews. - Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform. - Maintain services once they are live by measuring and monitoring availability, latency and overall system health. - Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes. - Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems. - Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity. - Act as an agent orchestrator using Amazon Kiro: run multiple activities in parallel by leveraging AI agents to accelerate execution, while personally validating results and completing selected tasks manually when needed. - Be on-call. - Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience. - Implement monitoring, logging, alerting, and SLA reporting. - Create and maintain technical documentation. - Implement, maintain and mature SRE best practices. - Lead incidents: Act as Incident Commander for incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration. - Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth. - Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment. - Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users. Qualifications - 5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments. - Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure. - Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale. - Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar). - Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable). - Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards. - Experience with incident management, on-call participation, escalation, and structured postmortems. - Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics. - Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned. - Experience with Fedramp compliance is a strong asset. - Basic knowledge of Java- or .Net-based development required. - Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec. Requirements - Escalation on-call rotation. - Occasional travel (quarterly offsites, conferences – less than 10%). Benefits - We understand that experience comes in many forms and that careers are not always linear. If you don't meet every requirement in this posting, we still encourage you to apply. - At Tecsys, we are committed to fostering a diverse and inclusive workplace where all employees feel valued, respected, and empowered. - We believe that diversity drives innovation and strengthens our ability to deliver exceptional solutions. - We welcome and encourage applicants from all backgrounds, experiences, and perspectives to join our team. - Tecsys is an equal opportunity employer. Accommodation is available for applicants selected for an interview. - NB: if you are applying to this position, you must be a Canadian Citizen or a Permanent Resident of Canada, OR, have a valid Canadian work permit.
Senior Infrastructure Engineer
BitriseMobile DevOps and Continuous Integration & Delivery for faster, better and more efficient app development 🚀
• Maintain, improve, and scale our existing infrastructure (both data center and cloud services) to meet growing demands • Manage Mac-based and Linux-based build infrastructures alongside network infrastructure • Streamline workflows through automation, enhancing efficiency and reducing manual intervention • Proactively monitor infrastructure performance and identify potential issues to prevent customer impact • Handle incidents effectively and conduct thorough post-mortems to prevent recurrence • Develop systematic procedures for troubleshooting and maintenance • Refine deployment practices to enhance quality and velocity • Demonstrate the ability to work independently and take initiative • Thrive in leading projects and project teams to drive innovation and achieve our mission • Work closely with various teams within the company • Document tribal knowledge and promote knowledge sharing across the engineering department • Plan and execute work in an iterative, agile manner
Staff Infrastructure Engineer
Netwrix CorporationData security starts with identity, #1 attack vector. Fast, cost-effective solutions trusted by 13,500 organizations
• Design, own, and evolve the on-prem Kubernetes architecture used to deploy the Netwrix platform. • Define best practices for cluster layout, networking, storage, security, and lifecycle management. • Support self-managed Kubernetes environments across a wide range of customer infrastructure setups. • Diagnose and resolve complex platform-level issues spanning Kubernetes, networking, storage, and application layers. • Own the design and implementation of the product installer and deployment tooling (e.g. Helm charts, custom installers, scripts). • Define packaging and dependency management strategies for on-prem environments. • Design and maintain reliable upgrade, rollback, and migration workflows with minimal customer downtime. • Improve installation validation, pre-flight checks, and post-install verification to reduce support burden. • Establish standards for logging, monitoring, diagnostics, and supportability in customer-managed environments. • Partner with Support and Customer Success to improve troubleshooting workflows and reduce deployment-related incidents. • Ensure on-prem installations are secure, observable, and operationally mature. • Act as the technical authority for on-prem infrastructure and deployment concerns. • Review and influence architectural decisions that impact platform operability and reliability. • Mentor engineers on Kubernetes, infrastructure design, and operational best practices. • Drive improvements in documentation, runbooks, and operational readiness across the organization.
• Work with a team of highly skilled Staff and Senior Engineers • Evolve and optimize our hybrid AWS and bare metal infrastructure to securely run sandboxed code and AI Agents with industry leading cost efficiency • Investigate customer and infrastructure problems down to the packet capture and process memory level together with the team to ensure customers can trust Checkly • Contribute to infrastructure reliability and ensure systems stay snappy for ad hoc and scheduled workloads without breaking or exploding costs • Collaborate with product engineers to improve developer experience, support our strong shipping culture and provide observability they need




