Senior Cloud Operations Engineer
Location
California
Posted
7 days ago
Salary
$75K - $115K / year
Seniority
Senior
Job Description
Senior Cloud Operations Engineer
PTC
• Maintain and optimize existing monitoring and automation solutions • Collaborate with stakeholders to gather requirements • Define monitoring strategies and engineer solutions • Design and implement cloud automation and orchestration workflows • Develop and maintain integrations with RESTful APIs • Create and maintain technical documentation • Continuously analyze and improve monitoring KPIs and incident response processes
Job Requirements
- Bachelor’s degree in computer science, Engineering, or a related field
- 5+ years of experience in software or cloud engineering roles
- Proficiency in AWS and Azure cloud platforms
- Hands-on experience with Docker containerization
- Expertise in enterprise monitoring tools such as Zabbix, Catchpoint, and Sumo Logic
- Strong Python development skills
- Experience in designing and supporting RESTful APIs
- Knowledge of Infrastructure-as-Code tools like Terraform and SaltStack
- Familiarity with agile development methodologies and DevOps practices
Benefits
- Medical, dental and vision insurance
- Paid time off and sick leave
- Tuition reimbursement
- 401(k) contributions and employer match
- Flexible spending accounts
- Life insurance
- Disability coverage
- Commuter subsidy for office-assigned employees
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Site Reliability Engineer (SRE)
DevsuDevsu is a technology agency that provides software development services, IT augmentation and staffing.
Role Description We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP). This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments. As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required. Responsibilities - Monitoring & Observability (Core Focus) - Own and operate the monitoring and observability stack across on-prem and GCP environments - Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications - Define, tune, and maintain alerts to ensure high signal-to-noise ratio - Establish observability standards and best practices across teams - Improve visibility into system health, performance, and reliability - Site Reliability Engineering - Apply SRE principles to improve availability, performance, and resilience - Define and track SLIs, SLOs, and error budgets - Participate in on-call rotations and SEV incident response - Lead or contribute to incident investigations and root cause analysis (RCA) - Drive preventative actions to reduce repeat incidents - Kubernetes & Platform Reliability - Support and monitor Kubernetes environments (GKE and on-prem clusters) - Monitor cluster health, capacity, and resource utilization - Troubleshoot platform-level issues impacting application reliability - Collaborate with Platform and Engineering teams on reliability improvements - Secondary Responsibilities (Backup Application Support) - Provide L2/L3 application support coverage during: - Support team resource shortages - High-severity incidents (SEVs) - Peak support periods or escalations - Triage and troubleshoot application issues using existing runbooks and dashboards - Collaborate with Application Support and Engineering teams during incidents - Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW) Qualifications - Strong experience as a Site Reliability Engineer or Reliability Engineer - Deep hands-on expertise with Grafana (dashboards, alerting, troubleshooting) - Solid experience with monitoring and observability systems - Production experience operating Kubernetes environments - Experience supporting systems in GCP and on-prem environments (mandatory) - Strong Linux systems and troubleshooting skills - Fluent English (written and spoken) - Ability to work in PST time zone - Ability to participate in an on-call rotation that includes coverage for one weekend day Requirements - Technology Stack: - Observability: Grafana, Prometheus, logging platforms - Containers: Kubernetes (GKE and on-prem) - Cloud: Google Cloud Platform (GCP) - Operations: Linux, networking, infrastructure monitoring - Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents) - Nice to have: - Experience supporting application teams during SEV incidents - Knowledge of capacity planning and performance tuning - Scripting skills (Python, Bash, etc.) - Experience with hybrid infrastructure environments Benefits - A stable, long-term contract with opportunities for career growth - Private health insurance - A remote-friendly culture that promotes work-life balance - Continuous training, mentorship, and learning programs to keep you at the forefront of the industry - Free access to AI training resources and state-of-the-art AI tools to elevate your daily work - A flexible Paid Time Off (PTO) policy as well as paid holiday days - Challenging, world-class software projects for clients in the US and LatAm - Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment
• Own the technical direction of Emergent’s DevOps practice, including the IaC standards, pipeline patterns, code review expectations, and reference architectures that the team applies across client engagements. • Author and maintain the internal documentation, templates, and modules that the team relies on day to day. This is a core part of the role, not a side task. • Mentor a team of engineers from mixed backgrounds, including software developers growing into cloud roles and traditional cloud engineers building DevOps fluency. Pair, review pull requests, run office hours, and raise the technical bar across the group. • Lead the most complex DevOps engagements end to end, from discovery through delivery, with a focus on environments where scale, regulatory requirements, or legacy constraints demand strong technical judgment. This includes grooming backlog work and feature & task assignments to project team as well as influence technical direction across practice areas. • Represent the DevOps practice in pre-sales. Partner with sales and account teams to scope work, lead technical discovery sessions, contribute to proposals, and build client confidence in Emergent’s approach. • Continuously evolve Emergent’s DevOps practice by evaluating new tools, patterns, and approaches, and bringing the right ones into the team’s standard playbook. • Diagnose and improve existing client systems that may be partially implemented or inconsistently managed, balancing pragmatic delivery with long-term maintainability. • Accurately track time and work performed to support client billing, project planning, and overall project health, as part of a professional services delivery model. • Strong time management skills to be able to prioritize workload, meet deadlines, and provide technical support and guidance for other DevOps focused team members. • Share your knowledge at regular talk shop and lunch & learn sessions to help build a stronger team.
SRE / DevOps / Cloud Platform Engineer
Built TechnologiesAn award-winning FinTech startup, Built Technologies is on a mission to power smarter construction by transforming the construction finance ecosystem. Past flexible jobs at Built T
Role Description Built is hiring a Mid or Senior Cloud Platform Engineer to join our Cloud Platform Team in Mendoza, Argentina. This role is for the hands-on engineer who keeps AWS environments running smoothly, improves Terraform every week, and is the person product teams want in the room when an infrastructure problem needs a practical, durable fix. You will help operate and improve the cloud platform that product engineering teams build on: - Triaging AWS issues - Hardening infrastructure as code - Supporting production databases - Improving delivery pipelines - Raising the bar on reliability This is a strong fit for someone who enjoys digging into AWS console errors, tracing IAM denials, untangling Terraform state, and tuning the database or deployment path behind a production slowdown. At the L3 end of the band, this person will increasingly lead well-scoped platform initiatives while staying close to the operational craft. What You'll Do - Solve AWS problems across the services Built depends on, including EC2, EKS, ECS, RDS, S3, IAM, VPC, Route 53, CloudWatch, KMS, Secrets Manager, and the surrounding service ecosystem. - Author, review, and improve Terraform across Built's AWS estate by extending modules, resolving drift, managing state safely, and lifting the quality of reusable infrastructure patterns. - Support production databases as a hands-on partner to engineering teams, including: - Schema reviews - Index tuning - Slow-query investigations - Backup and restore validation - Replication or failover checks - Capacity planning - Migration support across RDS PostgreSQL, MySQL, Aurora, DynamoDB and related data stores. - Own practical improvements to GitHub Actions CI/CD workflows, making day-to-day builds and deploys faster, safer, and easier for product teams to debug. - Improve observability by building useful Datadog or CloudWatch dashboards, writing meaningful monitors, tuning noisy alerts, and helping teams instrument services with metrics, logs, traces, and basic SLO thinking. - Participate in incident response and on-call for AWS-related failures: investigate quickly, communicate clearly, write clean postmortems, and turn findings into Terraform-backed or operational fixes. - Partner with product engineering teams on infrastructure design by reviewing TRDs, recommending AWS patterns, and helping teams choose services that fit the problem, reliability needs, and operational cost. - Help raise the standard for security and compliance through least-privilege IAM, secret hygiene, patching cadence, vulnerability remediation, and data-protection practices appropriate for a regulated fintech environment. - Grow into larger ownership over time by leading bounded reliability, infrastructure, database, or developer-experience initiatives from problem definition through delivery. Qualifications - Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent practical experience. - 3-7 years of hands-on experience in DevOps, platform engineering, SRE, or cloud infrastructure in a production environment. - Strong AWS troubleshooting skills, including comfort with CloudTrail, CloudWatch Logs, VPC flow logs, IAM policy analysis, AWS Support tooling, and root-cause work across compute, networking, storage, and identity. - Solid working knowledge of core AWS services, including EC2, RDS, S3, IAM, VPC networking, Security Groups, Route 53, KMS, Secrets Manager, CloudWatch, and at least one of EKS or ECS. - Hands-on Terraform experience writing and refactoring modules, working with remote state, managing environments or workspaces, and resolving drift and provider issues with care. - Experience building or maintaining CI/CD pipelines in GitHub Actions and shipping production changes safely through reusable workflows, secrets management, and failed-run debugging. - Working knowledge of at least one observability platform such as Datadog, Grafana, or CloudWatch, including dashboards, monitors, alert quality, and basic SLO concepts. - Ability to script in Python, Go, Node or Bash for automation, operational tooling, and cloud-platform glue work. - Clear written communication, with the ability to explain an AWS issue, Terraform change, database finding, or incident timeline to engineers outside the platform team. Requirements - Production DBA or database engineering experience, especially with MySQL and/or Postgres: query tuning, EXPLAIN plan analysis, indexing strategy, backup and restore, replication, failover testing, schema migrations, and partitioning. - Experience operating DynamoDB in production, including data modeling, partition and sort key strategy, GSIs and LSIs, capacity mode selection, streams, DAX, and tuning hot-partition or throttling issues. - Experience supporting database migrations and major version upgrades on RDS or Aurora. - Familiarity with data security and compliance practices around PII, PCI, and regulated production environments. Benefits - The rare opportunity to radically disrupt a $1.5T industry. - Competitive benefits including: uncapped vacation [US ONLY], health, dental & vision insurance. - Robust compensation package, including equity in the form of stock options. - Learning Grant program to support ongoing professional development. - 401k with match and expedited vesting [US ONLY]. Operational Details - Team size: You will join a team of 6+ platform engineers, with senior ICs on the team to learn from and partner with. - On-call: Rotations are a week at a time, with frequency based on team size, expected roughly once every two months. The team works continuously to reduce incident volume so on-call stays manageable and low-noise. - Location: Mendoza, Argentina, with flexible working hours.
• Owning the SRE lifecycle for NodeBalancer and Network Load Balancer — from design reviews and pre-rollout readiness assessments through production sign-off and ongoing reliability management • Designing and implementing SLO/SLI frameworks that reflect true customer experience for L4 and L7 load balancing services, and driving action when error budgets are at risk • Building and maintaining observability pipelines for NB/NLB infrastructure, including Prometheus metrics from load balancing components and system-level sources, and Grafana dashboards that enable rapid incident triage • Leading technical incident response for complex NB/NLB failures — BGP/VIP issues, failover failures, data plane degradations, and configuration problems — acting as the technical commander and driving root cause analysis and preventive follow-through • Developing and automating safe deployment workflows for phased NB/NLB releases, including bake period monitoring, feature flag management, and GO/NO-GO validation across global datacenter rollouts • Reviewing design documents, product requirement Documents and producing actionable SRE input on operational risks, capacity implications, Day-2 concerns, and product strategy gaps • Building automation and tooling using Python or Go that reduces operational toil and improves team-wide operational capability • Mentoring SRE II engineers on the NB team, providing hands-on technical guidance, code/config reviews, and raising the bar for the team's SRE practice • Participating in an on-call rotation for NB/NLB production systems, responding to incidents and driving resolution for customer-facing load balancing infrastructure • Participate in a scheduled, daytime-only on-call rotation to spearhead technical incident response and resolve complex NB/NLB failures.




