Job Closed
This listing is no longer active.
Built on more than 130 years of experience, GE Vernova, a division of General Electric (GE), is leading a new era of energy by electrifying the world while work
SRE Production DevOps
Location
Worldwide
Posted
84 days ago
Salary
0
Seniority
Mid Level
No structured requirement data.
Job Description
SRE Production DevOps
General Electric - GE
Role Description The Production DevOps Engineer serves as a critical link in the "Middle-Mile" of software delivery for the GE Vernova’s Grid Software SaaS products. This role is responsible for ensuring that software moves from development to production environments through a standardized, secure, and highly observable path. You will own the Change Management Process, serving as a primary authority for production deployments to ensure that new SaaS product versions do not compromise the stability of global energy grid operations. This position requires a strong technical background in automation and a disciplined approach to release safety in a 24/7 operational environment. Works independently and is seen as a Technical Leader. The role demonstrates deep understanding of concurrent software development, its effect on build management and releasing the builds across versions and environments. Qualifications - 3–5 years of experience in DevOps, SRE, or Release Engineering roles for cloud-native SaaS applications. - Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience. Requirements - Hands-on experience with Jenkins, Artifactory, GitHub Actions and ArgoCD for automated software delivery. - Proficiency in managing workloads on Kubernetes, specifically with EKS clusters. - Strong skills in Ansible and Terraform for configuration management and infrastructure-as-code. - Solid understanding of AWS cloud services (VPC, IAM, EKS, RDS, S3, MSK, etc) in a production setting. - Experience using Prometheus, Grafana, Splunk, Datadog or Dynatrace to monitor deployment health and system performance. - Experience building dynamic build pipelines using Groovy Script, Python, Bash or Go languages. - Proven ability to manage production changes and troubleshooting under pressure in a high-stakes environment. - Familiarity with regulated industries and security frameworks such as NERC CIP, SOC2, ISO 27001, IEC 62443 is highly preferred. - Strong ability to document technical procedures and communicate clearly with stakeholders during global shift handovers. Benefits - Relocation Assistance Provided: No - #LI-Remote - This is a remote position Key Performance Indicators (KPIs) - Contribution towards the 4-hour SLA target for Customer Onboarding Speed. - Help maintain 99.99% availability of mission critical grid SaaS products. - Maintaining a low rate of failed production deployments through improved quality gates for Change Failure Rate. - Ensuring fast restoration of service through automated rollbacks and clear runbooks for Mean Time to Recover (MTTR). - Automating repetitive manual tasks to ensure at least 50% of time is spent on engineering improvements for Toil Reduction. Business Acumen - Strong problem solving abilities and capable of articulating specific technical topics or assignments. - Experience in building scalable and highly available distributed systems. - Skilled in breaking down problems and estimating time for development tasks. - Evangelizes how our technology solves customer problems from a technology and business perspective. Leadership - Demonstrates clarity of thinking to work through limited information and vague problem definitions. - Influences through others; builds direct and "behind the scenes" support for ideas. - Proactively identifies and removes project obstacles or barriers on behalf of the team. - Shares knowledge, power, and credit, establishing trust, credibility, and goodwill. Personal Attributes - Able to work under minimal supervision. - Excellent communication skills and the ability to interface with senior leadership with confidence and clarity. - Skilled in providing oversight and mentoring team members. Shows ability to effectively delegate work. - Applies values, business strategy, policies, precedent, and experience to make complex decisions in ambiguity and with uncertain consequences.
Job Requirements
- 3–5 years of experience in DevOps, SRE, or Release Engineering roles for cloud-native SaaS applications.
- Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience.
- Hands-on experience with Jenkins, Artifactory, GitHub Actions and ArgoCD for automated software delivery.
- Proficiency in managing workloads on Kubernetes, specifically with EKS clusters.
- Strong skills in Ansible and Terraform for configuration management and infrastructure-as-code.
- Solid understanding of AWS cloud services (VPC, IAM, EKS, RDS, S3, MSK, etc) in a production setting.
- Experience using Prometheus, Grafana, Splunk, Datadog or Dynatrace to monitor deployment health and system performance.
- Experience building dynamic build pipelines using Groovy Script, Python, Bash or Go languages.
- Proven ability to manage production changes and troubleshooting under pressure in a high-stakes environment.
- Familiarity with regulated industries and security frameworks such as NERC CIP, SOC2, ISO 27001, IEC 62443 is highly preferred.
- Strong ability to document technical procedures and communicate clearly with stakeholders during global shift handovers.
Benefits
- Relocation Assistance Provided: No
- #LI-Remote - This is a remote position
- Key Performance Indicators (KPIs)
- Contribution towards the 4-hour SLA target for Customer Onboarding Speed.
- Help maintain 99.99% availability of mission critical grid SaaS products.
- Maintaining a low rate of failed production deployments through improved quality gates for Change Failure Rate.
- Ensuring fast restoration of service through automated rollbacks and clear runbooks for Mean Time to Recover (MTTR).
- Automating repetitive manual tasks to ensure at least 50% of time is spent on engineering improvements for Toil Reduction.
- Business Acumen
- Strong problem solving abilities and capable of articulating specific technical topics or assignments.
- Experience in building scalable and highly available distributed systems.
- Skilled in breaking down problems and estimating time for development tasks.
- Evangelizes how our technology solves customer problems from a technology and business perspective.
- Leadership
- Demonstrates clarity of thinking to work through limited information and vague problem definitions.
- Influences through others; builds direct and "behind the scenes" support for ideas.
- Proactively identifies and removes project obstacles or barriers on behalf of the team.
- Shares knowledge, power, and credit, establishing trust, credibility, and goodwill.
- Personal Attributes
- Able to work under minimal supervision.
- Excellent communication skills and the ability to interface with senior leadership with confidence and clarity.
- Skilled in providing oversight and mentoring team members. Shows ability to effectively delegate work.
- Applies values, business strategy, policies, precedent, and experience to make complex decisions in ambiguity and with uncertain consequences.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevSecOps Consultant – AWS, Kubernetes, Terraform
Trility ConsultingStart delivering technology solutions that simplify, automate, and secure your business.
• Support and optimize cloud infrastructure for a data lake environment within AWS • Develop and maintain Infrastructure as Code using Terraform to ensure scalable, repeatable deployments • Manage and support Kubernetes-based workloads, including deployment and configuration using Helm • Collaborate with data and platform teams to ensure infrastructure supports data ingestion, processing, and reporting needs • Write and maintain Python scripts to support automation, integration, and operational tasks • Monitor and troubleshoot infrastructure and platform issues across cloud and containerized environments • Implement and maintain security best practices across cloud resources, Kubernetes, and data platform components • Contribute to documentation, runbooks, and operational standards to support long-term platform sustainability • Partner with cross-functional teams to support ongoing enhancements and stabilization of the data platform
Senior Site Reliability Engineer
BackblazeBackblaze is the cloud storage innovator delivering a modern alternative to traditional cloud providers.
• Own and drive the availability, durability, and performance of critical services across all production environments. • Lead and champion complex projects from problem discovery through complete, cross-functional resolution, demonstrating high-level technical ownership. • Define, establish, and enforce service health standards, including working with engineering leadership to implement SLIs, SLOs, and error budget policies for multiple services. • Lead critical incident response and post-incident reviews, translating findings into strategic, long-term service improvements and architectural changes. • Mentor others and act as a subject matter expert in following and evolving established ITIL/OSS processes (incident, change, problem, and capacity management). • Design and architect scalable automation solutions to eliminate toil and improve the efficiency of operational tasks across the entire platform. • Drive the strategic direction of monitoring, logging, and alerting frameworks (e.g., Prometheus, Grafana, Catchpoint, ELK), and integrate them for comprehensive observability. • Build, maintain, and secure advanced CI/CD pipelines, configuration management, and complex infrastructure as code solutions (Terraform, Ansible, Jenkins). • Write production-grade code (Bash, Python, Go, etc.) to develop new reliability tools and enhance existing systems. • Act as a principal partner to engineering, product, and operations teams, consulting on resilient system design, architecture, and operation. • Lead and formalize the Production Readiness Review (PRR) process, ensuring robust operational handoff for all new services and features. • Lead capacity planning and disaster recovery strategy across critical infrastructure components. • Manage the relationship with vendors and service providers to troubleshoot systemic issues and ensure strict adherence to SLA performance. • Drive the creation of high-quality documentation, proactively share advanced learnings, and cultivate a reliability-first engineering culture across teams. • Own the creation, maintenance, and dissemination of operational playbooks, runbooks, and detailed system documentation. • Proactively identify systemic, recurring issues and architect and drive the implementation of long-term improvements and strategic design action plans. • Be a leading voice in promoting and embedding reliability-focused practices within development and operations teams.
DevOps – Site Reliability Engineer
OowlishWe make innovation simple, convenient and right...we just make it HAPPEN
• Join a growing AI-focused SaaS startup as a DevOps & Site Reliability Engineer • Responsible for maintaining, optimizing, and scaling infrastructure supporting the platform • Work closely with development and product teams to improve deployment processes • Monitor systems and respond proactively to incidents
SRE Analyst – Mid-level
Vivo (Telefônica Brasil)Com a conexão, queremos que você descubra novos pontos de vista e aproveite tudo o que realmente importa.
• Perform troubleshooting and functional analysis of incidents in non-production environments; • Provide support for applications in testing environments; • Implement and manage monitoring tools to ensure visibility into system performance and proactively detect issues; • Lead incident response, conducting post-incident (postmortem) analyses to identify root causes and prevent recurrence; • Develop scripts and tools to automate repetitive tasks, improving operational efficiency and reducing human error; • Analyze system capacity and plan scalability to meet demand, ensuring services remain available and responsive; • Collaborate with development teams to implement changes safely and efficiently, minimizing impact on the staging environment; • Work closely with security teams to ensure security practices are integrated into the testing lifecycle; • Create and maintain technical documentation and operational runbooks, and train teams on best practices and tools; • Work together with QA analysts to continuously improve system reliability and efficiency.




