A comprehensive cloud-based platform to modernize the Office of the CFO.
Senior Cloud DevOps Engineer
Location
United States
Posted
3 days ago
Salary
$131K - $170K / year
Seniority
Senior
Job Description
Senior Cloud DevOps Engineer
OneStream Software
• Develop and maintain Infrastructure-as-Code such as Terraform, PowerShell, ARM, Bicep, Bash, and YAML languages • Deliver high-quality implementations in a timely manner • Design and maintain CI/CD pipelines supporting secure, reliable, and repeatable deployments • Update technical documentation, workflows, and knowledge base articles • Build knowledge in focused areas of the OneStream platform and deployment stack • Participate in collaborative engineering, peer reviews, and knowledge sharing initiatives • Collaborate with other teams to define, estimate, and implement requirements for new automations or services needed for development • Apply software engineering best practices to infrastructure and automation development • Optimize cloud environments for scalability, reliability, and cost efficiency • Participate in troubleshooting and resolution of complex production issues across cloud platforms and services • Work with Compliance and Security teams to ensure compliance with required controls
Job Requirements
- BS/BA in computer science, engineering, or technology-related field (or equivalent work experience)
- 8+ years of cloud infrastructure experience
- Advanced understanding of Infrastructure-As-Code concepts and tooling (Terraform, CloudFormation templates, Bicep or ARM templates) on Microsoft Azure, Amazon Web Services (AWS), or Google Cloud Platform (GCP)
- Deep knowledge of Configuration Management/Orchestration utilities such as Ansible, PowerShell DSC, Chef, and Puppet
- Advanced understanding of cloud concepts including elasticity, security, and identity management
- Well versed familiarity with Agile Development methodologies utilizing Jira or Azure DevOps Boards
- Strong understanding of Azure Kubernetes Services (AKS) with container-based deployment skills or other platforms such as OpenShift, GKS, EKS
- Proficient knowledge in Software Development Lifecycles
- AI-focused Azure resources such as API Management, Azure OpenAI, or Cognitive Services
- Experience collaborating with software development teams focused on implementing Large Language Model (LLM), Predictive AI, or other AI solutions using Azure cloud resources
- 8+ years of hands-on experience with the following technologies, tools, and concepts: Automating processes using PowerShell, Bash, CLI, REST APIs, python, ARM Templates, or other scripting languages
- Comfortable leveraging source control tools such as Git, BitBucket, or GitHub
- Knowledge of container orchestration platforms such as Kubernetes, OpenShift, AKS, GKS, or helm
- Microsoft Azure
- Microsoft Windows 11, Windows Server, IIS, Microsoft SQL Server, Entra ID
Benefits
- Vision
- Medical
- Life
- Dental
- 401K
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Champion a security-first mindset within Engineering to help set the security posture of our platform infrastructure — supply chain hardening, secrets management, IAM/IRSA, container image integrity, and vulnerability remediation across our AWS/EKS environment • Design and build automation that makes compliance evidence continuous, not manual — translating HITRUST controls into passing tests and structured outputs that flow into our compliance tooling (Vanta) • Embed security into the platform by default: make the secure path the easy path for application engineers, through guardrails, policy-as-code, and well-documented patterns • Partner with our Security team to translate threat assessments and control gaps into engineering proposals with clear scope, tradeoffs, and recommended paths forward • Lead platform security initiatives from design to operationalization — requirements, technical design, code and code review, deployment, and documentation • Contribute hands-on to the broader platform: CI/CD pipelines, container orchestration, observability, and developer tooling — this is an IC role, not a governance role • Participate in on-call rotation and own the systems you build, including production incidents • Mentor engineers on security practices and raise the security baseline across the team
Senior Reliability Operations Engineer
Serve RoboticsMeet the future of sustainable, self-driving delivery.
• Serve as the primary incident lead during your region’s daytime hours, coordinating technical investigations, centralizing communication, and engaging the appropriate engineering and SRE teams when escalation is required. • Respond to escalations from Tier 1 support, using runbooks, metrics, logs, and system diagnostics to investigate and remediate issues or determine when escalation to Tier 3 is necessary. • Develop and update runbooks, workflows, and operational documentation to ensure consistent and reliable responses to recurring issues, collaborating with product teams to expand coverage over time. • Write, maintain, and enhance automation scripts and tools that streamline common remediation steps, improve response times, and reduce manual operational overhead. • Use metrics, logs, and tracing tools (Grafana/Prometheus, GCP Monitoring, OpenTelemetry) to proactively identify problems, validate system behavior, and support continuous improvement of detection mechanisms. • Act as the central point of communication during active incidents, ensuring timely updates and clear routing to the correct product engineering and SRE stakeholders. • Collaborate with reliability and product teams to share insights, recommend improvements, and help refine processes that enhance the stability and operability of our systems. • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise. • Help establish operational best practices, refine workflows, and prepare the foundation for a broader reliability operations function.
• Lead incident investigations during your region’s daytime hours, providing timely updates, escalating appropriately, and supporting senior engineers leading the response. • Respond to escalations from Tier 1 support using established runbooks, metrics, logs, and diagnostics to remediate issues or escalate to Tier 3 when needed. • Update runbooks and operational documentation based on new issues, discoveries, and feedback, ensuring clarity and consistency across all procedures. • Run existing automations and collaborate with senior team members to enhance tooling and scripts that streamline troubleshooting and remediation tasks • Use observability tools such as Grafana/Prometheus, GCP Monitoring, and OpenTelemetry to interpret metrics, logs, and traces, helping identify anomalies and validate system performance. • Provide concise, accurate updates during incidents, ensuring information reaches the correct engineering and SRE contacts and supporting structured incident coordination. • Participate in discussions around root causes, share operational insights, and contribute to process improvements that enhance system stability and supportability. • Participate in a shared weekend on-call rotation to help maintain operational coverage for production systems, responding to incidents and escalations as needed and coordinating with engineering teams when issues arise. • Proactively strengthen workflows, adopt best practices, and build the foundation of the Reliability Operations function as it evolves.
• Own the DevOps roadmap across CI/CD, infrastructure automation, release workflows, and environment management with a clear focus on engineering velocity, reliability, and operational efficiency. • Lead the DevOps team while partnering closely with Engineering, SRE, Platform, and Product teams to remove production bottlenecks and raise automation standards across the organization. • Cultivate resilient multi-cloud practices across AWS and GCP, driving infrastructure as code (IaC), Kubernetes-based delivery, and modern operational tooling. • Strengthen observability, uptime discipline, and incident response while leading cloud cost and capacity optimization efforts as our platform scales. • Cultivate team capabilities, manage project plans, and build a high-accountability engineering culture that scales fluidly with company needs. • Evaluate and implement AI powered DevOps tools to improve deployment, monitoring, and incident response processes. • Leverage AI and machine learning solutions for predictive analytics, anomaly detection, capacity planning, and root-cause analysis. • Establish governance, security, and compliance standards for AI enabled infrastructure and operations. • Monitor emerging AI technologies and identify opportunities to improve operational efficiency and reduce manual effort.



