Job Closed
This listing is no longer active.
A.I.-driven customer intelligence tools that give companies the power to discover & engage the humans in their data.
DevOps Manager
Location
United States
Posted
108 days ago
Salary
$140K - $170K / year
Seniority
Senior
Job Description
DevOps Manager
BlastPoint
• Ensure high availability, fault tolerance, and scalability of cloud services • Optimize performance and cost efficiency across AWS environments • Lead and mentor a small team of DevOps engineers, fostering a culture of innovation, collaboration, and accountability • Balance hands-on contributions with strategic leadership, leading by example to ensure smooth execution of DevOps initiatives • Design, deploy, and maintain BlastPoint’s AWS-based infrastructure using Terraform • Own the SOC 2 certification and compliance monitoring process • Implement security best practices, including IAM policies, encryption, vulnerability management, and incident response. • Enhance and maintain CI/CD pipelines using GitHub Actions to improve developer productivity and deployment speed • Collaborate with software engineers to streamline build, testing, and release processes • Implement observability, logging, and monitoring solutions to proactively detect and resolve issues. • Establish best practices for disaster recovery, data backup, and infrastructure resilience.
Job Requirements
- Bachelor’s degree or equivalent experience in computer science, mathematics, statistics, economics, or a similar field of study
- 5 - 8 years Minimum total experience in DevOps, Site Reliability Engineering (SRE), or Cloud Infrastructure roles
- 2+ years Experience Managing DevOps, SRE, and/or Information Security teams
- Experience managing SOC 2 compliance efforts, including working with Vanta or similar security auditing/monitoring tools
- Proficiency using Terraform to deploy and configure virtual infrastructure in an AWS cloud environment
- Experience building CI/CD pipelines using GitHub Actions
- Expert level knowledge AWS cloud services, including but not limited to: RDS, ECS/Fargate, ECR, Lambda, Step Functions, SageMaker, S3, IAM, VPC, Elastic MapReduce, EC2, AWS Backup, CloudWatch, and CloudTrail
- Excellent leadership and communication skills
- A willingness to travel to the Pittsburgh, PA office periodically (roughly 2-4 times per year)
- Authorized to work without sponsorship in the United States.
Benefits
- Health insurance
- 401K
- Three weeks of PTO
- Schedule and work-from-home flexibility
- Tailored growth opportunities, from skills training to industry conferences
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
• Proactively explore and implement AI tools, LLM integrations, and MCP (Model Context Protocol) to reduce routine database toil, optimize query performance, and accelerate incident resolution. • Support our data warehouse ecosystem by optimizing Snowflake performance, including application packaging and testing. • Own the deep-level optimization of MSSQL (crucial for on-call stability) and PostgreSQL at the server, database, and query levels. • Forecast resource utilization across platforms. Identify cost-saving opportunities, optimize Snowflake credit usage, and right-size AWS infrastructure. • Automate all data infrastructure using Terraform, AWS, Docker, and Kubernetes. You will manage containerized data services and stateful workloads. • Manage and optimize deployment pipelines using GitLab and Octopus Deploy, ensuring safe, repeatable database schema changes. • Create technical documentation, including runbooks, "how-to" guides for developer self-service, and clear architectural diagrams. • Serve as the subject matter expert for SQL Server, Postgres, and Snowflake in a 24/7/365 on-call rotation.
• Deliver the ADO Environment Current-State Assessment Report identifying gaps in configurations, pipelines, and workflow structures (Deliverable A1) • Develop and execute the ADO Configuration Modernization Plan; implement updated ADO configurations including work item hierarchies, custom fields, sprint boards, and Kanban views (Deliverables A2/A3) • Design and deploy reusable CI/CD pipeline templates for Azure Databricks notebook deployment, data validation, and automated reporting (Deliverable A5) • Configure end-to-end DataOps integration: ADO Repos → Databricks notebooks → automated Power BI dashboard refresh — reducing manual effort by 80%+ through workflow automation • Build Power Automate workflows for governance approvals, policy triggers, and document routing integrated with ADO and SharePoint • Design and deploy GMCB's SharePoint Knowledge Management Library including architecture, document taxonomy, metadata schema, and content migration (Deliverables C1/C2) • Develop ADO Analytics dashboards for sprint velocity, governance compliance, data quality, and operational KPIs • Implement traceable work item linkage between ADO Epic-Feature-Story-Task structures and Azure Databricks development artifacts • Develop the ADO Integration Deployment Package including configuration documentation, runbooks, and administrator guides (Deliverable A5) • Support Agile pilot sprints by configuring and validating ADO Board workflows; provide hands-on technical support during adoption phase
Senior DevOps Engineer
Slingshot AerospaceWe build space simulation and analytics solutions to bring clarity to complex environments and create a safer world.
• Partner with offshore and onshore engineering teams to design, implement, and scale cloud-native infrastructure supporting a new customer portal and ongoing platform refactoring efforts • Architect, build, and maintain Kubernetes-based environments that power production systems, ensuring scalability, resilience, and security • Lead Infrastructure as Code initiatives (primarily Terraform) to automate provisioning, configuration, and environment consistency across AWS • Design, implement, and optimize CI/CD pipelines to improve deployment velocity, reliability, and developer experience • Integrate and operationalize MLOps practices, enabling efficient deployment, monitoring, and lifecycle management of machine learning workflows • Embed DevSecOps best practices across the platform, incorporating security controls, compliance requirements, and monitoring into the development lifecycle • Drive automation initiatives that reduce manual processes and increase system reliability and repeatability • Collaborate closely with Platform, Engineering, and cross-functional stakeholders to gather requirements, troubleshoot issues, and continuously improve system architecture • Monitor system performance, identify bottlenecks, and proactively implement improvements to optimize availability and cost efficiency • Support incident response and root cause analysis efforts, driving long-term fixes and ensuring lessons learned translate into system improvements
Senior Site Reliability Engineer, AI Factory
NVIDIANVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. If you're creative and autonomous, we want to hear from you! Applications for this job will be accepted at least until June 15, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes. NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.
• Running commissioning and provisioning for GPU systems • Running the firmware versions of equipment and components, and communicating the supported versions across the organization • Through Day-2 operations, keeping tight SLOs around efficiency, performance, and availability • Monitoring the hardware state of the cluster, finding bottlenecks and hot spots, and helping users attain peak performance constantly • Triaging the HW break-fix issues and making constant improvements using open-source break-fix solutions • Collaborate with programming and technical divisions to define and implement repeatable procedures • Develop and implement operations strategy & processes, maintaining consistency with SLAs across critically important infrastructure • Develop and apply procedures for minimal downtime and quality controls to strive to achieve continuous uptime • Feeding requirements to software and hardware teams • Creation of documentation that the ecosystem can use to run its own AI Data Centers




