Job Closed

This listing is no longer active.

Parallel Domain

Synthetic data for computer vision and perception.

Principal Site Reliability Engineer

DevOps EngineerDevOps EngineerFull Time Remote LeadTeam 51-200H1B No SponsorCompany Site LinkedIn

Location

Canada

Posted

103 days ago

Salary

Seniority

Lead

No structured requirement data.

Job Description

About the Role Parallel Domain is looking for a Principal Site Reliability Engineer to own the reliability, scalability, and security of our cloud infrastructure - the backbone that runs simulation workloads for some of the most demanding customers in autonomous vehicle development. This is a hands-on, high-ownership role. You'll be the primary infrastructure owner across our multi-region AWS/EKS platform, working closely with a small platform engineering team, partnering with engineering leads across simulation and ML, and our customer-facing teams. What You'll Do Infrastructure Ownership & Cloud Operations - Own and evolve our AWS-based infrastructure, improving platform performance and availability today, and building toward deployable configurations that support enterprise customer environments tomorrow. - Own EKS cluster operations across production regions: node pool strategy, AMI lifecycle, autoscaling, and Kubernetes workload health. - Support the GitOps deployment pipeline - define, deploy, and manage applications across clusters using infrastructure-as-code. - Manage complex networking: VPC design, cross-region connectivity, DNS, and load balancing. - Lead infrastructure deprecation and migration efforts with minimal disruption. Reliability Engineering & Incident Response - Own SLO measurement infrastructure; enable proactive triage of emerging issues before they impact customers. - Lead incident investigation, root cause analysis and postmortems, driving systemic fixes rather than one-off patches. - Design and improve automated remediation systems to reduce MTTR. Security & Access Management - Review and provide security-conscious feedback on platform architecture decisions. - Own cloud IAM governance - roles, policies, and access boundaries across accounts and services. - Lead compliance-adjacent work including audit-readiness, partner certification requirements, and supporting responses to customer security questionnaires. Cross-Functional Collaboration - Partner with application development teams to build an inherently secure platform and drive next-generation deployment architecture. - Partner with customer teams to ensure availability for expected utilization. - Partner with Finance on cloud cost optimization - lifecycle policies, right-sizing, and spend visibility. - Support GPU and batch workloads in collaboration with simulation and ML engineering teams. Platform Tooling & Developer Experience - Improve CI/CD pipelines and automated infrastructure validation. - Support engineering teams with infra-side debugging, log analysis, and environment configuration. What We're Looking For Technical Depth - 5+ years in SRE, DevOps, or infrastructure engineering roles. - Infrastructure-as-code proficiency - Terraform modules, state management, and multi-environment patterns. - Deep AWS experience - EKS, EC2, IAM, S3, Storage Gateway, VPC networking, Transit Gateway, CloudFront, KMS, and IRSA. - Kubernetes expertise - cluster operations, node pools, probes, cordoning, pod scheduling, RBAC, Helm, node autoscaling (Karpenter experience a plus); solid understanding of containerization and AMI lifecycle management. - CI/CD - experience with GitOps workflows and pipeline tooling (ArgoCD, GitHub Actions, Jenkins) - Solid networking fundamentals - CIDR design, security groups, DNS, load balancing, VPN, cross-region connectivity. - Experience with monitoring and observability tooling - Prometheus, Grafana, Elasticsearch. - Comfort with Python and Bash for tooling and automation. - Familiarity working across Linux and Windows environments. Operational familiarity with Windows Server is a meaningful advantage. Communication & Ownership - You communicate clearly across engineering, product, and customer-facing teams, flagging issues with urgency proportional to customer impact. - You advocate for SRE best practices and can effectively operationalize an informed and principled view on security. - You take end-to-end ownership of complex, multi-team efforts - from planning through execution and post-change verification. - You know when to push for a clean solution vs. when to accept a pragmatic one, and you communicate that tradeoff clearly. Nice to Have - Experience with Windows-based workloads on EKS. - Experience supporting simulation, ML, or rendering workloads in cloud infrastructure; running GPU workloads on Kubernetes, including NVIDIA and DirectX device plugin configuration. - Experience with AWS Storage Gateway or Transfer Family integrations. - Familiarity with Envoy Gateway or similar. - Experience with container-optimized OS images (e.g., Bottlerocket, Packer). - Experience with cloud cost optimization at scale. Core ToolsTerraform · AWS · Kubernetes · Helm · ArgoCD · Kustomize · Grafana · Prometheus · Elasticsearch · VictoriaLogs · Fluent Bit · GitHub Actions · Jenkins · Docker · Python · Bash Why This Role PD's simulation platform runs at the intersection of high-performance compute, distributed systems, and customer-critical reliability. The infrastructure problems here are genuinely interesting — multi-region GPU scheduling, Windows workloads on Kubernetes, startup latency optimization, and an enterprise product direction that will require rethinking how we deploy and manage the platform entirely. The Principal SRE at PD is not a ticket-taker - it's a high-trust, high-autonomy position where you'll have genuine influence over infrastructure architecture, cross-team process, and customer experience.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior Site Reliability Engineer

Centene Corporation

Transforming the health of the communities we serve, one person at a time.

DevOps Engineer103 days ago

Full Time RemoteTeam 10,001+Since 1984H1B No Sponsor

Company Site LinkedIn

You could be the one who changes everything for our 28 million members by using technology to improve health outcomes around the world. As a diversified, national organization, Centene's technology professionals have access to competitive benefits including a fresh perspective on workplace flexibility. Position Purpose: Helps lead projects that are focused on managing and maintaining optimum platform infrastructure performance, reliability, and security using SRE practices, observability tools, manual and automated procedures, documentation, people and processes and continuous delivery(CI/CD) tools, processes, and designs. Develops complex services to automate monitoring activities and provide critical information to facilitate response and resolution of performance and availability issues and incidents. Understands and advocates for standardized and scalable software tools to ensure that systems operate without interruption at optimum performance and leads project teams through out the deployment process. Troubleshoots and analyzes service disruptions to determine the root cause of issues and develop solutions for improved reliability. - Support multiple applications and schedule batch jobs for a large number of transactions weekly - Troubleshoots and resolves more complex problems with systems and services and initiates regular deployment of new versions of the systems and their subcomponents - Leads more complex projects focused on building and maintaining observability/monitoring for the application, monitoring key performance indicators, maintaining alerting, and continuously improving visibility. - Helps make decisions around periodic system validation and testing, service monitoring, and standing up new services/tools - Uses knowledge and experience to identify strategies that increase system reliability and performance through on-call rotation and process optimization - Identifies and implements necessary manual and automated procedures for improved collaborative response in real-time - Leads lower level Engineers in stress, security, and performance testing - Resolves issues that come up through support escalation - Keeps documentation and runbooks up to date to effectively deal with new incidents that might arise - Leads post incident reviews and documents findings for future informed decision making - Reviews proposals to optimize Software Development Life Cycle (SDLC) to boost service reliability and makes decisions around which proposals should move forward. - Communicates complex topics with development teams to investigate and document issues and leads internal team to develop solutions to mitigate them - Performs other duties as assigned - Complies with all policies and standards Education/Experience: A Bachelor's degree in a quantitative or business field (e.g., statistics, mathematics, engineering, computer science) and Requires 4 – 6 years of related experience. Or equivalent experience acquired through accomplishments of applicable knowledge, duties, scope and skill reflective of the level of this position. Technical Skills: - One or more of the following skills are desired. - - Experience with SRE or DevOps - Batch scheduling - Monitoring experience - SQL Pay Range: $87,000.00 - $161,300.00 per year Centene offers a comprehensive benefits package including: competitive pay, health insurance, 401K and stock purchase plans, tuition reimbursement, paid time off plus holidays, and a flexible approach to work with remote, hybrid, field or office work schedules. Actual pay will be adjusted based on an individual's skills, experience, education, and other job-related factors permitted by law, including full-time or part-time status. Total compensation may also include additional forms of incentives. Benefits may be subject to program eligibility. Centene is an equal opportunity employer that is committed to diversity, and values the ways in which we are different. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, veteran status, or other characteristic protected by applicable law. Qualified applicants with arrest or conviction records will be considered in accordance with the LA County Ordinance and the California Fair Chance Act

View details: Senior Site Reliability Engineer

United States + 1 more

$87K - $161K / year

Apply

Job Closed

DevOps Engineer

TalentNeuron

Navigating the future of work.

DevOps Engineer103 days ago

Full Time RemoteTeam 201-500Since 1999H1B Sponsor

Company Site LinkedIn

• Collaborating with our web development, data engineering, and business teams to design and deploy reliable, scalable, and secure infrastructure • Ensuring high availability and uptime of our applications by implementing monitoring and alerting systems • Automating tasks and workflows using scripts, CI/CD pipelines, and other tools to streamline development and deployment processes • Maintaining optimal use of AWS infrastructure and databases to ensure scalability and cost-effectiveness • Staying up-to-date with the latest DevOps tools and technologies to continuously improve our processes and infrastructure • Participating in code reviews, architecture discussions, and other activities to help maintain best practices across our engineering teams • Providing technical guidance and mentorship to other team members, helping to grow their skills and improve our overall engineering capabilities • Having knowledge of networking concepts such as DNS, VPN, load balancing, and security groups

AWS Cloud DNS Grafana JavaScript Kubernetes Node.js PostgreSQL Prometheus Python React Terraform

View details: DevOps Engineer

United States

Apply

Job Closed

Senior DevOps Engineer (Kafka Specialist)

Fugro

Founded in 1962, Fugro is the world’s top independent provider of geo-intelligence and asset integrity solutions for the oil and gas, construction, and mining

DevOps Engineer103 days ago

Full Time Remote

Company Site

Job Description Who we are Do you want to join our Geo-data revolution? Fugro’s global reach and unique know-how will put the world at your fingertips. Our love of exploration and technical expertise help us to provide our clients with invaluable insights. We source and make sense of the most relevant Geo-data for their needs, so they can design, build and operate their assets more safely, sustainably and efficiently. But we’re always looking for new talent to take the next step with us. For bright minds who enjoy meaningful work and want to push our pioneering spirit further. For individuals who can take the initiative, but work well within a team. Job Purpose We are building the Common Data Backbone (CDB)—Fugro’s strategic data platform, which enables discovery, governance, and integration across our global geospatial data ecosystem. The CDB connects multiple cloud services and end-user applications through Apache Kafka, serving as the integration solution within the CDB for event orchestration and integration services. To further develop and deploy our CDB, we want to strengthen the team with an experienced Kafka DevOps Engineer who will expand and mature the Kafka infrastructure on AWS. This role focuses on secure cluster setup, lifecycle management, performance tuning, Dev-Ops and reliability engineering to ensure Kafka runs at enterprise-grade standards. Key Responsibilities •Design, deploy, and maintain secure, highly available Kafka clusters in AWS (MSK or self-managed). •Perform capacity planning, performance tuning, and proactive scaling. •Automate infrastructure and configuration using Terraform and GitOps principles. •Implement observability: metrics, Grafana dashboards, CloudWatch alarms. •Develop runbooks for incident response, disaster recovery, and rolling upgrades. •Ensure compliance with security and audit requirements (ISO27001). •Collaborate with development teams to provide Kafka best practices for .NET microservices and Databricks streaming jobs. •Conduct resilience testing and maintain documented RPO/RTO strategies. •Drive continuous improvement in cost optimization, reliability, and operational maturity. Required Skills & Experience •5-6 years in DevOps/SRE roles, with 4 years hands-on Kafka operations at scale. •Strong knowledge of Kafka internals: partitions, replication, ISR, controller quorum, KRaft. •Expertise in AWS services: VPC, EC2, MSK, IAM, Secrets Manager, networking. •Proven experience with TLS/mTLS, SASL/SCRAM, ACLs, and secure cluster design. Proficiency in Infrastructure as Code (Terraform preferred). •Familiarity with CI/CD pipelines for cluster and topic configuration. •Monitoring and alerting using Grafana, CloudWatch, and log aggregation. •Disaster recovery strategies. •Strong scripting skills (Bash, Python) for automation and tooling. •Excellent documentation and communication skills. •Kafka Certification. Nice-to-Have •Experience with AWS MSK advanced features. •Knowledge of Schema Registry (Protobuf) and schema governance. •Familiarity with Databricks Structured Streaming and Kafka Connect. •Certifications: Confluent Certified Administrator, AWS SysOps/Architect Professional. •Databrick data analyst/engineering and Certification •Geo-data experience What we offer Fugro provides a positive work environment as well as projects that will satisfy the most curious minds. We also offer great opportunities to stretch and develop yourself. By giving you the freedom to grow faster, we think you’ll be able to do what you do best, better. Which should help us to find fresh ways to get to know the earth better. We encourage you to be yourself at Fugro. So bring your energy and enthusiasm, your keen eye and can-do attitude. But bring your questions and opinions too. Because to be the world’s leading Geo-data specialist, we need the strength in depth that comes from a diverse, driven team. Our view on diversity, equity and inclusion At Fugro, our people are our superpower. Their variety of viewpoints, experiences, knowledge and talents give us collective strength. Distinctive beliefs and diverse backgrounds are therefore welcome, but discrimination, harassment, inappropriate behavior and unfair treatment are not. Everybody is to be well-supported and treated fairly. And everyone must be valued and have their voice heard. Crucially, we believe that getting this right brings a sense of belonging, of safety and acceptance, that makes us feel more connected to Fugro’s purpose ‘together create a safe and livable world’ – and to each other. HSE Responsibilities Responsible for ensuring safety of self and others at site. Prevent damage of equipment and assets Responsible for following all safety signs/procedures/ safe working practices Responsible for using appropriate PPE’s Responsible for participating in mock drills. Entitled to refuse any to undertake any activity considered unsafe. Responsible for filling up of hazard observation card, wherever hazard has been noticed at site. Responsible for safe housekeeping of his work place. To stop any operation that is deemed unsafe To be able to operate fire extinguisher in case of fire To report an incident as soon as possible to immediate supervisor and HSE manager To complete HSE trainings as instructed to do so. Disclaimer for recruitment agencies: Fugro does not accept any unsolicited applications from recruitment agencies. Acquisition to Fugro Recruitment or any Fugro employee is not appreciated.

View details: Senior DevOps Engineer (Kafka Specialist)

Brazil

Apply

Senior DevOps Engineer – Public Cloud

Ensono

Ensono delivers complete Hybrid IT solutions, from mainframe to cloud, tailored to each client’s journey.

DevOps Engineer103 days ago

Full Time RemoteTeam 1,001-5,000H1B Sponsor

Company Site LinkedIn

• Act as the lead engineer on large or complex client deploy projects • Act as the lead engineer on product engineering projects to further develop our service • Provide Public Cloud subject matter expertise advice and assistance to pre-sales architects • Installation, configuration, and ongoing management of O/S and Public Cloud services. • Implementing and maintaining custom client machine images and centralized configuration management to enable quick repeatable builds and auto-scaling. • Maintaining high availability through proactive measures. • Troubleshooting and resolving complex technical issues. • Ensuring clients & partners are updated on current status of work and issues. • Raise, investigate and resolve problems and known errors in line with the problem management process. • Ability and willingness to proactively improve ways of working, automation and processes via our continual improvement Kanban board. • Partake in an on-call rotation to provide 24x7 4th line support for Ensono’s Public Cloud clients. • Proactively keep up to date on Public Cloud services and developments. • Buddy new members of the team and train other teams • Act as team leader and technical escalation point in Manager’s absence

Ansible AWS Azure Cloud Docker EC2 Google Cloud Platform Kubernetes Linux Perl Python Terraform

View details: Senior DevOps Engineer – Public Cloud

United States

$112K - $163K / year

Apply

Job Closed

Principal Site Reliability Engineer

Job Description

Related Guides

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Senior Site Reliability Engineer

DevOps Engineer

Senior DevOps Engineer (Kafka Specialist)

Senior DevOps Engineer – Public Cloud