HighlightTA logo
HighlightTA

Your on-demand talent team.

Senior DevOps Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 1-10Since 2024H1B No SponsorCompany SiteLinkedIn

Location

Canada

Posted

50 days ago

Salary

$122.7K - $150.0K / year

Seniority

Senior

Job Description

Senior DevOps Engineer

HighlightTA

• Partner with product and platform engineering teams to improve system reliability, scalability, and developer experience • Build, maintain, and evolve CI/CD pipelines to support safe, fast, and reliable deployments • Improve observability through better monitoring, alerting, logging, and telemetry • Implement and maintain Infrastructure as Code (Terraform) to manage cloud resources safely and reproducibly • Operate and scale containerized workloads (Kubernetes, Docker) • Support and evolve Zipline's AWS-based cloud infrastructure (experience with GCP or Azure is a plus) • Assist our Data team in codifying and maintaining our data warehouse and ML infrastructure on GCP. • Participate in an on-call rotation, responding to and resolving production incident • Contribute to incident follow-ups and postmortems by helping implement durable fixes and reducing operational toil • Collaborate with Rails-focused product teams to improve reliability, performance, and deployment workflows

Job Requirements

  • Production experience operating and supporting cloud-based systems
  • Proficiency with at least one programming or scripting language (Ruby preferred; TypeScript/JavaScript a plus)
  • Experience working with a major cloud provider (AWS preferred)
  • Hands-on experience with Infrastructure as Code tools (Terraform, Pulumi, or similar)
  • Familiarity with CI/CD systems (CircleCI, GitHub Actions, GitLab CI, Jenkins, etc.)
  • Experience with containerization and orchestration (Docker, Kubernetes, ECS)
  • Working knowledge of observability practices and tools (Datadog, Sentry, Prometheus, etc.)
  • Strong communication skills and a collaborative, service-oriented mindset
  • Comfort working as a remote contributor—managing time, communicating clearly, and delivering reliably.
  • Pride in building systems and tooling that are maintainable, well-documented, and easy for others to use.

Benefits

  • Remote-first culture: Join a high performing, fully remote team and work where you're comfortable
  • Stock Options: Get meaningful ownership in a fast-growing, venture-backed company shaping the future of retail.
  • Time off: Our flexible time-off policy gives you the freedom to take the breaks you need, when you need them.
  • Benefits: World-class medical, dental, and vision policies.
  • Team Connection: Annual company off-sites in fun locations.
  • Volunteering: Every quarter, Zipliners get a paid day off to volunteer for a nonprofit of their choice.
  • Learning: We support continuous learning and provide unlimited access to our Udemy Business account.
  • Great humans, great work: Work with kind, collaborative teammates who care about doing meaningful work and making a real impact.

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Full TimeRemoteTeam 11-50

About 10a Labs: 10a Labs is the safety and threat-intelligence layer trusted by frontier AI labs, AI unicorns, Fortune 10 companies, and leading global technology platforms. Our adversarial red teaming, model evaluations, and intelligence collection enable engineering, safety, and security teams to stay ahead of evolving threats and deploy AI systems safely. 3–8 Years of Industry Experience | Remote | High-Impact About the Role: We’re looking for an infrastructure-focused engineer who thrives at the intersection of machine learning, systems, and product delivery. This is a hands-on role responsible for deploying, monitoring, and scaling a real-time ML-powered content moderation system used to detect and triage abuse, threats, and edge-case language. You’ll work closely with ML engineers, researchers, and clients to build infrastructure that makes high-performance models accessible and reliable in the wild. In This Role, You Will: - Design and maintain cloud infrastructure (GCP or AWS) to support real-time model serving, data ingestion, and evaluation workflows. - Deploy and optimize APIs for low-latency access to ML models and embedding search systems. - Manage and optimize the end-to-end training data flow—from sourcing and cleaning datasets to preparing them for model consumption—ensuring accuracy, scalability, and efficiency. - Build observability tooling for production ML pipelines (monitor latency, error rates, request volumes, drift). - Automate model deployment, retraining, and evaluation pipelines (CI/CD for ML). - Work with ML engineers to package models for serving. - Help manage vector databases and semantic search infrastructure (e.g., Pinecone, FAISS, Vertex Matching Engine). - Ensure security, compliance, and uptime of infrastructure supporting safety-critical systems. We’re Looking for Someone Who: - Has 3–8 years of experience deploying machine learning systems or high-availability backend systems. - Has shipped and maintained production infrastructure at scale, supporting ML workflows. - Has experience with GCP, AWS, or similar platforms (including managed ML services). - Is proficient in Terraform, Docker, Kubernetes, or similar infra tools. - Understands performance tradeoffs in serving models and embedding search pipelines. - Can work cross-functionally with ML, security, and product teams to deploy safely and iterate fast. - Brings a builder's mindset and bias for ownership in ambiguous environments. Nice to Have Experience With: - Experience with vector databases or ANN systems, preferably within GCP (or AWS). - Experience serving LLMs or embedding-based models via API. - Experience with model monitoring, logging, and metrics platforms (e.g., Prometheus, Grafana, Sentry). - Familiarity with trust & safety infrastructure, abuse detection, or policy enforcement systems. What Success Looks Like in the First 3 Months: - You’ve deployed and monitored a real-time ML inference system with well-defined observability. - You’ve implemented an API with latency under 200ms for embedding or classifier-based inference. - You’ve partnered with ML engineers to streamline deployment and retraining workflows. - You’ve built logging and monitoring that gives insight into system performance and classifier behavior. Compensation & Benefits: - Salary Range: $130K–$230K, depending on experience and location. - Bonus: Performance-based annual bonus. - Professional Development: Support for continuing education, conferences, or training. - Work Environment: Fully remote, U.S.-based. - Health Benefits: Comprehensive health, dental, and vision coverage. - Time Off: Generous PTO and paid holiday schedule. - Retirement: 401(k) plan.

United States
$130K - $230K / year

Location: Austin Texas, Reston, Virginia OR Fully Remote About the Opportunity: We are seeking a DevOps Manager with deep expertise in Kubernetes, Terraform, and Ansible to help scale Seekr’s AI platform across on-premises, cloud, and SaaS environments. You’ll be highly hands-on, juggling multiple projects, mentoring engineers, and driving complex initiatives to deliver robust, scalable, and reliable systems. On-prem experience is highly preferred. This role demands a strong foundation in Linux, networking (both traditional and Kubernetes), container technologies, and automation. You’ll collaborate closely with software engineering teams, own critical infrastructure, and solve challenging operational and scalability problems in fast-paced, dynamic environments. From your first day, you will make a valuable — and valued — contribution. We are a fast-growing company where no one is a bystander. We offer you the opportunity to delight millions of consumers around the world while gaining meaningful experience across a variety of disciplines. Duties and Responsibilities: - Lead development of solutions to complex reliability, performance, and scaling challenges. - Design, architect, and implement systems, networks, and services powering Seekr’s platform. - Provide hands-on leadership and mentorship to the team. - Partner with software engineering teams to build scalable, efficient, and reliable services. - Identify and resolve operational inefficiencies through automation. - Troubleshoot and lead response to deployment and production incidents. - Implement and enforce security best practices, ensuring infrastructure, deployments, and data are protected at every stage. Skills and Qualifications: - Technical Leadership: 12+ years experience, Proven ability to deliver results in a high-pressure/dynamic environment, Communication Skills, Roadmap & long-term strategy, mentoring senior engineers. - Kubernetes & Distributed Systems: Enterprise-scale K8s with custom operators/controllers, multi-platform clusters, hybrid fleet orchestration across cloud & edge, K8s control plane, k8s upgrades, Docker, containerd, CRI-O, Ingress Controllers (Istio, NGINIX, Traefik), K8s Databases, Helm charts. - Database Management: Postgres, ElasticSearch/OpenSearch, Kubernetes databases, Stateful sets. - Networking: L2/L3 protocols (BGP, OSPF, VLANs, IPSec), VPNs, firewalls, redundancy paths, bare-metal Linux networking, CoreDNS, Calico, K8s service mesh (Istio). - Infrastructure Automation: Ansible, Terraform, CI/CD Pipelines, GitLab, ArgoCD, MAAS, scripting (Python, Golang, Bash), AWS, Azure. - Observability: Grafana, Prometheus, Loki, Tempo, ELK, OTEL. - Security: Zero-trust architecture, PKI, mTLS, SPIFFE/SPIRE, certificate automation, CVE remediation, secrets management, IAM. - Incident Management & RCA: End-to-end incident lifecycle, root cause analysis, corrective action ownership. About the Company: Seekr is a leader in explainable and trustworthy artificial intelligence designed to power mission-critical decisions in enterprises, government, and regulated industries. SeekrFlow™, our end-to-end AI platform, provides secure, auditable AI solutions tailored to sectors where transparency, accuracy, and compliance are paramount. Available across cloud, on-premises, and edge environments, SeekrFlow reduces bias, strengthens data integrity, and simplifies model oversight so organizations can rely on trusted AI decisions in high-stakes settings that impact society’s most sensitive and vital systems. Trusted by leading enterprises and government agencies, we partner with defense, finance, telecom, and critical infrastructure leaders to enable AI solutions that drive real-world results with unmatched transparency and control.We are a team of strategic thinkers and problem-solvers tackling the toughest challenges facing critical infrastructure and global enterprises through best-in-class AI models and customer deployment.Our team operates with unwavering commitment to our core values and mission: - We are driven by outcomes—our customers' success is what we strive for every day. - We believe trust is earned, which is why we build explainability and transparency into the entire AI lifecycle. - We take our responsibility to deliver secure AI seriously. - We believe innovation drives progress—we are building the technologies that power the systems our society depends on. Company Benefits: - Meaningful Mission & Impact - Work with a deeply talented, collaborative team solving some of the toughest AI challenges that matter. - Equity Ownership – RSUs that let you share directly in Seekr’s long‑term success and growth. - Time Off That Respects Real Life – Unlimited PTO plus 14 paid company holidays to truly recharge. - Work Your Way – A flexible hybrid work environment with offices in Reston, VA and Austin, TX, plus remote options and flexible working hours. - Competitive Total Rewards – A role‑appropriate compensation structure that supports long‑term growth, including base salary, bonuses, or commission plans depending on role. - 401(k) with Company Match – Build your future with a retirement plan that includes employer matching. - Comprehensive Health & Wellness – Medical, dental, vision, and life insurance coverage starting day one—for you and your family. - Parental Leave – Paid parental leave to support employees as they welcome a new child through birth, adoption, or foster placement.

Texas + 1 moreAll locations: Texas | Virginia
CloudWalk, Inc. logo

Senior Engineer – Platform/DevOps

CloudWalk, Inc.

The interplanetary payment network.

DevOps Engineer50 days ago
Full TimeRemoteTeam 201-500H1B No Sponsor

• Design and evolve Kubernetes architectures across multi-cluster environments • Build and improve platform foundations involving networking, storage, scalability, and resilience • Manage and standardize deployments using Helm Charts, Kustomize, and GitOps practices • Provision and maintain infrastructure using Terraform • Drive performance tuning and security hardening across clusters and workloads • Support cloud-native environments on GCP with a strong architecture mindset • Help define and improve networking and security architecture for distributed systems • Work with AI beyond simple chat assistant usage, applying it as a practical accelerator for analysis, operations, and problem-solving • Create and maintain development and operational scripts to improve efficiency and reliability • Read and interpret code when needed to troubleshoot behavior, assess risks, and collaborate with engineering teams • Communicate clearly across technical and non-technical stakeholders, building trust and strong collaboration across teams

Brazil
Inframark logo

DevOps Engineer

Inframark

Inframark's Operations and Maintenance team is an award-winning team that delivers cutting-edge water, wastewater, and public works services to municipalities, utility districts, and industries. We are dedicated to supporting our employees as well as protecting the environment and the communities we serve. You would be empowered to thrive in a dynamic, supportive, and innovative environment. Our dedication to sustainability and community impact drives us to ensure clean, safe water for future generations. Whether you're at the start of your career or looking for advancement, Inframark offers purpose-driven work and opportunities for growth.

DevOps Engineer50 days ago
Full TimeRemoteTeam 1,001-5,000

Join Inframark: Pioneering Automation and Intelligence Step into the future with Inframark's award-winning Automation and Intelligence team. We deliver cutting-edge solutions in instrumentation and controls, industrial cybersecurity, data analysis, and remote network operations center services for water and wastewater plants. Elevate your career and join us at Inframark. Apply today! Why Work for Inframark? Our dedication to sustainability and community impact drives us to ensure clean, safe water for future generations. Whether you're at the start of your career or looking for advancement, Inframark offers purpose-driven work and opportunities for growth. We offer an attractive salary package, including a generous benefits package with health, dental, and life insurance, 401(k) plan, paid time off, sick leave, holidays, and wellness plan. Job Title: DevOps Engineer Location: Remote (Eastern Time zone preferred - AWS GovCloud requirement) Reports To: Sr. Director of Technology and Architecture Position Overview We're looking for a DevOps Engineer who takes ownership of infrastructure. You'll stabilize and modernize the infrastructure supporting WaterMinds, our cloud-based platform for water and wastewater utilities—implementing proper monitoring and alerting, upgrading production environments, establishing operational discipline, and enabling our engineering teams to ship with confidence. You'll follow DevOps best practices, proactively identify and solve problems, and drive infrastructure improvements with minimal direction. The challenge: build and maintain infrastructure that can reliably serve hundreds of utility customers at scale. Your immediate focus is moving our infrastructure from reactive firefighting to proactive maintenance mode. As the platform matures and our data science team ramps up, you'll have the opportunity to transition into MLOps, building the infrastructure that enables machine learning at scale. Key Responsibilities Take ownership of production monitoring and alerting using Prometheus, Grafana, and CloudWatch—proactively identify issues before they become incidents. Modernize production EKS cluster with GitOps practices (ArgoCD), comprehensive monitoring, and proper deployment workflows following industry best practices. Streamline staging deployment process; eliminate branch-based workarounds and establish clean GitOps patterns. Design infrastructure patterns that scale to hundreds of customers and own AWS infrastructure operations including patching, maintenance, cost optimization, and security compliance—stay ahead of requirements. Expand into MLOps—building the infrastructure that enables data scientists to deploy models at scale across multiple utility customers once DevOps operations are automated. Manage Kubernetes clusters (EKS) including pod migrations, resource optimization, troubleshooting, and security updates—proactively, not reactively. Maintain infrastructure as code using Terraform and Ansible following best practices—all changes tested in non-production before deployment. Support engineering teams with infrastructure needs, unblock them quickly, and establish self-service patterns where possible—anticipate needs, don't wait for requests. Manage message queue infrastructure (Kafka/Redpanda) including retention policies, storage optimization, and performance tuning. Document infrastructure, create runbooks, and automate operational tasks to move systems into maintenance mode. Clean up technical debt—proactively identify infrastructure to decommission, resources to consolidate, and costs to optimize. Qualifications 5+ years of experience in DevOps, infrastructure, or site reliability engineering. Demonstrated ability to take ownership and initiative—you see what needs to be done and do it without waiting for direction. Deep knowledge of DevOps and infrastructure best practices—you know what good looks like and implement it proactively. Strong Kubernetes experience (EKS preferred) including cluster management, deployments, services, and troubleshooting. Hands-on AWS experience (EC2, EKS, ECS, RDS, VPC, IAM, CloudWatch, S3). Infrastructure as code proficiency (Terraform and Ansible). GitOps experience (ArgoCD, Flux, or similar). CI/CD pipeline experience (Bitbucket Pipelines, Jenkins, GitHub Actions, or similar). Monitoring and observability experience (Prometheus and Grafana preferred). Python scripting ability for automation and tooling. US citizenship (required for AWS GovCloud access). Self-starter mentality—you identify problems and opportunities, then drive solutions to completion. Proven track record of delivering tested, high-quality infrastructure changes on schedule. Excellent communication skills—proactive about sharing status, raising blockers, and documenting decisions. Bonus Points For Curiosity about machine learning and interest in transitioning to MLOps as the platform matures. Any MLOps or ML infrastructure experience (KServe, Kubeflow, SageMaker, model serving). Experience with data pipelines, feature engineering, or supporting data science teams. AWS GovCloud experience and understanding of compliance requirements (FedRAMP). Experience with message queue systems (Kafka, Redpanda). Container security and vulnerability scanning (Snyk). Background in SaaS platforms, IoT, or critical infrastructure. Inframark is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, age, religion, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against based on disability. Learn more about us at Automation and Intelligence - Inframark

United States
Job Closed