Job Closed

This listing is no longer active.

Collaboration.Ai

Connecting the right people around the right ideas.

DevOps Engineer — AWS, Observability & Security

DevOps EngineerDevOps EngineerFull Time Remote Mid LevelTeam 11-50Since 2016H1B No SponsorCompany Site LinkedIn

Location

United States

Posted

85 days ago

Salary

Seniority

Mid Level

AWS Terraform Kubernetes Amazon EKS Docker Helm Argo CD GitHub Actions Datadog OpenTelemetry TLS AWS Secrets Manager Amazon RDS

Job Description

Who we are: Collaboration.Ai is a mission-focused, AI-powered software and services company based in Minnesota, with employees, partners, and customers around the world. We unite people, technology, and purpose to accelerate breakthroughs that transform industries, empower communities, and create a more sustainable future. We collaborate with customer teams across a broad spectrum of public and private sector organizations, helping them navigate complex challenges and drive transformative change. To learn more about us, visit collaboration.ai Our product lineup: NetworkOS is an AI-powered platform that aligns people, purpose, ideas, and expertise in real-time, generating actionable insights to propel movements forward. CrowdVector is an integrated solution marketplace and innovation management platform that rapidly uncovers new ideas and advances breakthroughs to fuel movements. About the Role You’ll own CAI’s cloud infrastructure end-to-end: the AWS footprint, CI/CD pipelines, container orchestration, and observability platforms that let our engineering teams ship with confidence. This isn’t a “keep the lights on” role. You’ll architect infrastructure that meets federal compliance requirements by design, build out our DataDog monitoring from the ground up, and lay the foundation for AI/ML workloads. You’ll work closely with our Product Dev teams, Security and Compliance, and Platform Architect to keep our infrastructure secure, efficient, and developer-friendly - supporting customers who turn ideas, networks, and data into real outcomes. High autonomy. Real ownership. Infrastructure that matters. What You’ll Do - Architect AWS infrastructure: Design and manage our cloud footprint for scalability, high availability, and security controls aligned with NIST 800-53 and SOC 2 - Build reliable CI/CD pipelines: GitHub Actions with automated quality gates (testing, coverage, security scanning) that let developers ship to Kubernetes with high confidence - Codify everything in Terraform: Establish IaC standards, enforce peer review, detect drift, and maintain consistent patterns across dev, nonprod, and prod - Run Kubernetes at scale: Operate and optimize EKS clusters with zero-downtime upgrades, right-sized resources, and security policies - Build out DataDog observability: Create dashboards, alerts, and integrations that give teams actionable insight into infrastructure health - Embed security in pipelines: Snyk scanning, container image validation, AWS security baselines, and compliance-ready documentation - Support AI workloads: Build infrastructure for LLM Ops; compute scaling, model deployment pipelines, and cost optimization Our Tech Stack - Cloud: AWS (EKS, RDS, S3, KMS, Secrets Manager, VPC, ALB, CloudTrail, Security Hub) - IaC: Terraform, GitOps workflows Containers: Kubernetes (EKS), Docker, Helm, ArgoCD - CI/CD: GitHub Actions, Codecov, Amazon ECR Security: Snyk (SAST, container scanning, dependency scanning), AWS security controls, TLS 1.3 - Observability: DataDog (infrastructure monitoring, dashboards, alerts), OpenTelemetry - Compliance: NIST 800-53, NIST 800-171, SOC 2, CMMC Level 2, FedRAMP High readiness - Future: AWS GovCloud (IL2/IL4) What We’re Looking For Must-Haves - 5+ years of hands-on DevOps or Platform Engineering experience with AWS - Production Terraform (or OpenTofu) expertise for infrastructure as code - Strong Docker and Kubernetes (EKS) experience in production - Experience implementing security controls aligned with NIST 800-171 or NIST 800-53 - Hands-on CI/CD pipeline design (GitHub Actions preferred) - Experience with observability platforms (DataDog strongly preferred) - Understanding of GitOps practices and deployment automation - US citizenship (required DoD contracting and FedRAMP compliance) Nice-to-Haves - Candidates located in the Minneapolis/ Saint Paul, MN area for coworking opportunities. However remote candidates are encouraged to apply. - LLM Ops or AI/ML infrastructure experience - Advanced AWS certifications (DevOps Engineer Professional, Security Specialty) - High-growth startup or B2B SaaS background Why Join Collaboration AI? Modern stack, real problems. Kubernetes, Terraform, DataDog, GitHub Actions — no legacy infrastructure, no manual deployments. You’ll build on tools you actually want to use. AI-native culture. We build with AI, not just for AI. Claude Code, agentic workflows, and AI-assisted infrastructure automation are how we work daily. You’ll be expected to push the boundaries of what’s possible with AI tooling in your domain and you’ll have the freedom to do it. Work that matters. >Our customers span defense, pharma, aerospace, global enterprises, and universities; complex organizations where the stakes are real. The infrastructure you build will meet the most demanding compliance requirements in the industry: FedRAMP High, CMMC Level 2, SOC 2. Own it. Small, senior team. High autonomy. Your infrastructure decisions are visible, valued, and directly tied to company outcomes.

Related Categories

DevOps Engineer

Related Job Pages

Remote Full-time Jobs (US)More Remote Jobs

More DevOps Engineer Jobs

Senior Site Reliability Engineer

Veterinary Emergency Group (VEG)

DevOps Engineer85 days ago

Full Time RemoteTeam 1,001-5,000

ABOUT VEG In 2014, VEG was born with a mission to help people and their pets when they need it most by challenging norms and fixing the ER experience. Since then, we’ve expanded rapidly, with hospitals nationwide open 24/7/365, and created an ER experience that focuses on what our pets and pet parents really need. We’ve done the same for our people (VEGgies), finding a way to say YES so they are empowered to achieve great things, grow in unexpected ways, and find a place where they truly belong. We’re rethinking emergency care from every angle—from how we run our hospitals to how we support the people working inside them. That’s where our headquarters team comes in. Whether building technology to make our hospitals more efficient, recruiting and growing incredible VEGgies, or bringing our brand to life through marketing, our VQ (VEG Headquarters) team makes it all possible—ensuring our hospitals and people have everything they need to help pets and their families. VEG is a 2025 and 2026 certified Great Place to Work®. THE JOB We are looking for a Senior Site Reliability Engineer who understands that at VEG, "reliability" is a medical necessity – if our proprietary platform, DogByte, goes down, a pet's life could be at risk. You will be the primary lead for our platform's resilience, transforming our infrastructure into a self-healing system that empowers our medical teams to provide 24/7/365 life-saving care. You will spend your time bridging the gap between high-level architectural strategy and hands-on technical "surgery," ensuring our engineering teams can build at pace while the foundation remains rock-solid. You will evolve and strengthen an existing system that must meet the demands of VEG’s hospital expansion – ensuring our infrastructure never limits our ability to open new hospitals or provide medical care. You will own the ongoing stability of DogByte, scaling it from its current state into a robust enterprise platform where one hospital's traffic is isolated and does not impact another's experience. This job has an opportunity to work at our VQ in White Plains or could be open to remote work. WHAT YOU’LL DO - Formulate short- and long-term strategies to ensure DogByte withstands year-over-year volume increases, specifically solving for hospital-to-hospital traffic isolation - Work with engineers to ensure data flows -- from client to API to database -- are configured for high-concurrency and maximum reliability - Build automated processes to handle high-traffic spikes and automatically remediate common system errors - Set up monitoring and alerting to identify latency throughout the stack and resolve issues before they impact hospital operations - Establish and meet SLOs for high availability, ensuring our engineers can build products without worrying if the system can support them WHAT YOU NEED - Bachelor’s Degree preferred or equivalent experience - 5+ years in SRE/DevOps roles, expertly handling high-concurrency environments - Deep understanding of the AWS ecosystem managed entirely through Infrastructure as Code - Expertise in traffic management, including load balancing techniques, Nginx configuration, and autoscaling to handle volatile patterns - Technical leadership in observability, establishing the tracing frameworks and monitoring required to diagnose latency issues and ensure high availability across the entire request lifecycle - You have direct experience with technologies relevant to our technical stack, which currently includes: AWS ECS, Terraform, Nginx, PostgreSQL (RDS), Python WHO YOU ARE - Empathetic, instinctively taking a people-centric approach, whether supporting your colleagues or making an effort to understand different perspectives - Have a sense of humility; acknowledging mistakes, sharing credit with others, and lifting up your team’s’ accomplishments - Feel a strong sense of ownership over your work, taking responsibility for outcomes and staying committed to achieving long-term, impactful results - Curious by nature; you ask insightful questions and continuously seek out opportunities to learn and grow your skills and knowledge HOW WE INVEST IN YOU - Competitive Compensation Including ($170,000 - $200,000) + bonus + benefits. - Comprehensive health and wellness benefits that start on day one, and access to free therapy or counseling - Paid parental leave, up to 10 weeks at 100% of regular salary, and offering inclusive fertility and family-building care for all types of families - Unlimited PTO to use for vacation or sick days—however you need it! - Generous employee referral program, so our awesome people can bring in more awesome people. - And the little (big) things, like casual office attire, ability to bring your fur baby to work, cool VEG swag, food in the fridge for when you’re hungry and free lunches twice a week!! - Company laptop and a monthly cell phone reimbursement DEI At VEG, diversity is not just a word—it's a strength that fuels innovation and kindness. Our mission is “Helping people and their pets when they need it most.” And we do that better when our VEGgies (employees) feel valued, respected, and empowered to bring their authentic selves to work. That's why we're devoted to creating an environment that reflects the diverse communities we serve—where different perspectives are not only welcomed but celebrated. We are focused on providing equitable opportunities for growth, promoting inclusive decision-making, and ensuring that everyone's perspective is considered. Saying yes to VEG means helping us build a culture where your unique experiences and background contribute to a shared vision: being the world’s veterinary emergency company.

AWS Terraform Nginx PostgreSQL Python Amazon ECS Infrastructure as Code

View details: Senior Site Reliability Engineer

United States

$170K - $200K / year

Apply

Senior SRE Analyst

Inmetrics

We make a difference, solve outstanding problems and make the digital transformation of our clients possible.

DevOps Engineer85 days ago

Full Time RemoteTeam 501-1,000Since 2002H1B No Sponsor

Company Site LinkedIn

• Responsible for providing technical leadership for the team, guiding the technical execution of SRE activities to ensure the availability, scalability, performance, and security of the company’s systems and infrastructure. • Promote DevOps culture and implement and manage automation, monitoring, and orchestration tools and processes for infrastructure and applications, working closely with development, infrastructure, and information security teams. • Perform incident analysis, identify problems and implement solutions to prevent recurrence, keeping the environment stable and reliable for users, as well as performing other related duties inherent to the role. • Act as an SRE responsible for maintenance, observability, troubleshooting, developing solutions and automations, reducing incidents, and lowering MTTR.

AWS Docker Grafana Apache Kafka Kubernetes Linux Oracle Database Prometheus Python RabbitMQ SQL

View details: Senior SRE Analyst

Brazil

Apply

Job Closed

Site Reliability Engineer

GEOTAB

The world’s #1 telematics provider, committed to advancing technology, empowering businesses and making the roads safer!

DevOps Engineer85 days ago

Full Time RemoteTeam 1,001-5,000Since 2000H1B Sponsor

Company Site LinkedIn

• Ensure the availability, reliability, and performance of Geotab's core products for our customers. • Act as a primary escalation point for critical production application/product issues. • Rapidly troubleshoot complex problems across the application stack, utilizing observability tools to identify root causes. • Coordinate effectively with development, infrastructure, and other technical teams during incidents to implement fixes and restore service swiftly. • Clearly communicate incident status, impact, and resolution steps to internal stakeholders. • Collaborate with team members to improve monitoring tools, dashboards, and alerting mechanisms for proactive detection of issues impacting Critical User Journeys (CUJs) within the application/product and computing architecture. • Monitor application/product and system health proactively using a combination of tools to ensure high availability and adherence to Service Level Objectives (SLOs) / Service Level Agreements (SLAs). • Identify opportunities and implement automation tools/scripts to streamline routine operational tasks, reduce manual effort (toil), and improve response times. • Conduct system tests to validate performance, reliability, and successful remediation of issues. • Recommend design and process enhancements based on operational experience to improve overall application reliability and maintainability. • Participate in post major incident reviews (PMIRs) to analyze disruptions, document findings, track corrective actions to prevent recurrence, and identify areas of improvement for incident response processes. • Contribute to building a culture of learning from incidents. • Participate in a 24x7 on-call rotation to provide timely support for critical issues outside of business hours.

Ansible Apache HTTP Server AWS Azure BigQuery DNS GCP Grafana Kubernetes Linux PostgreSQL Prometheus Python TCP/IP .NET

View details: Site Reliability Engineer

United States

Apply

Job Closed

DevOps & Cloud Engineer | REMOTE

OnTrac

Headquartered in Chandler, Arizona, OnTrac is a package delivery company that provides overnight delivery services at ground rates to millions of consumers. This company offers a f

DevOps Engineer85 days ago

Full Time Remote

OnTrac is hiring a DevOps & Cloud Engineer! Are you eager to join a dynamic and expanding company where you can both learn and make a meaningful impact? If you possess a strong sense of empathy, enjoy assisting others, thrive in a fast-paced environment, and excel at problem-solving, we encourage you to apply today to connect with a recruiter! Founded in 1986, OnTrac has evolved into the leading provider of same-day and next-day delivery services in the U.S. for premier e-commerce and product-supply businesses, including five of the largest retailers in the U.S.  Location: REMOTE Pay: $104,800 to $131,000 / year depending on experience and qualifications Shift: Monday through Friday from 8:00am to 5:00pm (After-hours on-call rotation responsibilities to ensure 24/7 system reliability is required) Employment Logistics: The DevOps & Cloud Engineer collaborates closely with the dev teams to design our CI/CD pipelines while serving as a core member of the Cloud Engineering team and will help establish a true DevOps culture, ensuring that our infrastructure is resilient, scalable, and follows a "Security-First" philosophy. This versatile engineer joins our IT Engineering team during a pivotal cloud transformation. As we consolidate our workloads into GCP, the DevOps & Cloud Engineer will be a critical contributor in evolving our Azure DevOps environment into a mature, automated CI/CD ecosystem. Unpacking the Benefits: We offer a comprehensive benefits package designed to support your health, financial security, and life outside of work. Health & Protection - Medical, Dental, and Vision insurance; HSA and FSA options - Life and Disability coverage (basic and voluntary) - Voluntary Accident, Critical Illness, Identity & Fraud Protection, Auto & Home, and Pet Insurance Financial & Future - Competitive benefits and 401(k) with company match - Referral Bonus Program - up to $500 per referral! Time Away & Leave - Paid Vacation, Sick Leave, Floating Holidays, and Parental Leave - Paid Holidays Work & Life Support - Employee Assistance Program - Safe and clean work environment The Must-Haves: - Experience: 5+ years in DevOps, Systems Engineering, or Cloud Architecture. - Cloud Expertise: Strong experience with GCP is highly preferred, specifically regarding IAM and VPCs. Experience with GKE (Google Kubernetes Engine) is a plus, as we look to explore containerization in the future. - Linux Proficiency: Strong Linux administration skills are required. We value a Linux-first approach to automation and troubleshooting. - Pipeline Expertise: Proven experience building and managing pipelines in Azure DevOps (pipelines, releases, boards). - Automation Tools: High proficiency with Terraform and Ansible. - Development Knowledge: Familiarity with PHP and .NET environments. The ability to understand application dependencies and troubleshoot build errors is essential. - Security Mindset: Understanding of "Secure-by-Design" principles, including secret management and identity-based access. Your Mission in Motion: - CI/CD Development: Collaborate on the design, implementation, and maintenance of robust CI/CD pipelines in Azure DevOps for both application code (PHP, .NET) and Infrastructure as Code (Terraform). - GCP Cloud Engineering: Play a key role in our migration from other platforms into Google Cloud Platform (GCP), focusing on stability and performance. - Infrastructure as Code (IaC): Standardize and scale our infrastructure using Terraform for provisioning and Ansible for configuration management and orchestration. - Mentorship & Culture: Act as a subject matter expert, helping to upskill the team on DevOps best practices and modern deployment strategies. - Automation & Consistency: Eliminate manual configuration drift by enforcing "hardened" infrastructure baselines through automation and consistent orchestration. - Reliability & On-Call: Partner with the team to define critical system metrics and participate in the critical incident on-call rotation to ensure 24/7 system reliability. Paving your way to your success: - Flexible and adaptable to learning and understanding new technologies. - Strong written and oral communication and interpersonal skills. - Highly self-motivated and self-directed with a keen attention to detail. - Openness to direction and constructive criticism for self-improvement - Proven analytical and problem-solving abilities. - Ability to effectively prioritize and execute tasks in a high-pressure environment. - Ability to work both independently and in a team-oriented, collaborative environment. - Ability to proactively seek solutions, take ownership of tasks, and drive personal and professional growth without constant external direction. If you are excited to be part of our team and grow with our OnTrac family, we invite you to apply! OnTrac is proud to be an Equal Opportunity Employer OnTrac is an equal-opportunity employer. We value diversity and welcome applications from individuals of all backgrounds, abilities, and experiences. We do not discriminate based on race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or age. Join us in our commitment to creating a diverse and inclusive workplace. If you are excited to be part of our team and contribute to our talent acquisition efforts, we invite you to apply. Lasership, Inc. dba OnTrac Final Mile with its affiliates, including OnTrac Logistics, Inc. (collectively, "OnTrac" or the "Company") is an equal opportunity employer.

GCP Amazon IAM Google Kubernetes Engine Linux Azure DevOps Terraform Ansible PHP .NET

View details: DevOps & Cloud Engineer | REMOTE

United States

$104K - $131K / year

Apply