Senior DevOps Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 51-200

Location

Worldwide

Posted

2 days ago

Salary

0

Seniority

Senior

No structured requirement data.

Job Description

Senior DevOps Engineer

Acclaim

Role Description We are looking to strengthen our team for a DevOps/SRE Engineer! Qualifications - Minimum 5 years of experience in a DevOps and/or Site Reliability Engineering role - Strong hands-on experience with Linux system administration - Extensive experience deploying, operating, and scaling Kubernetes in both cloud and bare-metal environments - Deep expertise and practical experience with at least one major cloud provider (preferably Google Cloud Platform) - Experience with ML inference on GPU/CPU is a strong plus - Proven experience implementing SRE practices and building observability stacks using Grafana, Prometheus, and Loki - Strong adherence to GitOps, Infrastructure as Code (IaC), and CI/CD principles - Advanced expertise in Terraform, Ansible, and Python - Comfortable working in high-uncertainty environments - Proactive mindset: ability to look beyond DevOps tasks and actively debug and understand the product - Strategic thinking: ability to choose technologies and architectural approaches based on long-term goals rather than short-term compromises Requirements - Deploy, operate, and evolve a microservices-based platform running in Kubernetes clusters across AWS, GCP, and on-prem (Rancher) - Operate and support GPU-based ML inference services (Triton Inference Server, vLLM) deployed on RunPod, Scaleway, and Nebius - Build and maintain Docker images for all microservices and ensure a stable service lifecycle - Maintain and scale development and production Kubernetes clusters, actively participate in deployment debugging, incident investigation, and performance troubleshooting - Develop, maintain, and evolve custom Helm charts for each service - Design and operate CI/CD pipelines using GitHub (code and pipelines) and GitLab for on-prem customer deployments - Ensure platform compliance with SOC 2 requirements and actively contribute to improving security and compliance processes - Manage cluster access via NetBird VPN, implementing role-based access control using group policies - Deploy and manage infrastructure using IaC practices with Terraform and Ansible - Develop and continuously improve observability systems: - Grafana & Prometheus for metrics - ELK stack for centralized log storage and analysis - Continuously optimize infrastructure in the areas of IaC, IAM, Observability, and CI/CD - Work with a technology stack, including: Python, Kubernetes, Linux, Docker, GitHub CI/CD, PostgreSQL, ClickHouse, Kafka, Superset, Terraform, Ansible Benefits - The team has built award-winning AI products for tech corporations — devices, voice assistants, products that are actually in the world - Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment - High engineering bar and real ownership — the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directly - Fast career progression — a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere else - Startup pace with enterprise stability — real clients, real revenue, no bureaucracy - Fully remote - 21 vacation days + public holidays + 5 sick days - Private English lessons via Preply

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Schneider Electric logo

Design for Safety and Reliability Practitioner

Schneider Electric

With a foundation that dates back to 1836, Schneider Electric has developed into a worldwide specialist in energy management. In the past, the company has hired

DevOps Engineer2 days ago

Role Description Join us as a Senior Specialist, Project Quality and play a pivotal role in delivering safe, reliable, and customer-focused products. You'll lead quality initiatives across the product lifecycle—from design through launch—ensuring excellence at every stage. - Drive quality planning and risk management throughout new product development to ensure successful, compliant launches. - Partner with design, industrialization, and supply chain teams to embed quality standards and customer insights. - Establish and monitor quality goals for products, suppliers, and manufacturing sites using structured OLM methodologies. - Provide expert guidance on quality tools, risk mitigation, and continuous improvement to project teams. Qualifications - Strong collaboration skills with the ability to influence and align cross-functional teams. - Strategic mindset that balances quality standards with business objectives and customer needs. - Proactive approach to identifying risks and driving preventive actions before issues escalate. - Comfort working independently in ambiguous situations while knowing when to seek guidance. Requirements - Customer Experience Improvement — advanced level; translating field insights into actionable design and quality enhancements. - Industrialization Quality Preparation — intermediate level; ensuring readiness for pre-series and launch phases. - Risk Management — intermediate level; identifying and mitigating product and process risks across the project lifecycle. - Quality Management Systems — intermediate level; implementing standards and ensuring compliance with OLM processes. - Supplier Quality Management — intermediate level; driving quality performance with external partners and vendors. - Problem Solving — intermediate level; applying structured methodologies to resolve complex quality challenges. - Project Management — intermediate level; coordinating quality activities across multiple functions and timelines. - Design for Safety and Reliability — intermediate level; embedding safety and reliability principles into offer development. Benefits - Meaningful work where your quality leadership directly impacts customer satisfaction and product safety. - Collaborative culture that values expertise, innovation, and continuous learning. - Opportunity to influence quality strategy across global product launches. - Professional growth through exposure to advanced methodologies and cross-functional projects.

India

Site Reliability Engineer

Bright Vision Technologies

Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications. We recognize that our people are our strength. We are an equal opportunity employer and place a high value on diversity and inclusion. We do not discriminate on the basis of any protected attribute. We make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs. Bright Vision Technologies is an Equal Opportunity Employer, including Disability/Veterans.

DevOps Engineer2 days ago

Role Description We are seeking an experienced Site Reliability Engineer to ensure the availability, performance, and operational excellence of large-scale distributed systems in production. As an SRE you will live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems, and continually pushing the platform toward higher reliability with lower operational toil. The ideal candidate will combine deep systems knowledge with strong programming skills, a measurement-driven mindset, and the discipline to design, automate, and operate complex services so that reliability becomes a first-class engineering deliverable rather than a reactive concern. Key Responsibilities - Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services. - Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed. - Ensure high-quality post-incident reviews that drive lasting improvements. - Design and implement comprehensive monitoring, logging, and tracing strategies using tools like Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar. - Build and maintain robust on-call processes, runbooks, and escalation paths. - Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages. - Architect and operate large-scale Kubernetes clusters and container-based workloads. - Design CI/CD pipelines that promote safe, frequent, and observable releases. - Lead capacity planning and performance engineering activities. - Partner closely with application development teams to embed reliability practices early in design. - Strengthen the platform’s resiliency through chaos engineering and fault injection. - Drive continuous improvement of security posture in collaboration with security teams. - Contribute to the technical roadmap for reliability tooling and observability platforms. - Mentor engineers across the organization on SRE practices. Qualifications - Bachelor’s degree in Computer Science, Engineering, or a related technical discipline. - Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems. - Strong programming skills in at least one of Python, Go, or Java. - Deep, hands-on experience operating Linux at scale. - Production experience operating Kubernetes and container-based workloads. - Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents. - Hands-on experience designing and operating CI/CD pipelines. - Solid understanding of distributed system design. - Demonstrated experience leading incident response and conducting effective post-incident reviews. - Excellent communication and documentation skills. Preferred Qualifications - Experience defining and operationalizing SLOs and error budgets in real production environments. - Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus. - Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP). - Background in capacity planning, performance engineering, or large-scale load testing. - Familiarity with service mesh technologies such as Istio, Linkerd, or Consul. How to Apply Would you like to know more about this opportunity? For immediate consideration, please send your resume to [email protected] or contact us at (908) 505-3899. Learn more about Bright Vision Technologies at www.bvteck.com .

United States
$100K - $150K / year
Full TimeRemoteTeam 51-200Since 2018H1B Sponsor

• Build and operate the AWS GovCloud environment for federal customers • Design and implement infrastructure-as-code for dedicated environments • Own the container image pipeline for government deployment • Identify and address availability risks and monitoring gaps • Collaborate with assessment partners for FedRAMP documentation • Enable product engineers to enhance features across environments • Define separation of compliance functions from engineering operations • Support federal customers in CMMC environment with escalations and support issues

Massachusetts
$210K - $220K / year
NBCUniversal logo

Senior Video Streaming DevOps Engineer

NBCUniversal

Here you can create the extraordinary. Join us.

DevOps Engineer2 days ago
Full TimeRemoteTeam 10,001+Since 2004H1B Sponsor

• Automation and reliability of NBCU’s Live sources • Delivery of 200+ NBC and Telemundo stations and live events • Build automation solutions to deploy and maintain applications • Work on high-visibility projects • Collaborate with 3rd party vendors • Develop Infrastructure as Code • Participate in troubleshooting activities • Analyze current technology and develop improvement processes • Develop robust CI/CD pipelines • Participate in an on-call rotation for L2 support

New York
$110K - $135K / year