Founded in 2005, Smartsheet offers collaborative work management and process automation to empower greater enterprise productivity. A leading cloud-based platform for work executio
Senior DevOps Engineer
Location
Bulgaria
Posted
4 days ago
Salary
0
Seniority
Senior
Job Description
Senior DevOps Engineer
Smartsheet
• Own and evolve the edge proxy platform: Maintain, upgrade, and extend a high-performance reverse proxy — including maintaining the proxy binary and its configuration tooling, writing Go and Python automation, managing the full container image lifecycle on hardened Linux base images, and working across the broader edge layer, including CDN, WAF, and traffic management capabilities. • Build and maintain cloud infrastructure as code: Design and implement Terraform/Terragrunt modules and live environment configurations managing EKS clusters, load balancers, IAM roles, VPC networking, ECR registries, and supporting AWS services across multiple regions including GovCloud. • Operate Kubernetes clusters at scale: Manage multi-region, multi-cluster EKS deployments via FluxCD GitOps workflows and Helm charts, including node AMI rotation, add-on lifecycle management, and horizontal pod autoscaling. • Build and own CI/CD pipelines: Design, maintain, and improve shared GitLab CI/CD pipeline templates used across all team repositories; build and operate alternative pipeline workflows for isolated government cloud environments. • Automate operational toil: Build and maintain tooling for tasks such as container image patching, EKS AMI rotation, air-gapped ECR image sync to GovCloud, and automated MR creation for monthly version-bump patching cycles. • Manage observability and on-call: Provision and maintain Datadog SLOs, monitors, and dashboards via Terraform; participate in the team's on-call rotation responding to edge proxy incidents across production and GovCloud environments. • Support FedRAMP/GovCloud operations: Operate the GovCloud environment with its unique constraints — air-gapped image distribution, infrastructure automation in isolated networks, and alert management with compliance-aware data handling. • Evaluate and adopt internal developer tooling: Research, prototype, and drive the adoption of internal tools that improve engineering productivity across the company — including developer portals, platform self-service capabilities, and other tooling that raises the bar for the developer experience at Smartsheet. • Mentor and collaborate: Share knowledge across the team through code reviews, architecture discussions, and runbook authorship; foster a culture of engineering excellence and operational rigour. • Strategically apply AI tools: Strategically apply and champion AI tools within your team's domain to improve project execution, infrastructure design, quality, and debugging, leading adoption of AI best practices.
Job Requirements
- 5+ years of experience in DevOps, platform engineering, or site reliability engineering.
- A BS or MS in Computer Science, Engineering, or a related field, or equivalent industry experience.
- Deep proficiency with Terraform and Terragrunt for managing production cloud infrastructure at scale across multiple environments and regions.
- Strong Kubernetes expertise, including EKS cluster operations and Helm chart authoring.
- Hands-on experience with AWS networking and container workload services: EKS, ALB/NLB, VPC, IAM, ECR, Route53, CloudWatch, and EventBridge.
- Proficiency in at least one general-purpose programming language — Go or Python preferred — for building operational tooling and automation.
- Solid understanding of reverse proxies, API gateways, or load balancers (NGINX, HAProxy, or equivalent).
- Experience designing and maintaining CI/CD pipelines (GitLab CI preferred), including shared template libraries across multiple repositories.
- Experience with container image security practices: hardened base images, vulnerability scanning, and image promotion workflows.
- Strong operational instincts: comfort with on-call responsibilities, incident response, runbook authorship, and postmortems in production environments.
- 1 year of professional experience leveraging AI-based workflows to author, maintain, review, and deploy infrastructure or code.
- Fluency in English is required.
- Legally eligible to work in Bulgaria on an ongoing basis.
Benefits
- Health insurance
- Retirement plans
- Paid time off
- Flexible work arrangements
- Professional development
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
DevOps / SRE Engineer - AI Platform
Makro PROMakro PRO is an exciting new digital venture by the iconic Makro. Our proud purpose is to build a technology platform that will help make business possible for restaurant owners, hotels, and independent retailers, and open the door for sellers. We welcome bold, energetic, and thoughtful people who share our belief in collaboration, diversity, excellence, and putting customers at the heart of our work. Clear focus Diverse Workplace (Our members are from around the world!) Non-hierarchical and agile environment Growth opportunity and career path
Role Description The DevOps / SRE Engineer owns the operational substrate of an AI-native retail decisioning platform — infrastructure, CI / CD, observability, cost meter, and incident response for a system that runs production agents taking real business actions. The role builds on the enterprise Terraform standard, CI / CD spine, and FinOps tagging policy rather than reinventing parallel infrastructure. Remote candidates outside of Thailand are welcome to apply. - Adopt the enterprise Terraform standard and module library for all platform infrastructure; author platform-specific modules where needed (agent runtime, vector DB, knowledge graph); run drift detection weekly. - Build platform-specific CI / CD pipelines on the enterprise spine — service deploys, agent deploys, eval-gate enforcement; integrate eval gates so no agent reaches production without eval pass. - Operate rollback orchestration with sub-15-minute recovery; quarterly game days. - Own the platform observability stack — OpenTelemetry, Langfuse for LLM traces, custom dashboards for per-agent cost. - Implement the per-agent cost meter end-to-end — token counts, vector queries, model inference, downstream LLM Gateway costs; surface cost data to the enterprise GenAI cost dashboard. - Stand up the platform on-call rotation; author runbooks for every production agent and service; lead incident response with measurable corrective actions. - Implement platform cost-tagging policy consistent with the enterprise standard (team, domain, environment, project, agent, suite, persona); report monthly to Cost Review. - Drive cost optimisation — right-sizing, caching, model routing decisions, reserved compute. Qualifications - Bachelor's or Master's degree in Computer Science, Engineering, or a related discipline. - 5+ years SRE / DevOps with production ownership. - Terraform at scale — modules, state, drift, environment promotion. - CI / CD for data + ML / AI services (GitLab CI / CD or comparable). - Cloud platform (Azure preferred; AWS / GCP transferable). - Observability — OpenTelemetry, Langfuse (or comparable LLM traces), custom dashboards. - FinOps — tagging policies, attribution, optimisation. - Incident response — on-call, post-mortems, runbook authorship. Preferred Qualifications - AI / agent platform SRE experience; cost-meter / chargeback systems built or operated. - Multi-cloud production experience; open-source contributions to IaC / observability tooling. - AI / ML / agent system observability instrumentation (LLM cost, agent cost, eval scores). - Vendor certifications such as HashiCorp Terraform Associate / Professional, Azure Solutions Architect Associate, or Databricks Data Engineer Professional.
Senior Platform Engineer – SRE
FiligranUncover Threats. Take Action. Home of OpenCTI, OpenBAS and more.
• Design, build, and operate production‑grade Kubernetes clusters on bare metal and cloud • Industrialize, automate, improve observability & monitoring • Continue to create a culture of service delivery excellence • Participate in on‑call rotation, incident management, and post‑incident reviews • Design and drive projects around DevSecOps practices in the company
DevOps Engineer
RemotePro.phWe are a US-based IT services firm with a consistently growing and fully remote PH team.
The best way to look at this role is you would have the main responsibility to own our Linux systems and the responsibility for deployment and the support of team that will handle the maintenance and monitoring of applications on them. This will include custom and other more standard open source and proprietary applications. Our favorite candidate will be able to support at least basic needs for Window DevOps and more. **RESPONSIBILITIES:** - Plan/Design, Build and define the monitoring of our Linux and Windows applications and the systems that run them. - Implement client-requested integrations. - Design/Plan and support team in deploying updates and fixes - Conduct root cause analysis if issues - Investigate and resolve technical issues and create/provide resources to team to prevent. - Develop scripts to automate processes and updates - Provide technical support and design procedures for system troubleshooting and maintenance.
Senior Service Reliability Engineer
ThoughtworksThoughtworks is a dynamic and inclusive community of bright and supportive colleagues who are revolutionizing tech. As a leading technology consultancy, we’re pushing boundaries through our purposeful and impactful work. Over 30 years of delivering extraordinary impact with clients. Helping clients solve complex business problems with technology as the differentiator.
Role Description As a Senior Service Reliability Engineer (SRE) you will take a multifaceted approach to ensure technical excellence and operational efficiency within the infrastructure domain. Specializing in reliability, resilience and system performance, you take a lead role in championing the principles of Site Reliability Engineering. By strategically integrating automation, monitoring and incident response, you facilitate the evolution from traditional operations to a more customer-focused and agile approach. Emphasizing shared responsibility and a commitment to continuous improvement, you cultivate a collaborative culture, enabling organizations to meet and exceed their reliability and business objectives. - You will improve site reliability by building mechanisms/architectures that enable fault tolerance and faster median time to respond and median time to detect. - You will drive the integration of observability automation into the CI/CD pipeline. - You will handle production incidents, manage incident communication with clients and draft root cause analysis documents. - You will monitor performance of production systems and improve their scaling to ensure business goals are met within expected SLA and SLO metrics. - You will work closely with application development teams as advisors on improving system reliability and assisting in implementation for reliability improvements. - You will improve system observability across multiple facets such as logging and metrics, reducing false alarms to eliminate unnecessary toil and improving process efficiency. - You will implement chaos engineering practices as necessary to test system reliability, setting up processes for such testing to be done regularly. - You have a clear understanding of client goals and business needs and setting direction for site reliability in line with the same, e.g.: Achieving application availability with minimum/no disruption (99.999%) if necessary for business. Qualifications - You have hands-on experience in programming and scripting languages such as Python, Go or Bash. - You have a good understanding of at least one Public Cloud, e.g.: AWS, Azure or GCP. - You have had exposure to observability tools such as Grafana, Datadog, NewRelic, ELK Stack, Dynatrace or equivalent and you are proficient in using data from these tools to dissect and identify root causes of system and infrastructure issues. - You are familiar with DevOps and GitOps practices. - You have a good knowledge of container-based architecture and orchestration tools such as Kubernetes, AWS EKS, Docker Swarm, Nomad, etc. - You understand technical architecture and modern design patterns, including microservices, serverless functions, NoSQL and RESTful APIs, with experience in fixing bugs, analyzing logs, building metrics and operational dashboards. - You are familiar with creating infrastructure resources for improving reliability of system that follows Cloud’s Well Architected Framework principles: Reliability, security, cost optimization, performance efficiency and operational. Requirements - You have strong communication and articulation skills, and are proficient in English. - You have good people skills with an emphasis on negotiation and close collaboration with multiple cross-functional teams from the client side and/or Thoughtworks. - You solve challenging problems and difficult to debug issues with a never give up attitude. - You have the ability to work under pressure and with composure during production incidents. - You can confidently recommend improvements backed by strong technical arguments to client stakeholders or application development teams. - You are able to understand requirements provided by the client on both technical and business aspects and break them down for successful implementation. - You have a strong drive and ownership mentality, with a willingness to sign up for and deliver work when called upon, without being too concerned about role boundaries. - You’re willing to be part of a rotation- and need-based 24x7 available team. Benefits - There is no one-size-fits-all career path at Thoughtworks: however you want to develop your career is entirely up to you. - Your career is supported by interactive tools, numerous development programs and teammates who want to help you grow. - We see value in helping each other be our best and that extends to empowering our employees in their career journeys.


