An independent platform for cutting-edge, progressive, legal, and political opinion.
Senior Site Reliability Engineer
Location
Florida
Posted
4 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer
The Leaflet
• Ensure the availability, reliability, and performance of high-traffic Java-based applications in a distributed environment. • Troubleshoot and resolve complex issues across production and non-production environments. • Participate in pre- and post-deployment performance testing and monitoring to continuously improve application performance. • Optimize Java application performance with a focus on JVM tuning, efficient resource utilization, and horizontal scaling. • Deploy and manage the Grafana stack (Grafana, Prometheus, Loki, Mimir, Alloy) to deliver real-time monitoring, logging, and alerting. • Implement and refine observability strategies that enhance visibility into application and infrastructure health. • Create and maintain dashboards, alerts, and log queries for comprehensive system health monitoring. • Integrate AI/ML models into the observability pipeline for anomaly detection, predictive alerting, and intelligent alert correlation and noise reduction. • Design, build, and operate agentic AI workflows that automate operational tasks such as alert triage, root cause analysis, runbook execution, and incident summarization. • Develop tool-calling LLM agents that interact with infrastructure APIs (Kubernetes, Grafana, Jira, Slack, PagerDuty) to execute diagnostic and remediation actions autonomously or with human-in-the-loop approval. • Build and maintain MCP (Model Context Protocol) servers and integrations that expose internal systems as tool surfaces for AI agents. • Evaluate, select, and operationalize LLM frameworks and orchestration platforms (e.g., LangChain, LangGraph, CrewAI, n8n, or custom solutions) for production-grade agentic systems. • Implement guardrails, evaluation harnesses, and feedback loops to ensure AI agent outputs are accurate, safe, and continuously improving. • Champion the adoption of AI-assisted development and operations practices across the SRE and broader engineering organization. • Support the operations team’s incident response efforts, conduct post-mortems, and identify root causes to prevent recurrence. • Leverage AI tools to accelerate incident timelines, auto-generate post-mortem drafts, and surface patterns across historical incidents. • Document and share lessons learned, contributing to a culture of continuous improvement. • Identify repetitive operational workflows and engineer AI-augmented or fully automated replacements. • Build self-service tools and chatbot interfaces that allow engineering teams to query system status, retrieve logs, and execute standard operating procedures through natural language. • Measure and report on toil reduction metrics to quantify the impact of automation initiatives. • Work closely with developers, architects, and data/ML engineers to design solutions that improve reliability and leverage AI capabilities. • Collaborate with DevOps and NOC teams to support the application platform. • Communicate SRE practices, AI/automation capabilities, and operational insights to technical and non-technical stakeholders. • Provide feedback on application performance, potential improvements, and observability metrics.
Job Requirements
- Degree in Computer Science or a related field, or equivalent professional experience.
- 5+ years in SRE, DevOps, or similar infrastructure roles with experience managing large-scale, high-availability production systems.
- 3+ years hands-on experience managing production Kubernetes clusters, including deep understanding of architecture, networking, storage, and security.
- Experience with cluster autoscaling (Karpenter), upgrades, and multi-cluster management.
- Proficiency with kubectl, Helm, Kubernetes operators, and container orchestration troubleshooting.
- Advanced expertise with the Grafana observability stack: dashboards, alerting, visualization, and Grafana Alloy for telemetry collection.
- Proficiency in PromQL and experience with Loki for log aggregation and analysis.
- Hands-on experience managing Java-based applications in distributed environments, including JVM tuning and optimization.
- Cloud platform expertise (AWS preferred; GCP or Azure also valued).
- Familiarity with Infrastructure as Code tools such as Terraform/Terragrunt or Ansible.
- ArgoCD proficiency for GitOps workflows and continuous deployment.
- Strong scripting abilities in Python, Bash, or Go, with experience building CI/CD pipelines and deployment automation.
- Proven track record with on-call rotations, incident response, and root cause analysis.
- 1+ years of practical experience building or operating AI/LLM-powered tools, agents, or workflows in a production or production-adjacent context.
- Demonstrated ability to design agentic systems that use tool calling, retrieval-augmented generation (RAG), or multi-step reasoning to accomplish operational tasks.
- Experience integrating LLM APIs (e.g., Anthropic Claude, OpenAI, or open-source models) into backend services or automation pipelines.
- Familiarity with at least one agentic orchestration framework or workflow engine (LangChain, LangGraph, CrewAI, n8n, Temporal, or equivalent).
- Understanding of prompt engineering best practices, including structured outputs, system prompts, and few-shot examples.
- Familiarity with AI-assisted coding tools (Claude Code, Codex, Cursor) and their integration into engineering workflows.
- Experience building or consuming MCP (Model Context Protocol) servers to expose internal tools to AI agents.
- Awareness of AI safety, hallucination mitigation, and human-in-the-loop design patterns for autonomous systems.
Benefits
- Competitive pay and benefits
- Flexible vacation allowance
- A hybrid / remote working environment
- Startup culture backed by a secure, global brand
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Role Description We are looking for a Senior DevOps Engineer - Multi - Cloud. Qualifications - Experience in DevOps practices and tools. - Strong knowledge of cloud platforms (AWS, Azure, GCP). - Proficiency in scripting and automation. - Experience with CI/CD pipelines. - Familiarity with containerization technologies (Docker, Kubernetes). Requirements - Ability to work in a remote-first environment. - Strong communication skills. - Problem-solving mindset. - Team player with a collaborative approach. Benefits - Flexible hours and remote-first mode. - Competitive compensation. - Complete Hardware/Software setup – anything you need for work. - Open-door culture, transparent communication, and top management at a handshake distance. - Health insurance, vacation, sick leaves, holidays, paid maternity/paternity leave. - Access to our learning & development center: workshops, webinars, training platform, and edutainment events. - Virtual team buildings and social activities. Company Description Innovecs is a global digital services company with a presence in the US, the UK, the EU, Israel, Australia, and Ukraine. Specializing in software solutions, the Innovecs team has experience in Supply Chain, Healthtech, Collaboration Tech, and Gaming. - Included in the Inc. 5000, the list of fastest-growing private companies in the US. - Ranked as one of the best global outsourcing service providers by IAOP. - Honored with the Global Good Awards for Employee Engagement & Wellbeing. - Won gold at the Employer Brand Management Awards. - Included in the Global Top 100 Inspiring Workplaces Ranking.
DevOps Engineer, Fluent Ukrainian
SupportYourAppSupportYourApp is an industry leader in premium outsourced customer support that provides tech companies with reliable, cost-effective services. A multinational
• Будувати, підтримувати та оптимізувати CI/CD pipelines для веб-продуктів, сайтів та внутрішніх сервісів компанії у Jenkins та GitLab CI/CD • Підтримувати поступову міграцію deployment processes з Jenkins на GitLab CI • Забезпечувати стабільні, repeatable та predictable deployments з rollback-механізмами і мінімальною кількістю manual steps • Налаштовувати та підтримувати Docker-based runtime environments для web applications та сервісів • Стандартизувати Docker, docker-compose, deployment scripts та runtime-конфігурації, щоб рішення не потребували регулярного rework • Адмініструвати Linux-сервери у production-середовищі: налаштування, patch management, troubleshooting, performance analysis • Автоматизовувати infrastructure setup, configuration management та maintenance-процеси через Ansible і Bash • Підтримувати web infrastructure: Nginx, SSL/TLS, reverse proxy, routing, Cloudflare, DNS, caching та базові security rules • Налаштовувати, підтримувати та покращувати monitoring, logging та alerting для production systems • Аналізувати deployment failures та production incidents, визначати root cause і пропонувати preventive actions • Підтримувати backup/restore, monitoring та базове troubleshooting для MySQL/PostgreSQL • Забезпечення reliability та stability production systems • Аналіз production incidents, проведення root cause analysis та впровадження preventive actions • Участь у post-incident reviews та підготовка технічних висновків після інцидентів • Впроваджувати та підтримувати security practices для Linux і web infrastructure: hardening, контроль доступів, оновлення, закриття вразливостей • Документувати інфраструктурні рішення, deployment workflows, конфігурації та важливі зміни • Узгоджувати production changes з командою, попереджати про ризики та не вносити критичні зміни без прозорої комунікації • Проактивно виявляти слабкі місця в deployment, infrastructure та application architecture, які можуть призвести до нестабільності, та ініціювати їх усунення.
DevOps Engineer(4)
EricssonWe create limitless connectivity to improve lives, redefine business and pioneer a sustainable future. #ImaginePossible
Join our Team About this opportunity: As a DevOps Engineer, you will be responsible for developing sophisticated systems and software basis the customer's business goals, needs and general business environment creating software solutions What you will do: - Participate in developing functionalities, starting from requirement analysis, system design, architecture design, software design, software testing, integration, Product Lifecycle Management support and product documentation. - Ensure that all architectural and design documents, diagrams, non-functional requirements, standards, and best practices are followed and done on time and with high-quality. - Manage infrastructure services like Kubernetes cluster, databases, analytics platforms, etc. - Monitor system performance and troubleshoot issues. - Responsible for services restoration and providing first line technical troubleshooting within service level agreements. - Act as a liaison between customers and Software development team - Work with the team and Product Owner in requirement/user story analysis. - Working side by side with developers, product managers, product owner, program managers and key executives to plan ongoing feature development, product maintenance. - Participate in Agile ceremonies, and not afraid to identify what we're doing wrong so we can fix it, and what we're doing right so we can improve on it. The skills you bring: - Bachelor's Degree in Computer Science, Computer Engineering or a related technical field - Hands-on experience in Programming and scripting knowledge in one or more of the following: OOP language (preferably Java), JavaScript, Typescript, Node.js, Bash, YAML or REST API - Familiar with Relational Database concepts and SQL - Familiar with Microservices architecture and Cloud Technologies - Familiar with Agile software development methodologies. - Knowledge of Windows and Linux operating systems - Ability to execute research tasks and generate practical results and recommendations. - Proven problem-solving skills and innovation. - Excellent verbal and written communication skills including the ability to produce usable and maintainable documentation. - Result-driven analytical mindset and ability to identify and report patterns. - Ability to work across multiple cultures. - 2-4 Years of Experience Why join Ericsson?At Ericsson, you'll have an outstanding opportunity. The chance to use your skills and imagination to push the boundaries of what's possible. To build solutions never seen before to some of the world's toughest problems. You'll be challenged, but you won't be alone. You'll be joining a team of diverse innovators, all driven to go beyond the status quo to craft what comes next. What happens once you apply? Click Here to find all you need to know about what our typical hiring process looks like.Encouraging a diverse and inclusive organization is core to our values at Ericsson, that's why we champion it in everything we do. We truly believe that by collaborating with people with different experiences we drive innovation, which is essential for our future growth. We encourage people from all backgrounds to apply and realize their full potential as part of our Ericsson team. Ericsson is proud to be an Equal Opportunity Employer. learn more. Primary country and city: Egypt (EG) || Cairo Req ID: 785303
• Liderar el proyecto garantizando la estabilidad • Coordinar al equipo de dos developers junto con el scrum master • Gestión de stakeholders • Liderar y participar en el desarrollo técnico de las soluciones e implementación de las mismas • Gestionar iniciativas de migración • Realizar análisis de viabilidad técnica • Actuar como Scrum Master para el cumplimiento de las metodologías ágiles • Actualización de la documentación en ServiceNow • Gestión de incidencias • Supervisar las posibles vulnerabilidades de seguridad y creación de planes para solucionarlos • Participación en auditorías • Actuar como principal interlocutor con los proveedores internacionales




