Job Closed
This listing is no longer active.
Senior Engineer - Site Reliability
Location
United States
Posted
34 days ago
Salary
$109.6K - $164.4K / year
Seniority
Senior
No structured requirement data.
Job Description
Senior Engineer - Site Reliability
Core42 US Services LLC
Role Description As a Senior Site Reliability Engineer, you will be responsible for designing, implementing, and operating scalable, reliable, and secure infrastructure to support large-scale AI and HPC workloads. You will play a key role in building and maintaining CI/CD pipelines, Kubernetes-based environments, and observability systems that ensure high availability and performance across globally distributed platforms. Working closely with engineering, product, and operations teams, you will drive automation, enforce SRE best practices, and contribute to a resilient and efficient infrastructure ecosystem that supports mission-critical applications. Your Key Responsibilities - CI/CD & Automation: Design, build, and maintain robust CI/CD pipelines using tools such as GitLab CI, Azure DevOps, and/or Jenkins to enable rapid and secure software delivery. - Kubernetes Operations: Operate, manage, and optimize Kubernetes clusters, ensuring scalability, performance, and resilience. - Infrastructure as Code: Develop and maintain infrastructure using Terraform, Helm, Ansible, or similar tools to automate provisioning and configuration. - Observability & Monitoring: Implement and manage monitoring solutions using Prometheus, VictoriaMetrics, Grafana, and ELK/EFK to ensure system health and performance. - Incident Management: Lead root cause analysis (RCA), post-mortems, and continuous improvement initiatives to enhance system reliability. - Reliability Engineering: Define and implement SRE best practices, including SLAs, SLOs, and error budgets. - Logging & Alerting: Build and maintain logging, alerting, and tracing systems for proactive issue detection and rapid troubleshooting. - Security & Compliance: Enforce security best practices and compliance standards across CI/CD pipelines and runtime environments; support audit readiness. - Collaboration: Work cross-functionally with engineering, product, and infrastructure teams to align platform capabilities with business needs. - Mentorship: Provide guidance and mentorship to junior engineers and contribute to knowledge sharing across teams. - On-call Support: Participate in on-call rotations to support critical platform services. Qualifications - Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field. - 5+ years of experience in DevOps, Site Reliability Engineering, or platform engineering roles in production environments. - Proven experience managing Kubernetes clusters (e.g., GKE, EKS, AKS, or self-managed). - Hands-on experience with CI/CD tools and automation frameworks. - Strong experience with infrastructure-as-code tools such as Terraform, Helm, or Ansible. - Proficiency in container technologies (Docker, containerd) and orchestration with Kubernetes. - Strong scripting/programming skills (e.g., Python, Bash, Go). - Experience with observability and monitoring stacks (Prometheus, Grafana, ELK/EFK). - Solid understanding of Linux systems, networking concepts, and cloud-native security best practices. Preferred Skills/Qualifications - Experience supporting AI/ML or HPC workloads in production environments. - Knowledge of GPU resource management, workload schedulers, and performance tuning. - Familiarity with distributed systems and large-scale infrastructure environments. - Experience with incident management frameworks and reliability engineering practices. - Strong collaboration and communication skills across cross-functional teams. Compensation The U.S. base salary range for this full-time role is $109,600 to $164,400, with bonus, and benefits on top. Salary ranges are set according to the role, level, and location. The range listed represents the minimum and maximum target salary for new hires across all U.S. locations. Actual pay within this range will depend on factors such as work location, job-related skills, experience, and relevant education or training. Benefits - With a diverse team of 1,100+ employees from 68 nationalities, we foster an inclusive, innovative, and collaborative environment. - We foster a culture grounded in trust, accountability, and high performance. - Our values include: - Grit – overcoming challenges with resilience and determination. - Passion – striving for excellence in everything we do. - Impact – driving meaningful change and progress. - Our team members thrive in an environment where each contribution matters, and together, we achieve extraordinary results.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior Consultant – Site Reliability Engineering
Fabric GroupGood Problems. Unlocking value from business challenges
• Consultative Ownership: Work with autonomy to own problems and deliver solutions, acting as a bridge between development and operations. • Observability Architecture: Design and implement robust monitoring solutions using the LGTM stack to ensure system health and performance. • Reliability Strategy: Advise clients on defining meaningful SLOs/SLIs and managing error budgets to balance innovation with stability. • AI Assistance: Drive use of AI Agents or AI tools for intelligent automation and improving operational efficiency. • Incident Leadership: Lead post-incident reviews (Blameless Post-Mortems) to identify systemic improvements and reduce future toil. • Mentorship: Coach less experienced engineers within Fabric and our client teams on SRE principles and modern infrastructure patterns. • Advising our clients on the right technical decisions and advocating for the right practices to use. • Being an ambassador for Fabric, promoting our values and the practices we use to make sure we build the software right. • Participate in interviewing and recruitment based on business needs. • Thought Leadership: Contribute to the SRE community through blog posts, meetups, or internal knowledge sharing. • Operational Support & Availability: Rotational Support Coverage: Participate in a sustainable team rotation to provide extended service coverage (including weekends) for business-critical systems. • Incident Response: Act as a primary responder for high-priority (P1/P2) incidents during your rostered shift, focusing on rapid restoration and clear stakeholder communication.
DevOps (Cloud Engineering)
APACXebia is a trusted advisor in the modern era of digital transformation, serving hundreds of leading brands worldwide with end-to-end IT solutions. The company has experts specializing in: Technology consulting Software engineering AI Digital products and platforms Data Cloud Intelligent automation Agile transformation Industry digitization In addition to providing high-quality digital consulting and state-of-the-art software development, Xebia has a host of standardized solutions that substantially reduce the time-to-market for businesses. The company has a strong presence across 16 countries with development centres across the US, Latin America, Western Europe, Poland, the Nordics, the Middle East, and Asia Pacific.
Role Description We are looking to bring on a hands-on Cloud DB Platform Automation Engineer (CWR) to support our Database Platforms team at BCG. The preferred location is India, with willingness to work during UK business hours. This role is focused on Terraform-based automation and standardization of database platform offerings across AWS, Azure, and GCP, with emphasis on: - Self-service (Terraform Cloud, GitHub Actions) - Consistency - Reducing manual effort We are specifically looking for someone who is: - Strong in Terraform (module development, not just usage) - Comfortable with CI/CD and automation workflows - Able to work independently and deliver from defined requirements - Easy to collaborate with and provides clear status updates Company Description Xebia is a trusted advisor in the modern era of digital transformation, serving hundreds of leading brands worldwide with end-to-end IT solutions. The company has experts specializing in: - Technology consulting - Software engineering - AI - Digital products and platforms - Data - Cloud - Intelligent automation - Agile transformation - Industry digitization In addition to providing high-quality digital consulting and state-of-the-art software development, Xebia has a host of standardized solutions that substantially reduce the time-to-market for businesses. The company has a strong presence across 16 countries with development centres across the US, Latin America, Western Europe, Poland, the Nordics, the Middle East, and Asia Pacific.
Backend Ops Engineer Role
Weekday (YC W21)We are a Y-Combinator-backed startup building your AI-powered Recruiter Agent
Role Description We are looking for a DevOps / Site Reliability Engineer to take full ownership of infrastructure and platform operations in a fast-scaling, AI-first environment. This role is central to building a secure, scalable, and cost-efficient cloud-native ecosystem while enabling fast and reliable deployments. You will focus on automating infrastructure, improving system reliability, and embedding intelligent, AI-driven operations into DevOps workflows. As a key contributor, you will work closely with backend and product teams to ensure seamless performance, reduce operational risks, and support rapid growth. This is a high-impact role with the opportunity to shape platform architecture, drive efficiency, and contribute to next-generation engineering practices. Key Responsibilities - Design, implement, and manage scalable cloud infrastructure using AWS services such as ECS/Fargate, RDS, S3, and IAM - Build and maintain infrastructure as code using Terraform for consistent and automated deployments - Develop and manage CI/CD pipelines using GitHub Actions to ensure fast and reliable releases - Implement observability and monitoring systems using tools like Prometheus, Grafana, OpenTelemetry, and Sentry - Manage containerized environments using Docker and optimize performance under high-load conditions - Drive cost optimization initiatives across cloud infrastructure - Integrate AI-driven solutions into DevOps workflows, such as automated log analysis and predictive scaling - Collaborate with engineering teams to improve system performance, scalability, and reliability - Ensure infrastructure security, compliance readiness, and best practices - Continuously improve deployment pipelines, reduce downtime, and enhance system resilience Qualifications - 2–3+ years of experience in DevOps, SRE, or backend infrastructure roles - Strong hands-on experience with AWS infrastructure and cloud-native architectures - Expertise in Terraform and infrastructure as code practices - Proven experience with CI/CD pipelines and containerization tools like Docker - Strong understanding of observability, monitoring, and incident management - Experience troubleshooting and optimizing systems under production load - Exposure or strong interest in integrating AI/LLMs into DevOps workflows - Knowledge of security, compliance standards (SOC 2, GDPR), and infrastructure best practices - Familiarity with multi-cloud environments (GCP, Azure) is a plus - Strong ownership mindset, problem-solving ability, and clear communication skills Requirements - Min Experience: 3 years - Location: Remote (India) - Job Type: full-time - Salary range: Rs 2000000 - Rs 3500000 (ie INR 20-35 LPA)
Senior DBA Engineer
TabbyOn a mission to create financial freedom. No interest. No fees. Shariah-Compliant.
Role Description We are seeking a skilled IT professional to join our team in Saudi Arabia. The role involves a variety of responsibilities, including: - Design, construct, install, and maintain large relational databases. - Maintain the integrity and security of the database, including backups and recovery procedures. - Implement and manage disaster recovery and failover systems. - Monitor database performance, implement changes, and apply new patches and versions when required. - Optimize queries for performance. - Collaborate with development teams to optimize database usage. - Set up and maintain database replication, clustering, mirroring, and other high availability strategies. - Use and understand tools like pgbouncer and modern monitoring systems. - Stay updated with the latest database technologies and best practices. Qualifications - Experience with PostgreSQL is mandatory. - Proficiency in PostgreSQL setup, replication, upgrade, monitoring, and performance tuning. - Experience with Clickhouse is a plus. - Can read and write complex and very complex queries. - Experience with backup and recovery procedures, as well as PITR. - Strong knowledge of database design, documentation, and coding. - Familiarity with database management tools and performance tuning techniques. - Strong problem-solving and communication skills. - Familiarity with programming/scripting languages like bash, Python, Go, etc. - Experience with DbaaS on cloud platforms such as GCP or AWS (would be a plus). Requirements - Certification in database management or equivalent training (would be a plus). - Experience in migrating large databases between cloud platforms. - Knowledge of the latest trends in database administration. - Familiarity with modern DevOps practices such as Kubernetes, Terraform, Helm. - Experience in real-time data streaming technologies such as Debezium/Flink. Benefits - Work alongside a high-performing, international engineering team in a global fintech unicorn. - Stock options (ESOP) in a fast-scaling, pre-IPO company. - Health Insurance. - Competitive salary and other bonuses.


