Job Closed
This listing is no longer active.
When It Matters®
Senior Software Engineer, Site Reliability
Location
United States
Posted
129 days ago
Salary
0
Seniority
Senior
Job Description
Senior Software Engineer, Site Reliability
Benchmark
• Contribute to the design, development, and delivery of features that enhance system reliability and scalability. • Define, measure, and improve SLIs, SLOs, and error budgets in collaboration with engineering teams. • Participate in building a culture of reliability through knowledge sharing, documentation, and process improvements. • Implement and improve observability tooling and practices to monitor the health and performance of production systems. • Participate in incident management, including on-call rotations, root cause analysis, and postmortem reviews. • Lead smaller initiatives or components of larger projects, ensuring technical quality and operational readiness. • Collaborate with software engineering, security, and product teams to ensure resilient and secure system design. • Mentor junior engineers, sharing expertise in SRE principles and AWS best practices. • Contribute to automation efforts to reduce toil and improve efficiency of operational processes.
Job Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering with a focus on production operations.
- Strong knowledge of AWS cloud services and cloud-native architectures.
- Proficiency in scripting or programming languages (e.g., Python, Bash).
- Experience with observability tools (e.g., CloudWatch, Datadog, Prometheus, Grafana).
- Familiarity with infrastructure-as-code tools (e.g., Terraform, CloudFormation) and CI/CD pipelines.
- Strong problem-solving skills and ability to work cross-functionally.
- Some experience mentoring or coaching junior engineers.
Benefits
- Health insurance
- Retirement plans
- Paid time off
- Flexible work arrangements
- Professional development
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Engineer, AWS
RecruityTalentConnecting top IT and Executive talents with great companies in EMEA/LATAM through tailored recruitment solutions.
• Lead operations for multi-tenant SaaS workloads on AWS, ensuring scalability, high availability, and cost efficiency • Design, implement, and maintain reliable infrastructure for production, data, and AI/ML workloads • Own incident response, postmortems, and operational runbooks to improve system reliability and reduce MTTR • Manage and enhance CI/CD pipelines supporting both application and ML deployment workflows • Build and maintain infrastructure automation using Infrastructure as Code (AWS CDK or Terraform) • Enable self-service capabilities for engineering and data science teams • Monitor and optimize cloud usage across compute, GPU, and storage resources, implementing cost controls and forecasting • Support and automate ML pipelines, including training, testing, and deployment using AWS SageMaker, Kubeflow, or MLflow • Manage GPU and compute clusters (EKS, ECS, EC2) for model training and inference workloads • Develop and maintain monitoring, alerting, observability, and security best practices • Collaborate closely with Engineering, Data, AI/ML, and PlatformOps teams to ensure smooth cross-team delivery
Spalding, a Saalex Company is seeking a DevOps Engineer in Patuxent River, MD. Spalding, a Saalex Company is a professional services company delivering cutting-edge solutions to the Department of Defense since 2001. Our expert-level solutions include software development, information technology, program management, financial management and business intelligence services. Spalding, a Saalex Company offers competitive compensation, career development, flexible work schedules and excellent benefits. Position Type: Full-Time Salary: $115K-$135K (depending on experience) Work Location: This is a remote position. **On-Site Requirements: On-boarding will require 1-2 visits to Patuxent River, MD for candidates that are local to the area. Candidates out of state will be onboarded virtually. Training will be virtual and telework maximized/permitted to the greatest extent possible, however for local candidates, training/tasking may require on-site work a few hours per week. Future on-site/telework requirements/schedules may change as additional client direction is received. Essential Functions: - Develops DevOps functionality for CI/CD pipeline solutions. - Improves and maintains GitLab pipeline configurations. - Collaborates and assists software engineers with the design, configuration, implementation, and maintenance of CI/CD pipelines. - Assist with GitLab upgrades as received from the vendor (i.e. bi-weekly, monthly, etc.; requires evening support) - Onboards new applications/customers to the CI/CD environment. - Provides recommendations for technology advancement to streamline CI/CD tools and processes. - Provides technical assistance and troubleshooting to applications and systems deployed within a DevOps CI/CD pipeline. - Identifies, troubleshoots, and resolves pipeline issues. - Other duties as assigned or required.
SRE – Clickhouse Team
PostHogProduct analytics, session replay, feature flags, A/B testing, data warehouse, CDP, surveys. PostHog does that.
• Manage large fleets of EC2-based VMs, disks, and networking for data-intensive workloads • Improving operational tooling around deploys, schema changes, backups, restores, and incident response • Working closely with ClickHouse engineers to turn database-level needs into infra-level solutions • Reducing operational load by identifying repeat pain points and eliminating them through code and self-healing automation • Participating in on-call and incident response, with a strong focus on making incidents rarer over time • You’ll have room to design and automate, not just respond to alerts.
• Seeking a Lead AI DevOps Engineer to oversee design and delivery of advanced AI/ML/GenAI solutions. • Combines cloud engineering and automation with hands-on leadership in deploying and integrating LLM/SLM models into enterprise applications. • Leading architecture and deployment of AI/ML/GenAI solutions (LLM/SLM at scale). • Driving automation of infrastructure, model lifecycle and inference pipelines. • Overseeing CI/CD processes for AI/ML/GenAI workloads. • Designing secure, scalable cloud infrastructures (Azure-focused). • Acting as technical advisor for stakeholders and client-facing solution design. • Mentoring engineers, promoting best practices, and fostering innovation in GenAI adoption. • Coordinating cross-functional teams to align AI engineering with business outcomes. • Ensuring cost optimization, monitoring and compliance across environments.




