Job Closed
This listing is no longer active.
We are a leader in AIOps providing modern IT operations with actionable insights to predict and resolve problems faster.
Senior Site Reliability Engineer, Observability
Location
Virginia
Posted
132 days ago
Salary
0
Seniority
Senior
Job Description
Senior Site Reliability Engineer, Observability
ScienceLogic
• Be a key contributor on an Agile development team, collaboratively realizing business value through iterative software development lifecycle • Build and execute the monitoring strategy for ScienceLogic SaaS infrastructure • Define, deploy, and maintain system and service monitors • Be the authority for various monitoring technologies like Prometheus, AWS Cloudwatch, Scylla manager, New Relic to provide next generation monitoring solutions for ScienceLogic SaaS • Employ advanced monitoring practices and technologies to detect and automatically resolve platform issues before they impact the customer’s experience. • Participate in architecture and operations reviews • Identify and automate measurement of operations SLAs, SLOs using SLIs • Triage incident response, document SOPs, Runbooks and train NOC team members • Participate in shared on-call manager rotation for escalations during incidents and outages, occasionally during off hours • Provide dash boarding and analytics solutions to internal teams based on requirements
Job Requirements
- 8+ years of software development or site reliability engineering or equivalent experience
- Skilled at problem solving, algorithms, and data structures
- Building tools and scripting frameworks from scratch
- Working with Cloud Automation tools like CloudFormation, Terraform, CDK, aws-cli
- Scripting languages like Python, Groovy, PowerShell, Bash, Perl etc.
- Configuration automation using Ansible or equivalent tools
- Exposure to Windows and Linux administration skills
- Project management tools like Jira, Trello
- Prior experience in dealing with Datastore technologies like Postgres, MySQL, SQL, DynamoDB is desirable
- Familiarity with basic networking, security and cloud engineering concepts
- Team player who is eager to help others to succeed through mentoring and leading by example
- Highly collaborative with effective written and verbal communication skills.
Benefits
- Comprehensive medical, dental and vision plans
- 401(k) plan with employer match
- Flexible Paid Time Off (FTO) so that you can take the time that you need to re-energize
- Volunteer Time Off (VTO) - take two days off per calendar year to volunteer with your preferred charitable organization
- 5-year Service Milestone Sabbatical
- Paid parental leave
- Generous employee referral bonus program
- Pet insurance
- HQ Office centrally located in Reston Town Center featuring a well-stocked kitchen with rotating snacks and beverages, and catered lunch on Thursdays
- Regular virtual company-wide events, including cooking classes, yoga, meditation and more
- Mentorship and professional development opportunities with experienced product marketing leaders
- The opportunity to learn and develop from some of the best and brightest minds in the industry!
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
SRE Analyst, Senior
Pottencial Seguradora S.ASomos a maior insurtech do Brasil e líderes no mercado de Seguro Garantia!
• Collaborate with development, infrastructure, and security teams to design, build, and maintain reliable and scalable systems; • Participate in planning and executing load, chaos, and failover tests, focusing on risk mitigation and identification of bottlenecks; • Develop and maintain automation tools for monitoring, deployment, rollback, and incident response; • Monitor and respond to critical incidents, conducting root cause analysis (RCA) and proposing preventive actions; • Support the evolution of CI/CD processes, infrastructure as code (IaC), and security; • Lead automation, observability, and performance initiatives for critical systems; • Design, implement, and evolve monitoring, metrics, distributed tracing, and logging solutions; • Conduct incident reviews (postmortems) with root cause analysis and structured action plans; • Identify and apply continuous improvements to SLOs, SLIs, and SLAs; • Act as a focal point for failure mitigation, recovery, and continuity plans; • Drive a culture of reliability and resilience across the organization; • Mentor junior and mid-level professionals, promoting technical training and best practices.
• Establish the DevSecOps function at Playson, defining best practices and security standards across the Platform Tribe. • Integrate security into CI/CD pipelines (SAST, DAST, dependency scanning, container scanning). • Harden infrastructure and runtime environments (Linux, Docker, Kubernetes/EKS, RBAC). • Design and enforce cloud security controls in AWS (IAM least-privilege, GuardDuty, Security Hub, encryption at rest/in transit). • Define and maintain IaC security policies (Terraform/Terragrunt, drift detection, policy-as-code). • Implement and manage secrets management solutions (Vault, AWS Secrets Manager). • Build centralized security monitoring & alerting (Datadog, ELK, CloudWatch, SIEM/SOAR). • Lead vulnerability management and threat modeling practices. • Automate workflows through scripting (Python, Bash). • Partner with backend, infrastructure, and platform engineers to embed security in design & delivery. • Contribute to compliance readiness (ISO 27001, GDPR, PCI-DSS). • Act as a security subject-matter expert, mentoring engineers and raising awareness. • Continuously evaluate and implement new security tools and approaches.
• Redefine how public funds reach the people who need them. • Serve as the primary architect of security posture. • Own the critical intersection of security and compliance. • Drive the continuous hardening and evolution of systems at every layer. • Ensure platform remains a fortress, maintaining the highest levels of trust with government partners. • Collaborate closely with DevOps, Engineering teams, and CISO.
• Ensure reliability, availability and performance of compute nodes running VMs • Analyze and debug Linux systems across user space and kernel space, understanding capabilities, limitations and trade-offs at each layer • Troubleshoot complex production issues involving CPU, memory, NUMA, cgroups and scheduling • Work hands-on with virtualization and containerization, primarily using QEMU/KVM and Linux-native technologies • Design and evolve observability as a core capability of the node layer: metrics, logs, traces, alerts, SLIs and SLOs • Lead incident response, root-cause analysis, and postmortems, driving long-term reliability improvements • Collaborate closely with platform, kernel/hypervisor, GPU and infrastructure teams to improve system design and operability.




