Secure your enterprise with the autonomous cybersecurity platform. Endpoint. Cloud. Identity. XDR. Now.
Senior Site Reliability Engineer, Government
Location
United States
Posted
3 days ago
Salary
$132K - $182K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer, Government
SentinelOne
• Drive continuous software delivery, resolve incidents, run post mortems, and create automation strategies for deployment, self-testing, and alerting. • Lead and execute incident management for production issues, ensuring rapid recovery, root cause analysis, and preventative follow-up actions. • Improve and optimize the observability strategy by collaborating with application engineering teams to design monitoring solutions that enhance alerting capabilities and reduce noise. • Define, implement, and monitor SLOs, SLIs, and SLAs in collaboration with product and engineering teams to align with business objectives. • Design, develop, and maintain software solutions that address operational, compliance, and pipeline challenges. • Own and coordinate all government environment releases, driving process improvements to enhance the release pipeline's efficiency, reliability, and visibility. • Partner cross-functionally with engineering, product, SecOps, compliance, and leadership teams to align priorities, define testing strategies, and resolve challenges. • Ensure all infrastructure and deployments meet FedRAMP, government regulations, and industry standards, while maintaining required release documentation and risk assessments.
Job Requirements
- 5+ years of experience in SRE, DevOps, or Infrastructure Engineering for SaaS products, with 4+ years running operations at a large scale.
- 2+ years of production experience with a container orchestration system (Kubernetes preferred) and Continuous Delivery.
- Strong understanding of compliance frameworks relevant to government deployments (e.g., FedRAMP, DoD, NIST 800 53, NIST 800 137).
- Multi cloud experience in AWS/GCP (expertise within AWS preferred).
- Demonstrated experience with at least one main programming language (Python, Go, Ruby, etc.) and proficiency in bash scripting to improve operational workflows.
- Familiarity with GitOps frameworks, IaC tooling (Terraform or Pulumi), and deployment strategies (blue green, rolling deploys, canary deploys).
- Experience with industry standard observability stacks (Prometheus, Grafana, ELK, OpenTelemetry, etc.) and incident management processes.
- Proven background implementing and supporting FedRAMP, security, risk management, and compliance processes for software releases.
- Experience working directly with government agencies or in highly regulated industries.
- Familiarity with testing strategies and automation in large scale environments.
Benefits
- Restricted Stock Units (RSUs)
- Employee Stock Purchase Plan (ESPP)
- Flexible time off
- Paid company holidays and paid sick time
- Gender-neutral parental leave
- Grandparent leave
- Medical, dental, and vision coverage
- 401(k) retirement plan with company match
- Life and disability insurance
- Health and dependent care FSA
- Voluntary benefits (hospital, accident, critical illness)
- Employee Assistance Program (EAP)
- ARAG pre-paid legal
- Nationwide pet insurance
- Cancer Care program
- Global business travel medical insurance
- Home office allowance
- Mobile phone reimbursement
- Wellness coach
- Wellness/gym reimbursement
- Fertility coverage
- Adoption & surrogacy reimbursement
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Engineer
UnqorkUsing CaaS (Codeless-as-a-Service) to accelerate time-to-market & eliminate legacy code for the enterprise 🚀
• Reporting to the Director of DevOps • Build the next-generation control plane that provisions, configures, and manages Unqork's Kubernetes fleet across commercial, government, and edge customer environments, continuing to push toward an architecture that is automated, modular, and built to scale • Design and deliver self-service infrastructure tooling that enables Ops and Support teams to execute common operational workflows without engineering intervention, shifting operational work left and freeing engineers to build • Drive observability improvements across the fleet, establishing alerting and instrumentation that produces cleaner signal, enables faster mitigation, and supports deeper root cause analysis when incidents occur • Improve the internal engineering experience by building faster CI/CD pipelines, better tooling, and paved-road patterns that reduce cognitive load and make the right way to build and ship the obvious way • Collaborate closely with the Principal Architect to shape technical direction across the control plane, infrastructure automation, and cross-cutting infrastructure concerns
Role Description Build out a cloud-native team that owns the entire software delivery life cycle on Amazon Web Services. You will combine deep Kubernetes expertise with Python and shell scripting to automate, monitor, and continuously improve the Linqia platform while driving FinOps practices to keep our cloud footprint efficient. Work in a GitOps culture where every change is delivered through pull requests and rolled out by automated pipelines. What You Will Do - Design, maintain, and evolve our AWS account structure, VPC networking, IAM policies, security boundaries, and cost-management controls using Terraform and the AWS console. - Maintain secure networking layers with AWS load balancers, ingress controllers, service-mesh policies, network policies, and zero-trust principles. - Operate and harden production-grade Kubernetes clusters on AWS EKS, including upgrades, service mesh, policy management, and multi-cluster architectures driven by Argo CD. - Build reusable infrastructure-as-code modules with Terraform that provision cloud resources in minutes while enforcing tagging standards and least-privilege access. - Create self-service CI/CD pipelines in Jenkins and GitHub Actions for fast, safe releases with automated testing and promotion across environments. - Deliver real-time observability with Datadog, Prometheus, Grafana, CloudWatch, and OpenTelemetry, and use these tools to assist in solving production bugs and issues. - Administer and maintain purpose-built Linux VMs via configuration management tools like Puppet, Ansible, or Chef. - Deploy, scale, and maintain databases on AWS (Aurora, PostgreSQL, MySQL, OpenSearch, etc.), maintaining high database performance/uptime, optimizing tables and datasets, and ensuring disaster recovery protocols are in place. - Support developers by maintaining Podman-based local dev boxes and Kubernetes staging environments that mirror production, ensuring smooth hand-off from local code to cloud-native deployments. - Implement FinOps practices: track and forecast AWS spend, enforce cost-allocation tagging, identify rightsizing opportunities, manage Savings Plans or Reserved Instances, and build cost-optimization dashboards for engineering and finance stakeholders. - Write automation utilities and command-line tools in Python and craft shell scripts that glue components and workflows together. - Champion reliability through incident reviews, capacity planning, game days, chaos testing, and service-level objective tracking. - Collaborate in Agile rituals, plan sprints, refine backlog tickets, and pair with peers to spread DevOps and FinOps best practices. Qualifications - Bachelor degree in Computer Science or equivalent practical experience. - Three plus years working with cloud infrastructure or platform engineering focused on AWS. - Deep hands-on experience with Kubernetes, preferably EKS, covering upgrades, networking, storage, RBAC, and custom resources. - Proficiency in Python and Bash or Zsh scripting. - Strong understanding of core AWS services EC2, VPC, IAM, ALB, S3, RDS, CloudFormation, and CloudWatch. - Demonstrated experience applying FinOps principles: cost monitoring, forecasting, and optimization on AWS. - Solid experience with Docker and container runtimes, with emphasis on Podman for local development environments. - Hands-on practice with configuration-management tools such as Ansible or Puppet and infrastructure-as-code with Terraform. - Proven use of Datadog for metrics, logs, and APM, plus familiarity with Prometheus and Grafana dashboards. - Comfortable with Git-based workflows, feature branching, and pull-request reviews. - Strong SQL skills and a deep understanding of relational database internals. - Competent in Linux administration, process troubleshooting, and performance tuning. - Practical knowledge of TCP/IP, HTTP, TLS, DNS, and common networking tools. - Clear communication skills and an ability to translate complex technical topics to diverse audiences. - Familiarity with Scrum or Kanban and a continuous-improvement mindset. Extra Credit - AWS certifications such as Solutions Architect, DevOps Engineer, or FinOps Practitioner. - Experience with AWS security tooling GuardDuty, Security Hub, IAM Access Analyzer, and KMS. - Building data pipelines with Apache Spark, Flink, or similar frameworks. - Implementing event-driven architectures with Kafka Streams or KSQL. - Applying SRE practices such as error budgets and service-level dashboards. - Exposure to machine-learning workflows, ModelOps, or MLOps in production.
• Build highly interactive, single-page React apps that can scale with both increased interaction complexity and volume. • Design, implement, and maintain deployments at scale, infrastructure, reliability, and scalability; then iterate and optimize continual improvements. • Manage always-available infrastructure, deployment pipelines, and platform tooling to eliminate downtime and improve the manageability of services and systems. • Collaborate with Software Engineering teams to architect and develop infrastructure and automated deployments for cloud-native SaaS applications. • Research and integrate new technologies and innovative solutions to continuously enhance platform functionality and performance. • Partner with peers on product development to define and execute the company’s roadmap and to address critical technical challenges.
• Helping migrate a self-managed Kubernetes cluster onto Amazon EKS. • Managing and improving AWS infrastructure defined in Terraform. • Supporting the migration of self-hosted Kafka onto Amazon managed services. • Ensuring platform stability, observability, and security during changes. • Collaborating closely with a senior internal team and taking initiative on tasks. • Documenting work for team maintenance.




