Job Closed
This listing is no longer active.
Take charge of access management. Tools you need. Security you deserve.
Senior Site Reliability Engineer
Location
Maryland
Posted
114 days ago
Salary
$115K - $140K / year
Seniority
Senior
Job Description
Senior Site Reliability Engineer
Provision IAM
• Own and execute infrastructure projects, including migrations, automation, and tooling improvements • Manage and troubleshoot Kubernetes clusters across multiple environments • Maintain and improve GitOps deployment pipelines • Build and maintain CI/CD pipelines • Manage Google Cloud Platform infrastructure (GKE, IAM, networking, storage) • Implement and maintain secrets and configuration management systems • Write and maintain automation (infrastructure as code, configuration management, scripting) • Participate in an on-call rotation supporting production infrastructure as needed • Communicate with internal teams and occasionally with clients when infrastructure matters impact delivery • Collaborate with developers on deployment, reliability, and performance • Use AI tools appropriately to enhance engineering productivity and workflow
Job Requirements
- Authorized to work in the United States
- Hands-on Kubernetes production experience
- Experience with GitOps workflows (ArgoCD, Flux, or similar)
- Strong cloud infrastructure experience (Google Cloud preferred; AWS/Azure transferable)
- CI/CD pipeline design and maintenance (GitLab CI/CD or equivalent)
- Infrastructure as Code (Terraform, OpenTofu, Pulumi, or similar)
- Enterprise secrets management tools (HashiCorp Vault or equivalent)
- Advanced Linux command-line and system administration
- Monitoring and observability tools (Prometheus, Grafana, Datadog, etc.)
- Understanding of SLIs/SLOs and incident response practices
- Automation and scripting (Bash, Python, or similar)
Benefits
- Company-paid health insurance (employee and family coverage)
- Generous paid time off
- SIMPLE IRA retirement plan (IRS-compliant eligibility and company participation)
- Fully remote work environment
- Meaningful technical ownership and growth opportunities
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Senior DevOps Engineer
NetVendorWe help property managers save time, reduce vendor risk, and optimize maintenance operations.
• You'll build and own the foundation that our engineering team ships on every day. • Design, deploy, and manage AWS infrastructure: EC2, ECS/Fargate, RDS, DynamoDB, ElastiCache, S3, CloudFront, and more. • Implement and evolve our Infrastructure as Code practices. • Build and maintain CI/CD pipelines using GitHub Actions. • Configure IAM roles, policies, and least-privilege access. • Enforce tagging, cost controls, and guardrails across environments. • Design for resilience — redundancy, backups, and multi-AZ or multi-region strategies where appropriate. • Set up CloudWatch and Datadog metrics, dashboards, and monitors/alarms. • Establish backup, recovery, and disaster recovery strategies. • Work with the Head of Security to ensure the appropriate controls and tests (automated via Vanta) are in place to meet the goals of the security program. • Architect and automate well-separated environments for Dev, QA/Test, Staging, and Production.
• Build and maintain production infrastructure in AWS. • Manage Linux servers. • Operate Kubernetes clusters. • Administer and optimize PostgreSQL databases. • Operate monitoring & observability. • Be part of the on-call rotation for the infrastructure components. • Ownership of the CI/CD process. • Work on improving infrastructure and application security. • Manage CloudFlare, WAF, and DDoS protection solutions to improve our stance in this area.
• Apply SRE principles to Customer Success • Detect issues commonly occurring in the platform • Proactively find improvements in the platform • Work on escalations and longer-running, more complex technical cases • Assist those using the Supabase platform with complex and/or long-running issues • Deliver on synchronous and asynchronous engagements with Supabase customers • Serve as an internal champion for the platform and how customers use it.
Site Reliability Engineer – AI & ML Infrastructure, Kubernetes, Terraform
DeepgramBuilding foundational AI for speech transcription and understanding.
• Architect and maintain our core computing platform using Kubernetes on AWS and on-premise, providing a stable, scalable environment for all applications and services. • Develop and manage our entire infrastructure using Infrastructure-as-Code (IaC) principles with Terraform, ensuring our environments are reproducible, versioned, and automated. • Design, build, and optimize our AI/ML job scheduling and orchestration systems, integrating Slurm with our Kubernetes clusters to efficiently manage GPU resources. • Provision, manage, and maintain our on-premise bare metal server infrastructure for high-performance GPU computing. • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions to support high-throughput, low-latency workloads across hybrid environments. • Develop a comprehensive observability stack (monitoring, logging, tracing) to ensure platform health, and create automation for operational tasks, incident response, and performance tuning. • Collaborate with AI researchers and ML engineers to understand their infrastructure needs and build the tools and workflows that accelerate their development cycle. • Automate the life cycle of single-tenant, managed deployments




