Peraton Corporation, a national security company headquartered in Herndon, Virginia, supplies solutions for mission-critical programs and systems. Founded in 2017, Peraton's missio
Senior AWS Cloud Site Reliability Engineer
Location
United States
Posted
88 days ago
Salary
$104K - $166K / year
Seniority
Senior
No structured requirement data.
Job Description
Senior AWS Cloud Site Reliability Engineer
Peraton Corporation
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description We are seeking an experienced and motivated Senior AWS Cloud Site Reliability Engineer (SRE) to join our dynamic team. As an AWS Site Reliability Engineer, you will play a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure on Amazon Web Services (AWS). The ideal candidate will have a strong background in AWS services, a deep understanding of infrastructure as code, and deep expertise with relational databases. The AWS Site Reliability Engineer (SRE) will collaborate closely with cross-functional teams, including development, quality assurance, and operations, to ensure seamless software releases and continuous improvement of our release processes. - Infrastructure Automation: - Design, implement, and manage infrastructure as code (IaC) solutions using tools like AWS CloudFormation, Terraform or Helm Charts to automate continuous database deployment and scaling processes. - Collaborate with development teams to integrate continuous deployment practices and ensure the reliability of applications and databases. - Monitoring and Alerting: - Implement robust monitoring and alerting systems to proactively identify and address potential issues before they impact system performance. - Analyze system metrics, logs, and alerts to troubleshoot and resolve issues promptly. - Performance Optimization: - Conduct performance analysis and optimization of AWS infrastructure components to enhance system efficiency and reduce latency. - Identify and implement improvements to enhance system reliability and resilience. - Incident Response: - Participate in on-call rotations to respond to and resolve incidents promptly. - Conduct post-incident reviews to identify root causes and implement preventive measures. - Security and Compliance: - Work closely with security teams to implement and enforce best practices for securing AWS environments. - Ensure compliance with industry standards and regulations related to cloud infrastructure. - Communication: - Facilitate clear communication across teams, providing updates on release status, known issues, and any potential impact on stakeholders. - Coordinate communication of release schedules and changes to all relevant parties. - Release Planning and Coordination: - Collaborate with development, QA, and operations teams to plan and coordinate database schema releases. - Define release scope, schedule, and dependencies to ensure timely and smooth deployments. - Create and submit change records as required for process and audit compliance. - Participation in Technical Change Advisory and Review boards as required. - Release Automation: - Develop and maintain automated deployment pipelines using industry-standard tools such as GitLab CI/CD, Liquibase, or similar. - Automate and streamline release processes to improve efficiency and reduce manual errors. - Continuous Improvement: - Proactively identify areas for process improvement within the release management lifecycle. - Implement feedback loops to capture lessons learned from each release and apply improvements iteratively. - Stay up to date with industry best practices, emerging technologies, and trends related to database automation. - Quality Assurance: - Collaborate with QA teams to establish and execute release validation procedures. - Ensure releases are thoroughly tested and meet quality standards before deployment. - Drive continuous improvement by analyzing release management trends, identifying recurring issues, and working with teams to implement solutions. Qualifications - Bachelor's Degree and 8 years of experience or 12 years of experience and a HS Degree/Diploma. - Proven experience as a Site Reliability Engineer or similar role with a strong emphasis on relational databases. - In-depth knowledge of AWS services like RDS and DynamoDB and expertise in managing cloud infrastructure. - Advanced level programming and/or scripting in 3 or more of the following languages: Python, Java, Chef, Helm, Playwright, Bash, JavaScript, Terraform. - Strong understanding of DevOps principles and continuous integration/continuous deployment (CI/CD) pipelines. - Proficiency in CI/CD tools such as GitLab CI/CD, Liquibase, or others. - Familiarity with infrastructure as code (IaC) tools like CloudFormation, Terraform, Helm Charts, or similar technologies. - Hands-on experience with version control systems (GitLab, GitHub, AWS CodeCommit) and branching strategies. - Experience with containerization and orchestration tools (e.g., Amazon Elastic Compute Service (ECS), Amazon Elastic Kubernetes Service (EKS), Docker, Kubernetes). - Familiarity with monitoring tools (e.g., CloudWatch, Prometheus, Grafana, Datadog) and log analysis. - Attention to detail, with a focus on maintaining high-quality software releases. - Solid understanding of Agile methodologies and their application in release management. - Excellent problem-solving and troubleshooting skills. - Strong communication and collaboration skills. - Must be a US Citizen. - Must be able to obtain and maintain the required agency clearance (6C Public Trust). Requirements - Relevant certifications in DevOps or related fields are a plus. - High Risk Public Trust or Secret Clearance preferred. - 3 or more years in SRE or Platform Engineering group for high availability/critical platforms/applications. - 2 or more years managing relational databases. Company Description Peraton is a next-generation national security company that drives missions of consequence spanning the globe and extending to the farthest reaches of the galaxy. As the world’s leading mission capability integrator and transformative enterprise IT provider, we deliver trusted, highly differentiated solutions and technologies to protect our nation and allies. Peraton operates at the critical nexus between traditional and nontraditional threats across all domains: land, sea, space, air, and cyberspace. The company serves as a valued partner to essential government agencies and supports every branch of the U.S. armed forces. Each day, our employees do the can’t be done by solving the most daunting challenges facing our customers. Visit peraton.com to learn how we’re keeping people around the world safe and secure. Target Salary Range $104,000 - $166,000. This represents the typical salary range for this position. Salary is determined by various factors, including but not limited to, the scope and responsibilities of the position, the individual’s experience, education, knowledge, skills, and competencies, as well as geographic location and business and contract considerations. Depending on the position, employees may be eligible for overtime, shift differential, and a discretionary bonus in addition to base pay. EEO EEO: Equal opportunity employer, including disability and protected veterans, or other characteristics protected by law.
Job Requirements
- Bachelor's Degree and 8 years of experience or 12 years of experience and a HS Degree/Diploma.
- Proven experience as a Site Reliability Engineer or similar role with a strong emphasis on relational databases.
- In-depth knowledge of AWS services like RDS and DynamoDB and expertise in managing cloud infrastructure.
- Advanced level programming and/or scripting in 3 or more of the following languages: Python, Java, Chef, Helm, Playwright, Bash, JavaScript, Terraform.
- Strong understanding of DevOps principles and continuous integration/continuous deployment (CI/CD) pipelines.
- Proficiency in CI/CD tools such as GitLab CI/CD, Liquibase, or others.
- Familiarity with infrastructure as code (IaC) tools like CloudFormation, Terraform, Helm Charts, or similar technologies.
- Hands-on experience with version control systems (GitLab, GitHub, AWS CodeCommit) and branching strategies.
- Experience with containerization and orchestration tools (e.g., Amazon Elastic Compute Service (ECS), Amazon Elastic Kubernetes Service (EKS), Docker, Kubernetes).
- Familiarity with monitoring tools (e.g., CloudWatch, Prometheus, Grafana, Datadog) and log analysis.
- Attention to detail, with a focus on maintaining high-quality software releases.
- Solid understanding of Agile methodologies and their application in release management.
- Excellent problem-solving and troubleshooting skills.
- Strong communication and collaboration skills.
- Must be a US Citizen.
- Must be able to obtain and maintain the required agency clearance (6C Public Trust).
- Relevant certifications in DevOps or related fields are a plus.
- High Risk Public Trust or Secret Clearance preferred.
- 3 or more years in SRE or Platform Engineering group for high availability/critical platforms/applications.
- 2 or more years managing relational databases.
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
About the Company CardioOne partners with independent cardiologists to provide innovative solutions that improve patient outcomes and reduce costs. Our platform helps our physician partners thrive in today’s fee-for-service environment and prepare for success in value-based care. In February 2024, we partnered with WindRose Health Investors as well as top physician services and payor executives to grow our team and invest in our next phase of growth. CardioOne offers a magnificent work environment, good working conditions, and competitive pay. We offer medical, dental, vision, and a 401k plan with a match to benefit eligible employees. We offer PTO (Personal Time Off) and sick time to full-time employees. We take pride in creating a culture of employee engagement that translates into an exemplary patient experience. Join us in our mission to positively impact US cardiology. About the Job We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, security, and performance of our production systems and services. The SRE will bridge the gap between software development and operations, implementing automation, monitoring, and best practices to enable rapid, reliable delivery of applications. You will report directly to the Senior Director of Engineering. What you’ll do: Reliability & Performance - Ensure high availability, scalability, and performance of production systems. - Implement and maintain SLIs, SLOs, and SLAs for critical services. - Conduct capacity planning and performance tuning. Automation & Tooling - Automate infrastructure provisioning using IaC tools such as Terraform and Terragrunt , ansible - Develop automation to minimize manual operations and improve deployment workflows. - Build CI/CD pipelines to support rapid and reliable deployments. Monitoring & Incident Response - Design and maintain monitoring, logging, and alerting systems (Datadog). - Participate in on-call rotations and lead incident response efforts. - Perform root-cause analysis and develop postmortems to prevent recurring issues. Systems Engineering - Manage cloud infrastructure (AWS, Azure) and container orchestration platforms (Kubernetes, ECS). - Optimize system architecture for reliability and fault tolerance. - Implement best practices for security, networking, and service resilience. Collaboration & Leadership - Work closely with development teams to design reliable microservices and distributed systems. - Advocate for SRE principles and drive operational excellence across engineering teams. - Mentor engineers on reliability practices, tooling, and automation strategies. What you’ll need: - Bachelor’s degree in Computer Science, Engineering, or equivalent experience. - 3–7 years of experience in SRE, DevOps, or Systems Engineering roles. - Strong proficiency with Linux systems and shell scripting. - Experience with cloud platforms (AWS, Azure). - Hands-on experience with Kubernetes/ECS and container technologies (Docker). - Proficiency in at least one programming language: Python or Java - Experience with CI/CD pipelines and DevOps tooling. - Strong understanding of distributed systems, networking, and security fundamentals. Preferred Qualifications - Experience with observability stacks (OpenTelemetry). - Knowledge of database management (PostgreSQL). - Experience with configuration management tools (Ansible, Chef, Puppet). - Familiarity with zero-downtime deployments and chaos engineering practices. Soft Skills - Strong analytical and problem-solving skills. - Excellent communication and cross-team collaboration. - Ability to thrive in fast-paced, high-stakes environments. - A mindset focused on continuous improvement and operational excellence. Work Location: - Remote: Colorado, Delaware, Florida, New Hampshire, New Jersey, New York, Pennsylvania, Texas. Additional Information Full-time base salary range of $130,000 to $150,0000 plus medical, dental, and vision benefits and a matching 401K.
Site Reliability Engineer
AmwellAmwell (previously known as American Well): digital care delivery will transform healthcare
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description The Site Reliability Engineer builds and operates the paved roads that service teams use every day. You take shared infrastructure from idea to module to production, then you keep it boring. This is not a research role and not a hero role. It is delivery with discipline. You build with intention. You do not just make things work, you make them make sense. You challenge assumptions, question defaults, and tighten bolts others ignore. You move fast, but not recklessly. You are becoming the engineer others trust to take ownership and deliver cleanly. This is a hands-on engineering role who can work independently on well-scoped problems with guidance, follow established patterns, and improve them when the evidence supports change. You partner closely with Security, Networking, and SRE because the platform is where constraints become real. As a Site Reliability Engineer, you help determine whether the platform feels chaotic or calm to everyone else. Your work directly affects developer velocity, operational safety, and trust in the system. When the platform is boring, predictable, and resilient, it is because engineers like you did the work carefully and well. Core Responsibilities - Cloud Foundations - Implement cloud infrastructure in AWS using approved patterns and guardrails. - Support EKS based runtime foundations, including cluster add-ons and shared services. - Build environment parity across nonprod and prod and flag any required divergence early with evidence. - Help make cloud primitives predictable, supportable, and easy to consume. - Infrastructure Patterns and Modules - Develop and maintain reusable platform modules and templates using Terraform or CDKTF where applicable. - Contribute to baseline building blocks: VPC patterns, IAM primitives, EKS base clusters, ingress patterns, secrets, and shared data stores as assigned. - Keep modules consumable through sane defaults, versioning, changelogs, and upgrade guidance. - Reduce drift by enforcing standards through code, not documentation alone. - Automation and Delivery Enablement - Improve CI workflows for infrastructure changes: plan and apply safety, policy checks, drift detection, and promotion across environments. - Remove manual steps from provisioning and onboarding by turning them into pipelines and documented runbooks. - Support internal module consumption patterns, including examples and reference implementations. - Favor repeatability and clarity over clever one-off solutions. - Operations and Reliability - Operate platform owned services with an ownership mindset. Ownership is not optional. - Participate in on call for platform services and follow incident procedures. - Write and maintain runbooks, dashboards, and alerts for what you ship. - Drive post-incident follow-ups that reduce repeat failures. - Security, Compliance, and Governance - Implement least privilege IAM patterns and secure by design defaults. - Partner with Security to integrate controls into pipelines and platform defaults. - Treat auditability as a feature: logs, approvals, traceability, and evidence. - Follow established governance and exception processes and document deviations. Qualifications - 3 plus year's experience in platform engineering, DevOps, SRE, or infrastructure engineering. - Working experience with AWS and infrastructure as code (Terraform preferred, CDKTF acceptable). - Practical Kubernetes experience, preferably EKS (deploying, operating, debugging). - Comfort with networking fundamentals: DNS, TLS, routing, load balancers, and security groups. - Ability to debug pipelines and distributed failures without guessing. - Strong written communication: design notes, runbooks, and crisp status updates. Benefits - Flexible Personal Time Off (Vacation time) - 401K match - Competitive healthcare, dental and vision insurance plans - Paid Parental Leave (Maternity and Paternity leave) - Employee Stock Purchase Program - Free access to Amwell’s Telehealth Services, SilverCloud and The Clinic by Cleveland Clinic’s second opinion program - Free Subscription to the Calm App - Tuition Assistance Program - Pet Insurance
Staff Site Reliability Engineer
JobgetherWe use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team. We appreciate your interest and wish you the best! Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time. #LI-CL1 We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description This is a senior, hands-on role within a small, high-leverage SRE team, responsible for ensuring the reliability, scalability, and security of a high-growth digital financial platform. The Staff SRE will architect, automate, and optimize cloud infrastructure, focusing on operational excellence and system resilience. You will collaborate closely with engineering, product, and security teams to embed reliability into every layer of the platform while mentoring fellow engineers and shaping long-term infrastructure strategy. This role provides the opportunity to directly impact platform performance, member trust, and product velocity through robust monitoring, incident prevention, and automation. You will lead initiatives across GCP environments, cloud networking, Kubernetes, and IaC, while exploring innovative automation solutions, including LLM-driven tooling, to reduce toil and improve operational efficiency. This position is ideal for a systems thinker who thrives in ambiguous, high-impact environments and wants to build resilient, scalable services for millions of users. Accountabilities - Lead architecture and automation across cloud infrastructure, ensuring reliability, scalability, security, and cost-effectiveness. - Define and operate SLIs, SLOs, and error budgets, translating reliability goals into measurable business outcomes. - Design and optimize multi-region, disaster recovery, and capacity planning strategies to support platform growth. - Manage and optimize cloud networking, including VPC architecture, ingress/egress, Cloud Armor, VPN, and DNS. - Drive infrastructure-as-code and GitOps practices using Terraform, Kubernetes, Helm, and ArgoCD to enable repeatable, predictable deployments. - Mentor SREs and infrastructure engineers through hands-on collaboration, design reviews, and incident retrospectives. - Partner with cross-functional teams to align platform decisions with product velocity, security, and long-term durability. Qualifications - 8+ years of experience in software, infrastructure, or site reliability engineering. - 5+ years of hands-on experience operating production systems in GCP (compute, networking, storage, IAM, observability). - Deep experience with Kubernetes (GKE), Helm, containerization, Terraform (IaC), and ArgoCD. - Strong programming skills in Python, Go, or TypeScript/JavaScript for automation and internal tooling. - Proven ability to define and operate against SLIs, SLOs, and error budgets. - Strong knowledge of relational and distributed databases (e.g., MySQL, Cloud SQL, Cloud Spanner, Redis) including performance tuning and HA strategies. - Experience leading incident response, root cause analysis, and systemic remediation. - Bonus: Experience in fintech or regulated environments, CI tooling familiarity, and high-growth startup experience. Benefits - Competitive compensation and benefits package. - Premium Medical, Dental, and Vision Insurance plans. - 401(k) savings plan with matching contributions. - Flexible PTO and generous company holidays, including Juneteenth and Winter Break. - Paid parental and caregiver leave. - Flexible hours with a virtual-first work culture and home office stipend. - Opportunities for professional growth, mentorship, and impactful work on a high-growth platform. - Company-sponsored in-person and virtual events for team connection.
Associate Reliability Engineer
ChompsProtein-packed meat snacks that deliver on taste, simple ingredients and powerful nutrition!
• Support continued management of maintenance of fixed assets. • Focus on equipment reliability, uptime, and tracking of OEE to ensure effective asset utilization. • Drive predictive and preventative maintenance to support asset health. • Collaborate with internal and external partners to drive to zero loss. • Benchmark equipment reliability and maintenance performance across multiple sites. • Analyze breakdown data and maintenance trends to implement corrective actions. • Develop and implement preventive maintenance programs for packaging machines. • Provide technical support to third-party facility staff on packaging equipment operations.



