Software Engineer- Site Reliability Engineering (SRE)
Location
United States
Posted
2 days ago
Salary
$106.5K - $177.5K / year
Seniority
Mid Level
No structured requirement data.
Job Description
Software Engineer- Site Reliability Engineering (SRE)
Noctua Technology
Role Description The Site Reliability Engineering discipline at Noctua Technology, LLC is a strategic force driving digital transformation. We treat operations as a software engineering challenge, focusing on the seamless integration, scalability, and long-term reliability of cloud native systems. Our SREs don’t just manage infrastructure; they build it using Infrastructure as Code (IaC), monitor it through advanced observability stacks, and protect it by engineering for failure. We work closely with clients to bridge the gap between development and operations. We are seeking a motivated Site Reliability Engineer (SRE) to join our dynamic team. As a key contributor, you will apply software engineering principles to operations, focusing on the reliability, scalability, and performance of production systems. You will play a crucial role in reducing toil through automation, defining and monitoring Service Level Objectives (SLOs), and implementing best practices for system stability and incident response. This role requires working with modern cloud technologies to ensure the high availability and efficiency of applications and infrastructure. Location: Primarily Remote. Candidates must be based in CA or DC Metro Area for proximity to project and client teams. Security Clearance Requirement: Applicants must be US citizens and eligible to obtain and maintain an active Secret security clearance or above. Key Responsibilities - Site Reliability Engineering - Define, measure, and report on Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure system reliability and uptime. - Develop and deploy Infrastructure as Code (IaC) using Terraform, CloudFormation, or similar tools, with an emphasis on repeatability and change management. - Implement and manage containerized and serverless architectures using Docker, Kubernetes, and cloud-native services, focusing on performance and error budgets. - Build and maintain reliable and self-healing CI/CD pipelines to automate deployments and improve development workflows. - Toil Reduction and Incident Management - Implement and refine comprehensive monitoring, alerting, and logging to detect and address performance and availability issues proactively. - Eliminate toil by extensively automating operational tasks, including provisioning, patching, and deployments, using scripting and configuration management tools such as Python, Bash, or Ansible. - Conduct post-incident reviews (blameless postmortems) to drive continuous improvement in system reliability and operational processes. - Testing and Service Resiliency - Implement cloud security best practices, including identity and access management (IAM), encryption, and compliance controls. - Proactively identify and address system weaknesses and ensure performance under stress. - Support disaster recovery and high availability strategies through backup and failover planning. - Collaboration and Knowledge Sharing - Collaborate with development teams to improve the operability and production readiness of applications from design through deployment. - Create and maintain documentation for cloud architectures, deployment processes, and best practices. - Contribute to internal knowledge-sharing initiatives, ensuring continuous learning within the team. - Stakeholder Communication - Provide technical guidance and support to clients and internal teams on cloud infrastructure and reliability best practices, with a focus on defining Service Level Agreements (SLAs). - Act on client feedback to refine and enhance cloud solutions. - Conduct training and knowledge-sharing sessions to help clients manage their cloud environments effectively. - Continuous Learning and Innovation - Stay updated on the latest developments in cloud infrastructure and technology trends. - Drive innovation by proposing and implementing new techniques and technologies. Qualifications - 1-5 years of experience in site reliability engineering, cloud engineering, or related fields. - Strong software engineering skills with an emphasis on writing clean, modular, and maintainable code, specifically for automation and system management. - Proficiency in Infrastructure as Code (IaC) tools like Terraform or CloudFormation. - Experience with containerization and orchestration tools like Docker and Kubernetes. - Knowledge of networking concepts, cloud security best practices, and identity management. - Experience with programming or scripting languages such as Python, Bash, or Go. - Familiarity with CI/CD pipelines and DevOps methodologies. - Strong problem-solving skills and the ability to troubleshoot complex cloud environments. - Effective communication skills and a willingness to learn and collaborate. Preferred Qualifications - Bachelor's or advanced degree in Computer Science or a related field. - Any of the below cloud certifications: - Google Cloud Professional Cloud Architect - Google Cloud Professional Cloud DevOps Engineer - AWS Certified Solutions Architect - AWS Certified Developer - AWS Certified SysOps Administrator - Azure Solutions Architect Expert - CompTIA Security+ certification or an equivalent DoD 8140/8570 IAT Level II baseline certification. Salary Range $106,500 - $177,500
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Role Description The Site Reliability Engineering discipline at Noctua Technology, LLC is a strategic force driving digital transformation. We treat operations as a software engineering challenge, focusing on the seamless integration, scalability, and long-term reliability of cloud native systems. Our SREs don’t just manage infrastructure; they build it using Infrastructure as Code (IaC), monitor it through advanced observability stacks, and protect it by engineering for failure. We work closely with clients to bridge the gap between development and operations. We are seeking a highly experienced and autonomous Senior Site Reliability Engineer (SRE) to join our dynamic team. As a technical leader, you will: - Define the strategy and apply advanced software engineering principles to operations. - Focus on the architecture, reliability, and long-term performance of large-scale production systems. - Play a crucial role in reducing toil through automation. - Define and monitor Service Level Objectives (SLOs). - Implement best practices for system stability and incident response. This role requires working with modern cloud technologies to ensure the high availability and efficiency of applications and infrastructure. Location: Primarily Remote. Candidates must be based in CA or DC Metro Area for proximity to project and client teams. Security Clearance Requirement: Applicants must be US citizens and eligible to obtain and maintain an active Secret security clearance or above. Qualifications - 5+ years of experience in site reliability engineering, cloud engineering, or related fields. - Strong software engineering skills with an emphasis on writing clean, modular, and maintainable code, specifically for automation and system management. - Deep experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation. - Deep experience with containerization and orchestration tools like Docker and Kubernetes. - Deep knowledge of networking concepts, cloud security best practices, and identity management. - Experience with programming or scripting languages such as Python, Bash, or Go. - Experience with CI/CD pipelines and DevOps methodologies. - Strong problem-solving skills and the ability to troubleshoot complex cloud environments. - Demonstrated ability to influence technical decision-making across organizational boundaries. Preferred qualifications: - Bachelor's or advanced degree in Computer Science or a related field. - Any of the below cloud certifications: - Google Cloud Professional Cloud Architect - Google Cloud Professional Cloud DevOps Engineer - AWS Certified Solutions Architect - AWS Certified Developer - AWS Certified SysOps Administrator - CompTIA Security+ certification or an equivalent DoD 8140/8570 IAT Level II baseline certification. Requirements - Drive the definition and adoption of SLIs and SLOs across multiple services or entire platforms, ensuring alignment with business goals. - Design and architect Infrastructure as Code (IaC) solutions for large-scale, complex environments, establishing standards and best practices. - Implement and manage containerized and serverless architectures using Docker, Kubernetes, and cloud-native services, focusing on performance and error budgets. - Build and maintain reliable and self-healing CI/CD pipelines to automate deployments and improve development workflows. - Implement and refine comprehensive monitoring, alerting, and logging to detect and address performance and availability issues proactively. - Lead the strategic effort to eliminate toil, identifying and championing major automation projects that deliver significant organizational efficiency. - Lead high-severity incident response and coordinate blameless postmortems for major outages, driving the resulting remediation and systemic improvements. - Implement cloud security best practices, including identity and access management (IAM), encryption, and compliance controls. - Proactively identify and address system weaknesses and ensure performance under stress. - Support disaster recovery and high availability strategies through backup and failover planning. - Serve as a primary SRE liaison for development teams, influencing application architecture and design to meet reliability and scalability targets from inception. - Create and maintain documentation for cloud architectures, deployment processes, and best practices. - Contribute to internal knowledge-sharing initiatives, ensuring continuous learning within the team. - Act as a subject matter expert and trusted advisor to clients and internal leadership on cloud infrastructure, reliability strategy, and Service Level Agreement (SLA) negotiations. - Act on client feedback to refine and enhance cloud solutions. - Conduct training and knowledge-sharing sessions to help clients manage their cloud environments effectively. - Stay updated on the latest developments in cloud infrastructure and technology trends. - Drive innovation by proposing and implementing new techniques and technologies. Benefits - Salary Range: $149,400 - $202,000
Senior SRE
Gorillas GroupA Gorillas Group é líder mundial em inovação, com um forte foco em avanços pioneiros nos campos dinâmicos do iGaming.
• Maintain platform reliability in production, focusing on availability, incidents, monitoring, deployments, costs, security, and capacity. • Help structure and evolve alerts, runbooks, incident response rituals, SLOs, SLIs, metrics, and operational processes. • Improve platform observability by separating useful alerts from noise and creating visibility for engineering, operations, and the business. • Work with cloud, containers, pipelines, databases, networks, CDN/WAF, APIs, and automations, without listing the full architectural details here. • Support decisions on scaling, resilience, isolation, rollback, security, cost, and technical evolution. • Automate repetitive tasks and turn operational knowledge into simple, useful, and actionable documentation. • Collaborate with development, security, product, support, and operations so that reliability is part of the product—not an island at the end of the delivery pipeline.
Sr. Video Streaming DevOps Engineer
Scratch FinancialScratch Financial is the world's simplest patient financing solution.
Company Description NBCUniversal is one of the world's leading media and entertainment companies. We create world-class content, which we distribute across our portfolio of film, television, and streaming, and bring to life through our global theme park destinations, consumer products, and experiences. We own and operate leading entertainment and news brands, including NBC, NBC News, NBC Sports, Telemundo, NBC Local Stations, Bravo, and Peacock, our premium ad-supported streaming service. We produce and distribute premier filmed entertainment and programming through our powerhouse film and television studios, including Universal Pictures, DreamWorks Animation, and Focus Features, and the four global television studios under the Universal Studio Group banner, and operate industry-leading theme parks and experiences around the world through Universal Destinations & Experiences, including Universal Orlando Resort, home to Universal Epic Universe, and Universal Studios Hollywood. NBCUniversal is a subsidiary of Comcast Corporation. Visit www.nbcuniversal.com for more information. Our impact is rooted in improving the communities where our employees, customers, and audiences live and work. We have a rich tradition of giving back and ensuring our employees have the opportunity to serve their communities. We champion an inclusive culture and strive to attract and develop a talented workforce to create and deliver a wide range of content reflecting our world. Job Description NBCU's Distribution engineering is responsible for the automation and reliability of NBCU's Live sources. Reasonable for the delivery of 200+ NBC and Telemundo stations and Cable networks, and live events like sports, news and entertainment content to third party providers and Peacock. NBCU's Distribution Engineering is looking to add a talented junior DevOps Engineer to be part of our Video Streaming team to build automation solutions to deploy and maintain applications, focusing on scalability, reliability and resiliency. We are looking for a dynamic team player who is an owner, accountable and passionate about technology. You will be working on high-visibility projects serving millions of viewers for industry-leading sporting and entertainment events, such as the Super Bowl, Olympics, Premier League, and America's Got Talent. This position is eligible for company sponsored benefits, including medical, dental and vision insurance, 401(k), paid leave, tuition reimbursement, and a variety of other discounts and perks. Learn more about the benefits offered by NBCUniversal by visiting the Benefits page of the Careers website. You will become a part of a team that deploys and supports applications at facilities across the country and business-to-business delivery of 24/7 Live content. You will work with 3rd party vendors to understand their systems and develop the automation to deploy, manage and monitor these systems. This is a fast-paced environment which requires you to think on your feet and pivot as needed to achieve our goals. Projects that seem easy will sometimes turn into hackathons. Some of our tasks include: - Deploy and maintain remote systems across the country and in the Cloud - Develop Infrastructure as a Code to instantiate / provision cloud resources - Participate, facilitate, and/or lead large or complex troubleshooting activities that require significant judgment and reasoning outside area of expertise - Work collaboratively with operations, engineers and vendor to gather requirements and develop solutions - Analyze current technology utilized within the company and develop steps and processes to improve and expand upon them by evaluating new technology alternatives and vendor products - Develop robust CI/CD pipelines that integrates security, code quality and automated testing - Participate in an on-call rotation for L2 support off-hours Qualifications Qualifications/Requirements - Bachelor's degree in computer science, IT or related experience - 5+ years of prior DevOps experience or equivalent, modern infrastructure deployment and maintenance - Strong experience with Linux/Unix administration IP networking and interfacing with cloud-based networks, good understanding of the core networking fundamentals - Experience designing and delivering cloud hosted services in (e.g., AWS, GCP or Azure) - Comfortable managing code with Git and Git CLI - Working knowledge of Cloud Formation, Terraform, Ansible and Chef to automate deployments of software/infrastructure - Working knowledge of scripting languages such Python, Bash, JavaScript, JSON, XML and YML - Experience with developing and deploying containerized systems, Kubernetes, docker - Experience with CI/CD build and deployment practices - Experience with Splunk, Grafana, ELK to create dashboards and alerts - Familiar with cyber security best practices and tooling such as CrowdStrike, Qualys, Airlock - Experience using and/or developing AI agents to code and troubleshoot issues Desired Characteristics - Understanding of Broadcast and Media distribution workflows / technologies - Experience developing software/scripts that leverage Microsoft Graph API - Familiarity with streaming protocols and codecs (TS, HEVC, H.264, HLS, CMAF, DASH, SCTE-35, SCTE-224, ESAM, SRT/RIST) - Good communicator and can clearly articulate complex issues and technologies - Great design and problem-solving skills - Willing to take ownership of problems and see them through to resolution - Comfortable working in a fast-paced agile. environment. Requirements change quickly and our team needs to constantly adapt to moving targets - Experience with video encoding vendors such as Harmonic, Elemental, Synamedia, Evertz, Nevion Additional Requirements: - Fully Remote: This position has been designated as fully remote, meaning that the position is expected to contribute from a non-NBCUniversal worksite, most commonly an employee's residence. This position is eligible for company sponsored benefits, including medical, dental and vision insurance, 401(k), paid leave, tuition reimbursement, and a variety of other discounts and perks. Learn more about the benefits offered by NBCUniversal by visiting the Benefits page of the Careers website. Salary range: $110,000 - $135,000 We are accepting applications for this position on an ongoing basis. Additional Information As part of our selection process, external candidates may be required to attend an in-person interview with an NBCUniversal employee at one of our locations prior to a hiring decision. NBCUniversal's policy is to provide equal employment opportunities to all applicants and employees without regard to race, color, religion, creed, gender, gender identity or expression, age, national origin or ancestry, citizenship, disability, sexual orientation, marital status, pregnancy, veteran status, membership in the uniformed services, genetic information, or any other basis protected by applicable law. If you are a qualified individual with a disability or a disabled veteran, you have the right to request a reasonable accommodation if you are unable or limited in your ability to use or access nbcunicareers.com as a result of your disability. You can request reasonable accommodations by emailing AccessibilitySupport@nbcuni.com. For LA County and City Residents Only: NBCUniversal will consider for employment qualified applicants with criminal histories, or arrest or conviction records, in a manner consistent with relevant legal requirements, including the City of Los Angeles' Fair Chance Initiative For Hiring Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act, where applicable.
Senior DevOps Engineer
Zafran SecurityZafran's Threat Exposure Management Platform integrates with your security tools to reveal, remediate, and mitigate risk
Role Description Zafran is looking for a Senior DevOps Engineer to own our US-only production environment end-to-end and lead our growing US DevOps presence. This is a hands-on role combining infrastructure ownership, deployment automation, and close collaboration with engineering on application-level investigations. You will be our first DevOps hire on the ground in the US, working alongside our Tel Aviv DevOps team while establishing the US site. The ideal candidate has a strong development background, thinks like an engineer, and is comfortable moving between Terraform, Kubernetes manifests, and application code to debug a production issue. - Own the prod-us-only environment end-to-end: infrastructure deployment, maintenance, scaling, and reliability - Lead and grow the US-based DevOps as the site expands - Design, implement, and manage scalable, highly available AWS infrastructure - Build and maintain CI/CD pipelines that let developers ship safely and quickly - Partner with engineering on investigations into application errors - Operate and improve monitoring, logging, and alerting to ensure availability and performance; lead production incident response - Coordinate with the Tel Aviv DevOps team on shared platform decisions and standards Qualifications - Must be located in the US, with a strong preference for the New York area; US remote considered - Prior experience leading or mentoring a small team - U.S. citizenship or lawful permanent resident status (Green Card) required due to access restrictions for a U.S.-only environment - 5+ years of DevOps / SRE / platform engineering experience - Strong development background in Go, Python, or similar; comfortable reading and contributing to application code - Deep AWS experience across compute, networking, IAM, and data services - Strong Kubernetes and Docker experience in production - Infrastructure as Code with Terraform / Terragrunt - CI/CD pipeline design with GitHub Actions, CircleCI, Jenkins, or similar - Hands-on with Datadog or equivalent - Experience with disaster recovery and high-availability architectures - Excellent troubleshooting skills across the full stack Nice to have - Experience working across distributed teams and time zones - Relevant certifications Benefits - Flexible PTO - Health insurance plans (medical, dental, vision) - Monthly stipend for phone and internet - 401k - Flexible spending account - Home office stipend when joining - Access to frontier AI models, including Claude Company Description At Zafran, people matter! We’re proud to be an equal opportunity employer. We believe the best teams are built by people who think differently, come from all kinds of backgrounds, and aren’t afraid to challenge the status quo. We welcome everyone across race, religion, gender, gender identity or expression, sexual orientation, age, disability, national origin, and veteran status, because what matters most is what you bring to the table. If you’re curious, fun, and someone who gets things done, we’d love to meet you.


