Site Reliability Engineer (SRE)
Location
United States
Posted
1 day ago
Salary
$100K - $150K / year
Seniority
Mid Level
No structured requirement data.
Job Description
Site Reliability Engineer (SRE)
Bright Vision Technologies
Role Description We are seeking an experienced Site Reliability Engineer to ensure the availability, performance, and operational excellence of large-scale distributed systems in production. As an SRE, you will live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems, and continually pushing the platform toward higher reliability with lower operational toil. The ideal candidate will combine deep systems knowledge with strong programming skills, a measurement-driven mindset, and the discipline to design, automate, and operate complex services so that reliability becomes a first-class engineering deliverable rather than a reactive concern. Key Responsibilities - Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services. - Lead incident response and resolution for production issues, acting as a calm and effective incident commander when needed. - Ensure high-quality post-incident reviews that drive lasting improvements. - Design and implement comprehensive monitoring, logging, and tracing strategies using Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar tooling. - Build and maintain robust on-call processes, runbooks, and escalation paths. - Automate operational toil aggressively by writing production-grade tooling in Python, Go, Bash, or similar languages. - Architect and operate large-scale Kubernetes clusters and container-based workloads. - Design CI/CD pipelines that promote safe, frequent, and observable releases. - Lead capacity planning and performance engineering activities. - Partner closely with application development teams to embed reliability practices early in design. - Strengthen the platform’s resiliency through chaos engineering, fault injection, and well-tested failover paths. - Drive continuous improvement of security posture in collaboration with security teams. - Contribute to the technical roadmap for reliability tooling and observability platforms. - Mentor engineers across the organization on SRE practices. Qualifications - Bachelor’s degree in Computer Science, Engineering, or a related technical discipline. - Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems. - Strong programming skills in at least one of Python, Go, or Java. - Deep, hands-on experience operating Linux at scale. - Production experience operating Kubernetes and container-based workloads. - Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents. - Hands-on experience designing and operating CI/CD pipelines. - Solid understanding of distributed system design. - Demonstrated experience leading incident response and conducting effective post-incident reviews. - Excellent communication and documentation skills. Preferred Qualifications - Experience defining and operationalizing SLOs and error budgets in real production environments. - Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus. - Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP). - Background in capacity planning, performance engineering, or large-scale load testing. - Familiarity with service mesh technologies such as Istio, Linkerd, or Consul. How to Apply Would you like to know more about this opportunity? For immediate consideration, please send your resume to [email protected] or contact us at (908) 505-3899. Learn more about Bright Vision Technologies at www.bvteck.com .
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Site Reliability Engineer
MLabs LTDFounded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli
Role Description We are hiring on behalf of our client, a high-performance financial technology organization specializing in advanced integration service products. The successful candidate will join a multi-disciplinary Site Reliability Engineering (SRE) team that actively champions a comprehensive automation culture. - Automated Infrastructure Provisioning: Architect and build automated provisioning systems for global server and network architectures across both physical bare-metal environments and public cloud infrastructure (AWS, GCP). - Continuous Delivery Pipeline Management: Evolve, maintain, and optimize the Continuous Delivery (CD) pipeline responsible for provisioning servers, configuring network switches, and deploying core software updates. - External Stakeholder Interfacing: Interact directly with hardware vendors, telecommunications providers, and external financial institutions to manage connectivity and optimize remote operations. - Collective System Ownership: Take shared ownership of the stability, performance, and monitoring of the end-to-end production environment alongside a tight-knit engineering team. Qualifications - Practical experience with software engineering is required. - Proficiency in basic programming within a language of choice is necessary. - Demonstrated experience working with software automation tools (specifically Ansible). - Proven experience operating as a Systems Administrator or Network Engineer. - A constructive, open-minded, and self-motivated approach to technical problem-solving. - A strong belief in lifelong learning and an awareness of evolving technologies. - High professional autonomy, with a proven ability to take the initiative within a collaborative, team-oriented ecosystem. - Experience managing third-party relationships, negotiating with vendors, procuring hardware, and supporting remote workforce initiatives via phone and email. Requirements - Familiarity with enterprise Authentication Protocols (e.g., SAML, OAuth2, Active Directory, Kerberos). - Technical proficiency in Microsoft and Azure Active Directory. - Hands-on exposure to Database High Availability (HA) frameworks, including clustering and replication models (e.g., PostgreSQL). Benefits - Competitive Base Compensation: Up to £110,000 per annum, commensurate with experience and skill set. - Equity Participation: Allocation of company share options. - Corporate Benefits: A comprehensive standard benefits package. - Workplace Flexibility: Remote-first working philosophy with flexible arrangements within the UK and Europe. Interview Process - Initial Screening Call: An introductory conversation with the Talent Acquisition/HR team. - Hiring Manager Interview: A 1-hour in-depth technical domain screen. - Technical Evaluation: A 2-to-3-hour practical interview incorporating a live coding and infrastructure automation exercise. - Executive Interview: A final strategic conversation with the firm’s three Founders. Commitment to Equality and Accessibility At MLabs, we are committed to offer equal opportunities to all candidates. We ensure no discrimination, accessible job adverts, and providing information in accessible formats. Our goal is to foster a diverse, inclusive workplace with equal opportunities for all.
Site Reliability Engineer
MLabs LTDFounded in 2018, MLabs is a private software engineering consultancy specializing in Haskell and Rust development with a focus on blockchain, artificial intelli
• Automated Infrastructure Provisioning: Architect and build automated provisioning systems for global server and network architectures. • Continuous Delivery Pipeline Management: Evolve, maintain, and optimize the Continuous Delivery (CD) pipeline. • External Stakeholder Interfacing: Interact directly with hardware vendors and external financial institutions. • Collective System Ownership: Take shared ownership of the stability, performance, and monitoring of the production environment.
DevOps Engineer
General DynamicsGeneral Dynamics is a global aerospace and defense company offering products designed to provide safety and security to people around the world. In the past, General Dynamics has p
Role Description Transform technology into opportunity as a DevOps Engineer at GDIT. Shape what’s next for mission-critical government projects, while shaping what’s next for your engineering career. As a DevOps Engineer Senior, the work you’ll do at GDIT will be impactful to the mission of Defense Health Agency (DHA). You will play a crucial role on the ABACUS program, supporting Revenue Cycle Management Operations across 136 Medical Treatment Facilities (MTF) worldwide. You will: - Lead/Manage/Support the ABACUS front end (Development) while supporting the Cloud Management/Engineering team - Collaborate between the GDIT ABACUS team and the customer to achieve the mission - Drive efficiencies within AWS GovCloud to increase performance while lowering costs Role requirements: This role is a flex position that needs to be able to support the front-end development of the SaaS application (~30%), and support the Engineering team as required (~70%). Qualifications - Bachelor of Arts/Bachelor of Science - 5 + years of related experience - TIA COMP Sec + certification is required - Experience working in DHA/DoD - .Net and SQL development - Amazon Web Services (AWS) - SDLC, troubleshooting, integration and test strategies and implementation - Optimizing systems operations - Windows, Cloud Operations, AWS GovCloud, System Optimization - Must be a team player and have good communication with all levels of staff Requirements - 5 + years of related experience - Must pass a T3 background investigation; Secret eligible - US citizenship required Benefits - Comprehensive benefits and wellness packages - 401K with company match - Competitive pay and paid time off - Full-flex work week to own your priorities at work and at home - Award-winning culture of innovation and a military-friendly workplace Company Description We are GDIT. A global technology and professional services company that delivers consulting, technology and mission services to every major agency across the U.S. government, defense and intelligence community. Our 26,000 experts extract the power of technology to create immediate value and deliver solutions at the edge of innovation. We operate across 50 countries worldwide, offering leading capabilities in digital modernization, AI/ML, Cloud, Cyber and application development. Together with our clients, we strive to create a safer, smarter world by harnessing the power of deep expertise and advanced technology.
Java Support Engineer
AccentureAccenture Federal Services, a division of Accenture, provides technology and consulting services to U.S. federal agencies, delivering solutions that enhance performance and efficie
Role Description DARE TO BE A PART OF THE CHALLENGE! COME AND JOIN OUR TEAM TOGETHER WE CAN MAKE THE DIFFERENCE! Did you know that Accenture is leading the digital transformation in the World? - Career development according to your profile and interests. - Work in one of the best companies and feel proud. - Access to an innovative methodology and tools. - Direct contact with experts worldwide. - Use of work schemes and cutting-edge technologies. - Constant training. - Work environment based on teamwork and collaboration. - Participation in International Projects. Qualifications - 3-6 years of experience. - Mandatory Skills: - Java - Springboot - Cloud AWS - English C1 - Nice to have: - OpenText / Content Management Systems - SQL (Any RDBMS or NoSQL) Benefits - Career development according to your profile and interests. - Work in one of the best companies and feel proud. - Access to an innovative methodology and tools. - Direct contact with experts worldwide. - Use of work schemes and cutting-edge technologies. - Constant training. - Work environment based on teamwork and collaboration. - Participation in International Projects. Company Description Accenture is a leading global professional services company that helps the world’s leading businesses, governments and other organizations build their digital core, optimize their operations, accelerate revenue growth and enhance citizen services—creating tangible value at speed and scale. We are a talent- and innovation-led company with approximately 791,000 people serving clients in more than 120 countries. Technology is at the core of change today, and we are one of the world’s leaders in helping drive that change, with strong ecosystem relationships. We combine our strength in technology and leadership in cloud, data and AI with unmatched industry experience, functional expertise and global delivery capability. Our broad range of services, solutions and assets across Strategy & Consulting, Technology, Operations, Industry X and Song, together with our culture of shared success and commitment to creating 360° value, enable us to help our clients reinvent and build trusted, lasting relationships. We measure our success by the 360° value we create for our clients, each other, our shareholders, partners and communities. Visit us at www.accenture.com
