Bright Vision Technologies is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications. We recognize that our people are our strength. We are an equal opportunity employer and place a high value on diversity and inclusion. We do not discriminate on the basis of any protected attribute. We make reasonable accommodations for applicants’ and employees’ religious practices and beliefs, as well as mental health or physical disability needs. Bright Vision Technologies is an Equal Opportunity Employer, including Disability/Veterans.

Site Reliability Engineer

EngineerEngineerFull Time Remote Mid Level Company Site

Location

United States

Posted

4 days ago

Salary

$100K - $150K / year

Seniority

Mid Level

Distributed Systems Observability/Monitoring Prometheus Grafana OpenTelemetry Datadog Python Shell Kubernetes CI/CD Java Linux AWS Azure GCP Istio Linkerd Consul

Job Description

Role Description We are seeking an experienced Site Reliability Engineer to ensure the availability, performance, and operational excellence of large-scale distributed systems in production. As an SRE you will live at the boundary between development and operations, applying strong software engineering principles to infrastructure and operations problems, and continually pushing the platform toward higher reliability with lower operational toil. The ideal candidate will combine deep systems knowledge with strong programming skills, a measurement-driven mindset, and the discipline to design, automate, and operate complex services so that reliability becomes a first-class engineering deliverable rather than a reactive concern. Key Responsibilities - Define, instrument, and continually refine service-level objectives (SLOs), service-level indicators (SLIs), and error budgets for critical services. - Lead incident response and resolution for production issues, ensuring high-quality post-incident reviews. - Design and implement comprehensive monitoring, logging, and tracing strategies using tools like Prometheus, Grafana, OpenTelemetry, ELK/EFK, Datadog, or similar. - Build and maintain robust on-call processes, runbooks, and escalation paths. - Automate operational toil by writing production-grade tooling in Python, Go, Bash, or similar languages. - Architect and operate large-scale Kubernetes clusters and container-based workloads. - Design CI/CD pipelines that promote safe, frequent, and observable releases. - Lead capacity planning and performance engineering activities. - Partner closely with application development teams to embed reliability practices early in design. - Strengthen the platform’s resiliency through chaos engineering and fault injection. - Drive continuous improvement of security posture in collaboration with security teams. - Contribute to the technical roadmap for reliability tooling and observability platforms. - Mentor engineers across the organization on SRE practices. Qualifications - Bachelor’s degree in Computer Science, Engineering, or a related technical discipline. - Five or more years of SRE, DevOps, or production engineering experience supporting large-scale distributed systems. - Strong programming skills in at least one of Python, Go, or Java. - Deep, hands-on experience operating Linux at scale. - Production experience operating Kubernetes and container-based workloads. - Strong working knowledge of observability tooling such as Prometheus, Grafana, OpenTelemetry, ELK/EFK, or commercial equivalents. - Hands-on experience designing and operating CI/CD pipelines. - Solid understanding of distributed system design. - Demonstrated experience leading incident response and conducting effective post-incident reviews. - Excellent communication and documentation skills. Preferred Qualifications - Experience defining and operationalizing SLOs and error budgets in real production environments. - Exposure to chaos engineering practices and tools such as Chaos Monkey, Gremlin, or Litmus. - Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP). - Background in capacity planning, performance engineering, or large-scale load testing. - Familiarity with service mesh technologies such as Istio, Linkerd, or Consul. How to Apply For immediate consideration, please send your resume to [email protected] or contact us at (908) 650-6699. Learn more about Bright Vision Technologies at www.bvteck.com .

Related Categories

Engineer

Related Job Pages

Remote Full-time Jobs (US)Remote Python Jobs (US)More Remote Jobs

More Engineer Jobs

Ingeniero/a IA - GCP

IRIUM

Líderes en gestión de servicios integrados de infraestructuras y plataformas IT.

Engineer4 days ago

Full Time RemoteTeam 501-1,000Since 2002H1B No Sponsor

Company Site LinkedIn

• Diseñar, desarrollar e implementar soluciones avanzadas de IA/ML y agentes en Google Cloud Platform (GCP). • Colaborar en un proyecto internacional en modalidad full-remote.

Cloud Google Cloud Platform Kubernetes Python Terraform

View details: Ingeniero/a IA - GCP

Spain

€45K - €65K / year

Apply

Petroleum Engineer

ComboCurve

Year-end reserves, A&D, type curve, and scheduling workflows all on one cloud-based platform.

Engineer4 days ago

Full Time RemoteTeam 51-200Since 2017H1B Sponsor

Company Site LinkedIn

• Contributing to ComboCurve’s client retention goals by driving product adoption and expanding usage through proactive, regularly scheduled touchpoints. • Serving as the primary owner of the consultative journey for an assigned portfolio of customers. • Advising clients on industry best practices, reservoir engineering workflows, and how to maximize value using ComboCurve. • Identifying opportunities for user growth and account expansion within existing client organizations. • Developing, executing, and maintaining strategic success plans tailored to each allocated customer. • Acting as a trusted advisor and thought leader within the oil and gas industry by contributing to relevant content, discussions, and insights. • Serving as the source of truth for all facets of your customer base, including technical workflows, business objectives, and account strategy. • Driving client adoption by aligning ComboCurve solutions with evolving business needs and operational goals. • Supporting customer inquiries through established support processes and ensuring timely resolution. • Advocating for customers internally by providing strategic feedback to Product, Sales, and Leadership teams to continuously improve the customer experience. • Travel intermittently for on-site customer engagements and trade shows.

View details: Petroleum Engineer

Texas

Apply

Forward Deployed Engineer

SHI International Corp.

Engineer4 days ago

Full Time RemoteTeam 5,001-10,000H1B No Sponsor

Company Site LinkedIn

• Deliver production workflows, ontologies, pipelines, and AIP-powered applications inside customer Foundry environments. • Integrate customer data sources (ERP, CRM, legacy systems, APIs) into Foundry and build operational workflows on top of them. • Partner with the Account Strategist to translate business priorities into concrete deployment plans. • Execute rapid-value deployments in the first 90 days of every engagement. • Contribute to reusable templates and delivery patterns that scale across customers and verticals. • Work directly with customer operational leaders and subject matter experts to refine and operate solutions. • Support account expansion by surfacing new use cases during delivery. • Work across priority verticals including manufacturing, logistics and supply chain, financial services, and healthcare.

ERP Python SQL

View details: Forward Deployed Engineer

Texas

Apply

Omnibus Engineer

Everforth

Everforth Apex, a division of Everforth and formerly Apex Systems, an IT staffing and workforce solutions firm, provides recruiting and staffing services to lar

Engineer4 days ago

Other

Company Site

Location: Richmond United States Employee Type: Contract Qualifications: Senior IBM Netcool/Omnibus EngineerPosition Overview We are seeking an experienced Senior IBM Netcool/Omnibus Engineer to join our team. The ideal candidate will possess deep technical expertise in IBM Netcool/Omnibus event management platforms, strong Linux administration skills, and Oracle database knowledge. This role is critical in maintaining and optimizing our enterprise event management infrastructure and ensuring seamless integration with our broader IT ecosystem. This Engineer will be a critical resource in an active migration and deployment of the environment.] Key Responsibilities Event Management & Platform Administration • Design, implement, and maintain IBM Netcool/Omnibus infrastructure for enterprise-wide event management• Develop and optimize ObjectServer configurations, triggers, and automation procedures• Create and maintain event correlation rules, filters, and enrichment policies to reduce alert noise and improve signal-to-noise ratio• Monitor platform performance and implement tuning strategies to ensure optimal operation• Manage probe configurations for multi-vendor monitoring tool integration Technical Operations • Administer Linux systems (RHEL) hosting Netcool/Omnibus components• Understanding of Netcool / Omnibus Oracle database schemas for tasks including data management, integrations and reporting for Netcool backend databases• Implement and maintain high availability and disaster recovery solutions• Execute system upgrades, patching, and maintenance activities with minimal downtime• Troubleshoot complex technical issues across the event management stack Integration & Automation • Design and implement webhook integrations with third-party monitoring and ITSM platforms• Develop custom integration solutions using APIs, scripting (PHP, JavaScript, Bash, Python..), and middleware technologies• Create automation workflows to streamline event processing and incident management• Collaborate with Observability teams to onboard new monitoring sources Best Practices & Documentation • Apply ITIL-aligned event management best practices• Develop and maintain comprehensive technical documentation, runbooks, and operational procedures• Establish and enforce naming conventions, coding standards, and configuration management practices• Participate in change advisory board reviews and ensure proper change management protocols Required QualificationsTechnical Skills • 5+ years of hands-on experience with IBM Netcool/Omnibus (ObjectServer, Probes, Gateways)• Strong understanding of Netcool architecture, including OMNIbus, Impact, WebGUI/Dash, and Message Bus• 3+ years of Linux system administration experience (user management, file systems, networking, security)• Working knowledge of Oracle database administration (SQL queries, basic DBA tasks, performance monitoring)• Proficiency in scripting languages: Bash, Python, or Perl• Experience with event correlation techniques and alert management strategies Professional Knowledge • Solid understanding of event management best practices and ITIL principles• Knowledge of network protocols (SNMP, Syslog, HTTPS/REST APIs)• Familiarity with monitoring concepts: thresholds, baselines, anomaly detection• Understanding of enterprise IT infrastructure (networks, servers, applications, databases)Soft Skills• Excellent problem-solving and analytical abilities• Strong communication skills with ability to explain technical concepts to non-technical stakeholders• Ability to work independently and collaboratively in a team environment• Experience working in 24/7 production environments with on-call rotation participation Preferred Qualifications Highly Valued Experience • Dynatrace integration: Experience integrating Dynatrace monitoring platform with Netcool via webhooks or API, leveraging Dynatraces AI-powered problem detection• ServiceNow integration: Hands-on experience with bidirectional integration between Netcool and ServiceNow for incident and event management workflows• Experience with webhook architectures and RESTful API integrations• Knowledge of additional Observability and Enterprise Data Center tools (Splunk, Control-M)• IBM Netcool Impact or OMNIbus Process Control scripting• Experience with IBM Cloud Pak for AIOps• ITIL v3 Foundation certification or higher Additional Assets • Bachelors degree in Computer Science, Information Technology, or related field (or equivalent experience)• IBM certifications related to Netcool/Tivoli products Work Environment • Hybrid/Remote options available depending on location• Participation in on-call rotation as needed• Occasional after-hours maintenance windows• Fast-paced, mission-critical environment Everforth Apex Benefits Overview: Everforth Apex offers a range of supplemental benefits, including medical, dental, vision, life, disability, and other insurance plans that offer an optional layer of financial protection. We offer an ESPP (employee stock purchase program) and a 401K program which allows you to contribute typically within 30 days of starting, with a company match after 12 months of tenure. Everforth Apex also offers a HSA (Health Savings Account on the HDHP plan), a SupportLinc Employee Assistance Program (EAP) with up to 8 free counseling sessions, a corporate discount savings program and other discounts. In terms of professional development, Everforth Apex hosts an on-demand training program, provides access to certification prep and a library of technical and leadership courses/books/seminars once you have 6+ months of tenure, and certification discounts and other perks to associations that include CompTIA and IIBA Employee Type: Contract Remote: Yes Pay Range: $80 - $80 per hour