The network observability company.
Staff Site Reliability Engineer, Cloud
Location
United States
Posted
134 days ago
Salary
$165K - $200K / year
Seniority
Lead
Job Description
Staff Site Reliability Engineer, Cloud
Kentik
• Make sure our real-time, scalable, infrastructure is set up for growth and working efficiently. Our infrastructure runs on our own hardware, across multiple locations as well as all major cloud vendors • Work on tools and processes to better monitor our platform as well as ensuring its stability through our rapid growth • Deep-diving into diverse topics, from firewalls and IP routing, to database replication strategies or automating build processes • Collaborate with engineering and infrastructure teams on finding solutions from an operational perspective • Assist with expanding our cloud deployments across the major cloud providers • Contribute code, code reviews and tools or patches to all kinds of existing code • Write design documents or collaborate on colleagues’ docs to introduce new features or changes into our infrastructure • Provide valuable feedback on team goals, projects, and processes. We believe in continuously improving our team
Job Requirements
- 8+ years of experience in cloud-based Systems Administration, IT and/or SRE related projects
- Expertise in public cloud environments such as AWS, GCP, Azure, or OCI.
- Strong command of containerization and orchestration using Docker and Kubernetes.
- Solid programming and automation skills using Bash, Python, or Go.
- Proficiency with Infrastructure as Code (IaC) and configuration management platforms such as Terraform, Ansible, and Puppet.
- Proficiency in Linux administration and command-line tools (e.g., SSH, grep, awk).
- Detailed understanding of major internet protocols (TCP/IP, DNS, HTTP, TLS)
- Networking administration experience: concepts such as routing, firewalls (iptables), peering sound familiar
- A passion for documenting code, processes, and infrastructure in runbooks and wikis
- Worked with metrics monitoring solutions such as grafana, prometheus, telegraf, and OpenTelemetry
- Experience creating and managing tickets with third party vendors and owning cloud vendor partner relationships.
Benefits
- 100% of premiums are paid by company for health, vision and dental coverage for you and your dependents
- Additionally, an annual Health Reimbursement Account (HRA) of $3,000 for an individual or $4,500 for a family
- Paid family & medical leave
- Open PTO, a quarterly Wellness Day, and a minimum of 10 paid holidays
- 401(k) retirement account
- Home office reimbursement
- Stock options
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Principal DevOps Engineer
SageSure Insurance ManagersSageSure is an insurance company and division of Insight Catastrophe Group, a New York-based company that delivers property risk management services. As an empl
• Drive the development and continuous improvement of platform tools, emphasizing scalability, reliability, and monitoring capabilities to effectively support engineering teams. • Design and implement self-service tools and frameworks that empower engineering teams, promoting scalability, efficiency, and reusability across various platforms. • Provide expert-level technical oversight and mentorship to engineering teams, ensuring platform capabilities are seamlessly integrated into workflows and aligned with organizational goals. • Establish and maintain comprehensive technical documentation and engineering standards, ensuring platform tools remain understandable, extensible, and accessible to all teams. • Analyze and resolve complex performance issues within platform tools, identifying root causes, and implementing robust, scalable solutions to enhance efficiency and reliability. • Proactively research and adopt new technologies, tools, and engineering patterns that elevate developer productivity and improve self-service capabilities. • Focus extensively on scalability, performance optimization, and sustainable software delivery, ensuring efficient resource utilization and cost effectiveness. • Actively participate in on-call rotations, providing critical expertise and technical guidance to maintain production environment resilience and high availability.
Senior DevOps Engineer
AgiloftThe global standard in no-code contract lifecycle management (CLM) software.
• help design, build, and maintain a stable and efficient infrastructure to optimize service delivery cross production throughout the development lifecycle • monitor, troubleshoot, maintain, and continuously improve building, packaging and deployment processes • collaborate within the Cloud Ops team as well as with QA and development to troubleshoot performance issues
This description is a summary of our understanding of the job description. Click on 'Apply' button to find out more. Role Description AutoRABIT is looking for a Site Reliability/DevSecOps Engineer to help develop, scale and operate our cloud services. In this role you will be an experienced business professional able to implement and execute best practice operations and improvements across teams by providing visibility and recommendations for improved reliability and automation. Responsible for the security, availability, performance, efficiency, change management, monitoring, emergency response, capacity planning, back-up, and disaster recovery of our technical ecosystem, as well as drive automation while building a robust and agile DevSecOps framework. Accountability, agility and strong analytical skills paired with an obsession for learning, gathering data and executing on that data, are key to being successful in this role. Responsibilities - Site Reliability or DevSecOps engineer with a passion for automation, reliability, scalability, monitoring, and capacity planning. - Contribute to the development and maintenance of frameworks for monitoring, automation and code to increase the scalability and reliability of the service. - Assist both internal and customer facing teams with deployment of new software releases, VPN and other related security infrastructure interfacing. - Assist with resolution of AutoRABIT service or customer issues as required. - Participate in and practice sustainable incident response and blameless postmortems. - Contribute to the automation of manual tasks, such as the provisioning of users in production and test environments. - Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve. - Participate in a regular on-call or rotational schedule needed to support AutoRABIT servers, including weekends and holidays. Qualifications - Experience with deployment and maintenance of scalable, resilient, and secure infrastructure with AWS, GCP, and/or Azure based infrastructure cloud and services and automation. - Knowledge of key DevSecOps tools for monitoring (ELK, AWS Azure CloudWatch etc.), Infrastructure management platforms (Kubernetes, Docker, Ansible, Jenkins, Terraform etc.). - Experience with Shell Scripting (Bash), Python or equivalent is required. - Knowledge of programming languages such as Python, Go, or Java. - Experience with configuration management tools such as Ansible or Chef. - Solid understanding of CI/CD pipelines and tools such as Jenkins, GitLab CI, or CircleCI. - Excellent troubleshooting skills in SaaS, or customer environments. - Team player, receiving and giving feedback as well as sharing knowledge. - Can-do attitude: challenging status, leading, and contributing to key improvements and innovations, while maintaining accountability. - Excellent written and verbal US English communication skills for working across a global team environment. - Responsible to adhere to set internal controls. Requirements - Bachelor's in computer science, Engineering, or equivalent degree or experience. - 2+ years of experience in Infrastructure Management, DevOps or Site Reliability preferably in a SaaS or cloud environment. - AWS, GCP and/or Azure Certified. - 2+ Years of Kubernetes experience. - 2+ years' experience managing Linux-based systems in a public cloud such as AWS, GCP, or Azure. - 2+ years of experience with systems monitoring and logging; knowledge of ELK is preferred. - Solid understanding of standard TCP/IP networking and common protocols like DNS, load balancers, HTTP, etc. - Must be a US citizen/permanent resident of the US, and capable of obtaining a Government Security clearance if required and live and work from the US. Green card holders qualify, but H1B or other work visa holders do not qualify for this role. Benefits - Salary range for the role is $150,000 to $175,000 per year, depending on experience. - THIS IS A 100% REMOTE JOB.
DevOps Engineer
TetraScienceTetraScience is a cloud-native technology company that develops software and hardware solutions for monitoring and managing research experiments, as well as clo
• Collaborate with product and engineering teams to drive and enhance the entire lifecycle of our products, from design and development to deployment and operation. • Work closely with clients to deploy and troubleshoot our products in clients' AWS environments ensuring smooth integration and optimal performance. • Develop CloudFormation templates, Terraform modules, Python scripts, deployment frameworks, monitors, and self-healing tools to automate processes and improve efficiency. • Assist the software engineering team in building accurate monitoring and metrics systems for applications before they go into production. • Manage the internal AWS environments and network, ensuring stability, security, and scalability while keeping costs in check • Participate in meetings with potential clients, working alongside solution architects to address their questions and concerns regarding integration of our products into their network and AWS accounts. • Maintain up-to-date documentation on deployments, processes, and standard operating procedures.



