We are building an AI Agent Trading Platform. Create your Agent, customize strategy & trade on your favorite exchanges.
Backend/DevOps Engineer
Location
United States
Posted
100 days ago
Salary
0
Seniority
Senior
Job Description
Backend/DevOps Engineer
Nick AI
• Design and manage infrastructure deployments using Docker, Kubernetes, and AWS/GCP. • Develop secure key management systems for API keys and wallet abstraction. • Implement monitoring, logging, and incident handling for execution nodes. • Set up CI/CD pipelines and streamlined developer workflows (GitHub Actions). • Optimize infrastructure for reliability, security, and scalability. • Collaborate with backend engineers to support execution and receipts systems.
Job Requirements
- 5+ years in DevOps or backend infrastructure roles.
- Expertise with Docker, Kubernetes, and cloud platforms (AWS/GCP).
- Security-first mindset with experience in key management and access control.
- Experience with monitoring and observability tools (Prometheus, Grafana, ELK).
- Strong scripting skills (Python, Bash, or similar).
- Proven track record scaling systems to production usage.
- Background in fintech, trading, or Web3 infrastructure preferred.
- Strong understanding of deployment best practices and incident response protocols.
Benefits
- Competitive salary commensurate with experience
- Flexible PTO and sick leave policies
- Fully remote-friendly, with flexible working hours
- Access to cutting-edge AI and trading technologies
- Opportunity to design core infra for a fast-growing product
- Direct impact on product reliability and security
Related Guides
Related Categories
Related Job Pages
More DevOps Engineer Jobs
Database Reliability Engineer
WorkOSWorkOS is an internet company providing a developer platform that helps app-builders sell their apps to enterprise customers with only a few lines of code. Founded in 2019, the com
• Own the reliability, performance, and scalability of WorkOS's PostgreSQL infrastructure. • Analyze and implement best practices for our database clusters, including replication, connection pooling, high availability, and disaster recovery. • Build and maintain observability for database metrics (query performance, replication lag, connection saturation, storage growth) and ensure we meet our database SLOs. • Provide database expertise to product engineering teams through migration reviews, query optimization guidance, and schema design consultation. • Develop automation and self-service tooling that enables engineers to safely interact with databases without bottlenecking on the DBRE team. • Participate in on-call rotations and lead incident response for database-related production issues, performing root cause analysis and implementing permanent fixes. • Plan and manage database capacity, forecasting growth and ensuring our infrastructure can handle increased workloads. • Collaborate with SREs to roll out infrastructure changes to production environments, with a focus on minimizing risk to the data layer. • Document operational procedures, runbooks, and architectural decisions so learnings become repeatable actions and eventually automation. • Drive improvements to backup and recovery strategies, regularly testing and validating disaster recovery procedures.
Site Reliability Engineer
WorkOSWorkOS is an internet company providing a developer platform that helps app-builders sell their apps to enterprise customers with only a few lines of code. Founded in 2019, the com
• Design and evolve the systems, tooling, and processes that improve the reliability and performance of WorkOS • Collaborate with product and infrastructure teams to ensure services are production-ready, observable, and resilient to failure • Define and measure SLIs/SLOs to guide reliability improvements • Write and optimize backend systems (in TypeScript) with a focus on performance, maintainability, and graceful degradation • Improve our incident response process, lead postmortems, and drive follow-through on reliability risks • Develop internal tools and automations that make it easier to operate and scale our systems • Participate in our on-call rotation—responding to, resolving, and learning from production incidents • Contribute to design and architecture discussions with a focus on operability and long-term sustainability • Document systems, share learnings, and help grow a reliability-minded engineering culture
Network DevOps Engineer, RDMA Fabric Automation
VultrVultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.
• Automate deployment and operations of large-scale RDMA (RoCEv2) Ethernet fabrics across Vultr data centers. • Build Ansible and Python-based frameworks to provision, validate, and remediate underlay and overlay networks. • Integrate network automation with Vultr’s source-of-truth systems (NetBox, OpsMill) for intent-driven configuration and validation. • Develop telemetry ingestion and correlation pipelines (gNMI, Prometheus, Kafka, custom collectors) for real-time network health and performance metrics. • Collaborate with platform, orchestration, and product engineering teams to optimize RDMA performance, PFC/ECN behavior, and path symmetry across fabrics. • Implement CI/CD workflows for network configuration changes — validation, pre-checks, and rollbacks. • Investigate complex network behaviors across layers — flow hashing, congestion domains, ECMP, and overlay interactions. • Contribute to the design of next-generation GPU and AI interconnect fabrics, ensuring seamless integration into Vultr’s global network architecture.
Senior Site Reliability Engineer, Core Cloud Engineering
VultrVultr is on a mission to make high-performance cloud computing easy to use, affordable, and locally accessible.
• Operate and scale Vultr’s control plane, ensuring availability, correctness, and performance across global datacenters. • Design, implement, and maintain automation to manage hypervisor fleets (KVM, QEMU, libvirt) and supporting infrastructure at scale. • Develop tooling and automation for Open vSwitch (OVS), BGP routing, and other networking components to ensure resilient and self-healing network operations. • Continuously analyze and improve system performance across compute, storage, and network layers, with an emphasis on reducing toil and eliminating single points of failure. • Implement advanced monitoring, logging, and tracing solutions (Grafana, Sentry, SumoLogic) while leading incident response to minimize impact and drive postmortem culture. • Maintain and evolve infrastructure pipelines (GitLab CI/CD, Puppet) to enable safe, fast, and reliable changes to both control plane and hypervisor infrastructure. • Work closely with Software Engineers, Network Engineers, and Product teams to align platform reliability with business and user needs. • Produce clear technical documentation for runbooks, operational procedures, and automation frameworks to improve team efficiency and reliability standards. • Coach and mentor team members in best practices for site reliability, incident handling, automation, and low-level Linux systems debugging.


