Software Mind logo
Software Mind

Software House focused on results since 1999

Senior DevOps Engineer

DevOps EngineerDevOps EngineerFull TimeRemoteSeniorTeam 1,001-5,000Since 1999H1B No SponsorCompany SiteLinkedIn

Location

Latin America (LATAM)

Posted

1 day ago

Salary

0

Seniority

Senior

No structured requirement data.

Job Description

Senior DevOps Engineer

Software Mind

Role Description Software Mind is building a private, tenant-isolated AI assistant for the real estate title and settlement industry. The platform is a retrieval-first (RAG) system that ingests historical email, documents, and structured metadata into a per-tenant vector index, and serves grounded, cited, expert-weighted answers through a chat-style Q&A interface with single sign-on and full audit logging. The platform is AWS-native with a Python/FastAPI backend, Vue.js frontend, OpenSearch/Pinecone vector store, and OpenAI/Anthropic/Bedrock as LLM provider. You will join a senior, cross-functional LATAM-based team where hands-on AI delivery experience not just familiarity is the baseline expectation. You stand up and own the cloud infrastructure and CI/CD foundation the entire project runs on. Your work is on the critical path from day one: delivery begins with environment provisioning. You design for tenant isolation, observability, and security from the outset not as an afterthought. This role requires prior experience operating infrastructure for production AI or LLM-based workloads. Responsibilities - Provision and configure a dedicated VPC and segmented cloud environment on AWS - Build the baseline CI/CD pipeline and maintain and evolve it across all delivery phases - Configure and manage the vector store infrastructure (OpenSearch/Pinecone on AWS) - Set up and manage the observability stack: CloudWatch, X-Ray, alerting thresholds, and LLM-specific monitoring - Implement infrastructure-as-code for all environments (dev, staging, production) using Terraform or CDK - Manage secrets, KMS encryption key configuration, and tenant-scoped access controls - Configure LLM provider connectivity (OpenAI / Anthropic / Amazon Bedrock enterprise tier, zero-data-retention) - Define and implement environment promotion strategy aligned with the 2-week sprint cadence - Support incremental ingestion pipeline infrastructure requirements and nightly scheduling Qualifications - +90% English written and oral (at least B2 level) with excellent communication skills - 6+ years in DevOps or cloud infrastructure engineering; strong AWS specialization required - Infrastructure-as-code: Terraform, CloudFormation, or AWS CDK - CI/CD tooling: GitHub Actions, AWS CodePipeline, or equivalent - Core AWS services: VPC, ECS, Lambda, S3, DynamoDB, API Gateway, Cognito, CloudWatch, X-Ray - Experience designing and operating multi-tenant cloud environments with tenant-level data isolation - At least one project operating infrastructure for a production AI/ML or LLM-integrated system not just general cloud workloads - Experience configuring and managing vector store infrastructure (OpenSearch, Pinecone, Weaviate, or equivalent) in a production environment - Familiarity with LLM provider APIs (OpenAI, Anthropic, or Amazon Bedrock) in a production/enterprise configuration, including zero-data-retention tier setup - Understanding of AI-specific observability concerns: token usage monitoring, latency profiling for LLM calls, and model response logging Additional Information - Experience with enterprise SSO and identity federation: Cognito, Okta, or Azure AD - Background in HIPAA, SOC 2, or regulated-data cloud environment configuration - Familiarity with OCR or document processing service infrastructure (AWS Textract, etc.) - We are accepting applications from LATAM countries

Related Categories

Related Job Pages

More DevOps Engineer Jobs

Full TimeRemoteTeam 11-50Since 2006H1B No Sponsor

• Design, implement, and maintain scalable Kubernetes infrastructure on GKE/EKS • Develop and manage Infrastructure as Code using Terraform, Helm, and Ansible • Build and improve CI/CD pipelines for fast and reliable deployments • Implement and maintain monitoring, logging, and alerting solutions • Support PostgreSQL and Kafka environments • Automate operational tasks using Python and Bash scripting • Troubleshoot production issues across cloud and Kubernetes environments • Collaborate with developers to improve deployment and operational processes • Participate in on-call rotation and production support

Europe
CodiLime logo

Senior Network Deployment Engineer

CodiLime

A strategic partner for technology-driven companies | Network engineering | Software engineering

Full TimeRemoteTeam 201-500Since 2011H1B No Sponsor

• Leading design, architecture, and optimization for networking infrastructure & devices in Large scale DCs and/or Offices • Overseeing entire site deployment cycles from initial requirements to operational handover • Collaborating with cross-functional IT, security, and facility teams plus external vendors • Enforcing and maintaining robust technical documentation and architectural standards • Developing Python and/or Ansible scripts to drive network automation initiatives

Egypt
Fabric Group logo

Senior Consultant – Site Reliability Engineering

Fabric Group

Good Problems. Unlocking value from business challenges

ContractRemoteTeam 51-200Since 2006H1B No Sponsor

• Consultative Ownership: Work with autonomy to own problems and deliver solutions, acting as a bridge between development and operations. • Observability Architecture: Design and implement robust monitoring solutions using the LGTM stack to ensure system health and performance. • Reliability Strategy: Advise clients on defining meaningful SLOs/SLIs and managing error budgets to balance innovation with stability. • AI Assistance: Drive use of AI Agents or AI tools for intelligent automation and improving operational efficiency. • Incident Leadership: Lead post-incident reviews (Blameless Post-Mortems) to identify systemic improvements and reduce future toil. • Mentorship: Coach less experienced engineers within Fabric and our client teams on SRE principles and modern infrastructure patterns. • Advising our clients on the right technical decisions and advocating for the right practices to use. • Participate in interviewing and recruitment based on business needs. • Thought Leadership: Contribute to the SRE community through blog posts, meetups, or internal knowledge sharing. • Operational Support & Availability: Rotational Support Coverage: Participate in a sustainable team rotation to provide extended service coverage (including weekends) for business-critical systems. • Incident Response: Act as a primary responder for high-priority (P1/P2) incidents during your rostered shift, focusing on rapid restoration and clear stakeholder communication.

India
Quzara LLC logo

Site Reliability Engineer – Google Cloud Platform

Quzara LLC

Cybersecurity & Managed Services firm providing Technical Advisory support to Federal and Commercial customers.

Full TimeRemoteTeam 11-50Since 2015H1B No Sponsor

• Design, build, and operate secure GCP cloud foundations and landing zones for federal and regulated environments, including organization hierarchy, policy guardrails, Assured Workloads, and Cloud Foundation Toolkit-based deployment patterns. • Engineer and maintain secure GCP network architectures, including Shared VPC, hub-and-spoke topology, VPC Service Controls, Access Context Manager, Private Google Access, Private Service Connect, Cloud NGFW, Cloud Armor, load balancing, DNS, NAT, VPN, and Interconnect under least-exposure principles. • Implement and administer identity, access, privileged access, and encryption controls, including least-privilege IAM, custom roles, IAM Conditions, deny policies, service-account hygiene, Workload Identity Federation, Privileged Access Manager, Access Approval, Access Transparency, BeyondCorp Enterprise, IAP, Cloud KMS, Cloud HSM, CMEK, and Cloud EKM. • Develop and operate security monitoring, threat detection, and response capabilities using Chronicle/Google Security Operations, Security Command Center, curated detections, YARA-L, threat intelligence, SOAR playbooks, telemetry pipelines, and integration with MDR/SOC workflows. • Build and maintain logging, audit, observability, and reliability capabilities using Cloud Audit Logs, aggregated log sinks, retention policies, BigQuery/Chronicle exports, Cloud Monitoring, Cloud Logging, dashboards, uptime checks, SLIs/SLOs, alerting, on-call operations, incident response, and blameless postmortems. • Secure and operate cloud workloads and platforms, including Sensitive Data Protection/Cloud DLP for CUI discovery and de-identification, hardened GKE environments, Workload Identity, Shielded/Confidential nodes, network policy, GKE Policy Controller, Binary Authorization, and secure Artifact Registry image promotion. • Automate infrastructure, security, compliance, and reliability operations using Terraform, Infrastructure Manager, Cloud Foundation Toolkit, policy-as-code, secure CI/CD pipelines, Cloud Build, Cloud Deploy, and scripting in Python, Go, or Bash to reduce manual work and operational toil. • Translate federal security and compliance requirements into GCP configurations and audit-ready evidence, including NIST SP 800-53, NIST SP 800-171, FedRAMP, CMMC, control inheritance, customer responsibility matrices, RMF/FedRAMP authorization support, and assessor/AO documentation. • Partner directly with customers and internal stakeholders to communicate technical requirements, operational risks, compliance expectations, and implementation status to both technical and non-technical audiences.

United States