Senior ML Engineer - Kimchi (LLM Inference Optimization)
Location
Worldwide
Posted
2 days ago
Salary
0
Seniority
Senior
No structured requirement data.
Job Description
Senior ML Engineer - Kimchi (LLM Inference Optimization)
CAST AI
Role Description Throughput. Latency. KV cache utilization. Move those three numbers in the right direction, and two things happen: customers get faster, cheaper inference, and our margins improve. That's the entire thesis of this role. Every kernel you tune, every quantization scheme you ship, every scheduler tweak you land shows up directly in a customer's p99 and on our P&L. This is a high-impact seat. It is also a high-autonomy seat as you'll be given the room to lead the technical direction of inference optimization at Kimchi, not execute someone else's roadmap. The problem: running LLMs in production is a moving target. The "right" model and serving configuration for a workload depend on: - Traffic shape - Sequence-length distribution - Batch dynamics - GPU SKU - Memory bandwidth - Quantization tolerance - A dozen other variables that shift week to week Most teams pick a model once, over-provision GPUs, and absorb the cost. Kimchi is the system that makes that decision automatically - continuously matching workloads to the most cost-efficient, best-performing LLM and serving configuration on a customer's infrastructure. We're building the optimization layer between the model and the hardware, and we need engineers who understand both sides deeply. Qualifications - 5+ years building real ML systems, with a portfolio that shows depth in inference or training infrastructure (not just model training notebooks). - Strong Python - production services, not scripts. - Hands-on experience with at least one of vLLM, SGLang, or TensorRT-LLM, and a working mental model of why an inference engine performs the way it does on a given GPU. - Fluency with quantization tradeoffs - you've measured quality regressions, not just compression ratios. - Comfort with distributed systems: collective communication, sharding strategies, and the practical failure modes of multi-GPU and multi-node setups. - A bias toward measurement. You instrument before you optimize, and you can tell the difference between a real win and a benchmark artifact. - Self-direction. This role comes with a wide mandate; you should be excited by that, not unsettled by it. Requirements - Push throughput. Continuous batching, speculative decoding, chunked prefill, kernel-level tuning across vLLM, SGLang, and TensorRT-LLM. Find the ceiling on each GPU SKU, then raise it. - Cut latency. Attack TTFT and TPOT separately. Profile, identify the actual bottleneck (compute, memory bandwidth, scheduling, networking), and fix it - not the bottleneck you assumed. - Get more out of the KV cache. Paged attention, prefix caching, eviction policies, cache reuse across requests, quantized KV. This is where a lot of the unrealized throughput lives, and it's an area you'll own. - Quantize without regressing quality. INT8, INT4, FP8 across weights, activations, and KV. Empirical work: measure quality on real workloads, not just perplexity benchmarks. - Shrink cold starts and memory footprint. Faster init, smarter weight loading, tighter memory accounting - the difference between a model that scales and one that doesn't. - Scale across nodes. Distributed inference topologies, network-aware placement, checkpointing strategies that don't bottleneck on storage or interconnect. - Set the technical direction. Decide what we benchmark, what we adopt, and what we build ourselves. Bring the team along with strong writeups and reproducible experiments. Benefits - Competitive salary (depending on the level of experience). - Enjoy a flexible, remote-first global environment. - Collaborate with a global team of cloud experts and innovators, passionate about pushing the boundaries of Kubernetes technology. - Equity options. - Get quick feedback with a fast-paced workflow. Most feature projects are completed in 1 to 4 weeks. - Spend 10% of your work time on personal projects or self-improvement. - Learning budget for professional and personal development - including access to international conferences and courses that elevate your skills. - Annual hackathon to spark new ideas and strengthen team bonds. - Team-building budget and company events to connect with your colleagues. - Equipment budget to ensure you have everything you need. - Extra days off to help maintain a healthy work-life balance. Hiring process - Screening call with Recruiter - Hiring Manager interview - Technical interview (system design) - Live coding - Culture Check interview with an executive As part of our standard hiring process, we would like to inform you that a background check may be conducted at the final stage of recruitment through our third-party provider, Checkr. Please note that Cast AI does not provide any form of visa sponsorship/work permit.
Related Guides
Related Job Pages
More LLM Engineer Jobs
Senior NLP/LLM Engineer
Social Discovery GroupTop world’s largest social discovery company uniting 70+ brands with 500M+ users
• Conducting experiments with LLMs: Explore and analyze the effectiveness of different architectures and techniques (SFT, RLHF, Adapters, etc.) to enhance the capabilities of AI models. • Developing and implementing evaluation methodologies: Implement and maintain robust frameworks to assess the performance, accuracy, and user satisfaction of AI bots, including offline and online metrics. • Optimizing models for inference: Improve the efficiency and speed of AI models during inference to ensure they meet the performance and scalability requirements for production environments. • Collaborating with cross-functional teams: Work closely with data scientists, software engineers, and product managers to integrate AI solutions into the overall product pipeline.
• Build and integrate AI-powered applications, workflows, and automation systems • Develop LLM-based tools, AI agents, chatbots, and internal productivity solutions • Integrate OpenAI, Anthropic, Gemini, or other generative AI APIs into business systems • Design AI automation solutions for operations, sales, customer support, recruiting, and productivity use cases • Build and improve RAG systems, document search tools, and knowledge-based AI applications • Work with APIs, databases, cloud services, and third-party platforms • Optimize prompts, workflows, outputs, and AI interactions for accuracy and usability • Collaborate with engineering, product, and business teams to deliver scalable AI solutions • Troubleshoot and improve AI workflows, including handling edge cases, errors, and unreliable outputs • Help ensure AI solutions are practical, secure, and aligned with business needs
• Lead large-scale AI infrastructure deployment programs across multiple sites, regions, or business units. • Drive end-to-end project execution for GPU clusters, AI compute environments, storage platforms, high-speed networks, and data center infrastructure. • Develop integrated project plans, implementation strategies, and operational readiness frameworks. • Manage cross-functional coordination between engineering, operations, supply chain, vendors, and executive stakeholders. • Identify and mitigate program risks, schedule impacts, technical dependencies, and operational constraints. • Lead infrastructure migration, expansion, upgrade, and modernization initiatives. • Drive governance reviews, project reporting, KPI tracking, and executive-level communications. • Coordinate infrastructure acceptance testing, deployment validation, and production readiness activities. • Mentor junior project managers and contribute to PMO process standardization and operational maturity. • Support vendor negotiations, technical evaluations, and infrastructure planning initiatives.
• Assist in creating and updating AutoCAD drawings, rack elevations, cabinet layouts, and structured cabling documentation. • Support data hall design documentation, asset tracking, and revision management activities. • Assist engineering teams with infrastructure inventory validation and basic capacity tracking. • Help maintain design standards, templates, and project documentation repositories. • Participate in engineering reviews, design walkthroughs, and quality assurance activities. • Coordinate with cross-functional teams to gather project inputs and update design records. • Support preparation of project reports, spreadsheets, diagrams, and technical documentation. • Learn data center infrastructure concepts including power, cooling, cabling, rack configurations, and AI cluster environments. • Follow established operational procedures, engineering standards, and safety requirements. • Assist with administrative and project coordination tasks related to infrastructure deployment activities.


