Senior AI Research Scientist at Cerence Inc.

Bringing real world currency to the blockchain.

AI Research Scientist57 days ago

Full Time RemoteTeam 11-50Since 2014H1B No Sponsor

View details: AI Research Engineer, Model Compression, Quantization

• Apply low-bit quantization to reduce model size and inference latency for generative AI models (LLMs, VLMs, multimodal) while maintaining accuracy and output quality • Leverage knowledge distillation to transfer capabilities from larger teacher models to smaller student models, enabling efficient multimodal reasoning across text, image, and audio inputs • Implement pruning techniques to remove redundant parameters and attention heads, reducing computational overhead without sacrificing task performance • Analyze trade-offs between model efficiency (size, latency, memory) and accuracy across quantization, distillation, and pruning methods; propose improvements based on empirical findings • Research and apply mixed-precision quantization and other advanced compression strategies (e.g., adaptive pruning schedules, distillation with intermediate feature matching) to optimize the accuracy–performance balance • Stay current with the latest research in model compression, including emerging techniques for multimodal and generative architectures • Document methodologies, experiments, and results clearly to support reproducibility, internal collaboration, and stakeholder communication • Author technical papers and publish findings in top-tier conferences (e.g., NeurIPS, ICML, ICLR, CVPR, ACL, AAAI) to advance the field of model compression for multimodal AI.

PyTorch

Brazil

AI Research Engineer, Model Compression – Quantization

Bringing real world currency to the blockchain.

AI Research Scientist57 days ago

Full Time RemoteTeam 11-50Since 2014H1B No Sponsor

View details: AI Research Engineer, Model Compression – Quantization

• Drive innovation in model compression and efficient deployment for advanced multimodal AI systems, including large language models (LLMs) and vision-language models (VLMs). • Focus on reducing model footprint and computational cost while preserving accuracy. • Apply and advance compression techniques such as quantization, knowledge distillation, and pruning. • Develop, test, and implement novel compression strategies that balance model size, latency, throughput, and accuracy. • Build robust compression pipelines, establishing performance and fidelity metrics, and addressing bottlenecks in production inference.

PyTorch

Switzerland

AI Research Engineer – Kernel, Inference Optimization

Bringing real world currency to the blockchain.

AI Research Scientist57 days ago

Full Time RemoteTeam 11-50Since 2014H1B No Sponsor

View details: AI Research Engineer – Kernel, Inference Optimization

• Drive innovation in model serving and inference architectures for advanced AI systems • Focus on optimizing model deployment and inference strategies to deliver highly responsive, efficient, and scalable performance across real-world applications • Work on a wide spectrum of systems, ranging from resource-efficient models designed for limited hardware environments to complex, multi-modal architectures that integrate data such as text, images, and audio • Adopt a hands-on, research-driven approach to develop, test, and implement novel serving strategies and inference algorithms • Engineer robust inference pipelines, establishing comprehensive performance metrics, and identifying and resolving bottlenecks in production environments • Enable high-throughput, low-latency, low-memory footprint, and scalable AI performance that delivers tangible value in dynamic, real-world scenarios

Flash

Switzerland

AI Research Engineer – Model Compression, Quantization

Bringing real world currency to the blockchain.

AI Research Scientist57 days ago

Full Time RemoteTeam 11-50Since 2014H1B No Sponsor

View details: AI Research Engineer – Model Compression, Quantization

• Drive innovation in model compression and efficient deployment for advanced multimodal AI systems, including large language models (LLMs) and vision-language models (VLMs). • Apply and advance compression techniques such as quantization, knowledge distillation, and pruning to streamline complex multimodal architectures that integrate text, images, and audio. • Build robust compression pipelines, establishing performance and fidelity metrics, and addressing bottlenecks in production inference.

PyTorch

India