What does Distillation mean when it comes to training LLMs?

January 29, 2025

Distillation in the context of training large language models (LLMs) is a technique used to make a smaller model learn from a larger, more complex model. This process, called knowledge distillation, helps transfer the knowledge from a powerful teacher model (usually a massive LLM) to a more efficient student model (a smaller, faster version). The goal is to retain as much of the original model’s performance as possible while significantly reducing computational costs and memory requirements.

To gain a foundational understanding of artificial intelligence concepts, including techniques like distillation used in training large language models, consider enrolling in the AI For Everyone course on Coursera. This course demystifies AI terminology and explores its applications across various industries, providing valuable context to the functionalities of AI models.*

The key idea behind distillation is that instead of training the smaller model from scratch on raw data, it learns from the soft labels or probabilistic outputs of the larger model. These outputs provide richer information than standard labels, capturing not just the correct answers but also the relationships between different possible answers. For example, if a teacher model predicts that the word “intelligent” has a 90% probability of being the right word in a sentence but also assigns 8% to “clever” and 2% to “smart,” the student model learns from this nuanced ranking rather than just a single correct label.

Distillation is widely used to create lightweight AI models that can run efficiently on edge devices like smartphones or in cloud environments with limited resources. It allows organizations to deploy AI-powered applications while keeping costs and energy consumption low. Despite being smaller, well-distilled models often perform surprisingly well, maintaining high accuracy while running much faster.

This technique is especially valuable in AI applications that require real-time responses, such as chatbots, voice assistants, and recommendation systems. By using distillation, companies can make advanced AI more accessible, scalable, and sustainable without compromising too much on intelligence or capabilities.