November 23, 2023

How Large Language Models (LLMs) Learn

"Have you ever wondered how AI assistants have gotten so good at understanding and responding to human language?"

Some innovations mark the start of new eras. The internet, which burst into prominence in the early 2000s, transformed how we access information. Smartphones in the early 2010s reshaped the way we communicated with each other. It is my firm belief that we're living through another such era, led by Large Language Models (LLMs).

ChatGPT is an iconic example - its utility in various areas, from enhancing productivity to offering empathetic responses in customer service has led to its widespread adoption today. One common praise is how it almost sounds like you’re speaking to another human, albeit one that sometimes hallucinates facts.

But have you ever wondered how LLMs got to this point? The answer lies in the way that they are trained on a wide range of topics.

The Journey from Data to Dialogue: How LLMs Learn

Imagine teaching a child to speak. You start with the basics – words, then sentences. Over time, with guidance and feedback, the child learns not just to speak, but to converse. This process is very similar to the current approach to training LLMs.

The process begins here and can be expressed in 3 stages:

Stage 1 - Data collection & pre-processing

Modern LLMs often have a massive knowledge base (see table below) that consists of gigabytes of text data. If you’re familiar with the maxim “garbage in, garbage out”, you’d know that it’s in the best interests of LLM developers to make sure that the data at this stage is at as high a quality as possible.

This process starts with collection of the raw data, which could be a blend of open-source datasets and scraped text data. This is what’s known as the LLM’s corpus, the sum of all its knowledge. The raw corpus is then put through multiple filters in a laborious pre-processing approach that aims to remove as much noise as possible from the data.

The objective at this stage is to give the LLM high-quality information to work with. This is often easier said than done as there is a vast amount of information to filter through. However, the quality of the data at this stage will play a pivotal role in the LLM’s later performance.

Stage 2 - Pre-training

Once the corpus has been collected and cleaned, LLMs start with a phase of pre-training, engaging in a form of self-learning where they consume and analyze the data in the corpus. The objective here is to learn how to reason with the info in the corpus.

At their core, LLMs operate on a sophisticated framework known as transformers. These structures are really good at processing and understanding human language, mimicking the way humans discern meaning from words and context. They will use this architecture to make sense of the data provided to it by the corpus, learning factual information, grammar, sentence structure, etc.

The process in which this happens is called next-token prediction. It involves predicting the next word in a sequence of words. The "token" refers to the smallest unit handled by the model, which can be a word, part of a word, or even a character, depending on the model's design.

To illustrate this - Imagine you're reading a long essay for English class. As you're reading through the paragraphs, there might be certain sentences that are really important for you to understand the main points. Other sentences provide background details or supporting evidence, so you don't need to focus on them as much.

The "transformer" architecture works similarly in AI systems for understanding language. As it processes sentences word by word, it learns to pay more "attention" to the important words that matter most for figuring out the overall meaning.

It's kind of like if you highlighted all the key topic sentences in a reading passage, while skimming over descriptions and examples. This lets you focus on the critical ideas. The transformer does something very akin to this highlighting, but automatically.

An example of how self-attention mechanism helps the model discern language

Specifically, the transformer uses something called a self-attention mechanism. This means that as it reads through a text, the model refers back to previous parts of what it has seen to better determine which words require more focus. It asks itself questions like:

"Have I seen this word multiple times before that tells me it's significant?"

"Is this word essential for the tense of the sentence or the overall meaning?"

"Does this word suggest a connection to a previous concept I need to track?"

By self-attending to the text in this way, the model figures out how to weigh and select the words most useful for understanding language structure, accuracy, continuity and other patterns. It's able to remember or ignore certain words appropriately as needed.

The model refers back to important context and focuses on relevant words (attention mechanisms), and predicts next words probabilistically - together, these interconnect to mimic human learning for understanding and generating language.

“I challenge the claim that next-token prediction cannot surpass human performance” - Ilya Sutskever, Chief Scientist at OpenAI

Now, some of you might be thinking, surely human reasoning can’t be surpassed by simple statistics. But, as Ilya Sutskever so succinctly puts it, if the pre-trained model is strong enough, you can simply ask it to extrapolate what a person with great insight, wisdom & capability would do and it would do that task very well.

Pre-training is computationally intensive due to the sheer amount of data involved. Next token prediction involves calculating the probability of each possible word being the correct choice. Given the size of the corpus, this can involve a lot of mathematical computations, especially since modern LLMs can consider tens of thousands of possible next words.

Stage 3 - Supervised Fine Tuning

In this phase, LLMs undergo a process called fine-tuning, where the LLM is trained on smaller, more specific datasets, allowing them to adapt and refine their responses for particular scenarios. The goal is to specialize the model's abilities for specific tasks (e.g. sentiment analysis, classification, question answering)

Why do this? Pre-training gives you a model with generally good performance in a wide range of activities. To get the LLM to use its info in a more specific manner, people generally attempt extensive prompt engineering e.g. Act as a fitness coach with 20 years of experience. However, there’s only so much that prompt engineering can do to change the model’s behavior. For example, it would take more prompt engineering than its worth to get the LLM to extract all information from an introduction into a JSON.

A more long-term and elegant solution would be fine-tuning, where:

The model is improved for a specific task. Fine-tuning helps the LLM think in a specific direction based on the parameters given.
The model is more efficient for that task. Due to its specificity, the model uses less tokens to arrive at your desired behavior i.e. less prompting effort and increased LLM latency + speed.

During fine-tuning, the weights of the pre-trained model are adjusted to minimize the error in predictions on the specific dataset. This step is overall less computationally intensive than pre-training as adjustments are smaller and more refined. It also involves less training iterations than pre-training, though great care should be taken to avoid overfitting - where the model performs well on training data but poorly on unseen data.

This fine-tuning process can take various forms based on the specific requirements of the task, the nature & availability of the data and the desired level of model customization. A non-exhaustive list of examples are:‍

Adversarial Fine-Tuning:

Concept: This involves fine-tuning the model using data that includes challenging or adversarial examples, designed to improve the model's robustness and ability to handle tricky inputs.
Applications: Useful for improving model performance in scenarios where it might face deceptive or confusing inputs.

Domain-Specific Fine-Tuning:

Concept: This involves fine-tuning the model on a dataset from a specific domain (like legal, medical, or technical texts) to adapt its language understanding to that domain.
Applications: Useful for specialized applications where expertise in a particular field is required.‍

Reinforcement Learning with Human Feedback (RLHF):

Concept: This technique involves a combination of supervised learning and reinforcement learning, where the model is initially trained on human-labeled data and then refined using reinforcement learning techniques. It usually starts with a process called reward modeling, where human raters provide feedback on model outputs, which is used to train a reward model. The LLM is then fine-tuned to maximize its expected reward.‍
Applications: Useful for tasks that require alignment with complex, subjective human judgments, such as ethical decision-making or creative writing.

Conclusion

As we've explored, the journey of an LLM from a vast corpus of raw data to a finely tuned, conversationally adept AI involves several stages – from pre-processing and pre-training to supervised fine-tuning. Each stage is integral in shaping the LLM's capabilities, ensuring it not only understands and processes language but also adapts to specific tasks with accuracy.

As we move forward, we expect LLMs to continue evolving, becoming more sophisticated and integrated into various aspects of our daily lives.

—