December 14, 2023

Reinforcement Learning from Human Feedback (RLHF): Revolutionizing Large Language Models

Rapid learning is characterized by repeated attempts without fear of failure. Just like a child touching a hot stove learns to avoid pain, or a student doing exercises learns academic subjects, trial-and-error and feedback are integral for humans to improve skills. It comes as no surprise that the best large language models (LLMs) are built using similar patterns.

LLMs are artificial intelligence systems trained on vast datasets to generate coherent, human-like text. Fine-tuning is a technique often used during the training process to get the model to behave a certain way e.g. to be conversationally adept like in a chatbot.

Basic fine-tuning adapts models like ChatGPT for specific tasks. However, it is an advanced approach called reinforcement learning with human feedback (RLHF) that has enabled the impressive performance gains allowing models like ChatGPT to outperform other models on select benchmarks.

RLHF incorporates both reinforcement learning principles as well as human input to steer model advancement. To understand this, we must first delve into the mechanics of “normal” reinforcement learning.

What is Reinforcement Learning?

‍

Reinforcement learning refers to a machine learning technique in which the model which is learning (the agent) learns how to behave in an environment by taking actions and receiving positive or negative feedback, called rewards and punishments. The goal of the agent is to maximize rewards over time through its actions.

In reinforcement learning, the agent starts out with no prior knowledge about the environment. It then repeatedly observes the current state of the environment and chooses an action out of all possible actions available to it. Upon taking an action, the agent receives an immediate numeric reward or penalty and finds itself in a new state.

The agent's goal is to learn which actions lead to higher long-term rewards across changing environmental states. It aims to develop an optimal policy for mapping states to actions that maximize cumulative future reward. The agent essentially uses trial-and-error experiences interacting with the environment to discern action patterns that yield the greatest rewards.

For example, consider an AI learning to play chess through repeated matches. The AI tries different moves and learns over time that certain actions like protecting its king lead to better outcomes and higher scores. This demonstrates the reinforcement learning concept of taking actions in an environment and learning which choices yield the greatest rewards in playing a game. There’s even someone who used RL to train an agent to play Pokemon.

The training process continues until the agent has compiled experiences allowing it to reliably select highly rewarding actions.

How does Human Feedback Enhance the Process?

‍

In standard reinforcement learning, the agent learns solely based on algorithmic rewards and penalties received for actions taken in the environment. So progress relies completely on the quality of programmatic feedback. However, in many real-world settings these environmental signals fail to capture the full nuance of desired behavior.

RLHF bridges this gap. It combines the automated feedback the agent pulls directly from environmental interactions, with additional human input meant to correct shortcomings in the reward functions.

For example, say we wanted to train a virtual assistant to have pleasant conversations. The environment itself provides no inherent feedback signals for user satisfaction. But with RLHF we enable human raters to review conversations and supply extra ratings to directly shape assistant responses toward more positive outcomes.

In this way, RLHF creates a symbiotic relationship. The environment grounds the agent in basic task completion, while human oversight steers overall behavior toward real-world standards like quality, safety, ethics, and user preferences. This combined guidance allows for efficient learning of more complex and nuanced skills.

Implementing RLHF in Large Language Models

‍

Adding human feedback to the RL loop can be very effective when done correctly. ChatGPT’s ability to maintain context, understand implied meanings, and even exhibit a sense of humor can be largely attributed to RLHF. Here’s an overview of the process:

Overview of Training a Reward Model

1. Choosing a Pretrained Language Model

The process starts with a language model that has already been pre-trained with vast amounts of data. Companies like OpenAI, Anthropic, and DeepMind have used pre-trained models with millions to hundreds of billions of parameters for this stage.

‍

2. Generating a Reward Model (with human feedback)

After pretraining, the next step is to develop what's known as a reward model. This model is separate from the pretrained model and acts a bit like a teacher's grading system, but instead of marking essays, it evaluates AI responses based on human feedback.

This model is key to integrating human preferences into the system and is trained via the contribution of human evaluators. They chat with the pretrained AI and rate its responses. These ratings are then used to train the reward model, helping it understand what makes the base model’s responses “good” based on predetermined parameters.

‍

3. Fine-tuning with RL

Now, it's time to tune the model via reinforcement learning. This is where the AI learns to improve its responses based on the reward model's feedback. The direction of the AI’s learning is determined by something called a policy, which is a set of rules & constraints for the model to follow while being trained. This is essentially how the base model learns using the reward model’s feedback.

For example, ChatGPT uses techniques like Proximal Policy Optimization (PPO) to facilitate this. PPO is a method of ensuring the AI doesn’t make big or abrupt changes based on new information e.g. not forgetting the basics while it’s being fine-tuned for new skills.

‍

4. Balancing & Optimizing Responses

The final step is all about balance. The reward function, which guides the AI's learning, is a mix of what the reward model suggests and certain constraints dictated by the policy. These constraints are like safety rails, ensuring that the AI’s responses make sense and stay on topic.

Complex mathematical metrics, like the Kullback–Leibler divergence, are used here. This is a method of measuring how much the AI’s current understanding differs from what it has just learned. It helps the person fine-tuning the model ensure that the LLM’s existing knowledge is not disrupted too much by the new knowledge it has just ingested.

‍

The success of OpenAI’s ChatGPT was partly attributed to the usage of RLHF. They demonstrated very clearly in their paper that a 1.3 billion parameter model fine-tuned with RLHF outperformed a foundational model with 175 billion parameters. By directly involving human feedback in the fine-tuning process, OpenAI was able to capture the top spot in the foundational model rankings today.

However, it is important to note that the specifics on how involved RLHF was in the fine-tuning process is proprietary. GPT-4 and most proprietary models tend to not share specifics of how the fine tuning was done. Open source models like LLama have a bit more transparency, but the exact details of how the fine-tuning process was done is often not documented.

Moreover, RLHF has its own share of issues as well. While the efficacy of the process was touted as the main source behind OpenAI’s success, the method in itself can be tricky to implement as I’ll cover in the next section.

The Challenges & Limitations of RLHF

‍

RLHF is a powerful approach to ensuring alignment in LLMs. It can be used to minimise a lot of the issues we face in fine-tuning a model, such as hallucination. However, it’s a complex technique that can be difficult to get right. There’s a lot of documentation on the challenges faced by AI developers when attempting this approach to fine tuning. Some of them are:

Quality Feedback Collection: RLHF integrates human input into the training of large language models. Often, this involves hiring a diverse group of human reviewers to rate data, which is needed to avoid biased outcomes and can be super expensive. Despite this cost, there's no guarantee of achieving the desired objectivity and quality, as poor curation of the training data or limited diversity among human labelers may propagate systemic issues of unfairness or discrimination within the model.
Accurate Reward Modeling: Constructing a reward model that accurately reflects human preferences is difficult due to the need to represent many viewpoints. There’s also a risk of the reward model exhibiting incentivized deception, which is when the model assigns a higher score towards responses that sound good but are not necessarily rooted in fact.
Effective Policy Training: Policies are how the AI adapts its performance based on the reward model’s feedback. Preventing the policy from “gaming the system” can get tricky.
Consider a game where points are lost for wrong moves. An AI might learn to never play, avoiding loss entirely — a clever but unhelpful solution. This illustrates the fine line in training AI policies: guiding them to solve problems without taking shortcuts.

Closing Thoughts: Alternatives to RLHF when Building with LLMs

‍

While LLMs are still relatively new, companies like OpenAI and Anthropic have brought foundational models to a level of sophistication that has created businesses for many different kinds of use cases. Whether you’re a new business venturing into AI, a more established one seeking to build AI functionality, or even a dev who is curious about how LLMs work, all of you might have wondered - How do I start? Where to start?

As a company who’s helped businesses build better AI and who is now building AI apps themselves, what worked for us was starting small and understanding our use case. RLHF can be a powerful tool to modify the desired behavior of your AI app, but there are other approaches that can be attempted first (based on your use case).

If you’re just experimenting, prompt engineering can be a quick and easy way to start using LLMs. It’s useful for validation of your use case and identifying where you want to dig deeper into.

If your use case is more specific or you’re experiencing hallucination, you can attempt approaches like retrieval-augmented generation (RAG), which can help for use cases that need a lot of precision.

Last but not least, there’s also normal fine-tuning, which is a lot easier to set-up compared to RLHF. RLHF is essentially a more advanced way of performing fine-tuning and can only work well if set up correctly. Fine-tuning requires a dataset that fits the parameters that you’re training the model for and a lot of patience when tuning and iterating.

Example of how OpenAI recommends you use the above approaches. Source: OpenAI

Depending on your use case, you might need a blend of the above techniques based on whether your model needs more context or if you need it to act a certain way.

In summary, RLHF stands as a pivotal innovation in the field of AI, transforming LLMs into more accurate, responsive, and human-aligned systems. It's not just a training methodology; it's a bridge connecting human intuition and machine intelligence. It is resource-intensive and can be challenging to implement effectively, but the results—more sophisticated, reliable, and ethically aligned AI systems—speak for themselves. As we continue to explore and refine this methodology, the potential for further groundbreaking advancements in AI is immense.

Whether you're a burgeoning AI startup, an established tech giant, or a curious developer, understanding the intricacies of RLHF and considering its application in your projects is crucial. While RLHF presents a cutting-edge approach, remember that a thorough understanding of your specific use case and exploring various techniques can lead to the better performing model. The future of AI, enriched by RLHF, holds exciting possibilities for innovation and advancement in technology.

—

RLHF is a complex but effective technique to tune LLMs. Building new products using LLMs & don’t know where to start? We’ve been in the AI industry for the past 5 years helping clients build better AI. Check out our use cases or book a call to find out more!