Skip to main content

What is GRPO? Group Relative Policy Optimization Explained

Explore what GRPO is, how it works, the essential components needed for its implementation, and when it is most appropriate to use.
Jul 1, 2025  · 12 min read

Group Relative Policy Optimization (GRPO) is a cutting-edge reinforcement learning (RL) technique powering the impressive performance of the latest large language models (LLMs). While it gained widespread attention after the release of DeepSeek-R1, GRPO was first introduced in DeepSeekMath, an LLM fine-tuned for advanced mathematical reasoning. GRPO was originally designed to improve efficiency in fine-tuning, and it has proven to be a cost-effective and versatile method embraced by the community.

In this article, we will take a deep dive into GRPO. We will explore what it is, how it works, the essential components needed for its implementation, and when it is most appropriate to use. The aim of this guide is to walk you through the key insights behind GRPO, which justifies its growing popularity.

If you’re eager to learn more about DeepSeek, make sure to check out our course, Working with DeepSeek in Python

Introduction to Reinforcement Learning

As we explore in our guide to fine-tuning LLMs, supervised fine-tuning (SFT) is the traditional training technique that consists of training a model using labeled data. That is, using examples that show the expected completions or outputs for given inputs.

One of the limitations of SFT is that it heavily relies on large, labeled datasets, which can be costly and time-consuming to produce. Moreover, models trained through SFT have the risk of overfitting to the training examples, meaning they perform well on seen data but struggle to generalize to new or unexpected situations.

An alternative to SFT is reinforcement learning, by which, instead of learning from fixed examples, an agent learns by interacting with its environment and trying different actions to complete a task. After each action, the agent receives feedback in the form of rewards or penalties. The goal is to maximize the total reward over time by discovering strategies that work best.

We can depict a simple RL workflow as follows:

Diagram of a simple RL workflow.

Diagram of a simple RL workflow.

If you wish to start hands-on with RL in Python, the tutorial Reinforcement Learning: An Introduction With Python Examples is for you!

Reinforcement learning made easy

To illustrate the concept of reinforcement learning, imagine you are teaching your cousin to ride a bike. In the beginning, she struggles with pedaling and balancing, often wobbling or falling.

Each time she rides a little farther, you cheer her on. This positive feedback encourages her to keep going. But when she tries to ride down the stairs, you quickly stop her and explain the risks. That serves as a negative signal to discourage such actions.

Similarly, reinforcement learning allows models to explore different actions. Positive rewards reinforce desirable outcomes, while negative rewards discourage unwanted behavior. Over time, the model learns to make better decisions through this feedback.

Approaches to reinforcement learning

There are different techniques to apply reinforcement learning to a model. Concretely, GRPO is considered an evolution of proximal policy optimization (PPO) and direct policy optimization (DPO).

Have you ever heard about PPO and DPO before?

Proximal policy optimization

PPO is a widely-used RL algorithm designed to optimize a model’s behavior maximizing its rewards by using a separate reward model. A great example of PPO in action is OpenAI’s reinforcement learning from human feedback (RLHF). In RLHF, human feedback is first collected on model outputs, and this data is later used to train a reward model to predict the feedback. Having a reward model is a way to scale human feedback, without the need for humans.

Finally, PPO uses the reward model during training to adjust the model’s parameters, encouraging it to generate responses that better align with human preferences. The following diagram illustrates the PPO workflow:

Diagram of the PPO workflow.

Diagram of the PPO workflow.

If you would like to implement the PPO workflow in Python, consider the tutorial Proximal Policy Optimization with PyTorch and Gymnasium.

Direct policy optimization

Training a separate reward model can be complex and resource-intensive. To simplify this process, DPO was introduced, changing how human feedback is collected.

Rather than asking humans to rate responses with numerical scores, DPO relies on preference comparisons. Human annotators are shown two responses and asked to choose the one they prefer. This creates a dataset of preferred and less preferred examples.

The model is then fine-tuned directly on these preference pairs, instead of relying on the reward model. Through this approach, the model learns to increase the likelihood of generating the preferred response while reducing the likelihood of the less preferred one. 

This approach allows the model to align with human preferences without needing a separate reward model. Let’s have a look at this new workflow:

Diagram of the DPO workflow.

Diagram of the DPO workflow.

The tutorial OpenAI's Preference Fine-Tuning: A Guide With Examples will help you implement DPO in practice.

Challenges of PPO and DPO

DPO presents a lot of benefits compared to PPO. Starting from the data needed, PPO needs to first collect the data to train the reward model, and then do the actual training of this auxiliary model (with all the technical challenges of training a new model), even before starting training your target model. DPO simplifies the process by removing the necessity of having a separate reward model. Nevertheless, this approach still requires a substantial amount of preference data.

At this point, do you foresee a way of eliminating these limitations?

What if instead of relying on external feedback, we could find an automatic way to validate and rate model responses?

That is exactly what GRPO brings to the table!

What is GRPO?

Group Relative Policy Optimization is an RL technique that doesn’t require labeled data, just a means to “verify” correctness and order responses accordingly. The verification is normally achieved by programmable reward functions, e.g., functions that can take the model’s response as input and output a rating score on some aspect of the function.

Some approaches of GRPO use an LLM-as-a-judge to verify and rate responses, but the core idea of GRPO can be exploited without the necessity of an external model in domains such as software development, since different aspects of the generated code can be verified by external tools. For example,

  • Does the code compile? Here, we just need to use the compiler.
  • Does it have a runtime error? Here we just need to run the code.
  • Does it pass unit tests? Here, we just need unit tests.
  • Is the output of the code linter clean? Here we just need a linter.

As you can see, no humans or preference data are needed!

Reward functions

In the original article of DeepSeek-Math, the reward functions were crafted to assess the correctness and formatting of mathematical solutions. If you have ever asked for a structured output to an LLM, you will have realized that the model might have followed the desired output format for most of the completion, but there was always a corner case that broke your pipeline, right?

In the case of DeepSeek-Math, the reward functions were focused on correctness and formatting:

  • Accuracy rewards evaluate whether the model’s final answer is correct. For deterministic math problems, the model is required to present the final answer in a specified format (e.g., within a box), enabling automated verification against the ground truth.
  • Format rewards ensure that the model’s responses adhere to a predefined structure. Specifically, the model is encouraged to enclose its reasoning process within designated tags (e.g., <think> and </think>). This formatting facilitates the extraction and analysis of the model's thought process, promoting clarity and consistency in its outputs.

GRPO workflow

At this point, you might be asking yourself, Where does GRPO fit in the training workflow of a model?

Let’s review the process step by step!

  1. Send a prompt to the LLM and sample multiple candidate responses.
  2. Write one or more programmable functions that take the prompt and response pairs and assign a score to each.
  3. Use these scores to update the LLM weights, increasing the probability of producing responses with above-average scores and decreasing it for those with below-average scores.

By following this loop, GRPO fine-tunes the model directly based on the output of the reward functions, without the need to collect preference data.

Diagram of the GRPO workflow with score rewards computed by an external actor.

Diagram of the GRPO workflow with score rewards computed by an external actor.

Finally, it is interesting to note that GRPO also brings the benefit of teaching the model new tasks, instead of only steering the learning towards a preference, as in PPO or DPO.

Benefits of GRPO

As we can observe from the diagram above, the major benefit of GRPO is that it does not require labeled data, just a means to “verify” correctness, achieved by the usage of programmable reward functions.

Another benefit is that it requires far fewer examples than fine-tuning, making this technique a cost-effective alternative. 

Additionally, the model learns actively from feedback rather than fixed labeled examples, which reduces the risk of overfitting. Training models with GRPO enables them to organically discover better strategies and improve their chain of thought.

GRPO Use-Cases

As discussed earlier, the primary use case of GRPO arises when you have no labeled data but are able to verify the correctness of the output. It is also highly effective when you have limited labeled data, though not enough to perform traditional supervised fine-tuning. This makes GRPO particularly valuable in scenarios where labeling is costly or impractical.

Some domains where GRPO has demonstrated significant advantages include:

  • Mathematical skills: For example, in the case of the DeepSeek-Math model, GRPO effectively enhanced the model’s ability to solve complex math problems without extensive labeled datasets.
  • Code generation: GRPO helps improve the accuracy and reliability of generated code by allowing the system to self-verify outputs and iteratively refine them.
  • Multi-step reasoning: GRPO has been shown to enhance models’ performance in tasks requiring sequential reasoning and the integration of multiple logical steps.

Advanced GRPO

There are some advanced tips and tricks when implementing GRPO in practice that you should be aware of. 

Advanced reward functions

Reward functions provide feedback to the model regarding how well it is achieving its objective. There are critical components in this process:

  • Diversity in responses: Generating a wide range of candidate outputs increases the chances of discovering higher-quality solutions or strategies.
  • Diversity in rewards: Designing reward functions that can differentiate between varying levels of success or partial achievement, rather than just a binary pass/fail signal. This is known as partial credit rewards, which give more nuanced feedback by assigning partial credit for different aspects of the response. Examples include verifying that at least the output format is correct, the generated code compiles successfully, or the code passes a subset of unit tests. This kind of graded reward encourages the model to improve incrementally, even if the response is not fully correct.

Additionally, setting a baseline for the whole group can play an important role in stabilizing and improving the training process. By subtracting this baseline from individual rewards, the model receives feedback relative to the overall group performance, which reduces variance in reward estimates and encourages incremental improvements over the average.

Temperature 

In LLMs, the temperature parameter controls the randomness of the sampling process during output generation. Setting the temperature to 0 results in a deterministic sampling, meaning the model always chooses the most likely next token. 

While this ensures consistency, it often leads to generating the same output repeatedly, limiting diversity in responses.

On the other hand, increasing the temperature introduces more randomness, allowing the model to explore a wider range of possibilities. This diversity can be beneficial for discovering different or unexpected solutions. 

However, higher temperatures come with a trade-off: the quality of each individual guess tends to be lower because the model samples less probable tokens more frequently. Because of this, the overall learning process may slow down.

Choosing the right temperature is sometimes an art!

Reward hacking

Models are sneaky and sometimes exploit reward functions in unintended ways to maximize rewards without truly achieving the goal. 

As an example, let’s say that the model gets rewarded for producing tests for a given code snippet. It could be that the model provides a test function, but without actually testing anything, bypassing the real objective of verifying code correctness and getting the reward anyways.

One should be aware of these hacks when writing the reward functions. For example, in the case of test generation, it is normally required that the model generates at least one ‘assert’ statement inside the test. Otherwise, it gets penalized.

Conclusion

To conclude with, I would like to apply GRPO to a real-world scenario to ensure all concepts are well understood.

Imagine you and your friends participate in a fitness competition where rewards are given based on performance in running, push-ups, and rowing. Initially, the gym rewards only the person with the best absolute results. 

Like this, if Alice runs farther and does more push-ups than Ben, she always wins, even if Ben has shown significant improvement. 

This feels unfair to Ben, isn’t it?

To address this, the gym tries a different approach: personal goals based on past performance

Now, you only earn rewards if you beat your own previous records. While this seems fairer, it introduces new problems. New members, like Charlie, don’t have any past data to compare against, making it hard for them to participate. Additionally, trainers must constantly track everyone’s progress individually, which becomes inefficient.

Finally, the gym offers a better solution: GRPO

Before the workout begins, the gym instructor activates a system that analyzes participants' characteristics and groups them based on similar conditions. During the workout, the system tracks each participant's performance and computes an average score within each group to serve as a baseline. Afterward, rewards are given based on how much each person exceeds their group’s average performance from that same day. 

So if Ben performs significantly better than others in his group, he earns a reward, even if Alice still has the highest overall score. This method is fair to newcomers like Charlie, too, since they’re evaluated relative to their peers in the same session, not based on prior history.

An important factor in this process is temperature, which controls how strictly improvements are rewarded. If the temperature is too low, only large improvements over the baseline count. If the temperature is too high, even small improvements are rewarded, which encourages experimentation but can lead to erratic progress. 

GRPO aims to find the right balance, ensuring steady improvement while allowing for exploration.

Finally, there’s the risk of reward hacking. That would mean that participants find ways to game the system without truly improving. For example, Ben might focus only on the easiest exercises to inflate his score without real effort. 

To prevent this, the gym adds safeguards, like requiring a balanced mix of exercises or penalizing repetitive, low-effort exercises. These constraints ensure that rewards reflect genuine progress.

If you’re keen to learn more about how LLMs work and how to develop your own, check out our course, Developing Large Language Models.

Introduction to AI Agents

Learn the fundamentals of AI agents, their components, and real-world use—no coding required.
Explore Course

GRPO FAQs

How much data is needed for GRPO?

Generally less than 1000 labeled examples.

Does GRPO require past performance tracking?

No. GRPO only uses information from the current training step.

What is a Group in GRPO?

A group is a collection of model responses to the same prompt.

How does GRPO prevent reward hacking?

By incorporating constraints, GRPO discourages models from exploiting easy paths to inflated rewards.

How are rewards assigned in GRPO?

Each response is rewarded based on how much it outperforms the group baseline, not on the absolute reward.


Andrea Valenzuela's photo
Author
Andrea Valenzuela
LinkedIn
Twitter

Andrea Valenzuela is currently working on the CMS experiment at the particle accelerator (CERN) in Geneva, Switzerland. With expertise in data engineering and analysis for the past six years, her duties include data analysis and software development. She is now working towards democratizing the learning of data-related technologies through the Medium publication ForCode'Sake.

She holds a BS in Engineering Physics from the Polytechnic University of Catalonia, as well as an MS in Intelligent Interactive Systems from Pompeu Fabra University. Her research experience includes professional work with previous OpenAI algorithms for image generation, such as Normalizing Flows.

Topics

Top DataCamp Courses

Track

Developing Large Language Models

0 min
Learn to develop large language models (LLMs) with PyTorch and Hugging Face, using the latest deep learning and NLP techniques.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related
robot flying to mars to represent grok 3 progress

blog

Grok 3: Features, Access, O1 and R1 Comparison, and More

Learn about Grok 3, xAI's latest AI model, and find out how it compares against OpenAI's o1 and DeepSeek's R1.
Alex Olteanu's photo

Alex Olteanu

8 min

blog

MLOps Best Practices and How to Apply Them

Learn the key best practices of a successful MLOps practice and how it ensures reliable and scalable deployment of machine learning systems
Adel Nehme's photo

Adel Nehme

12 min

Robot investigator to represent openai's deep research

blog

OpenAI's Deep Research: A Guide With Practical Examples

Learn about OpenAI's new Deep Research tool, which can perform in-depth, multi-step research.
Alex Olteanu's photo

Alex Olteanu

8 min

Tutorial

Policy Gradient Theorem Explained: A Hands-On Introduction

Learn about the policy gradient theorem in RL and how to derive it mathematically. Implement an algorithm based on policy gradients to solve a simple RL environment in Gymnasium.
Arun Nanda's photo

Arun Nanda

15 min

Tutorial

Groq LPU Inference Engine Tutorial

Learn about the Groq API and its features with code examples. Additionally, learn how to build context-aware AI applications using the Groq API and LlamaIndex.
Abid Ali Awan's photo

Abid Ali Awan

11 min

Tutorial

Proximal Policy Optimization with PyTorch and Gymnasium

Learn the first principles of Proximal Policy Optimization, including its implementation in PyTorch with Gymnasium!
Arun Nanda's photo

Arun Nanda

15 min

See MoreSee More