GRPO experiment - I trained a Language Model to schedule events

1 anakin87 1 5/10/2025, 9:26:34 AM github.com ↗

Comments (1)

anakin87 · 3h ago
I experimented with GRPO lately, since I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.

After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game, but I wanted a different challenge.

So I opted for teaching a model to create a schedule from a list of events and priorities.

Choosing an original problem forced me to think about the problem setting, generate data, choose the base model, design reward functions, and run multiple rounds of training, hoping that my model would learn something.

A fun and rewarding experience :-)

I learned a lot of things, that I want to share with you.

---

- Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo

- Code: https://github.com/anakin87/qwen-scheduler-grpo

- Hugging Face collection (dataset and model): https://huggingface.co/collections/anakin87/qwen-scheduler-g...

---

Some hot takes from my experiment

- GRPO is cool for verifiable tasks, but is more about eliciting desired behaviors from the trained model than teaching completely new stuff to it. https://arxiv.org/abs/2504.13837

- Choosing the right base model (and size) matters.

- "Aha moment" might be over-hyped. https://oatllm.notion.site/oat-zero

- Reward functions design is crucial. If your rewards are not robust, you might experience reward hacking (as it happened to me).

- Unsloth is great for saving GPU, but beware of bugs.