I experimented with GRPO lately, since I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game, but I wanted a different challenge.
So I opted for teaching a model to create a schedule from a list of events and priorities.
Choosing an original problem forced me to think about the problem setting, generate data, choose the base model, design reward functions,
and run multiple rounds of training, hoping that my model would learn something.
A fun and rewarding experience. :-)
I learned a lot of things, that I want to share with you.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game, but I wanted a different challenge. So I opted for teaching a model to create a schedule from a list of events and priorities.
Choosing an original problem forced me to think about the problem setting, generate data, choose the base model, design reward functions, and run multiple rounds of training, hoping that my model would learn something.
A fun and rewarding experience. :-)
I learned a lot of things, that I want to share with you.
Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
Code: https://github.com/anakin87/qwen-scheduler-grpo
Hugging Face collection (dataset and model): https://huggingface.co/collections/anakin87/qwen-scheduler-g...