Polaris: A Post-training recipe for scaling RL on Advanced Reasoning models

Really cool paper, lots of examples of what worked, lots of interesting ideas. Some things I got from a first read-through:

- sample selection while training - while removing 0/8 and 8/8 problems was done before, I think it's interesting that they're doing it during training as well (as the model learns to solve some problems, they shift from x/8 closer to 8/8, and in this paper they remove them dynamically). Cool idea.

- increasing temp after an "entropy decrease" in the model - As the model "learns" new patterns, the entropy of answers decreases (based on ngrams) so they dynamically increase temperature to encourage discovery of more diverse answers.

- rope gives you free gains.

- each model is different and what works at one scale doesn't necessarily work at other scales - I think this was "known", but cool to see it applies to RL as well.

Polaris: A Post-training recipe for scaling RL on Advanced Reasoning models

Comments (1)