Polaris: A Post-training recipe for scaling RL on Advanced Reasoning models

Comments (1)

NitpickLawyer · 8h ago

Really cool paper, lots of examples of what worked, lots of interesting ideas. Some things I got from a first read-through:

- sample selection while training - while removing 0/8 and 8/8 problems was done before, I think it's interesting that they're doing it during training as well (as the model learns to solve some problems, they shift from x/8 closer to 8/8, and in this paper they remove them dynamically). Cool idea.

- increasing temp after an "entropy decrease" in the model - As the model "learns" new patterns, the entropy of answers decreases (based on ngrams) so they dynamically increase temperature to encourage discovery of more diverse answers.

- rope gives you free gains.

- each model is different and what works at one scale doesn't necessarily work at other scales - I think this was "known", but cool to see it applies to RL as well.

What's your experience using Lynx (mobile framework)?

Ask HN: Is every company's internal wiki just broken by default?

Ask HN: How did Soham Parekh get so many jobs?

Ask HN: Bug Bounty Dilemma – Take the $$ and Sign an NDA or Go Public?

Ask HN: Do you think a new alternative to MCP would be useful?

Tell HN: I Lost Joy of Programming

Ask HN: People who work different timezones than your company. How sched?

Ask HN: What are some cool or underrated tech companies based in Australia?

Pocket LLM Server Just Like a Pocket WiFi

Ask HN: How are you making money on the side?

Proposal: GUI-first, text-based mechanical CAD inspired by software engineering

Ask HN: How is the tech scene in LA?

Ask HN: What's the verdict on GPT wrapper companies these days?

Ask HN: Any resources for finding non-smart appliances?

Ask HN: Has anyone else learned English just by reading tech posts (like HN)?

Agentic terminology doesn't make any sense

Ask HN: Worth leaving position over push to adopt vibe coding?

N8n AI Workflows – 3,400 Workflows and an LLM Prototype

Ask HN: What are some cool or underrated tech companies based in Canada?

Ask HN: What inspires you to persevere through adversity?

Ask HN: Advice for Starting a Hacker Space?

Ask HN: Do you use LLM for HTML translations?

Are there any noteworthy LinkedIn alternatives?

Ask HN: Took a break after burnout – what now?

Ask HN: What happened to W3C's PROV initiative to add provenance to the Web?

Ask HN: Brick and Mortar Dev Agency

Which email clients work well with keyboard shortcuts?

Ask HN: How do you deal with data backups in servers?

Ask HN: How many communities HN it devs in C language?

Super Simple "Hallucination Traps" to detect interview cheaters

Ask HN: Is there a business for extracting US tech talent?

Slack is just the worst – and I've used a BBS and 14.4k modem

Why did not numpy copy the J rank concept?

Ask HN: What's the greatest piece of non-dogfooded software?

Looking for Early Testers for a AI Assistant Inside Zotero

Ask HN: What old or outdated software have you never found a replacement for?

Ask HN: HN was much more interesting a year ago

Ask HN: MiniNAS Experience

Ask HN: How do I buy a typewriter?

mTLS vs. HTTP Message Signatures: Tradeoffs in Securing HTTP Requests

Ask HN: How do you sell to B2B in current state of AI?

Ask HN: What are the best resources to help with health insurance denials?

Tell HN: A fake, highly obfuscated Solidity VSCode plugin found on marketplace

Ask HN: How to generate product docs E2E?

If Emacs is not a text editor, then what is it really?

ARZY-G: A token born from AI-validated usefulness (not mined, not bought)

Ask HN: Are there any good WASM-based sites for learning Bash, Linux and CLI?

CellularLab – A Modern Android iPerf3 App with TCP/UDP Testing and AI Analysis

Ask HN: What clever tools/scripts do you use to manage development environments?

Ask HN: Ideas to acquire "good taste" in programming?

Polaris: A Post-training recipe for scaling RL on Advanced Reasoning models

Comments (1)