DeepSeek-Prover-V2

396 meetpateltech 77 4/30/2025, 4:23:28 PM github.com ↗

Comments (77)

islewis · 140d ago

> The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals

It feels pretty intuitive to me that the ability for an LLM to break a complex problem down into smaller, more easily solvable pieces will unlock the next level of complexity.

This pattern feels like a technique often taught to junior engineers- how to break up a multi-week project into bitesized tasks. This model is obviously math focused, but I see no reason why this wouldn't be incredibly powerful for code based problem solving.

bearjaws · 139d ago

It's actually pretty hilarious how far into detail they can go.

For example, I made a bot that you could give it a problem statement, and then it would return an array of steps to accomplish it.

Then you could take the steps, and click on them to break them down and add them to the list. If you just kept clicking you would get to excruciating detail.

For example taking out the trash can become over ~70 individual steps if you really drill into the details.

Some of the steps:

Stand close to the trash can – Position yourself so you have stable footing and easy access.

Place one hand on the rim of the can – Use your non-dominant hand to press lightly on the edge of the trash can to hold it in place.

Grip the top edge of the bag with your other hand – Find the part of the bag that extends past the rim.

Gently lift the bag upward – While your one hand stabilizes the can, slowly pull the bag up with the other.

Tilt the can slightly if needed – If the bag sticks or creates suction, rock or tilt the can slightly while continuing to lift.

Avoid jerking motions – Move steadily to prevent tears or spills

eightysixfour · 139d ago

This used to be part of one of the intro to engineering courses at my school - write an XX page document describing how to make a peanut butter and jelly sandwich.

larrysalibra · 139d ago

This was a homework assignment in my second grade class!

The next day we had to follow our instructions exactly in class to make the sandwich which was hilarious. A formative experience for me!

amelius · 139d ago

A dad trying this out on his kids:

https://www.youtube.com/watch?v=cDA3_5982h8

voiper1 · 139d ago

I've been using that as a test of new LLMs - and do it in a specific style.

lugu · 139d ago

This is how I imagine llms are used in robotics, with one or two more levels of description.

thrance · 139d ago

This feels like a manual for infiltrated aliens: "How to pass as humans, Vol. I"

roywiggins · 139d ago

or for goblins:

https://goblin.tools/

jrvarela56 · 139d ago

http://www.drawtoast.com/

Alifatisk · 139d ago

Is bot something I can try?

otabdeveloper4 · 139d ago

Yes, an LLM can generate infinite amounts of bullshit if you ask it to.

criley2 · 140d ago

Imo current models can already break things up into bite sized pieces. The limiter I've seen is twofold

1) Maintaining context of the overall project and goals while working in the weeds on a subtask of a task on an epic (so to speak) both in terms of what has been accomplished already and what still needs to be accomplished

and 2) Getting an agentic coding tool which can actually handle the scale of doing 50 small projects back to back. With these agentic tools I find they start projects off really strong but by task #5 they're just making a mess with every change.

I've played with keeping basically a dev-progress.md file and implementation-plan.md file that I keep in context for every request and end each task by updating files. But me manually keeping all this context isn't solving all my problems.

And all the while, tools like Cline are gobbling up 2M tokens to make small changes.

jhrmnn · 140d ago

> Maintaining context of the overall project and goals while working in the weeds on a subtask of a task on an epic (so to speak) both in terms of what has been accomplished already and what still needs to be accomplished

This is a struggle for every human I’ve ever worked with

mmis1000 · 139d ago

This is probably the biggest difference between people who wrote code and people that should never write code. Some people just can't write several connected progtam file without logical conflict. It's almost like their brain context is only capable for hold one file.

drob518 · 139d ago

True, but if AI only gets as useful as an average developer, it isn’t that useful.

pertymcpert · 140d ago

Yes. I wonder if the path forward will be to create systems of agents that work as a team, with an "architect" or "technical lead" AI directing the work of more specialized execution AIs. This could alleviate the issue of context pollution as the technical lead doesn't have to hold all of the context when working on a small problem, and vice versa.

Shit. Do we need agile AI now?

Rudybega · 140d ago

This is kind of what the modes in roo code do now. I'm having great success with them and having them as a default just rolled out a couple days ago.

There are a default set of modes (orchestrator, code, architect, debug, and ask) and you can create your own custom ones (or have roo do it for you, which is kind of a fun meta play).

Orchestrator basically consults the others and uses them when appropriate, feeding in a sensible amount of task definition and context into the sub task. You can use different LLMs for different modes as well (I like Gemini 2.5 Pro for most of the thinking style ones and gpt o4-mini for the coding).

I've done some reasonably complicated things and haven't really had an orchestrator task creep past ~400k tokens before I was finished and able to start a new task.

There are some people out there who do really cool stuff with memory banks (basically logging and progress tracking), but I haven't played a ton with that yet.

Basic overview: https://docs.roocode.com/features/boomerang-tasks

Custom Modes: https://docs.roocode.com/features/custom-modes

dataviz1000 · 139d ago

Here is the tippy top of my copilot-instructions.md file

```

# Copilot Instructions

## Prompts

### General Coding

- *Boyd’s Law of Iteration: speed of iteration beats quality of iteration*: First and foremost, break every problem into smaller atomic parts. Then make a plan to start with one small part, build it, give the user an opportunity to run the code to quickly check the part works, and then move on to the next part. After all the parts are completed independently, check that they all work together in harmony. Each part should be minimal.

```

With any big problem the LLM responds first with ..... Body's Law of Iteration ..... and proceeds to break the problem into smaller parts.

I've discovered keeping file size under 300 or 400 lines helps. The AI is great at refactoring.

ethbr1 · 139d ago

Everything that is 1950s is new again: dynamic programming https://en.m.wikipedia.org/wiki/Dynamic_programming#Computer...

cadamsdotcom · 140d ago

And it should be powerful for breaking down reasoning chains of thought too.

qoez · 140d ago

The best part about these is that I know the weights are static so I know I won't have to deal with a sassy unusable update for a week suddenly.

Implicated · 140d ago

Or, like with Claude, it being effectively lobotomized during north american 'business' hours. 3am PST? Cracked. 8am PST? ... mentally challenged.

navanchauhan · 140d ago

This is pretty interesting. Do you have more information about this?

devoutsalsa · 139d ago

I’m pretty sure the parent comment is referring to capacity constraints. When the Americans come online in the morning, Claude frequently can’t keep up with demand and error messages saying the system is at capacity are common.

aitchnyu · 138d ago

I used Openrouter for Claude and Deepseek which chooses between alternative hosts for a given model. I excluded Deepseek provider as its underperforming.

https://openrouter.ai/deepseek/deepseek-chat-v3-0324

whatshisface · 139d ago

I've noticed it too. When it started the overcapacity messages went away. I think they are switching to models with fewer parameters during oversubscribed hours.

ekez · 140d ago

I wonder if the authors have tried incorporating error feedback from Lean into their models.

Work from 2023 [1] showed general purpose models did better when they were able to incorporate error feedback, humans incorporate error feedback, but none of the SOTA models on minif2f seem to.

[1]: https://arxiv.org/abs/2310.04353

johnmcd3 · 139d ago

In a way, DeepSeek Prover's subgoal decomposition is a partial-step towards error/proof-state feedback. (DS Prover breaks down a proof into subgoals and attacks each subgoal separately with batched sampling, then puts the pieces back together.)

This is distinct from the approach of the previous SOTA for an open-weights model (Kimina Prover) which generated at the full-proof level.

While it was very impressive to see Kimina's ability to generate medium-length proofs (think AIME-level problems) without sub-goals or feedback at intermediate steps, it's likely that at least subgoal decomposition will be required for longer proofs (think IMO-level problems.)

I certainly agree that where and how error/proof state feedback is best incorporated (training data synthesis / reward function / CoT during inference / etc.) is a fascinating area of research. (It's rumored that GDM's AlphaProof does use proof state / lean feedback already.)

MJGrzymek · 139d ago

That's surprising to learn.

I'm surprised those even use actual lean code instead of like raw type theory.

mcshicks · 139d ago

You can run the model for free on openrouter.ai. I have played around with lean, slowly working my way through mathmatics in lean. I was stuck on this problem in section 3.6

theorem convergesTo_unique {s : ℕ → ℝ} {a b : ℝ} (sa : ConvergesTo s a) (sb : ConvergesTo s b) :

For fun I tried it on the free model on openrouter.ai. Got the answer the first time.

https://leanprover-community.github.io/mathematics_in_lean/m...

Here's the answer just to give you a feel.

  by_contra h
  have h₁ : a ≠ b := h
  have h₂ : |a - b| > 0 := by
    apply abs_pos.mpr
    exact sub_ne_zero.mpr h₁
  -- Use the definition of convergence to find N₁ and N₂
  have h₃ := sa (|a - b| / 2) (by linarith)
  have h₄ := sb (|a - b| / 2) (by linarith)
  cases' h₃ with N₁ h₃
  cases' h₄ with N₂ h₄
  -- Choose N to be the maximum of N₁ and N₂
  let N := max N₁ N₂
  have h₅ := h₃ N (by simp [N, le_max_left])
  have h₆ := h₄ N (by simp [N, le_max_right])
  -- Derive a contradiction using the triangle inequality
  have h₇ : |s N - a| < |a - b| / 2 := by simpa using h₅
  have h₈ : |s N - b| < |a - b| / 2 := by simpa using h₆
  have h₉ : |a - b| < |a - b| := by
    calc
      |a - b| = |a - s N + (s N - b)| := by ring_nf
      _ ≤ |a - s N| + |s N - b| := by
        apply abs_add
      _ = |s N - a| + |s N - b| := by
        rw [abs_sub_comm]
      _ < |a - b| / 2 + |a - b| / 2 := by
        linarith
      _ = |a - b| := by ring
  linarith

simianwords · 140d ago

related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.

A prover model might be used as a tool in the coming future.

Arcuru · 140d ago

For a concrete example today, see https://openrouter.ai/openrouter/auto

simianwords · 140d ago

thats nice but imagine first having models that are expert in specific domains. routing seems to be the easy part (just feed the available models as tools to your wrapper LLM)

samvaran · 140d ago

Is that not what MoE models already do?

AlexCoventry · 139d ago

MoE models route each token, in every transformer layer, to a set of specialized feed-forward networks (fully-connected perceptrons, basically), based on a score derived from the token's current representation.

neom · 139d ago

Good visual explainer in here: https://deepgram.com/learn/mixture-of-experts-ml-model-guide

oofbaroomf · 140d ago

No. Each expert is not separately trained, and while they may store different concepts, they are not meant to be different experts in specific domains. However, there are certain technologies to route requests to different domain expert LLMs or even fine-tuning adapters, such as RouteLLM.

woah · 139d ago

Why do you think that a hand-configured selection between "different domains" is better than the training-based approach in MoE?

oofbaroomf · 139d ago

First off, they are basically completely different technologies, so it would be disingenuous to act like it's an apples-to-apples comparison.

But a simple way to see it is that when you pick between multiple large models that have different strengths, you have a larger amount of parameters just to work with (e.g. Deepseek R1 + V3 + Qwen + LLaMA ends up being 2 trillion total parameters to pick from), whereas "picking" the experts in an MoE like has a smaller amount of total different parameters you are working with (e.g. R1 is 671 billion, Qwen is 235).

retinaros · 140d ago

That might already happen behind what they call test time compute

oofbaroomf · 139d ago

Many models that use test time compute are MoEs, but test-time compute is generally meant to refer to reasoning about the prompt/problem the model is given, not about reasoning about which model to pick, and I don't think anyone has released an LLM router under that name.

retinaros · 139d ago

we dont know what OAI does to find the best answer when reasoning but I am pretty sure that having variations of a same model is part of it.

someguy101010 · 140d ago

The No Free Lunch Theorem implies that something like this is inevitable https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

repsilat · 140d ago

A system of n experts is no different to a single expert wrt the NFLT. The theorem is entirely indifferent to (ie "equally skeptical of") the idea.

koakuma-chan · 140d ago

> related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.

In the future? I'm pretty sure people do that already.

simianwords · 140d ago

No I disagree. I would want ChatGPT to abstract away expert models - biochemistry model, coding model, physics model and maybe O3 would use these models as tools to come up with an answer.

The point being that a separate expert model would be better at its own field than a single model that tries to be good at everything. Intuitively it makes sense, in practice I have seen anecdotes where finetuning a small model on domain data makes the model lose coherence on other topics.

koakuma-chan · 139d ago

> have seen anecdotes where finetuning a small model on domain data makes the model lose coherence on other topics

This is expected behaviour.

simianwords · 139d ago

i know. so why don't we have domain specific models as tools in consumer llm products

energy123 · 140d ago

It's crudely done though.

kratom_sandwich · 140d ago

Mistrals model is a mixture-of-experts model

revskill · 140d ago

The way intelligence works to me, is more about:

- Making correct and smart assumption. Currently all LLM bots are too stupid at making good assumptions. I don't want to explicitly repeat and repeat again my own assumptions while the context is clear enough. Hey bots, try harder.

- LLM bot needs to bring their own secondary and contextual memory in reasoning, i don't want to do it for you, ok ? You're the bot.

- Thinking out of the box. This is the final stage of intelligence. Adapt old technique to make your own technique to solve non-existing before problems.

nthingtohide · 139d ago

I propose human-AI interaction data must be made public. This is our collective wikipedia of AI era. Otherwise our progress will be blank line after 2022. Just as Egyptians didn't write down process to move giant rocks.

jasonjmcghee · 140d ago

Super interesting that they chose 671B and 7B. no like 32B which feels like a "sweet spot"

versteegen · 139d ago

Likely because they haven't got their own suitable SoTA base models of any other size to build on. DeepSeek V3 is 671B, and DeepSeek-Prover-v1.5 [1] is 7B only, based on DeepSeekMath which is 7B, which is based on DeepSeekCoder-Base-7B-v1.5. Maybe DeepSeek-Coder-V2 (16B and 236B) would be a good start but it's merged into DeepSeek V2.5, and V2.5 is inferior to V3. Or some version of Qwen.

[1] https://github.com/deepseek-ai/DeepSeek-Prover-V1.5

bredren · 140d ago

Also notable is the earliest planning for a positive reception release of a new model might include both parameter-based and skill type market segmentation.

--> "In an increasingly crowded field of LLMs, how will our (costly to produce) model stand out?"

SweetSoftPillow · 139d ago

I feel like this is very logical way to do things. Test hypothesis on small model, play around, get it working, apply findings to big model.

ddlsmurf · 139d ago

or they did and it wasn't sweet ? (no idea but seems they would before redacting a publication)

smusamashah · 140d ago

That Putnam bench graph (middle one) is showing 49/658 solve rate.

> The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench.

Which is 0.07% (edit: 7%) for PutnamBench

darkmighty · 140d ago

49/658 is 7%

smusamashah · 140d ago

Sorry, forgot multiply by 100

booi · 140d ago

I bet DeepSeek-Prover-V2 wouldn't have made that mistake

gallerdude · 140d ago

classic human hallucination

HappyPanacea · 140d ago

How likely is it that Putnam answers were in DeepSeek's training data?

EvgeniyZh · 139d ago

The solutions weren't published anywhere. There is also no good automatic way to generate solutions as far as I know, even expensive ones (previous sota was 10 solutions and one before was 8 using pass@3200 for 7b model). Potentially the developers could've paid some people who are good in putnam-level math problems and lean to write solutions for LLMs. It is hard to estimate likelihood of that but it sounds like waste of money given relatively marginal problem/benchmark.

HappyPanacea · 139d ago

AoPS seems to have a forum dedicated to Putnam (including 2024): https://artofproblemsolving.com/community/c3249_putnam and here is a pdf with solutions to Putnam 2023: https://kskedlaya.org/putnam-archive/2023s.pdf

EvgeniyZh · 139d ago

These are still need to be formalized in Lean which can be harder than solving the problem sometimes

Alifatisk · 139d ago

Is this model hosted at Deepseek chat too? Couldn’t find it yesterday and I prefer not to selfhost because lack of good hardware.

rfoo · 139d ago

It's not something you should talk to. In concept it's more like AlphaProof, just with some of their research artifacts (and probably a paper / tech report later) shared with the community.

Alifatisk · 139d ago

Ohhh

whatshisface · 139d ago

How much education would a human need to perform at this level on the benchmarks?

pama · 139d ago

Learning to formalize math to then prove Putnam competition problems rigorously in Lean would require you to have mid-to-advanced college level math and CS background. (Learning to do a small fraction of the Putnam competition without using Lean probably only needs strong highschool math and early undergrad math, with training for competitions a strong bonus.)

hartator · 139d ago

Any way to install via ollama?

Like ollama run deepseek-ai/DeepSeek-Prover-V2-7B

amelius · 139d ago

Is someone working on similar ideas for compiler back-ends?

Fokamul · 139d ago

>"an open-source large language model"

Is it really opensource? Something changed?

eightysixfour · 139d ago

Everyone calls "open weight" models "open source" at this point, it is wrong, but we'll have to find another way to fight this fight. Maybe "open data and pipeline" or something.

yahaya12 · 138d ago

prediction for astonvilla vs fulham

WASM 3.0 Completed (webassembly.org)

Anthropic irks White House with limits on models’ use (semafor.com)

Apple Photos app corrupts images (tenderlovemaking.com)

Tinycolor supply chain attack post-mortem (sigh.dev)

Depression Reduces Capacity to Learn to Actively Avoid Aversive Events (eneuro.org)

DeepMind and OpenAI Win Gold at ICPC, OpenAI AKs (codeforces.com)

DeepSeek writes less secure code for groups China disfavors (washingtonpost.com)

Optimizing ClickHouse for Intel's 280 core processors (clickhouse.com)

Drought in Iraq Reveals Ancient Tombs Created 2,300 Years Ago (smithsonianmag.com)

U.S. investors, Trump close in on TikTok deal with China (wsj.com)

Event Horizon Labs (YC W24) Is Hiring (ycombinator.com)

Tau² benchmark: How a prompt rewrite boosted GPT-5-mini by 22% (quesma.com)

Ton Roosendaal to step down as Blender chairman and CEO (cgchannel.com)

Alibaba's new AI chip: Key specifications comparable to H20 (news.futunn.com)

Ask HN: What's a good 3D Printer for sub $1000?

Launch HN: RunRL (YC X25) – Reinforcement learning as a service (runrl.com)

How to motivate yourself to do a thing you don't want to do (ashleyjanssen.com)

UUIDv47: Store UUIDv7 in DB, emit UUIDv4 outside (SipHash-masked timestamp) (github.com)

Determination of the fifth Busy Beaver value (arxiv.org)

YouTube addresses lower view counts which seem to be caused by ad blockers (9to5google.com)

Procedural Island Generation (III) (brashandplucky.com)

Noise Cancelling a Fan (chillphysicsenjoyer.substack.com)

Famous cognitive psychology experiments that failed to replicate (buttondown.com)

Just for fun: animating a mosaic of 90s GIFs (alexplescan.com)

Microsoft Python Driver for SQL Server (github.com)

PureVPN IPv6 Leak (anagogistis.com)

When Computer Magazines Were Everywhere (goto10retro.com)

Bringing fully autonomous rides to Nashville, in partnership with Lyft (waymo.com)

Stategraph: Terraform state as a distributed systems problem (stategraph.dev)

Slow social media (herman.bearblog.dev)

Doom crash after 2.5 years of real-world runtime confirmed on real hardware (lenowo.org)

Firefox 143 for Android to introduce DoH (blog.mozilla.org)

SQLiteData: A fast, lightweight replacement for SwiftData using SQL and CloudKit (github.com)

Notion API importer, with Databases to Bases conversion bounty (github.com)

The Asus Gaming Laptop ACPI Firmware Bug: A Deep Technical Investigation (github.com)

Is Data Modeling Dead? (confessionsofadataguy.com)

GNU Midnight Commander (midnight-commander.org)

EU Chat Control: Germany's position has been reverted to undecided (mastodon.social)

You can't test if quantum uses complex numbers (algassert.com)

Murex – An intuitive and content aware shell for a modern command line (murex.rocks)

How to make the Framework Desktop run even quieter (noctua.at)

Algebraic Types are not Scary (blog.aiono.dev)

Denmark close to wiping out cancer-causing HPV strains after vaccine roll-out (gavi.org)

Oh no, not again a meditation on NPM supply chain attacks (tane.dev)

I got the highest score on ARC-AGI again swapping Python for English (jeremyberman.substack.com)

XeroxNostalgia.com (xeroxnostalgia.com)

AMD Open Source Driver for Vulkan project is discontinued (github.com)

A dumb introduction to z3 (asibahi.github.io)

Shai-Hulud malware attack: Tinycolor and over 40 NPM packages compromised (socket.dev)

Waymo has received our pilot permit allowing for commercial operations at SFO (waymo.com)

DeepSeek-Prover-V2

Comments (77)