Ask HN: How Does DeepSeek "Thinks"?

1 JPLeRouzic 2 6/26/2025, 8:12:03 AM
There is a useful feature in DeepSeek that isn't present in other commercial LLMs. It displays its internal "thinking" process. I wonder what technological aspect makes this possible. Do several LLMs communicate with each other before providing a solution? Are there different roles within these LLMs, such as some proposing solutions, others contradicting or offering alternative viewpoints, or reminding of overlooked aspects?

Comments (2)

123yawaworht456 · 8h ago
>Do several LLMs communicate with each other before providing a solution?

no

>I wonder what technological aspect makes this possible.

one of its training datasets (prioritized somehow over the rest of them) contains a large number of examples emulating the thinking process within <think></think> tags before providing an output. the model then emulates it at runtime.

JPLeRouzic · 2h ago
Thank you for taking the time to answer. However I am not sure the answer is "NO" because DeepSeek has a particular technique in their architecture. To cite this blog [0]:

"Modern large language models (LLMs) started introducing a layer called “Mixture of Experts” (MoE) in their Transformer blocks to scale parameter count without linearly increasing compute. This is typically done through top-k (often k=2) “expert routing”, where each token is dispatched to two specialized feed-forward networks (experts) out of a large pool.

A naive GPU cluster implementation would be to place each expert on a separate device and have the router dispatch to the selected experts during inference. But this would have all the non-active experts idle on the expensive GPUs.

GShard, 2021 introduced the concept of sharding these feed-forward (FF) experts across multiple devices, so that each device"

[0] https://www.kernyan.com/hpc,/cuda/2025/02/26/Deepseek_V3_R1_...