Ask HN: How Does DeepSeek "Thinks"?
1 JPLeRouzic 2 6/26/2025, 8:12:03 AM
There is a useful feature in DeepSeek that isn't present in other commercial LLMs. It displays its internal "thinking" process. I wonder what technological aspect makes this possible. Do several LLMs communicate with each other before providing a solution? Are there different roles within these LLMs, such as some proposing solutions, others contradicting or offering alternative viewpoints, or reminding of overlooked aspects?
no
>I wonder what technological aspect makes this possible.
one of its training datasets (prioritized somehow over the rest of them) contains a large number of examples emulating the thinking process within <think></think> tags before providing an output. the model then emulates it at runtime.
"Modern large language models (LLMs) started introducing a layer called “Mixture of Experts” (MoE) in their Transformer blocks to scale parameter count without linearly increasing compute. This is typically done through top-k (often k=2) “expert routing”, where each token is dispatched to two specialized feed-forward networks (experts) out of a large pool.
A naive GPU cluster implementation would be to place each expert on a separate device and have the router dispatch to the selected experts during inference. But this would have all the non-active experts idle on the expensive GPUs.
GShard, 2021 introduced the concept of sharding these feed-forward (FF) experts across multiple devices, so that each device"
[0] https://www.kernyan.com/hpc,/cuda/2025/02/26/Deepseek_V3_R1_...