The 96GB (HBM2e) SKU is named PPU from T-head semiconductor (basically a subsidiary of Alibaba). The spec is very similar to H20. Other chips they were using include Huawei Ascend 910B (64GB) and maybe other domestic designed chips.
boulos · 32d ago
I was surprised not to see a Kunlun P800 there.
rahen · 33d ago
I'm pretty surprised by the claimed memory usage for 300B parameters (table 1).
If we compare similar models:
- Llama 3.1 with 405B parameters: 2 TB of memory (FP32), 500 GB (FP8)
- DeepSeek R1 with 671B parameters: 1.3 TB (scaling linearly, around 600 GB for 300B parameters)
Ling claims no more than 96 GB of memory, most likely for inference. That's far more than a 20% reduction. Am I missing something?
cavisne · 33d ago
I think they only claim their "Ling-Lite" 17B model can fit on a single 96GB GPU, their 300B model needs 8 of them (768GB of HBM)
fxtentacle · 33d ago
Some of these models still produce great results with something low like 2.7 bits per variable.
vednig · 31d ago
They've shared some interesting optimization techniques for bigger LLMs that's all, not exactly low powered devices as in power consumption. Still a good read.
osti · 33d ago
I think this is the one where they train LLM without NVIDIA GPU's.
cavisne · 33d ago
They talk about CUDA level tracing in their framework. I assume its just consumer GPU's that Nvidia say arent meant to be used in datacenters.
Table 1 is the closest thing. Device specs for six devices: 120-989 TFLOPS and 64-96 GB RAM.
An RTX 5090 is about 105 TFLOPS.
https://www.techpowerup.com/gpu-specs/geforce-rtx-5090.c4216
- Llama 3.1 with 405B parameters: 2 TB of memory (FP32), 500 GB (FP8)
- DeepSeek R1 with 671B parameters: 1.3 TB (scaling linearly, around 600 GB for 300B parameters)
Ling claims no more than 96 GB of memory, most likely for inference. That's far more than a 20% reduction. Am I missing something?
No comments yet