Nvidia trains 10T model in 4 bit precision (NVFP4)

Comments (3)

jasonjmcghee · 3h ago

Incorrect and wildly misleading headline.

This is a 12B parameter model trained on 10T tokens.

It's also editorialized which is against HN.

Title is: "NVFP4 Trains with Precision of 16-Bit and Speed and Efficiency of 4-Bit"

patrickhogan1 · 1h ago

12B model

opcode84 · 4h ago

For narrow-precision formats to be practical in large-scale pretraining, they must ensure both model accuracy and stable convergence. To assess the viability of 4-bit precision in large-scale model training, experiments were conducted with FP8 and NVFP4 on a 12-billion parameter model based on a combined Mamba-Transformer architecture (12B Hybrid Mamba-Transformer model)—similar to NVIDIA Nemotron Nano 2. This model was trained on a massive dataset of 10 trillion tokens using a phased data-blending approach, switching to a different dataset mix in the second phase of training at 70%, and in the third phase of training at 90% during pretraining.

A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

Figure 3 below shows that NVFP4’s validation loss curve closely matches the loss curves from the higher-precision baseline (i.e., FP8) throughout the entire duration of training. The quantization techniques outlined above ensure that even with aggressive bit-width reduction, the 4-bit pretraining dynamics closely resemble those of higher-precision runs.

Ask HN: What is the biggest problem LLMs solved in your life/work?

Ask HN: Why hasn't x86 caught up with Apple M series?

Ask HN: How can I recover and run my old mobile game from the 2010s?

Ask HN: Is there a temp phone number like temp email?

Ask HN: What to Do with Old iPads?

Patient Lisp Hacker Seeks Same for Long Walks Through IPL-V Code

High rate of LLM (GPT5) hallucinations in dense stats domains (cricket)

429 Too Many Requests from registry.npmjs.org

Stop squashing your commits. You're squashing your AI too

Ask HN: Why do people hate on Sabine Hossenfelder so much?

How can a mutex in Wine be faster than a native one on Linux

Ask HN: What is wrong with modern software development

Ask HN: Recommandation for an Ergonomic Keyboard?

Ask HN: Best codebases to study to learn software design?

Ask HN: Someone has committed 20K+ LoC to a PR, exhausting my CI & AI workflows

Ask HN: Why does the US Visa application website do a port-scan of my network?

Ask HN: How can I trace what user queries make AI bots crawl my site?

Ask HN: Are AI filters becoming stricter than society itself?

Ask HN: I just abandoned my PyCharm subscription, what should I use now?

ASK HN: AI in high school. Will teachers and schools have to compensate?

Ask HN: What is your source for answers?

Ask HN: Devices to allow children to listen to podcasts on my local network?

DSPy GEPA Example: Listwise Reranker

Ask HN: Best Marketplaces for Used Servers?

Ask HN: No easy way for tvOS to display long documents (e.g., terms of service)?

Ask HN: How do you find early stage startups to join

Problem with Payment Gateways

Ask HN: Does using public transportation make you more creative than driving?

HeartWatch: A Proactive Child Safety System

Gemini in Gmail Is Pretty Well Useless

Ask HN: Why is Apple so far behind with Siri?

Ask HN: What's Hacker News's vision for the future?

Can you recommend movies like The Social Network?

Ask HN: Is it possible to do great things in STEM in a not so great country?

Ask HN: Non-Smart TV Recommendations?

Where is the exponential growth part of AI?

AI App Dev Log: The Story of Our App Begins

Nvidia trains 10T model in 4 bit precision (NVFP4)

Comments (3)