VaultGemma: The most capable differentially private LLM

46 meetpateltech 10 9/12/2025, 4:14:50 PM research.google ↗

Comments (10)

Workaccount2 · 1h ago

If I am understanding this correctly, this is pretty damn cool. I got 15 minutes of research on it, but no better way to get corrected than be wrong on the internet.

Essentially it seems that they can statistical magic "fuzz" the training set in such a way that it becomes very difficult for the model to leak information from the training set, while still providing the same output whether or not that exact info was in the training set. So I suppose the goal would be something like the ability to train on medical data, while making it so the model won't be able to complete the prompt "Workaccount 2 has a serious medical condition called ______" and would give the same response regardless of whether or not I was present in the database.

porridgeraisin · 58m ago

Yes.

prob(training_process(data)(Work account 2 has a serious medical condition called) = anaemia) <= e^epsilon * prob(training_process(data without that piece of information)(Work account 2 has a serious medical condition called) = anaemia)) + delta

Here epsilon = 2, and delta is small. Basically, there is a theoretical guarantee that if it had trained on that sentence, it would be no more than 7x as likely to output that in response to any prompt, compared to when it hadn't trained on that sentence at all. Sentence here is defined to be 1024 tokens long[1].

You might think 7x is not that big of a deal, but note that this is a theoretical guarantee( and with some mathematics it's possible to get an even tighter bound(see: Renyi DP)). In practice, actually getting private data out of a DP-trained model is difficult even for epsilon=8 (corresponds to 2000x likely!).

Edit: [1] this can be problematic, if a piece of information greater than 1024 tokens long gets split into two sentences, then there is no theoretical guarantee across sequences. However this is an implementation detail of this model, I've yet to see the effect of increasing this number to a more reasonable value.

freedomben · 50m ago

Thanks, that's quite exciting, because personally the thing I'm most excited about AI is the medical and scientific research capabilities. Exciting times!

diggan · 1h ago

The actual weights: https://huggingface.co/google/vaultgemma-1b

> VaultGemma is a variant of the Gemma family of lightweight, state-of-the-art open models from Google. It is pre-trained from the ground up using Differential Privacy (DP). This provides strong, mathematically-backed privacy guarantees for its training data, limiting the extent to which the model's outputs can reveal information about any single training example.

> VaultGemma was trained using Tensor Processing Unit (TPU) hardware TPUv6e. Training large language models with the significant computational overhead of differential privacy requires specialized hardware. TPUs are designed to handle the massive computations involved, offering the performance, memory, and scalability necessary to train models like VaultGemma efficiently and sustainably.

Seems like it requires TPUs to run, as DP has a huge performance impact, so we're unlikely to see this in homelabs and similar environments, as far as I understand.

Edit: On second read, the TPUs were only used for training, but no description if anything specific for the hardware is needed, so assuming it's fine with a regular GPU?

HenryMulligan · 1h ago

Ignoring what this model architecture could do and just considering what this model does do, why would I (or anyone) want to run this model (locally) to do <insert use-case>? Is it entirely a proof-of-concept for future training on medical data? Are they looking to use this to attempt to ethically justify training on (free-tier) user's personal data via the application of noise to the training data?

floridianfisher · 1h ago

The purpose is research

porridgeraisin · 46m ago

It's the last option.

The whole framing of DP is:

Probability that you reveal private info is same whether or not you train on a particular users data.

It is useful in many cases, but google the product company specifically is going to use it for ads.

ForHackernews · 2h ago

Can someone explain what this actually means? I assume this still runs on Google's cloud so it's not 'private' in any meaningful sense.

stephantul · 1h ago

It does not run on Google’s cloud. You can download the model and host it yourself, locally or using a provider you trust.

porridgeraisin · 1h ago

Differentially private means that:

training_algorithm(training data with a row that has "ForHackernews blood test report...") hard to distinguish from training_algorithm(training data without that) upto a factor of epsilon. They have explained further in the article itself with concrete values for epsilon.

Ask HN: Are there any modern devices similar to Palm Pilot?

Governments ban self-custody crypto, require backdoors on all computers (2035)

Ask HN: Who reads the "newest" feed, and what do you look for there?

Ask HN: Did you get hired through 'Who wants to be hired?' by whoishiring?

Pegasus Airlines lost my brain-computer interface – ignored 10 days

Form16x – Simplify tax season: JSON output and regime comparisons from Form 16

Ask HN: What do you recommend for test observability?

Google Ends Support for Lynx Browser

Ask HN: Why is enrolling in Apple's Developer Program so difficult in 2025?

Ask HN: How to make computer browse internet automatically?

Google Doesn't Rank My Site for My Own Brand Name

Ask HN: Good resources for DIY-ish animatronic kits for Halloween?

Ask HN: What's a modern alternative to Confluence for small dev teams?

Ask HN: How do you stay on top of new research?

Ask HN: Who wants to be hired? (September 2025)

Ask HN: Is MSFT hotmail down for you?

Ask HN: Looking for headless CMS recommendation

Ask HN: Who is hiring? (September 2025)

Linter for Your Docs

VaultGemma: The most capable differentially private LLM

Comments (10)