In a milestone for Manhattan, a pair of coyotes has made Central Park their home (smithsonianmag.com)

Analyze the most common responses of a website on their platform, build an efficient dictionary from that data, and then automatically inject a link to that site-specific dictionary so future responses are optimally compressed and save on bandwidth. All transparent to the customers and end users.

pornel · 53m ago

Per-URL dictionaries (where a URL is its own dictionary) are great, because they allow updating to a new version of a resource incrementally, and an old version of the same resource is the best template, and there's no extra cost when you already have it.

However, I'm sceptical about usefulness of multi-page shared dictionaries (where you construct one for a site or group of pages). They're a gamble that can backfire.

The extra dictionary needs to be downloaded, so it starts as an extra overhead. It's not enough for it to just match something. It has to beat regular (per-page) compression to be better than nothing, and it must be useful enough to repay its own cost before it even starts being a net positive. This basically means everything in the dictionary must be useful to a user, and has to be used more than once, otherwise it's just an unnecessary upfront slowdown.

Standard (per-page) compression is already very good at removing simple repetitive patterns, and Brotli even comes with a default built-in dictionary of random HTML-like fragments. This further narrows down usefulness of the shared dictionaries, because generic page-like content is enough to be an advantage. They need to contain more specific content to beat standard compression, but the more specific the dictionary is, the lesser the chance of it fitting what the user browses.

creatonez · 24m ago

Excited to see access control mishaps where the training data includes random data from other users

divbzero · 1h ago

This seems like a lot of added complexity for limited gain. Are there cases where gzip and br at their highest compression levels aren’t good enough?

ks2048 · 33m ago

Some examples here: https://github.com/WICG/compression-dictionary-transport/blo...

show significant gain of using dictionary over compressed w/o dictionary.

It seems like instead of sites reducing bloat, they will just shift the bloat to your hard-drive. Some of the examples said dictionary of 1MB which doesn't seem big, but could add up if everyone is doing this.

pmarreck · 1h ago

Every piece of information or file that is compressed sends a dictionary along with it. In the case of, say, many HTML or CSS files, this dictionary data is likely nearly completely redundant.

There's almost no added complexity since zstd already handles separate compression dictionaries quite well.

pornel · 43m ago

The standard compressed formats don't literally contain a dictionary. The decompressed data becomes its own dictionary while its being decompressed. This makes the first occurrence of any pattern less efficiently compressed (but usually it's still compressed thanks to entropy coding), and then it becomes cheap to repeat.

Brotli has a default dictionary with bits of HTML and scripts. This is built in into the decompressor, and not sent with the files.

The decompression dictionaries aren't magic. They're basically a prefix for decompressed files, so that a first occurrence of some pattern can be referenced from the dictionary instead of built from scratch. This helps only with the first occurrences of data near the start of the file, and for all the later repetitions the dictionary becomes irrelevant.

The dictionary needs to be downloaded too, and you're not going to have dictionaries all the way down, so you pay the cost of decompressing the data without a dictionary whether it's a dictionary + dictionary-using-file, or just the full file itself.

bsmth · 45m ago

If you're shipping a JS bundle, for instance, that has small, frequent updates, this should be a good use case. There's a test site here that accompanies the explainer which looks interesting for estimates: https://use-as-dictionary.com/generate/

Y-bar · 4h ago

    Available-Dictionary: :    =:

It seems very odd to use a colon as starting and ending delimiter when the header name is already using a colon. Wouldn’t a comma or semicolon work better?

judofyr · 3h ago

It’s encoded using the spec that binary data in headers should be enclosed by colons: https://www.rfc-editor.org/rfc/rfc8941.html#name-byte-sequen...

Y-bar · 3h ago

Oh, thanks, it looked like a string such as a hash or base64 encoded data, not binary. Don’t think I have ever seen a use case for binary data like this in a header before.

o11c · 4h ago

That `Link:` header broke my brain for a moment.

Air Pollution May Contribute to Development of Lung Cancer in Never-Smokers (today.ucsd.edu)

Mini NASes marry NVMe to Intel's efficient chip (jeffgeerling.com)

How to Incapacitate Google Tag Manager and Why You Should (2022) (backlit.neocities.org)

EverQuest (filfre.net)

The story behind Caesar salad (nationalgeographic.com)

Continue (YC S23) is hiring software engineers in San Francisco (ycombinator.com)

Why I left my tech job to work on chronic pain (sailhealth.substack.com)

Show HN: AirBending – hand gesture based macOS app MIDI controller (nanassound.com)

Kepler.gl (kepler.gl)

Compression Dictionary Transport (developer.mozilla.org)

Show HN: I AI-coded a tower defense game and documented the whole process (github.com)

Larry (cat) (en.wikipedia.org)

Writing a Game Boy Emulator in OCaml (linoscope.github.io)

ChatGPT creates phisher's paradise by serving the wrong URLs for major companies (theregister.com)

Bcachefs may be headed out of the kernel (lwn.net)

Gremllm (github.com)

Is an Intel N100 or N150 a better value than a Raspberry Pi? (jeffgeerling.com)

Sleeping beauty Bitcoin wallets wake up after 14 years to the tune of $2B (marketwatch.com)

OpenDrop – electro-wetting technology to control small droplets of liquids (gaudishop.ch)

``Free as Air, Free as Water, Free as Knowledge'' (1992) (bactra.org)

VLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention (blog.vllm.ai)

The Novelty of the Arpanet (twobithistory.org)

Wind Knitting Factory (merelkarhof.nl)

Can Large Language Models Play Text Games Well? (arxiv.org)

Lens: Lenses, Folds and Traversals (hackage.haskell.org)

Zig breaking change – initial Writergate (github.com)

Show HN: BunkerWeb – the open-source and cloud-native WAF (docs.bunkerweb.io)

Rust and WASM for Form Validation (sebastian.lauwe.rs)

Killer whales groom each other with pieces of kelp (science.org)

Logging Shell Commands in BusyBox? Yes, You Can Now (carminatialessandro.blogspot.com)

Show HN: A cross-platform terminal emulator written in Java (github.com)

In a milestone for Manhattan, a pair of coyotes has made Central Park their home (smithsonianmag.com)

DRM Panic QR code generator (rust-for-linux.com)

LLM-assisted writing in biomedical publications through excess vocabulary (science.org)

America Is Killing Its Chance to Find Alien Life (theatlantic.com)

Serving 200M requests per day with a CGI-bin (jacob.gold)

Launch HN: K-Scale Labs (YC W24) – Open-Source Humanoid Robots

Show HN: Fast Thermodynamic Calculations in Python (dlr-institute-of-future-fuels.github.io)

Eight dormant Satoshi-era Bitcoin wallets reactivated after 14 yrs (twitter.com)

Ask HN: How did Soham Parekh get so many jobs?

phkmalloc (phk.freebsd.dk)

Raphael discovery emerges from Vatican museum restoration (news.artnet.com)

LooksMapping (looksmapping.com)

Context Engineering for Agents (rlancemartin.github.io)

A Rust-TypeScript integration (github.com)

Our Fullstack Architecture: Eta, Htmx, and Lit (lorenstew.art)

Introducing tmux-rs (richardscollin.github.io)

My open source project was relicensed by a YC company [license updated] (twitter.com)

Major reversal in ocean circulation detected in the Southern Ocean (icm.csic.es)

How AI on Microcontrollers Works: Operators and Kernels (danielmangum.com)

Compression Dictionary Transport

Comments (13)