Disney, Universal Launch AI Legal Battle, Sue Midjourney over Copyright Claims (hollywoodreporter.com)

We've made extensive use of perfect hashing in HeavyDB (formerly MapD/OmniSciDB), and it has definitely been a core part of achieving strong group by and join performance.

You can use perfect hashes not only the usual suspects of contiguous integer and dictionary-encoded string ranges, but also use cases like binned numeric and date ranges (epoch seconds binned per year can use a perfect hash range of one bin per year for a very wide range of timestamps), and can even handle arbitrary expressions if you propagate the ranges correctly.

Obviously you need a good "baseline" hash path to fall back to you, but it's surprising how many real-world use cases you can profitably cover with perfect hashing.

anitil · 18h ago

So in HeavyDB do you on-the-fly build perfect hashes for queries? I've only ever seen perfect hashes used at 'build time' when the keys are already known and fixed (like keywords in a compiler)

TheTaytay · 8h ago

I had the same question! I have never heard of runtime perfect hashing. (Admittedly, I haven’t read the paper yet.)

senderista · 6h ago

In the DSA theory literature there is so-called “dynamic perfect hashing” but I don’t think it’s ever been implemented and its use case is served by high-load factor techniques like bucketized cuckoo hashing.

bytehamster · 5h ago

In the appendix of the survey, there are 3 references on dynamic perfect hashing. I think the only actual implementation of a dynamic PHF is a variant of perfect hashing though fingerprinting in the paper "perfect hashing for network applications". However, that implementation is not fully dynamic and needs to be re-built if the key set changes too much.

o11c · 17h ago

I'm only vaguely aware of how other people do perfect hashing (generators I've used always seem to produce arrays to load from), but dabbled in a purely-arithmetic toy problem recently.

As an exercise for the reader:

  There are exactly 32 symbols in ASCII:

    !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

  Taking input in a register, uniquely hash them to the range 0-31.
  Any other input values may return any number, but must not
  trap or exhibit undefined behavior. The obvious approach of
  "just make a 256-element array" isn't allowed.

  This can be done in few enough cycles that you need to
  carefully consider if there's any point to including:

    loads (even hitting L1)
    branches (even if it fully predicts when it is taken)
    multiplication (unless just using lea/add/shift)

  I found that out-of-order only helps a little; it's
  difficult to scatter-gather very wide in so few cycles.

  Writing C code mostly works if you can persuade the compiler
  not to emit an unwanted branch.

mananaysiempre · 15h ago

I’m tempted to emulate a conventional SIMD solution and go with a small lookup table:

  uint8_t classify(uint8_t c) {
      return c - (0x5F45452B2B210000ULL >> ((c & 0x70) >> 1));
  }

The AND-then-shift sequence on the right is annoying and it feels like one should be able to avoid it, but I don’t see how. Overall, this is more expensive than it looks—neither a full 64-bit constant nor a variable 64-bit shift are exactly free. So I’m probably missing something here.

thomasmg · 10h ago

Using a brute-force approach can quickly find a minimal perfect hash table. Eg. the RecSplit approach can be used for this case, to first split into 4 sections, and then use another one for each section. Or, in this case, the same one for each section:

    (hash(c + 4339104) & 7) * 4 + (hash(c + 201375) & 3)

For a generic hash function (eg the Murmur hash, or the simple one here:

    long hash(long x) {
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = (x >>> 16) ^ x;
        return x;
    }

As described in the linked paper, the fastest way to find such a MPHF for larger sets is nowadays Consensus-RecSplit.

[1] https://stackoverflow.com/questions/664014/what-integer-hash...

tmyklebu · 8h ago

Two fast (today) instructions:

  unsigned h = _pext_u32(1264523 * x, 0x1020a01);

tmyklebu · 7h ago

Same idea, but without BMI2:

  unsigned h = (1264523 * x & 0x1020a01) * 134746240 >> 27;

Alternatively:

  unsigned h = (1639879 * x & 0x1038040) * 67375104L >> 32 & 31;

The multiplication by 67375104L can be a usual 32x32 IMUL where the high half goes to edx, though I'm not sure that conveys a benefit over a 64x64 IMUL in serial code these days.

duskwuff · 15h ago

This set of symbols has some interesting properties which allow for this solution:

    func symbolHash(c byte) byte {
        x := (c - 1) >> 5
        y := int(c) + 0x1b150000>>(x<<3)
        return byte(y & 31)
    }

But this doesn't generalize - it depends on the input consisting of a couple runs of consecutive characters which can be made continuous. (Extra credit: why is "c-1" necessary?)

hinkley · 4h ago

I’ve always considered this a bit of an academic conversation because as others have people have pointed out the up front costs are often too much to bear. However we have languages now that can run functions at build time. So a static lookup table is possible.

And where else would one use a static lookup table to great effect? When I actively followed the SIGPLAN (programming languages) proceedings, one of the papers that really stood out for me was one about making interface/trait based function calls as fast as inheritance by using perfect hashing on all vtable entries.

bytehamster · 3h ago

That sounds super interesting! Do you remember the title or the authors of the paper?

throwaway81523 · 12h ago

This is a really good article. The subject area has a lot of new developments in the past few years (wonder why) and the survey discusses them. I always thought of perfect hashing as a niche optimization good for some things and of theoretical interest, but it's apparently more important than I thought.

No comments yet

vlovich123 · 15h ago

Interesting that they don’t cover boomphf which is the fastest MPHF I’ve encountered.

judofyr · 15h ago

Boomph is a Rust re-implementation of BBHash which is included (and dominated by three other implementations). AFAIK there’s no reason to think it would perform any better than BBHash.

bytehamster · 5h ago

Addition: BBHash, in turn, is a re-implementation of FiPHa (perfect hashing through fingerprinting). There are quite many re-implementations of FiPHa: BBHash, Boomph, FiPS, FMPH, etc. As shown in the survey, BBHash is by far the slowest. Even though it implements exactly the same algorithm, FMPH is much faster. Its paper [1] also compares to Boomph. The beauty of the fingerprinting technique is that it is super simple. That's probably the reason why there are so many implementations of it.

[1] https://dl.acm.org/doi/pdf/10.1145/3596453

rurban · 15h ago

They completely forget the startup-time in the query-time, which dominates by a factor of 1000.

Some PHF's can be pre-compiled, while most needs to be deserialized at run-time. I worked on a pre-compiled pthash variant, but got struck by C++ bugs.

There's a huge overhead for ordered variants in some, to check for false positives.

For small n gperf is still the fastest by far. And it is pre-compiled only.

thomasmg · 8h ago

From what you describe, I think you have a somewhat special use case: it sounds like you are compiling it. The experiments done in the survey are not: instead, the hash function is used in the form of a data structure, similar to a Bloom filter ("deserialized" in your words). Do you use it for a parser? How many entries do you have? The survey uses millions of entries. "Startup time in the query time": I'm not quite sure what you mean, I'm afraid. Could you describe it?

I'm also not sure what you mean with "check for false positives"... each entry not in the set returns a random value, maybe for you this is a "false positive"? A typical solution for this is to add a "fingerprint" for each entry (just a hash value per entry) - that way there is still some false positive probability, which may or may not be acceptable (it is typically considered acceptable if the hash function is cryptographically secure, and the fingerprint size is big enough: e.g. 128 bits of the SHA-256 sum). Depending on the data, it may be faster to compare the value (eg. in a parser).

rurban · 7h ago

using it like gperf is certainly not a special case. if your keys are fixed and known at compile-time, you can also pre-compile the hash function with its data structures, and not being forced to deserialize it at startup.

when comparing those MPFH query times the startup-time, the deserialization from disk, is 1000x higher than the actual query time. when you compile those data structures, the load time is instant. also memory usage is twice as low.

thomasmg · 6h ago

> using it like gperf is certainly not a special case.

Well... let's put it like this: in this survey, "parsers" (where I am one of the co-authors) are not mentioned explicitly in the "Applications" section. They are a subset of "Hash Tables and Retrieval". There are many other uses: Approximate Membership, Databases, Bioinformatics, Text Indexing, Natural Language Processing. Yes, parsers are mentioned in "The Birth of Perfect Hashing". Maybe we can conclude that parsers are not the "main" use case nowadays.

> when you compile those data structures, the load time is instant.

Well, in this case I would recommend to use a static lookup table in the form of source code. That way, it is available for the compiler at compile time, and doesn't need to be loaded and parsed at runtime: it is available as a data structure at runtime. All modern MPHF implementations in this survey support this usage. But, yes, they are not optimized for this use case: you typically don't have millions of keywords in a programming language.

bytehamster · 5h ago

Many modern perfect hash functions are super close to the space lower bound, often having just between 3% and 50% overhead. So your claim that the space consumption "twice as low" is information theoretically impossible. With gperf, the space consumption is in the machine code instead of a data structure, but it's definitely not "for free". In fact, I'm pretty sure that the size of the machine code generated by gperf is very far from optimal. The only exception are tiny input sets with just a couple of hundred keys, where gperf probably wins due to lower constant overheads.

We Need a New Growth Compact (imf.org)

You're wrong about QWERTY [video] (youtube.com)

Should Ethiopian philosophy be universalist or Africanist? Essays (aeon.co)

The risks behind the biggest gamble in business history (thecapitalcycle.co.uk)

U.S. evacuating personnel from the Middle East amid growing tensions with Iran (axios.com)

Meta's Synthetic-Data Kit for Llama Fine-Tuning (kostasp.net)

A big shake-up at Amazon brings Whole Foods into the fold (businessinsider.com)

How to speak to a vaccine sceptic: research reveals what works (nature.com)

Publishing AI Models to Docker Hub (docker.com)

Lawsuit Moves Forward After Kamala Harris Received Zero Votes in a County (latintimes.com)

PCI-Sig Releases PCIe 7.0 Specification to Support 128.0 GT/S Transfer Rates (techpowerup.com)

History of the Amiga (en.wikipedia.org)

Ask HN: Anyone using Cloudflare Container platform in production?

The Mighty Pen Tool (medium.com)

Hollywood studios target AI image generator in copyright lawsuit (arstechnica.com)

Dubious statistical results are becoming less common in psychology (science.org)

Writing a Truth Oracle in Lisp (lambda-cove.net)

MSVC will be unsupported after V8 version 13.0 (groups.google.com)

Google will soon nerf the Pixel 6a's battery due to an overheating issue (androidauthority.com)

"Yuck": Wikipedia pauses AI summaries after editor revolt (arstechnica.com)

Does drawing on memory help us solve problems? (theconversation.com)

rqlite development: the Agents are here (philipotoole.com)

App for tracking and understanding morning wood to gain better health insights (github.com)

Whales dominate $500M ICO for Tether-centric Plasma blockchain (dlnews.com)

Insecure credential storage plagues MCP (blog.trailofbits.com)

Processing 23B Rows of Alien Txtbase Stealer Logs (troyhunt.com)

Two AMD 7900XTX GPUs in a Tinygrad-Based Training Workstation (fangpenlin.com)

Disney, Universal Launch AI Legal Battle, Sue Midjourney over Copyright Claims (hollywoodreporter.com)

Gen Z and the End of Predictable Progress (kyla.substack.com)

Raku's "Core" (gist.github.com)

Good Marriages Are Good. Bad Marriages Are, Well, Bad (nytimes.com)

Taiwan's Air Force to retire F-5 jets after more than half a century (focustaiwan.tw)

Why no GenAI was used in this book (taoem.com)

Efficiently parsing big XML documents – a language/library comparison (github.com)

We hired a dev who failed the coding test. Here's why (flyingwhilebuilding.com)

Diabetes Drugs Are Finding New Life as an Antiaging Hack (wsj.com)

Cardio Before or After Weights? (sciencedirect.com)

Landmark crypto bill moves closer to passage in US Senate (politico.com)

First-Ever Images of Sun's South Pole Released by European Mission (nytimes.com)

Moon Pi: we're going to calculate π on the Moon (pi.space)

It does matter: I care (citationneeded.news)

Disney and Universal Sue Midjourney, Call It a "Bottomless Pit of Plagiarism" (plagiarismtoday.com)

CloudQuery raises $16M to unify cloud asset management (venturebeat.com)

Rufus 4.8 brings performance boost for Windows ISOs (neowin.net)

Agora (en.wikipedia.org)

Apple Liquid Glass Three.js (appleliquidglass.vercel.app)

Open source TTS by Resemble (claiming they are sota) (github.com)

The Lakehouse Is Dead. Long Live the Lakehouse (medium.com)

Amazon Doubles Prime Video Ads to 6 Minutes per Hour (adweek.com)

Open source server emulator for MicroVolts (github.com)

Modern Minimal Perfect Hashing: A Survey

Comments (22)