Modern Minimal Perfect Hashing: A Survey

83 matt_d 22 6/10/2025, 9:46:44 PM arxiv.org ↗

Comments (22)

tmostak · 19h ago
We've made extensive use of perfect hashing in HeavyDB (formerly MapD/OmniSciDB), and it has definitely been a core part of achieving strong group by and join performance.

You can use perfect hashes not only the usual suspects of contiguous integer and dictionary-encoded string ranges, but also use cases like binned numeric and date ranges (epoch seconds binned per year can use a perfect hash range of one bin per year for a very wide range of timestamps), and can even handle arbitrary expressions if you propagate the ranges correctly.

Obviously you need a good "baseline" hash path to fall back to you, but it's surprising how many real-world use cases you can profitably cover with perfect hashing.

anitil · 18h ago
So in HeavyDB do you on-the-fly build perfect hashes for queries? I've only ever seen perfect hashes used at 'build time' when the keys are already known and fixed (like keywords in a compiler)
TheTaytay · 8h ago
I had the same question! I have never heard of runtime perfect hashing. (Admittedly, I haven’t read the paper yet.)
senderista · 6h ago
In the DSA theory literature there is so-called “dynamic perfect hashing” but I don’t think it’s ever been implemented and its use case is served by high-load factor techniques like bucketized cuckoo hashing.
bytehamster · 5h ago
In the appendix of the survey, there are 3 references on dynamic perfect hashing. I think the only actual implementation of a dynamic PHF is a variant of perfect hashing though fingerprinting in the paper "perfect hashing for network applications". However, that implementation is not fully dynamic and needs to be re-built if the key set changes too much.
o11c · 17h ago
I'm only vaguely aware of how other people do perfect hashing (generators I've used always seem to produce arrays to load from), but dabbled in a purely-arithmetic toy problem recently.

As an exercise for the reader:

  There are exactly 32 symbols in ASCII:

    !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

  Taking input in a register, uniquely hash them to the range 0-31.
  Any other input values may return any number, but must not
  trap or exhibit undefined behavior. The obvious approach of
  "just make a 256-element array" isn't allowed.

  This can be done in few enough cycles that you need to
  carefully consider if there's any point to including:

    loads (even hitting L1)
    branches (even if it fully predicts when it is taken)
    multiplication (unless just using lea/add/shift)

  I found that out-of-order only helps a little; it's
  difficult to scatter-gather very wide in so few cycles.

  Writing C code mostly works if you can persuade the compiler
  not to emit an unwanted branch.
mananaysiempre · 15h ago
I’m tempted to emulate a conventional SIMD solution and go with a small lookup table:

  uint8_t classify(uint8_t c) {
      return c - (0x5F45452B2B210000ULL >> ((c & 0x70) >> 1));
  }
The AND-then-shift sequence on the right is annoying and it feels like one should be able to avoid it, but I don’t see how. Overall, this is more expensive than it looks—neither a full 64-bit constant nor a variable 64-bit shift are exactly free. So I’m probably missing something here.
thomasmg · 10h ago
Using a brute-force approach can quickly find a minimal perfect hash table. Eg. the RecSplit approach can be used for this case, to first split into 4 sections, and then use another one for each section. Or, in this case, the same one for each section:

    (hash(c + 4339104) & 7) * 4 + (hash(c + 201375) & 3)
For a generic hash function (eg the Murmur hash, or the simple one here:

    long hash(long x) {
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = ((x >>> 16) ^ x) * 0x45d9f3b;
        x = (x >>> 16) ^ x;
        return x;
    }
As described in the linked paper, the fastest way to find such a MPHF for larger sets is nowadays Consensus-RecSplit.

[1] https://stackoverflow.com/questions/664014/what-integer-hash...

tmyklebu · 8h ago
Two fast (today) instructions:

  unsigned h = _pext_u32(1264523 * x, 0x1020a01);
tmyklebu · 7h ago
Same idea, but without BMI2:

  unsigned h = (1264523 * x & 0x1020a01) * 134746240 >> 27;
Alternatively:

  unsigned h = (1639879 * x & 0x1038040) * 67375104L >> 32 & 31;
The multiplication by 67375104L can be a usual 32x32 IMUL where the high half goes to edx, though I'm not sure that conveys a benefit over a 64x64 IMUL in serial code these days.
duskwuff · 15h ago
This set of symbols has some interesting properties which allow for this solution:

    func symbolHash(c byte) byte {
        x := (c - 1) >> 5
        y := int(c) + 0x1b150000>>(x<<3)
        return byte(y & 31)
    }
But this doesn't generalize - it depends on the input consisting of a couple runs of consecutive characters which can be made continuous. (Extra credit: why is "c-1" necessary?)
hinkley · 4h ago
I’ve always considered this a bit of an academic conversation because as others have people have pointed out the up front costs are often too much to bear. However we have languages now that can run functions at build time. So a static lookup table is possible.

And where else would one use a static lookup table to great effect? When I actively followed the SIGPLAN (programming languages) proceedings, one of the papers that really stood out for me was one about making interface/trait based function calls as fast as inheritance by using perfect hashing on all vtable entries.

bytehamster · 3h ago
That sounds super interesting! Do you remember the title or the authors of the paper?
throwaway81523 · 12h ago
This is a really good article. The subject area has a lot of new developments in the past few years (wonder why) and the survey discusses them. I always thought of perfect hashing as a niche optimization good for some things and of theoretical interest, but it's apparently more important than I thought.

No comments yet

vlovich123 · 15h ago
Interesting that they don’t cover boomphf which is the fastest MPHF I’ve encountered.
judofyr · 15h ago
Boomph is a Rust re-implementation of BBHash which is included (and dominated by three other implementations). AFAIK there’s no reason to think it would perform any better than BBHash.
bytehamster · 5h ago
Addition: BBHash, in turn, is a re-implementation of FiPHa (perfect hashing through fingerprinting). There are quite many re-implementations of FiPHa: BBHash, Boomph, FiPS, FMPH, etc. As shown in the survey, BBHash is by far the slowest. Even though it implements exactly the same algorithm, FMPH is much faster. Its paper [1] also compares to Boomph. The beauty of the fingerprinting technique is that it is super simple. That's probably the reason why there are so many implementations of it.

[1] https://dl.acm.org/doi/pdf/10.1145/3596453

rurban · 15h ago
They completely forget the startup-time in the query-time, which dominates by a factor of 1000.

Some PHF's can be pre-compiled, while most needs to be deserialized at run-time. I worked on a pre-compiled pthash variant, but got struck by C++ bugs.

There's a huge overhead for ordered variants in some, to check for false positives.

For small n gperf is still the fastest by far. And it is pre-compiled only.

thomasmg · 8h ago
From what you describe, I think you have a somewhat special use case: it sounds like you are compiling it. The experiments done in the survey are not: instead, the hash function is used in the form of a data structure, similar to a Bloom filter ("deserialized" in your words). Do you use it for a parser? How many entries do you have? The survey uses millions of entries. "Startup time in the query time": I'm not quite sure what you mean, I'm afraid. Could you describe it?

I'm also not sure what you mean with "check for false positives"... each entry not in the set returns a random value, maybe for you this is a "false positive"? A typical solution for this is to add a "fingerprint" for each entry (just a hash value per entry) - that way there is still some false positive probability, which may or may not be acceptable (it is typically considered acceptable if the hash function is cryptographically secure, and the fingerprint size is big enough: e.g. 128 bits of the SHA-256 sum). Depending on the data, it may be faster to compare the value (eg. in a parser).

rurban · 7h ago
using it like gperf is certainly not a special case. if your keys are fixed and known at compile-time, you can also pre-compile the hash function with its data structures, and not being forced to deserialize it at startup.

when comparing those MPFH query times the startup-time, the deserialization from disk, is 1000x higher than the actual query time. when you compile those data structures, the load time is instant. also memory usage is twice as low.

thomasmg · 6h ago
> using it like gperf is certainly not a special case.

Well... let's put it like this: in this survey, "parsers" (where I am one of the co-authors) are not mentioned explicitly in the "Applications" section. They are a subset of "Hash Tables and Retrieval". There are many other uses: Approximate Membership, Databases, Bioinformatics, Text Indexing, Natural Language Processing. Yes, parsers are mentioned in "The Birth of Perfect Hashing". Maybe we can conclude that parsers are not the "main" use case nowadays.

> when you compile those data structures, the load time is instant.

Well, in this case I would recommend to use a static lookup table in the form of source code. That way, it is available for the compiler at compile time, and doesn't need to be loaded and parsed at runtime: it is available as a data structure at runtime. All modern MPHF implementations in this survey support this usage. But, yes, they are not optimized for this use case: you typically don't have millions of keywords in a programming language.

bytehamster · 5h ago
Many modern perfect hash functions are super close to the space lower bound, often having just between 3% and 50% overhead. So your claim that the space consumption "twice as low" is information theoretically impossible. With gperf, the space consumption is in the machine code instead of a data structure, but it's definitely not "for free". In fact, I'm pretty sure that the size of the machine code generated by gperf is very far from optimal. The only exception are tiny input sets with just a couple of hundred keys, where gperf probably wins due to lower constant overheads.