I recently discovered "compact approximators", which can be seen as a generalisation of Bloom filters. Instead of telling you "element X is probably in the set", they tell you "a lower bound on the value associated with key X".
If there are no collisions, you get the true value. The data structure also doesn't "fill up" the way a Bloom filter does — the older data just gets probabilistically worse, so in some use cases you can keep using the one structure continuously.
My particular use case (which led me to "invent" the structure and then discover that of course it's already been described in the literature, haha) is implicit cache invalidation. The approximator structure(s) store keys like "latest time member X was removed from group Y". I can then use that to lazily decide which cached values I can still use, instead of doing a potentially expensive search to see what I need to invalidate at the time the group member was removed.
I'm new to using them, so I keep getting excited about new use cases — much like when I was first introduced to Bloom filters.
abetusk · 50m ago
Can you provide a link to a paper or reference?
taneq · 3h ago
Ooh, this seems like it could be useful for collision avoidance (where you need to know a lower bound to your time to impact.)
ozgrakkurt · 32m ago
Would recommend reading rocksdb implementation of bloom filter and ribbon filter to anyone wanting learn more about the production level implementation side. It is extremely well explained in the comments and is the state of the art implementation as far as I know.
Sam Rose has also written a great article on Bloom Filters and has some great animations to help illustrate it too. Definitely worth checking out https://samwho.dev/bloom-filters/
celeritascelery · 2h ago
The author says that with a 1.2GB bloom filter and 7 hash functions lookup is only 30ns. I have to assume this is because the cache has loaded all the benchmark values. My guess is that the benchmark is only testing a few elements. In real world scenarios with a bloom filter this big most of those 7 hash lookups will be into cold cache since 1.2 GB is too large. That means lookups are much longer than 30ns. Probably still faster than going to network or database though.
eliben · 2h ago
Updated the result to 80ns - thanks for flagging this. This grows with the size of the data (because more cache misses), and running the benchmark on the full billion takes a while.
[That said, on a hot production bloom filter, much can be loaded into caches anyway so it's not an entirely un-realistic scenario that some of these are cache hits]
The benchmark alternates between ~1 million different keys to check in the filter, explicitly to account for cache effects.
Tuna-Fish · 2h ago
A single lookup is going to take more than 30ns, the reason they only see that is that the OoO machinery of their CPU is good enough to run those lookups in parallel.
Tuna-Fish · 2h ago
Yes, 30ns means it's in cache. But bloom filters are surprisingly fast for the amount of lookups they do, because they all happen in parallel and there is a lot of parallelism in modern memory subsystems, so that you essentially only pay the cost of a single random read for the entire lookup. If using 1GB pages, you can still realistically talk about <100ns lookups.
fooker · 25m ago
Here's my use case for it: I want really small filters (say 64 bits or 128 bits), and use these to represent 50-100 element subsets of a ~1000-2000 element set.
I know with a traditional bloom filter this would give me a lot of false positives. Is there an alternative that would fare better ?
noelwelsh · 7h ago
Hash functions are kinda magic. Bloom filters are one of the fun tricks you can play with them (though Bloom filters themselves are quite old at this point; check the literature for various alternatives that are better in specific situations.) The field of data sketches and streaming algorithms has many others. k-Minimum values is a nice example of a set cardinality estimator that is very easy to understand.
f_devd · 6h ago
> check the literature for various alternatives that are better in specific situations
But these, and xor/binary-fuse-filters are only direct replacements for Bloom filters in some problems. There are still problems for which Bloom filters are a better solution.
Bloom filters have the advantage that the filter produced by adding two elements, is the same as the bitwise-or of a pair of filters made by adding each element separately.
bloom([x, y]) == bloom([x]) | bloom([y])
Also, the filter produced by bitwise-and, `bloom([x]) & bloom([y])`, cannot have any bits set in it which are not also set by `bloom([x, y])`. We can assert that
There are practical applications that this can provide which are not satisfied by the "optimized" variants. The bitwise-or does set union (or least upper bound), and the bitwise-and does a set intersection (or greatest lower bound), though the filter produced by `bloom([x, y])` has a higher false positive rate than the filter produced by `bloom([x]) & bloom([y])`.
Also based on Ribbon, the BuRR filters: https://arxiv.org/pdf/2109.01892
which seem to get very close to the theoretical optimum, 0.1%-0.5% overhead comparatively
samuel · 6h ago
Interesting, but not quite the same:
Xor and binary fuse filters require access to the full set of keys at construction time. In this sense, they are immutable. Alternatives have typically a fixed memory usage and a maximal capacity, but they also allow more flexibility such as progressive construction (adding keys one by one).
virexene · 6h ago
I may have missed something when skimming the paper, but it sounds like Xor filters are constructed offline and can't be modified efficiently afterwards, whereas Bloom filters can be inserted into efficiently. So they don't seem to be an exact replacement
f_devd · 4h ago
No, you are correct. I misremembered with cuckoofilters.
Snawoot · 5h ago
Cuckoo filters are more efficient alternative. The algorithm of displacement is also very notable: how we can use random swaps to place data in cells optimally.
If you have a whole cluster of machines that have data on them and you need to ask: "does this machine probably have or not have this data?"; a bloom filter will tell you an answer. It can save a ton of time since a bloom filter's answer is "probably yes" and "definitely no."
andrewstuart · 3h ago
Bloom filters can be extremely fast to tell you if something is not in a dataset.
They can give false positives incorrectly indicating an element might be in the set when it's not, but never false negatives
Knowing (fast) if something is not in a dataset can be very useful.
leeoniya · 4h ago
permission checks (allow/denylists)
sparkie · 3h ago
Using them for whitelists is probably not a great idea because they can give false positives. An attacker could potentially flood the filter with fake accounts and increase the rate of false positives, increasing the chance they're granted access.
For blacklists, potentially more suitable, but since it can also give false positives, it could deny permission to people who should have it. An attacker might also attack this - by flooding the filter with accounts that deliberately get blacklisted, they could lock out people who should have access.
Obviously this is very use-case specific - it's probably not the best approach to doing permissions if security is paramount.
withinboredom · 3h ago
No, but they can tell you a user is definitely not in an allowlist or blocklist. That is useful, especially if it can save a database lookup on every check.
sparkie · 3h ago
That may work, but there are potential issues with that regarding timing attacks. If an attacker could make many attempts to access a resource, they may be able to figure out who (probably) has access with a brute-force timing test, and narrow down an attack target.
withinboredom · 2h ago
I'm not sure I understand. Generally, an allow-list/block-list is for authorized resources? By the time you are doing this check, the user is already authenticated and this is part of authorization. So, the user shouldn't be able to authenticate as arbitrary users to do a timing attack. If they can, you have bigger problems.
andrewstuart · 3h ago
Bloom filters - when I eventually learned about them - were nothing at all like what I had assumed from the name. And very interesting too.
eimrine · 3h ago
"Bloom's filter" that's a more correct name which doesn't let to make any assumptions.
"Hashmap" - maybe a less correct name but it lets to make more correct assumption.
londons_explore · 6h ago
A whole load of data structures, like bloom filters, are 'efficiency tricks'.
For example, "check if we might have Mr Smith's data cached by looking in this bloom filter". As Long as the answer is usually right, that's good enough.
I do wonder if in the future we'll use a mini online-trained AI to achieve the same thing. "Ask the AI if we have MR Smiths data cached".
Rather like you might ask an office clerk "Is Mr smith that troublesome customer you were dealing with last week? Do you still have his file on your desk?"
ozgrakkurt · 40m ago
There are actually “learned” bloom filters if anyone is interested in machine learning/bloom filter relation. but it is not related to chatbots
esafak · 1h ago
The way that would work is that the LLM would translate that sentence into a tool call, which would query a data store that does the heavy lifting. Also, there is an ML task called "Learning to hash", which is about optimizing the hash for the task: https://learning2hash.github.io/
nkrisc · 4h ago
Why would that be better than a bloom filter (or similar)?
If there are no collisions, you get the true value. The data structure also doesn't "fill up" the way a Bloom filter does — the older data just gets probabilistically worse, so in some use cases you can keep using the one structure continuously.
My particular use case (which led me to "invent" the structure and then discover that of course it's already been described in the literature, haha) is implicit cache invalidation. The approximator structure(s) store keys like "latest time member X was removed from group Y". I can then use that to lazily decide which cached values I can still use, instead of doing a potentially expensive search to see what I need to invalidate at the time the group member was removed.
I'm new to using them, so I keep getting excited about new use cases — much like when I was first introduced to Bloom filters.
https://github.com/facebook/rocksdb/blob/main/util/bloom_imp...
https://github.com/facebook/rocksdb/blob/main/util/ribbon_al...
[That said, on a hot production bloom filter, much can be loaded into caches anyway so it's not an entirely un-realistic scenario that some of these are cache hits]
The benchmark alternates between ~1 million different keys to check in the filter, explicitly to account for cache effects.
I know with a traditional bloom filter this would give me a lot of false positives. Is there an alternative that would fare better ?
For direct replacements, see Xor(+)-Filters: https://arxiv.org/abs/1912.08258 and the subsequent evolution into Binary Fuse Filters: https://arxiv.org/abs/2201.01174
But these, and xor/binary-fuse-filters are only direct replacements for Bloom filters in some problems. There are still problems for which Bloom filters are a better solution.
Bloom filters have the advantage that the filter produced by adding two elements, is the same as the bitwise-or of a pair of filters made by adding each element separately.
Also, the filter produced by bitwise-and, `bloom([x]) & bloom([y])`, cannot have any bits set in it which are not also set by `bloom([x, y])`. We can assert that There are practical applications that this can provide which are not satisfied by the "optimized" variants. The bitwise-or does set union (or least upper bound), and the bitwise-and does a set intersection (or greatest lower bound), though the filter produced by `bloom([x, y])` has a higher false positive rate than the filter produced by `bloom([x]) & bloom([y])`.Xor and binary fuse filters require access to the full set of keys at construction time. In this sense, they are immutable. Alternatives have typically a fixed memory usage and a maximal capacity, but they also allow more flexibility such as progressive construction (adding keys one by one).
A few non-exhaustive real world use-cases that come to mind:
- Databases: To quickly check if a record might exist before doing a costly disk lookup.
- Spell Checkers: To check if a word might be in the dictionary.
- Spam Filters: To check if an email sender is on a list of known spammers.
- Browser Security: Chrome uses Bloom filters to check if a site might be malicious.
- Password Checker: To check if a password is known to be leaked.
- Web Caches: To check if a URL or resource is definitely not in the cache.
- Distributed Systems: To avoid sending data that another system definitely doesn’t need.
Discussion of how Bloom filters made SQLite much faster.
1. https://en.m.wikipedia.org/wiki/Bloom_filters_in_bioinformat...
They can give false positives incorrectly indicating an element might be in the set when it's not, but never false negatives
Knowing (fast) if something is not in a dataset can be very useful.
For blacklists, potentially more suitable, but since it can also give false positives, it could deny permission to people who should have it. An attacker might also attack this - by flooding the filter with accounts that deliberately get blacklisted, they could lock out people who should have access.
Obviously this is very use-case specific - it's probably not the best approach to doing permissions if security is paramount.
"Hashmap" - maybe a less correct name but it lets to make more correct assumption.
For example, "check if we might have Mr Smith's data cached by looking in this bloom filter". As Long as the answer is usually right, that's good enough.
I do wonder if in the future we'll use a mini online-trained AI to achieve the same thing. "Ask the AI if we have MR Smiths data cached".
Rather like you might ask an office clerk "Is Mr smith that troublesome customer you were dealing with last week? Do you still have his file on your desk?"