I'm surprised at how normal some of the unseen words are. I expected them to all be archaic or niche, but many are pretty reasonable: 'congregant', 'definer', 'stereoscope'.
The dictionary site has only checked 4,920,000 posts, which is 0.28% of all messages.
wantlotsofcurry · 4h ago
I'm very curious as to how this works in the backend. I realize it uses Bluesky's firehose to get the posts, but I'm more curious on how it's checking whether a post contains any of the available words. Any guesses?
avibagla1 · 3h ago
Hey! this is my site - it's not all that complex, i'm just using a sqlite db with two tables - one for stats, the other for all the words that's just word | count | first use | last use | post.
I... did not expect this to be so popular
gumboshoes · 28m ago
What is your source dictionary to compare to? Seems kind of small. Also, how are you handling inflected forms?
f311a · 3h ago
You can probably fit all words under 10-15MB of memory, but memory optimisations are not even needed for 250k words...
Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.
The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.
gpm · 4h ago
Probably just a big hashtable mapping word -> the number of times it's been seen, and another hashset of all the words it hasn't seen. When a post comes in you hash all the words in it and look them up in the hashtable, increment it, and if the old value was 0 remove it from the hash set.
250k words at a generous 100 bytes per word is only 25MB of memory...
stwrzn · 3h ago
I very much hope that the backend uses one of the bluesky jetstream endpoints.
When you only subscribe to new posts, it provides a stream of around 20mbit/s last time I checked, while the firehose was ~200mbit/s.
avibagla1 · 3h ago
yes it does!
bangaladore · 4h ago
Maybe I'm being naive, but with only ~275k words to check against, this doesn't seem like a particularly hard problem. Ingest post, split by words, check each word via some db, hashmap, etc... and update metadata.
neaden · 5h ago
Is this not working or am I missing something, it just shows as seeing 0 words for me. Firefox on a PC.
accrual · 4h ago
You may need to allow scripts from the domain avibagla.com, it shows 0 when the scripts are blocked.
zem · 3h ago
ugh, it ought to be building the results on the server and serving up static pages.
rafram · 3h ago
But it updates live...
forgotmypw17 · 51m ago
It could do both...
AgentME · 3h ago
For me it took a minute to start loading data and switch from just showing 0.
SirFatty · 4h ago
Same... maybe you need a Bluesky account, which I don't have.
gpm · 4h ago
It doesn't... I can open it in a private browsing window.
GalaxyNova · 4h ago
It's working fine for me on Firefox
pona-a · 3h ago
For a moment I thought it would be an AT-Proto based Urban Dictionary clone.
GalaxyNova · 4h ago
fascinating! I think it's really cool that this is possible, and at the same time kine of sad that the norm is slowly moving towards more locked-down APIs.
timeon · 3h ago
> slowly moving towards
Depends what we accept as norm.
spullara · 4h ago
I did this against a pretty large tweet archive and got hits on about 125k of the words in the unix dictionary.
refreeze654 · 2h ago
I've wondered how blueksy affords the bandwidth to let anyone stream the full firehose.
psionides · 26m ago
From what they say, it is a lot, but it's generally on the order of a few hundreds of connections total at the moment
dgacmu · 2h ago
Not an answer to your question, but I suspect most people don't -- my bot (a pi searcher bot, of course) just runs on Jetstream, which is pretty lightweight and heavily compressed.
(The website in question uses jetstream also.)
75345d4c · 4h ago
I just saw it indexed "eluvium," but the post was referring to a band with that same name
Kye · 4h ago
GeologySky will get to it soon enough.
k7sune · 8m ago
Thanks to this I just learned about alluvium, eluvium, illuvium, and colluvium.
atlgator · 3h ago
I checked out the author's other projects and this is common issue. For example, he has a "lean checker" for bluesky that claims it is right-leaning simply because of all the people saying "That's right," "He was right," etc. None of the supposed right-leaning posts were actually conservative in nature. They just used to word right to mean correct.
avibagla1 · 3h ago
one, thank you for checking my website. two, that is the joke, 100% - at the time people kept talking about how "left leaning" bsky was and that idea came to mind
The dictionary site has only checked 4,920,000 posts, which is 0.28% of all messages.
I... did not expect this to be so popular
Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.
The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.
250k words at a generous 100 bytes per word is only 25MB of memory...
Depends what we accept as norm.
(The website in question uses jetstream also.)
- Search unseen words
made me chuckle