The Bluesky Dictionary

67 gaws 27 8/6/2025, 8:43:08 PM avibagla.com ↗

Comments (27)

wantlotsofcurry · 2h ago
I'm very curious as to how this works in the backend. I realize it uses Bluesky's firehose to get the posts, but I'm more curious on how it's checking whether a post contains any of the available words. Any guesses?
avibagla1 · 1h ago
Hey! this is my site - it's not all that complex, i'm just using a sqlite db with two tables - one for stats, the other for all the words that's just word | count | first use | last use | post.

I... did not expect this to be so popular

f311a · 1h ago
You can probably fit all words under 10-15MB of memory, but memory optimisations are not even needed for 250k words...

Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.

The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.

gpm · 1h ago
Probably just a big hashtable mapping word -> the number of times it's been seen, and another hashset of all the words it hasn't seen. When a post comes in you hash all the words in it and look them up in the hashtable, increment it, and if the old value was 0 remove it from the hash set.

250k words at a generous 100 bytes per word is only 25MB of memory...

stwrzn · 1h ago
I very much hope that the backend uses one of the bluesky jetstream endpoints. When you only subscribe to new posts, it provides a stream of around 20mbit/s last time I checked, while the firehose was ~200mbit/s.
avibagla1 · 1h ago
yes it does!
bangaladore · 2h ago
Maybe I'm being naive, but with only ~275k words to check against, this doesn't seem like a particularly hard problem. Ingest post, split by words, check each word via some db, hashmap, etc... and update metadata.
neaden · 2h ago
Is this not working or am I missing something, it just shows as seeing 0 words for me. Firefox on a PC.
accrual · 2h ago
You may need to allow scripts from the domain avibagla.com, it shows 0 when the scripts are blocked.
zem · 1h ago
ugh, it ought to be building the results on the server and serving up static pages.
rafram · 1h ago
But it updates live...
AgentME · 1h ago
For me it took a minute to start loading data and switch from just showing 0.
SirFatty · 2h ago
Same... maybe you need a Bluesky account, which I don't have.
gpm · 2h ago
It doesn't... I can open it in a private browsing window.
GalaxyNova · 2h ago
It's working fine for me on Firefox
pona-a · 1h ago
For a moment I thought it would be an AT-Proto based Urban Dictionary clone.
refreeze654 · 34m ago
I've wondered how blueksy affords the bandwidth to let anyone stream the full firehose.
dgacmu · 27m ago
Not an answer to your question, but I suspect most people don't -- my bot (a pi searcher bot, of course) just runs on Jetstream, which is pretty lightweight and heavily compressed.

(The website in question uses jetstream also.)

spullara · 1h ago
I did this against a pretty large tweet archive and got hits on about 125k of the words in the unix dictionary.
GalaxyNova · 2h ago
fascinating! I think it's really cool that this is possible, and at the same time kine of sad that the norm is slowly moving towards more locked-down APIs.
timeon · 1h ago
> slowly moving towards

Depends what we accept as norm.

75345d4c · 2h ago
I just saw it indexed "eluvium," but the post was referring to a band with that same name
Kye · 2h ago
GeologySky will get to it soon enough.
atlgator · 1h ago
I checked out the author's other projects and this is common issue. For example, he has a "lean checker" for bluesky that claims it is right-leaning simply because of all the people saying "That's right," "He was right," etc. None of the supposed right-leaning posts were actually conservative in nature. They just used to word right to mean correct.
avibagla1 · 1h ago
one, thank you for checking my website. two, that is the joke, 100% - at the time people kept talking about how "left leaning" bsky was and that idea came to mind
tough · 1h ago
Words We Haven't Seen

- Search unseen words

made me chuckle

crm9125 · 54m ago
I've found content for all of my future skeets.