Ask HN: What do you dislike about ChatGPT and what needs improving?

I'm surprised at how normal some of the unseen words are. I expected them to all be archaic or niche, but many are pretty reasonable: 'congregant', 'definer', 'stereoscope'.

gkoberger · 16m ago

For what it's worth, there's 1.7bn posts on Bluesky according to this: https://bsky.jazco.dev/stats

The dictionary site has only checked 4,920,000 posts, which is 0.28% of all messages.

wantlotsofcurry · 4h ago

I'm very curious as to how this works in the backend. I realize it uses Bluesky's firehose to get the posts, but I'm more curious on how it's checking whether a post contains any of the available words. Any guesses?

avibagla1 · 3h ago

Hey! this is my site - it's not all that complex, i'm just using a sqlite db with two tables - one for stats, the other for all the words that's just word | count | first use | last use | post.

I... did not expect this to be so popular

gumboshoes · 28m ago

What is your source dictionary to compare to? Seems kind of small. Also, how are you handling inflected forms?

f311a · 3h ago

You can probably fit all words under 10-15MB of memory, but memory optimisations are not even needed for 250k words...

Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.

The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.

gpm · 4h ago

Probably just a big hashtable mapping word -> the number of times it's been seen, and another hashset of all the words it hasn't seen. When a post comes in you hash all the words in it and look them up in the hashtable, increment it, and if the old value was 0 remove it from the hash set.

250k words at a generous 100 bytes per word is only 25MB of memory...

stwrzn · 3h ago

I very much hope that the backend uses one of the bluesky jetstream endpoints. When you only subscribe to new posts, it provides a stream of around 20mbit/s last time I checked, while the firehose was ~200mbit/s.

avibagla1 · 3h ago

yes it does!

bangaladore · 4h ago

Maybe I'm being naive, but with only ~275k words to check against, this doesn't seem like a particularly hard problem. Ingest post, split by words, check each word via some db, hashmap, etc... and update metadata.

neaden · 5h ago

Is this not working or am I missing something, it just shows as seeing 0 words for me. Firefox on a PC.

accrual · 4h ago

You may need to allow scripts from the domain avibagla.com, it shows 0 when the scripts are blocked.

zem · 3h ago

ugh, it ought to be building the results on the server and serving up static pages.

rafram · 3h ago

But it updates live...

forgotmypw17 · 51m ago

It could do both...

AgentME · 3h ago

For me it took a minute to start loading data and switch from just showing 0.

SirFatty · 4h ago

Same... maybe you need a Bluesky account, which I don't have.

gpm · 4h ago

It doesn't... I can open it in a private browsing window.

GalaxyNova · 4h ago

It's working fine for me on Firefox

pona-a · 3h ago

For a moment I thought it would be an AT-Proto based Urban Dictionary clone.

GalaxyNova · 4h ago

fascinating! I think it's really cool that this is possible, and at the same time kine of sad that the norm is slowly moving towards more locked-down APIs.

timeon · 3h ago

> slowly moving towards

Depends what we accept as norm.

spullara · 4h ago

I did this against a pretty large tweet archive and got hits on about 125k of the words in the unix dictionary.

refreeze654 · 2h ago

I've wondered how blueksy affords the bandwidth to let anyone stream the full firehose.

psionides · 26m ago

From what they say, it is a lot, but it's generally on the order of a few hundreds of connections total at the moment

dgacmu · 2h ago

Not an answer to your question, but I suspect most people don't -- my bot (a pi searcher bot, of course) just runs on Jetstream, which is pretty lightweight and heavily compressed.

(The website in question uses jetstream also.)

75345d4c · 4h ago

I just saw it indexed "eluvium," but the post was referring to a band with that same name

Kye · 4h ago

GeologySky will get to it soon enough.

k7sune · 8m ago

Thanks to this I just learned about alluvium, eluvium, illuvium, and colluvium.

atlgator · 3h ago

I checked out the author's other projects and this is common issue. For example, he has a "lean checker" for bluesky that claims it is right-leaning simply because of all the people saying "That's right," "He was right," etc. None of the supposed right-leaning posts were actually conservative in nature. They just used to word right to mean correct.

avibagla1 · 3h ago

one, thank you for checking my website. two, that is the joke, 100% - at the time people kept talking about how "left leaning" bsky was and that idea came to mind

OneDeuxTriSeiGo · 34m ago

lmao that's fantastic

tough · 3h ago

Words We Haven't Seen

- Search unseen words

made me chuckle

crm9125 · 2h ago

I've found content for all of my future skeets.