A simple search engine from scratch

175 bertman 24 5/20/2025, 9:58:56 AM bernsteinbear.com ↗

Comments (24)

franczesko · 5h ago
On the topic of search engines, I really liked classes by David Evans. The task was also building a simple search engine from scratch. It's really for beginners, as the emphasis is on coding in general, but I've found it to be very approachable.

https://www.cs.virginia.edu/~evans/courses/

franczesko · 1h ago
Due to dead links, this is more appropriate url:

https://www.cs.virginia.edu/~evans/courses/cs101/

marginalia_nu · 2h ago
The SeIRP-book, free online as a PDF, is also a fantastic resource on traditional search engines and information retrieval in general.

[1] https://ciir.cs.umass.edu/irbook/

snowstormsun · 1h ago
Nice idea, but this approach does not handle out of vocabulary words well which is one major motivation for using a vector-based search. It might not perform significantly better compared to lexical matching like tf-idf or BM25, and being slower because of linear complexity. But cool regardless.
haasisnoah · 57m ago
How would you handle those in wordvec?

And isn’t a big advantage that synonyms are handled correctly. This implementation still has that advantage.

netdevphoenix · 1h ago
It is supposed to be a simple search engine. Keyword: simple.

As long as it does what it is meant to, as a simple search engine, it seems fine

snowstormsun · 1h ago
Using tfidf or bm25 would actually be simpler than a vector search.

I understand this is just for fun, just wanted to point that out.

ktallett · 3h ago
I always wonder if the days of search engines for specific topics could return. With LLM's providing less than accurate results in some areas, and Google, bing, etc being taken over by adverts or well organised SEO, there feels like a place for accurate, specialised search.
datadrivenangel · 3h ago
The curation of an index of resources is what's needed for niche search
cosmicgadget · 50m ago
My hope is that content self-indexes so instead curation it just has to be aggregated.
dcist · 3h ago
WestLaw and Lexis Nexis provide this for legal search, but quite frankly, these services are subpar. It's amazing that these two companies rake in hundreds of millions but they are both slower than Google, Bing, Yandex, or any LLM service (ChatGPT, Claude, Gemini, etc.) while scouring a universe of text that is orders of magnitude smaller. The user experience is also terrible (you have to login and specify a client each and every time you attempt to use the service and both services log you out after a short -- in my opinion -- period of inactivity, creating friction and needless annoyance to the user). There's an opportunity there.
ahi · 2h ago
LN and Westlaw's real service is their ubiquity. Every law student has access to it and every firm expects proficiency. While they generally suck, the last time I used it (looong time ago), their boolean search was quite nice. That kind of text search has mostly been replaced by non-deterministic black boxes which aren't great for legal research.
piker · 1h ago
You forgot to mention their claim of copyright over the bulk of, e.g. obscure state case law.
ehecatl42 · 8m ago
So, you have to pay to access the law that you are subject to?
ktallett · 3h ago
I haven't personally used the mentioned services as they aren't in my field, however what is the accuracy of their results? Are they double checked? I don't find LLMs particularly accurate in my field (that's being kind), if anything I find they make up sources that simply don't exist.

I mean poor UX has no excuse but slow speed can be reasoned if it makes the quality of the service better.

ordersofmag · 3h ago
Here’s a place to start if you want to go down the rabbit hole of how search at places like this is approached. https://haystackconf.com/us2022/talk-12/

https://www.youtube.com/watch?v=9vCMFIJRiKk

raydenvm · 2h ago
Which is not scalable, right?
cosmicgadget · 56m ago
It's scalable if you are okay with not searching exhaustively.
fanwood · 3h ago
I already directly search on Wikipedia for most topics (with a search shortcut on URL bar)
ktallett · 3h ago
Wikipedia is useful up to a point for sure. I feel whether it could be a expansion of Wikipedia in it's current use case, but for emerging research and niche topics it can sometimes be less useful.
swyx · 34m ago
this embeds words with word2vec, which is like 10 years old. at least use BERT or sentencetransformers :)
cosmicgadget · 1h ago
This was a really nice read. Now I have no excuse not to upgrade my blog search. I do feel that I'll have a ton of long tail words like 'prank'.
sp0rk · 2h ago
The SVG equation is very difficult to read if you're using a dark OS theme because the blog uses the OS preference for dark/light theme (and doesn't seem to give an option to change it manually, either.)
tekknolagi · 2h ago
Fixed, I think? Let me know