Just wow. My greatest respect! Also an incredible write up. I like the take-away that an essential ingredient to a search engine is curated and well filtered data (garbage in garbage out) I feel like this has been a big learning of the LLM training too, rather work with less much higher quality data. I'm curious how a search engine would perform where all content has been judged by an LLM.
rkunnamp · 4m ago
I couldn't get the search working (there was some cors error) . But what a feat and writeup. Wonder Stuck!
voiper1 · 8m ago
Wow, looks like a tremendous commitment and depth of knowledge went into this one-man project. I couldn't even read the whole write up, I had to skim part of it. I'm super impressed.
ccgreg · 2h ago
At the end, the author thinks about adding Common Crawl data. Our ranking information, generated from our web graph, would probably be a big help in picking which pages to crawl.
I love seeing the worked out example at scale -- I'm surprised at how cost effective the vector database was.
Imustaskforhelp · 2h ago
This is really really cool. I had earlier wanted to entirely run my searches on it and though that seems possible, I feel like it would be sadly a little bit more waste of time in terms of searches but still I'll maybe try to run some of my searches against this too and give me thoughts on this after doing something like this if I could, like, it is a big hit or miss but it will almost land you to the right spot, like not exactly.
For example, I searched lemmy hoping to find the fediverse and it gave me their liberapay page though.
Please, actually follow up on that common crawl promise and maybe even archive.org or other websites too and I hope that people are spending billions in this AI industry, I just hope that you can whether even through funding or just community crowdwork, actually succeed in creating such an alternative. People are honestly fed up with the current search engine almost monopoly.
Wasn't Ecosia trying to roll out their own search engine, They should definitely take your help or have you in their team..
I just want a decentralized search engine man, I understand that you want to make it sustaianable and that's why you haven't open sourced but please, there is honestly so much money going into potholes doing nothing but make our society worse and this project almost works good enough and has insane potential...
Please open source it and lets hope that the community tries to figure out a way around some ways of monetization/crowd funding to actually make it sustainable
But still, I haven't read the blog post in its entirety since I was so excited that I just started using the search engine.., But I feel like the article feels super indepth and that this idea can definitely help others to create their own proof of concepts or actually create some open source search engine that's decent once and for all.
Not going to lie, But this feels like a little magic and I am all for it. I have never been this excited the more I think about it of such projects in actual months!
I know open source is tough and I come from a third country but this is actually so cool that I will donate ya as much as I can / have for my own right now. Not much around 50$ but this is coming from a guy who has not spent a single penny online and wanting to donate to ya, please I beg ya to open source and use that common crawl, but I just wish you all the best wishes in your life and career man.
jobswithgptcom · 2h ago
I been doing a smaller version of the same idea for just domain of job listings. Initially I looked at HNSW but couldn't reason on how to scale it with predictable compute time cost. I ended up using IVF because I am a bit memory starved. I will have to take at look at coreNN.
randomcatuser · 2h ago
This is so cool. A question on the service mesh - is building your own typically the best way to do things?
I'm new to networking..
A_Stefan · 30m ago
Such a big inspiration! One of the few times where I genuinely read and liked the work - didn't even notice how the time flew by.
Feels like it's more and more about consuming data & outputting the desired result.
1gn15 · 1h ago
This is incredibly, incredibly cool. Creating a search engine that beats Google in quality in just 2 months and less than a thousand dollars.
Really great idea about the federated search index too! YaCy has it but it's really heavy and never really gave good results for me.
giancarlostoro · 3h ago
This then begs the question for me, without an LLM what is the approach to build a search engine? Google search used to be razor sharp, then it degraded in the late 2000s and early 2010s and now its meh. They filter out so much content for a billion different reasons and the results are just not what they used to be. I've found better results from some LLMs like Grok (surprisingly) but I can't seem to understand why what was once a razor exact search engine like Google, it cannot find verbatim or near verbatim quotes of content I remember seeing on the internet.
yorwba · 2h ago
When I encounter the "cannot find verbatim quote I remember" problem and then later find what I was looking for in some other way, I usually discover that I misremembered and the actual quote was different. I do prefer getting zero results in that case, though.
andai · 3h ago
My understanding was that every few months Google was forced to adjust their algorithms because the search results would get flooded by people using black hat SEO techniques. At least that's the excuse I heard for why it got so much worse over time.
Not sure if that's related to it ignoring quotes and operators though. I'd imagine that to be a cost saving measure (and very rarely used, considering it keeps accusing me of being a robot when I do...)
From what I understand, that good old Google from the 2000s was built entirely without any kind of machine learning. Just a keyword index and PageRank. Everything they added since then seems to have made it worse (though it did also degrade "organically" from the SEO spam).
masfuerte · 29m ago
Google certainly had to update their algorithms to cope with SEO, but that's not why their results have become so poor in the last five years or so. They made a conscious decision to prioritize profit over search quality. This came out in internal emails that were published as part of discovery for one of the antitrust suits.
To reiterate: Google search results are shit because shit ad-laden results make them more money in the short term.
That's it. And it's sad that so many people continue to give them the benefit of the doubt when there is no doubt.
xnx · 3h ago
The majority of the public internet shifted to "SEO optimized" garbage while the real user-generated content shifted to walled gardens like Instagram, Facebook, and Reddit (somewhat open). More recently, even use generated content is poisoned by wannabe influencers shilling some snake oil or scam.
reactordev · 1h ago
This is correct. Marketing and Advertising manipulated pages to gain higher rankings because they figured out the algorithm behind it. Forcing Google to change the algorithm. Originally, prior to the flood of <meta> garbage and hidden <div>’s it was very good at linking content together. Now, it’s a weighted database.
ASalazarMX · 2h ago
This is my take as well. When websites were few, directories were awesome. When websites multiplied, Google was awesome. When websites became SEO trash, social networks were awesome. When social networks are become trash, I'm hoping the Fediverse becomes the next awesome.
I don't see AI in any form becoming the next awesome.
Imustaskforhelp · 2h ago
I wish all the best wishes to fediverse too.
I'd like to take this one step too that communities have gone a similar transition too from forums to mostly now discord and I wish them to move to something like matrix which is federated (yes I know it has issues, but trust me sacrifices must be made)
What are your thoughts on things like bluesky/nostr and (matrix) too.
Bluesky does seem centralized in its current stage but its idea of (pds?) makes it fundamentally hack proof in the sense that if you are on a server which gets hacked, then your account is still safe or atleast that's the plan, not sure about its current implementation.
I also agree with AI not being the next awesome. Maybe for coding sure, but not in general yeah. But even in coding man, I feel like its good enough and its hard to catch more progress from now on and its just not worth it but honestly that's just me.
ASalazarMX · 1h ago
I think BlueSky still needs to prove itself. It is what Twitter/X was a decade ago, before the enshittification, and I enjoy the content a lot, with my reservations.
The weakness of Mastodon (and the Fediverse IMO), is that you can join one of many instances, and it becomes easier to form an echo chamber. Your feed will the the Fediverse hose (lots of irrelevant content), your local instance (an echo chamber), or your subscriptions (curating them takes effort). Nevertheless, that might be as well a strength I'm not truly appreciating.
mwcz · 46s ago
There was a Neal Stephenson novel where curated feeds had become a big business because it was the only tolerable way to browse the Internet. Lately I've been thinking that's more likely to happen.
Imustaskforhelp · 49m ago
I mean both bluesky and fediverse are just decentralized technologies, so lets say that you are worried about bluesky "enshittening"
I doubt it to happen because of its decentralized-enough nature.
I also agree with the subscriptions curation part the last time I checked, but I didn't use mastodon as often as I used lemmy and it was a less of an issue on lemmy.
Still, I feel like bluesky as an technology is goated and doesn't feel like it can be enshittened.
Nostr on the other hand does seem to me as an echo chamber of crypto bros but honestly, that's the most decentralization as you can ever get. Shame that we are going to get mostly nothing meaningful out of it imo. Which in that case bluesky seems to me as good enough but things like search etc. / the current bluesky is definitely centralized but honestly the same problems kept coming up on fediverse too, lemmy.world was getting too bloated with too many members and even mastodon had only one really famous home server afaik iirc mastodon.social right?
Also I may be wrong, I usually am but iirc mastodon only allows you to comment/ interact with posts on your own server like, I wanted to comment on mastodon.social from some other server but I don't remember being able to do so, maybe skill issue from my side.
h2zizzle · 1h ago
This has always been the explanation, but I've always wondered if it wasn't so much battling SEO as balancing the appearance of battling SEO while not killing some factor related to their revenue.
giancarlostoro · 3h ago
That begs the question, if you can recreate their engine from the 2000s with high quality search results, would investors even fund you? Lol
entropie · 2h ago
> if you can recreate their engine from the 2000s
Seriously, how? Iam pretty sure you have to have a very different approach than google had in its best times. The web is a very different place now
mike_hearn · 1h ago
The internet itself has changed over time, and a lot of content has just disappeared. It shouldn't appear in search because it's just not there anymore, it'd be a 404.
cosmic_cheese · 2m ago
A search engine that kept dead entries but maybe put them in an “missing” tab or something would’ve been monstrously useful for me in so many situations. There’s been numerous times I’ve remembered looking at something N years ago only for all but the faintest traces of it to have disappeared from the internet. With a “missing” tab I’d at least have former URLs, page titles, etc to work with (archive.org, etc).
thr0w · 3h ago
I see you’re also having trouble coping with this. Fact is, “that” internet is simply gone.
giancarlostoro · 3h ago
Nah, its a series of tubes, just gotta get the right tubes together.
msgodel · 2h ago
I wish there was an old fashioned n-gram + page rank search engine for those of us who don't mind the issues the older Google had. I've thought about making my own a few times.
AndrewKemendo · 1h ago
That stack element is amazing
I wish more people showed their whole exploded stack like that and in an elegant way
Really well done writeup!
tmelm · 13m ago
Incredibly cool. What a write-up. What an engineer.
abraxas · 3h ago
Very nice project. Do you have plans to commercialize it next?
I love seeing the worked out example at scale -- I'm surprised at how cost effective the vector database was.
For example, I searched lemmy hoping to find the fediverse and it gave me their liberapay page though.
Please, actually follow up on that common crawl promise and maybe even archive.org or other websites too and I hope that people are spending billions in this AI industry, I just hope that you can whether even through funding or just community crowdwork, actually succeed in creating such an alternative. People are honestly fed up with the current search engine almost monopoly.
Wasn't Ecosia trying to roll out their own search engine, They should definitely take your help or have you in their team..
I just want a decentralized search engine man, I understand that you want to make it sustaianable and that's why you haven't open sourced but please, there is honestly so much money going into potholes doing nothing but make our society worse and this project almost works good enough and has insane potential...
Please open source it and lets hope that the community tries to figure out a way around some ways of monetization/crowd funding to actually make it sustainable
But still, I haven't read the blog post in its entirety since I was so excited that I just started using the search engine.., But I feel like the article feels super indepth and that this idea can definitely help others to create their own proof of concepts or actually create some open source search engine that's decent once and for all.
Not going to lie, But this feels like a little magic and I am all for it. I have never been this excited the more I think about it of such projects in actual months!
I know open source is tough and I come from a third country but this is actually so cool that I will donate ya as much as I can / have for my own right now. Not much around 50$ but this is coming from a guy who has not spent a single penny online and wanting to donate to ya, please I beg ya to open source and use that common crawl, but I just wish you all the best wishes in your life and career man.
I'm new to networking..
Feels like it's more and more about consuming data & outputting the desired result.
Really great idea about the federated search index too! YaCy has it but it's really heavy and never really gave good results for me.
Not sure if that's related to it ignoring quotes and operators though. I'd imagine that to be a cost saving measure (and very rarely used, considering it keeps accusing me of being a robot when I do...)
From what I understand, that good old Google from the 2000s was built entirely without any kind of machine learning. Just a keyword index and PageRank. Everything they added since then seems to have made it worse (though it did also degrade "organically" from the SEO spam).
To reiterate: Google search results are shit because shit ad-laden results make them more money in the short term.
That's it. And it's sad that so many people continue to give them the benefit of the doubt when there is no doubt.
I don't see AI in any form becoming the next awesome.
What are your thoughts on things like bluesky/nostr and (matrix) too.
Bluesky does seem centralized in its current stage but its idea of (pds?) makes it fundamentally hack proof in the sense that if you are on a server which gets hacked, then your account is still safe or atleast that's the plan, not sure about its current implementation.
I also agree with AI not being the next awesome. Maybe for coding sure, but not in general yeah. But even in coding man, I feel like its good enough and its hard to catch more progress from now on and its just not worth it but honestly that's just me.
The weakness of Mastodon (and the Fediverse IMO), is that you can join one of many instances, and it becomes easier to form an echo chamber. Your feed will the the Fediverse hose (lots of irrelevant content), your local instance (an echo chamber), or your subscriptions (curating them takes effort). Nevertheless, that might be as well a strength I'm not truly appreciating.
I doubt it to happen because of its decentralized-enough nature.
I also agree with the subscriptions curation part the last time I checked, but I didn't use mastodon as often as I used lemmy and it was a less of an issue on lemmy.
Still, I feel like bluesky as an technology is goated and doesn't feel like it can be enshittened.
Nostr on the other hand does seem to me as an echo chamber of crypto bros but honestly, that's the most decentralization as you can ever get. Shame that we are going to get mostly nothing meaningful out of it imo. Which in that case bluesky seems to me as good enough but things like search etc. / the current bluesky is definitely centralized but honestly the same problems kept coming up on fediverse too, lemmy.world was getting too bloated with too many members and even mastodon had only one really famous home server afaik iirc mastodon.social right?
Also I may be wrong, I usually am but iirc mastodon only allows you to comment/ interact with posts on your own server like, I wanted to comment on mastodon.social from some other server but I don't remember being able to do so, maybe skill issue from my side.
Seriously, how? Iam pretty sure you have to have a very different approach than google had in its best times. The web is a very different place now
I wish more people showed their whole exploded stack like that and in an elegant way
Really well done writeup!