What a nice project. What inspired this initially?
FYI there's a broken link in your readme:
https://rumca-js.github.io/internet full internet search
renegat0x0 · 21m ago
thanks, I replaced it with a other link demo
hobs · 4h ago
Cant you just request the ICANN’s zone files and have the canonical list of the day?
renegat0x0 · 23m ago
Any link list, or domain list is not worth much without any rating, or meta. I lead a hobby project, and I am not expert, so I provide ratings based on what kind of data pages provide (title, social, description), and my own manual voting system. It is not ideal, but it is something. Also I provide tags, so it is easily known what the domain provides, or domains can be filtered by tags.
I know that you cannot count and visit every domain, so the list will never be finished, but I am happy with the results.
egberts1 · 1h ago
Avoiding GIGO (Garbage In, Garbage Out).
This is why we have computer-variants of Library Science and Archeology, Forensic Science and a bunch of other advanced knowledge (not AI, mind you).
hobs · 59m ago
I don't see how this applies as its aggregating a bunch of stuff from random crawlers - if you want to crawl a list of actual domains that's generally considered the list of things that could resolve, so seems like a good starting place.
didip · 3h ago
This is amazing. Thanks for sharing!
luizfelberti · 5h ago
I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.
I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...
While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.
3RTB297 · 2h ago
You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?
I'll add it to the mile-long list of things that should exist and be online public goods.
moduspol · 4h ago
Is the common crawl usable for something like this?
Too bad it doesn't support android. It is much more energy efficient than anything else I can spare (for 100% uptime contribution)
ge96 · 5h ago
The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.
kccqzy · 5h ago
Yeah people buy residential IPs on the black market. They are essentially infected home PCs and botnets.
you can get paid about $0.10/GB in cryptocurrency (at a few GB per month) to run one on your PC. Apparently they also just buy actual connections sometimes. It's not even unethical - it's just two groups of equally bad businesspeople trying to spend money to block the other one.
6510 · 4h ago
The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?
cheema33 · 6h ago
I tried the search site at https://searcha.page/ by searching for something random and got the following message:
"An error has occurred building the search results."
authnopuz · 6h ago
hug of death? I fear the temperature will get very high in his laundry room
DannyBee · 6h ago
I'm sure it depends on how much laundry he is doing - his dryer is probably heated entirely by servers.
He can then exhaust the remaining server heat through the dryer vent stack.
debo_ · 6h ago
Keep going. I love dry humor.
egberts1 · 1h ago
Its dryer sheets soften the soul.
ArekDymalski · 5h ago
Untill the exhaust starts "Feeling leaky" I guess.
robofanatic · 6h ago
Might not even need a dryer :-)
ape4 · 5h ago
Change it to a sauna?
doublerabbit · 1h ago
I thought of this a whole ago when I was a Datacentre monkey. In the winter it was pleasant to walk down the hot aisles.
However the exhausted hot air never had the same feel of a sauna. It left the air stale and dry.
eschulz · 2h ago
Before this happened to me, my first search returned an impressive SERP.
It claims I reached the article limit. The last time I saw a fastcompany link must have been a decade ago! I was nostalgically looking forward to read another article of theirs. Alas...
> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding
> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.
> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech
And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"
udkl · 3h ago
I absolutely devoured Wilson Lins articles recently .. they are very high quality and informative for any amateur interested in search engines and LLMs! - https://blog.wilsonl.in/search-engine/
wvenable · 1h ago
Reader mode in Firefox (plus sometimes a page refresh) gets me past most paywalls -- including this article.
phendrenad2 · 3h ago
This is a cool project, and I hope he has fun with it.
I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.
Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.
We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.
None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?
Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?
ofrzeta · 5h ago
"The beefy CPU running this setup, a 32-core AMD EPYC 7532, underlines just how fast technology moves. At the time of its release in 2020, the processor alone would have cost more than $3,000. It can now be had on eBay for less than $200"
why do I never get deals like that when I am shopping for the homelab on eBay?
progval · 5h ago
You need to spend a lot of time looking through badly labeled offers, and be willing to buy from sellers with no reputation.
robrtsql · 5h ago
I searched "AMD EPYC 7532" and there are a ton of listings for $150-$200. Are you just regretful that it wasn't like this when you were shopping parts for your homelab?
throwawayffffas · 3h ago
I got a 7551p plus motherboard and ram for about 600 bucks from China this January. I may have overpaid but it works great, and gets the job done.
_fat_santa · 5h ago
Not for a CPU but earlier this year I bought a Thinkpad workstation off eBay for $500. It's a machine from 2020 and when it was new cost $5,700.
I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.
saalweachter · 4h ago
Has eBay fixed their "and then they ship you a box of rocks" problem?
I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.
buildbot · 4h ago
Yes, it’s extremely rare to be stuck with a broken/wrong/missing item as a buyer on eBay. Selling is quite risky in some ways because eBay will nearly always side with a buyer. Every missing or broken thing I have purchased has been refunded or replaced. On the other hand, 3 things I have sold were claimed to not arrive. The only case where eBay decided in my favor was when the buyer had signed for the package in a literal USPS office :)
apetresc · 4h ago
My understanding is that eBay sides with the buyer on all disputes, to the point of ridiculousness. So you should be fine.
The real issue is being a seller and solving the "and then the customer claims I shipped them a box of rocks" problem.
buildbot · 4h ago
Yep selling is way more risky. Ebay might be the most safe (refund wise) marketplace for buyers… I have more trouble with amazon.
throwawayffffas · 3h ago
You don't get that with used old stuff, you get it with unrealistic low prices for new stuff.
A 7532 CPU is now ewaste for all the datacenters out there 1/10 of original price is reasonable, but the latest Nvidia GPU for 200 bucks is obviously a scam.
accrual · 3h ago
> Has eBay fixed their "and then they ship you a box of rocks" problem?
I've personally never had that problem after over a decade and hundreds of purchases on eBay. I've had some defective parts, but never outright fraud. IME eBay favors buyers.
Gormo · 2h ago
TheServerStore.com often has good deals. I actually bought a brand new 64-core EPYC 7702 server with 256 GB RAM and 8TB NVMe storage for about $3K fully assembled earlier this year.
ThatMedicIsASpy · 4h ago
Epyc7000+MB+256GB-512GB RAM (from china) usually starts at 800 euros + import tax
OJFord · 3h ago
'Google rival' is quite a stretch, surely 'search engine' is not just more accurate, but clearer too with all that Google does today, as if that's new.
lxe · 2h ago
This is a cool hobby project, but why is this notable? Why a FastCompany article? I'm trying to figure out anything that sets this apart from thousands of other little hobby search projects.
I understand companies like Perplexity or Brave or DuckDuckGo "rivialing Google", but building a hobby index and crawler is nice, and worthy of a "Show HN: "... but an actual media article?
gowld · 2h ago
It's only notable as a clickbait narrative for ignorant readers -- FastCompany's target market
> Why the laundry room? Two reasons: Heat and noise. Pearce’s server was initially in his bedroom, but the machine was so hot, it actually made it too uncomfortable to sleep.
This is a rite of passage and a badge of honor for homelabbers/tinkerers/hackers to discover for themselves IMHO. If you haven't tried it, you should. The heat is bad enough to warrant moving it, but add the noise too, sprinkle in a few nights of bad sleep, and it becomes an effective form of torture :-D
Just don't decide to move it to a closet unless you also install some fans in there. I ended up finding a cozy spot under the staircase which worked quite well
BLKNSLVR · 6h ago
Great innovation plus cloud-skeptic self-hosting. There should be much much more of this!
The great thing about this is that with the decentralization/recentralization of the Web, it may become easier for certain people to roll their own search engines for their respective communities and crawl/index pages only according to their shared tastes.
The bad thing about this is...read above.
iam_saurabh · 5h ago
I love stories like this—tech history is full of scrappy beginnings. Even if this project doesn’t succeed, it reminds us that giant companies aren’t unshakable.
mips_avatar · 1h ago
It’s amazing what indie builders are doing with vector search, but I’m not sure how long it will last. Pure vector search works well today largely because no one is seriously trying to game it yet. Once adversaries start targeting it like they do SEO, we could see the same problems. You can already glimpse the risk in Pinterest, where roughly half the results for many queries are AI slop - since their primary search is image vectors
When I started using it (~ 2 years) , it was necessary. Google was simply not solving any of my actual issues (software related).
Now, It seems that google might have improved a bit. I check from time to time and the gap isn't as huge, as when Kagi started
shayway · 5h ago
How does your experience with Searcha compare? It seems to be down at the moment.
the_third_wave · 5h ago
Do Kagi users get paid for shilling the company? Nearly all threads relating to the subject of search has a few mentionings of the glory of Kagi, often including links to the site. I suspect this is not as effective as the Kagi crew thinks since there is likely to be a large overlap between their potential customers and those who are really turned off by such shilling.
dawnerd · 5h ago
Flip side how much does Google pay you to defend their monopoly? Kagi is a solid product with a team that clearly cares about what they’re building. They’re transparent and post change logs when things update. I simply trust them infinitely more than Google.
hamdingers · 5h ago
Have you considered it's a good product that causes its users to become advocates?
> The effect is most likely to occur when there are no obvious reasons for performing the task. Because expending effort to perform a useless or unenjoyable task, or experiencing unpleasant consequences in doing so, is cognitively inconsistent (see cognitive dissonance), people are assumed to shift their evaluations of the task in a positive direction to restore consistency.
It's not limited to physical effort. Wikipedia's example has embarassment in place of effort; presumably, money could also work.
glenstein · 3h ago
TIL about effort justification! I think signing up for Kagi is not particularly effort-intensive however.
datadrivenangel · 5h ago
Kagi customer here. Not getting paid to shill. I think it's worth occasionally mentioning alternatives that are good enough to pay for so that other people know there are other people using other options.
But full disclosure, sometimes I'm using DuckDuckGo and it's also good enough most of the time that I occasionally forget until I go down some rabbit hole and realize that I'm using the wrong search engine.
jasonvorhe · 2h ago
Whenever I fall back to Google and see how terrible it has become I feel sorry for everyone still using it as their main search engine so I tend to link people to kagi because it's just so much better. Especially the customization aspects. I also like the idea of mainstreaming to pay for critical services like search. No paid shilling whatsoever. Back in the early 2000s people used to drop links to Google whenever search engines where discussed because the alternatives were mostly bad.
Today we have Brave and the alternative Bing frontends but Kagi is still unrivaled because how easy it is to remove shitty results.
testdelacc1 · 5h ago
Disclaimer: Not a Kagi user. Unlikely to use it.
I just don’t understand people who get so upset that someone might like something enough to talk about liking it. So upset that they won’t ever try the thing. Like … ok I guess? You do you. It’s just a strange way to make decisions.
At least this is just a consumer product. Worse is when people here say they make technical decisions using the same process. They’d black list certain tech because they’ve heard people talking about how it solved their problems. Also ok, but now I know I should avoid them professionally.
mdaniel · 5h ago
I get the impression it's the volume of the folks who sing its praises. There was a web3 crowd for a while, Bitwarden champions would show up to any mention of a password manager, and (ahem) some AI champions can be over the top
In all of these cases, a reasonable counterpoint is that if it were that applicable for all audiences, one wouldn't need to sing its praises, it would sing its own praises
ufmace · 4h ago
It sings its own praises... how exactly? Maybe by a bunch of happy users talking about how they like it and it's a better solution to the problem that the thread or article is about without being explicitly paid? Which is exactly what's happening here and some people are complaining about it?
testdelacc1 · 5h ago
How does a password manager sing its own praises?
koakuma-chan · 5h ago
I tried it, it's slow and bad and free tier is only 100 requests, and it's too expensive, and price is unjustified. I use gemini with google search grounding.
alexjplant · 5h ago
I understand skepticism in the age of LLM-generated content and CAPTCHA-solving bots. What I don't understand is why people choose such weird hills to die on and think that posting about it will accomplish anything. Do you think people will read your comment and go "gee, I was going to use Kagi but now I won't because this random person has a bad feeling about a series of comments they remember seeing"?
I signed up for a specialist forum not too long ago and posted an honest review of a product because I hadn't been able to find one anywhere on the internet. Immediately a bunch of people accused me of being a "shill" for a direct-to-consumer business that's been powered by a Yahoo storefront for the last 20 years, as though a business that's run by a guy with an AOL e-mail address is sophisticated enough to figure out Fiverr and astroturf their reputation on a phpBB forum.
Think about it for just a moment - do you really think that the Hacker News audience is large enough or full of enough tastemakers to sway an alternative search engine's market share? It isn't. If Kagi wanted to do that they'd hire TikTok influencers.
throwaway290 · 4h ago
no one else would pay for search. people on HN is probably 90% of their total possible market.
lelandbatey · 5h ago
Nope, it's just a nice thing I like. It is nearly the platonic ideal of a search engine for me. It causes me no problems and doesn't try to sell me garbage.
It's like discovering that there a better pair of shoes that're more comfortable. Everybody can use a slightly improved more comfortable pair of shoes, so it comes up frequently.
tmdetect · 5h ago
Kagi is a polished product. This is drying someones laundry.
Google was invented many years ago by two guys in a dorm room and since then there's been so many white papers and advancements in the public sphere and the actual underlying problem has not changed that much, that it seems like it could be done by a small group or independent person.
dec0dedab0de · 6h ago
Crawling is much more difficult than it used to be. Significantly more content is behind a login, Javascript is required for way more than it should be, and almost the entire web is behind cloudflare or another type of captcha.
marginalia_nu · 3h ago
These things are actually fairly small problems.
The parts that absolutely require JS can't be reliably linked to and nobody indexes that stuff. Most apparent SPA:s serve a HTML alternative if you don't claim to be a web browser in the UA.
Cloudflare and the like are also fairly easy to deal with as long as your crawler is well behaved. You can register the fingerprint and mostly get access to cf:ed websites.
non_aligned · 5h ago
I think there are two factors that helped Google. First, the search engine landscape back then was absolutely abysmal. I'm sure someone will chime in saying that it's abysmal today as well, but the reality is that 99%+ of consumer searches get good results today. And that's simply because the nature of search has changed: we have billions of people using the internet, and they overwhelmingly just search for products to buy, local restaurants that offer takeout, or for familiar pop content to watch or listen to. And there's some SEO spam there, but also pretty fierce quality assurance by search engines.
Second, the internet was different: when all nerds declared that Google is good, that was CNN-grade newsworthy (and CNN used to matter a lot more back then), simply because the internet seemed kinda important, but there was no other authority on the topic. Today, that's not the case. If you need someone to opine on the internet on air, you invite some political pundit or a business analyst.
So no, I don't think you can repeat the success of Google the same way. It was a product of its time.
snek_case · 3h ago
Google maps is probably a big moat that's very hard to replicate. You can't as easily just crawl all of that data. It's not easy to generate directions. The average user doesn't want to use your search engine for one thing and Google for everything else, they just want a one stop shop for search.
That's what I was expecting this submission to be about, although to be honest I'm not certain that Marginalia would want the influx of a fastcompany sized tire kicking
marginalia_nu · 3h ago
To be fair I'm on a colocated server now. No more apartment hosting for me.
jrm4 · 5h ago
More to the point, it's a shame that we can't collectively grok (dammit, they took that from us too) concepts like "personal" and/or "curated" directories, e.g. individual and group wikis and so forth on perhaps more directed topics with lists of good links.
cosmicgadget · 4h ago
Other than the obvious (but surmountable) technical challenges with crawling and indexing, trying to establish "goodness" for a given user is tough. For a blogger it will be "hey, you are reading this so you probably like what I like". That's often true but as soon as you try to have a centralized service with arbitrary users, it is hard to do anything better than filtering purely commercial content.
sdf4j · 5h ago
what you mean we can't? there are a lot of curated content directories out there.
jrm4 · 4h ago
Right, I suppose I mean "getting more people to think about why a few of these bookmarked for your favorite topics, especially tied to a trustworthy person, is a million times better than just hitting up Google."
Or, perhaps, a "a better Google should just take you to these."
Something like that.
CalRobert · 6h ago
Among other things, I think crawling is a lot harder now.
ambicapter · 5h ago
Google basically invented the modern cloud in order to efficiently use the hardware necessary to actually build those search engine indices. It's not really a question of implementing a good algorithm and away we go.
lif · 5h ago
Provided they have the kind of massive government support Google has had from the get-go, sure!
OutOfHere · 6h ago
The actual underlying problem has changed altogether. Pagerank is easily gamed by SEO.
Search candidates and rankings now require assessment by LLM. Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.
Crawling too requires innovative approaches to bypass server filters.
I doubt any independent person can afford to run a vector database or LLMs at immense scale.
kcbanner · 6h ago
> users want the results intelligently synthesized into a text response with references rather than as raw results.
The reason I pay for Kagi is that I specifically don't want this to occur.
OutOfHere · 6h ago
If you pay for a service (web search) that 99.9% use for free, you're an extreme outlier, and not necessarily a justifiable one either. After all, DDG, Google and various others still have raw results for free.
Workaccount2 · 6h ago
How much do you technologically relate to the average person on the street though?
Every person I have seen (outside the tiny tech bubble) google something has just read the AI overview without skipping a beat.
yepitwas · 6h ago
That's worrisome since I've seen those be for-sure wrong a pretty high percentage of the time.
[EDIT] Incidentally, are there any sites that do actual web search any more, better than Yandex? I'd rather avoid a Russian site if I can, but there are whole topics where it's impossible to find anything useful on heavily "massaged" allegedly-Web-search-but-not-really sites like Google and DDG (Bing), but I can find what I want on page 1 or 2 of a Yandex search. Is Kagi as good as that, or is their index simply ignoring a whole bunch of the Web like so many others? I don't mind paying.
degamad · 5h ago
Google "Web" results (not the default results you get when you search) still seem okay for me. You can force them with the udm=14 url trick, or select the "Web" tab in the results. No AI, no images or shopping results, and slightly better text results.
franktankbank · 5h ago
Yep, same here. Ask it "should I wash venison tenderloin" and you get an initial "No, because" followed by a generally "yes its important to clean including with water" in the longer description. Wow a self contradictory answer! Good job!
jkestner · 6h ago
We’re being force fed them. I’m an AI hater and I catch myself reading those sometimes.
Yes, people want the answer directly. Google wants you to stay on their site to read some mishmash. I think the ideal would be to immediately go to the source’s site.
throwmeaway222 · 6h ago
At this point the web is also so centralized you only need 3 bookmarks these days (your news, youtube and Amazon)
A search is just learning what you don't know and AI does a better job than search has ever done for me - and I'm in tech.
freeopinion · 3h ago
> users want the results intelligently synthesized into a text response with references rather than as raw results
This leads directly to another big change.
People used to submit their sites to search engines and now they might actively block search engines. So a search engine author might have to spend a lot of effort in adversarial games.
ricardo81 · 5h ago
>Pagerank
Also a lot of site owners are reluctant to link out. So much so that 'nofollow' had been reduced to a hint rather than a directive.
iamacyborg · 6h ago
> Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.
Citation needed
OutOfHere · 5h ago
You mean all the users of chat services aren't evidence? Chat services increasingly incorporate web links for references in their responses, and this is as the users seek. The tide continues to shift from traditional search to LLM synthesis.
iamacyborg · 5h ago
I suspect there are more users of traditional search than there are of llm chat apps.
freeopinion · 3h ago
I suspect that chat apps dominate (80+%?) the under-20 demographic, and have a sizable chunk of the under-30 demographic. Within the next five years it will probably represent 50+% of total search traffic. Maybe it already does. It makes sense that any search site that wants to be in the game tomorrow would keep racing down the AI chat path.
vlucas · 5h ago
> “I think it’s definitely lowered the barrier,” Lin says of the LLM’s role in enabling DIY search engines. “To me, it seems like the only barrier to actually competing with Google, creating an alternate search engine, is not so much the technology, it’s mostly the market forces.”
Oh sweet summer child
HardCodedBias · 4h ago
I know that Google engineers have a cushy life but I actually find it unlikely that a guy, who isn't attempting some radical new type of search (like pagerank back in the day) can hope to compete with the orgs in Google who support search.
Again, those orgs are likely too comfortable and less productive than people would like, but we're talking about many-many thousands and depending upon how you define "the work" of search upwards of 10k.
I didn't see any new secret sauce in the article and Google is has said that since 2015 (?) Google Brain has been involved in search.
This is not to say that Google couldn't be dislodged by search via LLM or similar, that is "new" research.
freeopinion · 3h ago
If you wrote that 100 people could outwork one person, I'd nod my head. If you wrote that 10k people could outwork 1k people, I'd shrug. If you tell me that 100 people can combine to tie my shoe faster than I can, I'd question that.
Building a state-of-the-art search engine is not shoelaces. But upwards of 10k workers is not impressive in the right direction.
One person starting out with anything at all can quickly grow into one person with one or two really innovative ideas. One or two good ideas can catch fire pretty quickly. Don't be too dismissive.
p3rls · 4h ago
i've been thinking that google could use its own AI to evaluate URLs instead of relying on pagerank and backlinks which are almost completely valueless as a signal in 2025. in my niche there's more slop than ever being produced daily and it's all hitting rank 1. it's tragic what google is doing to the internet.
Oarch · 5h ago
I'm sure there's a money laundering joke in here somewhere
mooiedingen · 4h ago
Nothing new as it has been done before, the concept is simple enough:
step 1: indexer, solr/lucene
Step 2: crawler of which there are several foss, build one yourself?
or you just run yacy which is a combo of the above, hook combine with an oldschool searx instance and you will be granted the title as seeker by the spirit of Fravia+ who was elder of the searchlores!!! Not only will you filter crap made by machine learning models, but thou shall find what thou seek! I refuse to call a 16 line long for loop triggering in memory loaded tokenized data where data can be anything from a scientific paper hallucinated by a chatbot to a message between two lovers anything intelligent for it is not intelligence but a blob of tokenized fcking data in memory getting triggered for an output by a derp with a 16 line long for loop!!!
I have 1542766 domains. Might not be much, but it is an honest work.
It is available as a github repo, so anybody that wants to start crawling has some initial data to kick off.
Links
https://github.com/rumca-js/Internet-Places-Database
FYI there's a broken link in your readme:
I know that you cannot count and visit every domain, so the list will never be finished, but I am happy with the results.
This is why we have computer-variants of Library Science and Archeology, Forensic Science and a bunch of other advanced knowledge (not AI, mind you).
I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...
While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.
I'll add it to the mile-long list of things that should exist and be online public goods.
https://commoncrawl.org
https://www.proxyrack.com/residential-proxies/
"An error has occurred building the search results."
He can then exhaust the remaining server heat through the dryer vent stack.
However the exhausted hot air never had the same feel of a sauna. It left the air stale and dry.
https://archive.is/HA7y4
Some bits and pieces:
> his new search engine, the robust Search-a-Page <https://searcha.page>, which has a privacy-focused variant called Seek Ninja <https://seek.ninja>
> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding
> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.
> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech
And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"
I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.
Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.
We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.
None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?
Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?
why do I never get deals like that when I am shopping for the homelab on eBay?
I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.
I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.
The real issue is being a seller and solving the "and then the customer claims I shipped them a box of rocks" problem.
A 7532 CPU is now ewaste for all the datacenters out there 1/10 of original price is reasonable, but the latest Nvidia GPU for 200 bucks is obviously a scam.
I've personally never had that problem after over a decade and hundreds of purchases on eBay. I've had some defective parts, but never outright fraud. IME eBay favors buyers.
I understand companies like Perplexity or Brave or DuckDuckGo "rivialing Google", but building a hobby index and crawler is nice, and worthy of a "Show HN: "... but an actual media article?
This is a rite of passage and a badge of honor for homelabbers/tinkerers/hackers to discover for themselves IMHO. If you haven't tried it, you should. The heat is bad enough to warrant moving it, but add the noise too, sprinkle in a few nights of bad sleep, and it becomes an effective form of torture :-D
Just don't decide to move it to a closet unless you also install some fans in there. I ended up finding a cozy spot under the staircase which worked quite well
- SearchaPage - Web Search Engine https://searcha.page/
- Seek Ninja - Stealthy Search Engine https://seek.ninja/
Both of them are erroring out right now?
What are some good practices these days to ensure a good crawl/scrape? Invest in proxies, preferably residential?
The bad thing about this is...read above.
When I started using it (~ 2 years) , it was necessary. Google was simply not solving any of my actual issues (software related).
Now, It seems that google might have improved a bit. I check from time to time and the gap isn't as huge, as when Kagi started
[1] https://en.wikipedia.org/wiki/Effort_justification
I’m not following you.
https://dictionary.apa.org/effort-justification
But full disclosure, sometimes I'm using DuckDuckGo and it's also good enough most of the time that I occasionally forget until I go down some rabbit hole and realize that I'm using the wrong search engine.
Today we have Brave and the alternative Bing frontends but Kagi is still unrivaled because how easy it is to remove shitty results.
I just don’t understand people who get so upset that someone might like something enough to talk about liking it. So upset that they won’t ever try the thing. Like … ok I guess? You do you. It’s just a strange way to make decisions.
At least this is just a consumer product. Worse is when people here say they make technical decisions using the same process. They’d black list certain tech because they’ve heard people talking about how it solved their problems. Also ok, but now I know I should avoid them professionally.
In all of these cases, a reasonable counterpoint is that if it were that applicable for all audiences, one wouldn't need to sing its praises, it would sing its own praises
I signed up for a specialist forum not too long ago and posted an honest review of a product because I hadn't been able to find one anywhere on the internet. Immediately a bunch of people accused me of being a "shill" for a direct-to-consumer business that's been powered by a Yahoo storefront for the last 20 years, as though a business that's run by a guy with an AOL e-mail address is sophisticated enough to figure out Fiverr and astroturf their reputation on a phpBB forum.
Think about it for just a moment - do you really think that the Hacker News audience is large enough or full of enough tastemakers to sway an alternative search engine's market share? It isn't. If Kagi wanted to do that they'd hire TikTok influencers.
It's like discovering that there a better pair of shoes that're more comfortable. Everybody can use a slightly improved more comfortable pair of shoes, so it comes up frequently.
Google was invented many years ago by two guys in a dorm room and since then there's been so many white papers and advancements in the public sphere and the actual underlying problem has not changed that much, that it seems like it could be done by a small group or independent person.
The parts that absolutely require JS can't be reliably linked to and nobody indexes that stuff. Most apparent SPA:s serve a HTML alternative if you don't claim to be a web browser in the UA.
Cloudflare and the like are also fairly easy to deal with as long as your crawler is well behaved. You can register the fingerprint and mostly get access to cf:ed websites.
Second, the internet was different: when all nerds declared that Google is good, that was CNN-grade newsworthy (and CNN used to matter a lot more back then), simply because the internet seemed kinda important, but there was no other authority on the topic. Today, that's not the case. If you need someone to opine on the internet on air, you invite some political pundit or a business analyst.
So no, I don't think you can repeat the success of Google the same way. It was a product of its time.
Or, perhaps, a "a better Google should just take you to these."
Something like that.
Search candidates and rankings now require assessment by LLM. Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.
Crawling too requires innovative approaches to bypass server filters.
I doubt any independent person can afford to run a vector database or LLMs at immense scale.
The reason I pay for Kagi is that I specifically don't want this to occur.
Every person I have seen (outside the tiny tech bubble) google something has just read the AI overview without skipping a beat.
[EDIT] Incidentally, are there any sites that do actual web search any more, better than Yandex? I'd rather avoid a Russian site if I can, but there are whole topics where it's impossible to find anything useful on heavily "massaged" allegedly-Web-search-but-not-really sites like Google and DDG (Bing), but I can find what I want on page 1 or 2 of a Yandex search. Is Kagi as good as that, or is their index simply ignoring a whole bunch of the Web like so many others? I don't mind paying.
Yes, people want the answer directly. Google wants you to stay on their site to read some mishmash. I think the ideal would be to immediately go to the source’s site.
A search is just learning what you don't know and AI does a better job than search has ever done for me - and I'm in tech.
This leads directly to another big change.
People used to submit their sites to search engines and now they might actively block search engines. So a search engine author might have to spend a lot of effort in adversarial games.
Also a lot of site owners are reluctant to link out. So much so that 'nofollow' had been reduced to a hint rather than a directive.
Citation needed
Oh sweet summer child
Again, those orgs are likely too comfortable and less productive than people would like, but we're talking about many-many thousands and depending upon how you define "the work" of search upwards of 10k.
I didn't see any new secret sauce in the article and Google is has said that since 2015 (?) Google Brain has been involved in search.
This is not to say that Google couldn't be dislodged by search via LLM or similar, that is "new" research.
Building a state-of-the-art search engine is not shoelaces. But upwards of 10k workers is not impressive in the right direction.
One person starting out with anything at all can quickly grow into one person with one or two really innovative ideas. One or two good ideas can catch fire pretty quickly. Don't be too dismissive.