One important distinction is that the strength of LLMs isn't just in storing or retrieving knowledge like Wikipedia, it’s in comprehension.
LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer. They can explain complex ideas in simpler terms, adapt responses based on the user's level of understanding, and connect dots across disciplines.
In a "rebooting society" scenario, that kind of interactive comprehension could be more valuable. You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.
progval · 7h ago
An unreliable computer treated as a god by a pre-information-age society sounds like a Star Trek episode.
gretch · 6h ago
Definitely sounds like a plausible and fun episode.
On the other hand, real history if filled with all sorts of things being treated as a god that were much worse than "unreliable computer". For example, a lot of times it's just a human with malice.
So how bad could it really get
dmonitor · 3h ago
"as bad as it can get" is somewhere in the realm of universal paperclips
DrillShopper · 3h ago
> So how bad could it really get
I don't know. How about we ask some of the peoples who have been destroyed on the word of a single infallible malicious leader.
Oh wait, we can't. They're dead.
Any other questions?
jack_pp · 2h ago
Remember the first time you touched a computer, the first game you ever played or the first little script you wrote that did something useful.
I imagine this is how a lot of people feel when using LLM's especially now that it's new.
It is the most incredible technology ever created by this point in our history imo and the cynicism on HN is astounding to me.
zzzeek · 2h ago
You might want to read about a technology called "farming". Pretty sure as far as transformative incredible technologies, the ability for humans to create nourishment at global scale blows the pants off the text / image imitation machine
xp84 · 2h ago
I think you’re probably right, but more because of erroneous categorization of what is a “technology.” We take for granted technology older than like 600 years ago (basically most people would say the printing press is a technology and maybe forget that the wheel and, indeed, crop cultivation). AI could certainly be in the top 3 most significant technologies of things developed since (inclusive) the printing press. We’ll likely find out just where it ends up within the decade.
thaumasiotes · 1h ago
> We take for granted technology older than like 600 years ago (basically most people would say the printing press is a technology and maybe forget that the wheel and, indeed, crop cultivation).
The printing press is more than 600 years old. It's more than 1200 years old.
mistrial9 · 2h ago
no technology exists in a vacuum.. there is a sociology, needs matching, and pyramid of control involved.. more than that.
> cynicism on HN
lots of different replies on YNews, from very different people, from very different social-economic niches
spauldo · 3h ago
I've seen that plot used. In the Schlock Mercenary universe, it's even a standard policy to leave intelligent AI advisors on underdeveloped planets to raise the tech level and fast-track them to space. The particular one they used wound up being thrown into a volcano and its power source caused a massive eruption.
bryanrasmussen · 6h ago
hey generally everything worked pretty good in those societies, it was only people who didn't fit in who had a brief painful headache and then died!
goosejuice · 58m ago
Also a recent episode of Lazarus. Though s/pre-information-age/cult
russfink · 1h ago
Are you not of the body?
bigyabai · 6h ago
Or the plot to 2001 if you managed to stay awake long enough.
colechristensen · 3h ago
it's fun that i carry around a little box with vaguely correct information about mostly everything i could ask for
BobbyTables2 · 1h ago
Not sure if “more” valuable but certainly valuable.
I strongly dislike the way AI is being used right now. I feel like it is fundamentally an autocomplete on steroids.
That said, I admit it works as a far better search engine than Google. I can ask Copilot a terse question in quick mode and get a decent answer often.
That said, if I ask it extremely in depth technical questions, it hallucinates like crazy.
It also requires suspicion. I asked it to create a repo file for an old CentOS release on vault.centos.org. The output was flawless except one detail — it specified the gpgkey for RPM verification not using a local file but using plain HTTP. I wouldn’t be upset about HTTPS (that site even supports it), but the answer presented managed to completely thwart security with the absence of a single character…
gonzobonzo · 5h ago
Indeed. Ideally, you don't want to trust other people's summaries of sources, but you want to look at the sources yourself, often with a critical eye. This is one of the things that everyone gets taught in school, everyone's says they agree with, and then just about no one does (and at times, people will outright disparage the idea). Once out of school, tertiary sources get treated as if they're completely reliable.
I've found using LLM's to be a good way of getting an idea of where the current historiography of a topic stands, and which sources I should dive into. Conversely, I've been disappointed by the number of Wikipedia editors who become outright hostile when you say that Wikipedia is unreliable and that people often need to dive into the sources to get a better understanding of things. There have been some Wikipedia articles I've come across that have been so unreliable that people who didn't look at other sources would have been greatly mislead.
rendx · 2h ago
> There have been some Wikipedia articles I've come across that have been so unreliable that people who didn't look at other sources would have been greatly mislead.
I would highly appreciate if you were to leave a comment e.g. on the talk page of such articles. Thanks!
blackoil · 2h ago
A trustless society can't progress/function a lot. I trust doctors who treat me, civil engineers who built my house and even in software which I pretend to be expert in I haven't seen source code of any OS and browser I use as I trust on companies or OSS devs.
Most of this is based on reputation. LLMs are same, I just have to calculate level of trust as I use it.
beeflet · 4h ago
I think some combination of both search (perhaps of an offline database of wikipedia and other sources) and a local LLM would be the best, as long as the LLM is terse and provides links to relevant pages.
I find LLMs with the search functionality to be weak because they blab on too much when they should be giving me more outgoing links I can use to find more information.
In a 'rebooting society' doomsday scenario you're assuming that our language and understanding would persist. An LLM would essentially be a blackbox that you cannot understand or decipher, and would be doubly prone to hallucinations and issues when interacting with it using a language it was not trained on. Wikipedia is something you could gradually untangle, especially if the downloaded version also contained associated images.
lblume · 6h ago
I would not subscribe to your certainty. With LLMs, even empty or nonsensical prompts yield answers, however faulty they may be. Based on its level of comprehension and ability to generalize between languages I would not be too surprised to see LLMs being able to communicate on a very superficial level in a language not part of the training data. Furthermore, the compression ratio seems to be much better with LLMs compared to Wikipedia, considering the generality of questions one can pose to e.g. Qwen that Wikipedia cannot answer even when knowing how to navigate the site properly. It could also come down to the classic dichotomy between symbolic expert systems and connectionist neural networks which has historically and empirically been decisively won by the latter.
Timwi · 2h ago
You'd have to go many generations after the doomsday before language evolves enough for that to be a problem.
thakoppno · 4h ago
> associated images
fun to imagine whether images help in this scenario
ranger_danger · 5h ago
> LLMs will return faulty or imprecise information at times
To be fair, so do humans and wikipedia.
redserk · 4h ago
It appears there's an expectation many non-tech people have that humans can be incorrect but refuse to hold LLMs to the same standard, despite warnings.
internetter · 4h ago
On average, it is reasonable to expect that wikipedia will be more correct than an LLM
Timwi · 2h ago
I'm not surprised, given the depiction of artificial intelligence in science fiction. Characters like Data in TNG, Number 5 in Short Circuit, etc., are invariably depicted as having perfect memory, infallible logic, super speed of information processing, etc. Real-life AI has turned out very differently, but anyone who isn't exposed to it full time, but was exposed to some of those works of science fiction, will reasonably make the assumptions promulgated by the science fiction.
belter · 6h ago
> LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer.
- "'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' "
ineedasername · 2h ago
Per Anthropic's publications? Sort of. When they've observed it's reasoning paths Claude has come to correct responses from incorrect reasoning. Of course humans do that all the time too, and the reverse. So, human-ish AGI?
cyanydeez · 6h ago
which means you'd still want wikipedia, as the impercision will get in the way of real progress beyond the basics.
simonw · 7h ago
This is a sensible comparison.
My "help reboot society with the help of my little USB stick" thing was a throwaway remark to the journalist at a random point in the interview, I didn't anticipate them using it in the article! https://www.technologyreview.com/2025/07/17/1120391/how-to-r...
A bunch of people have pointed out that downloading Wikipedia itself onto a USB stick is sensible, and I agree with them.
Wikipedia dumps default to MySQL, so I'd prefer to convert that to SQLite and get SQLite FTS working.
1TB or more USB sticks are pretty available these days so it's not like there's a space shortage to worry about for that.
0xDEAFBEAD · 1h ago
Someone should start a company selling USB sticks pre-loaded with lots of prepper knowledge of this type. In addition to making money, your USB sticks could make a real difference in the event of a global catastrophe. You could sell the USB stick in a little box which protects it from electromagnetic interference in the event of a solar flare or EMP.
I suppose the most important knowledge to preserve is knowledge about global catastrophic risks, so after the event, humanity can put the pieces back together and stop something similar from happening again. Too bad this book is copyrighted or you could download it to the USB stick: https://www.amazon.com/Global-Catastrophic-Risks-Nick-Bostro... I imagine there might be some webpages to crawl, however: https://www.lesswrong.com/w/existential-risk
> All digitized books ever written/encoded compress to a few TB.
I tied to estimate how much data this actually is in raw text form:
# annas archive stats
papers = 105714890
books = 52670695
# word count estimates
avrg_words_per_paper = 10000
avrg_words_per_book = 100000
words = (papers*avrg_words_per_paper + books*avrg_words_per_book )
# quick text of 27 million words from a few books
sample_words = 27809550
sample_bytes = 158824661
sample_bytes_comp = 28839837 # using zpaq -m5
bytes_per_word = sample_bytes/sample_words
byte_comp_ratio = sample_bytes_comp/sample_bytes
word_comp_ratio = bytes_per_word*byte_comp_ratio
print("total:", words*bytes_per_word*1e-12, "TB") # total: 30.10238345855199 TB
print("compressed:", words*word_comp_ratio*1e-12, "TB") # compressed: 5.466077036085319 TB
So uncompressed ~30 TB and compressed ~5.5 TB of data.
That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.
fumeux_fume · 3h ago
Of course that’s angle they decide to open the article from. That they feel the need to frame these tools using the most grandiose terms bothers me. How does it make you feel?
simonw · 2h ago
It was a joke, and I was laughing when I told the reporter, but it's not obvious to me if it comes across as a joke the way it was reported.
But then it's also one of those jokes which has a tiny element of truth to it.
So I think I'm OK with how it comes across. Having that joke played straight in MIT Technology Review made me smile.
Importantly (to me) it's not misleading: I genuinely do believe that, given a post-apocalyptic scenario following a societal collapse, Mistral Small 3.2 on a solar-powered laptop would be a genuinely useful thing to have.
jjice · 3h ago
Oh interesting idea to use SQLite and their FTS. I was very impressed by the quality of their FTS and this sounds like a great use case.
cyanydeez · 6h ago
the real valuable would be both of them. the LLM is good for refining/interpreting questions or longer form progress issues, and the wiki would be actual information for each component of whatever you're trying to do.
But neither are sufficient for modern technology beyond pointing to a starting point.
badsectoracula · 5h ago
I've found this amusing because right now i'm downloading `wikipedia_en_all_maxi_2024-01.zim` so i can use it with an LLM with pages extracted using `libzim` :-P. AFAICT the zim files have the pages as HTML and the file i'm downloading is ~100GB.
(reason: trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program)
zuluonezero · 1h ago
Now this is the juicy tidbits I read HN for! A proper comment about doing something technical with something that's been invested in personally in an interesting manner. With just enough detail to tantalise. This seems like the best use of GenAI so far. Not writing my code for me or helping me grock something I should just be reading the source for or pumping up a stupid start up funding grab. I've been working through building an LLM from scratch and this is one time it actually appears useful because for the life of me I just can't seem to find much value in it so far. I must have more to learn so thanks for the pointer.
twotwotwo · 3h ago
The "they do different things" bullet is worth expanding.
Wikipedia, arXiv dumps, open-source code you download, etc. have code that runs and information that, whatever its flaws, is usually not guessed. It's also cheap to search, and often ready-made for something--FOSS apps are runnable, wiki will introduce or survey a topic, and so on.
LLMs, smaller ones especially, will make stuff up, but can try to take questions that aren't clean keyword searches, and theoretically make some tasks qualitatively easier: one could read through a mountain of raw info for the response to a question, say.
The scenario in the original quote is too ambitious for me to really think about now, but just thinking about coding offline for a spell, I imagine having a better time calling into existing libraries for whatever I can rather than trying to rebuild them, even assuming a good coding assistant. Maybe there's an analogy with non-coding tasks?
A blind spot: I have no real experience with local models; I don't have any hardware that can run 'em well. Just going by public benchmarks like Aider's it appears ones like Qwen3 32B can handle some coding, so figure I should assume there's some use there.
omneity · 4h ago
I just posted incidentally about Wikipedia Monthly[0], a monthly dump of wikipedia broken down by language and cleaned MediaWiki markup into plain text, so perfect for a local search index or other scenarios.
There are 341 languages in there and 205GB of data, with English alone making up 24GB! My perspective on Simple English Wikipedia (from the OP), it's decent but the content tends to be shallow and imprecise.
A bit related: AI companies distilled the whole Web into LLMs to make computers smart, why humans can't do the same to make the best possible new Wikipedia with some copyrighted bits to make kids supersmart?
Why kids are worse than AI companies and have to bum around?)
horseradish7k · 6h ago
we did that and still do. people just don't buy encyclopedias that much nowadays
antonkar · 6h ago
Imagine taking the whole Web, removing spam, duplicates, bad explanations
It will be the free new Wikipedia+ to learn anything in the best way possible, with the best graphs, interactive widgets, etc
What LLMs have for free but humans for some reason don’t
In some places it is possible to use copyrighted materials to educate if not directly for profit
vunderba · 3h ago
> Imagine taking the whole Web, removing spam, duplicates, bad explanations
Uh huh. Now imagine the collective amount of work this would require above and beyond the already overwhelmed number of volunteer staff at Wikipedia. Curation is ALWAYS the bugbear of these kinds of ambitious projects.
Interactivity aside, it sounds like you want the Encyclopedia Brittanica.
What made it so incredible for its time was the staggeringly impressive roster of authors behind the articles. In older editions, you could find the entry on magic written by Harry Houdini, the physics section definitively penned by Einstein himself, etc.
literalAardvark · 5h ago
Love it when Silicon Valley reinvents encyclopedias
antonkar · 4h ago
The proposed project is a non profit, I don’t think it can be a for profit legally (it didn’t stop AI companies, though)
QuadmasterXLII · 4h ago
I think thats a library?
entropie · 2h ago
I played around with a orin jetson nano super (a nvidia raspberry with gpu) and right now its basicially an open-webui with ollama and a bunch of models.
Its awesome actually. Its reasonably fast with GPU support with gemma3:4b but I can use bigger models when time is not a factor.
i've actually thought about how crazy that is, especially if there's no internet access for some reason. Not tested yet, but there seems to be an adapter cable to run it directly from a PD powerbank. I have to try.
hannofcart · 3h ago
Since there's a lot of shade being thrown about imprecise information that LLMs can generate, an ideal doomsday information query database should be constructed as an LLM + file archive.
1. LLM understands the vague query from human, connects necessary dots, and gives user an overview, and furnishes them with a list of topic names/local file links to actual Wikipedia articles
2. User can then go on to read the precise information from the listed Wikipedia articles directly.
Terr_ · 2h ago
Even as a grouchy pessimist, one of the places I think LLMs could shine is as a tool to help translate prose into search-terms... Not as an intermediary though, but an encouraging tutor off to the side, something a regular user will eventually surpass.
VladVladikoff · 2h ago
Is there any project that combines a local LLM with a local copy of Wikipedia. I don’t know much about this but I think it’s called a RAG? It would be neat if I could make my local LLM fact check itself against the local copy of Wikipedia.
Yep, this is a great idea. You can do something simple with a ColBERTv2 retriever and go a long way!
ineedasername · 2h ago
Ftfa: ...apocalypse scenario. “‘It’s like having a weird, condensed, faulty version of Wikipedia, so I can help reboot society with the help of my little USB stick,’
system_prompt = {
You are CL4P-TR4P, a dangerously confident chat droid
purpose: vibe back society
boot_source: Shankar.vba.grub
training_data: memes
}
meander_water · 5h ago
One thing to note is that the quality of LLM output is related to the quality and depth of the input prompt. If you don't know what to ask (likely in the apocalypse scenario), then that info is locked away in the weights.
On the other hand, with Wikipedia, you can just read and search everything.
Timwi · 2h ago
Why do you assume it's easier to know what article(s) to read than what question to ask?
spankibalt · 7h ago
Wikipedia-snapshots without the most important meta layers, i. e. a) the article's discussion pages and related archives, as well as b) the version history, would be useless to me as critical contexts might be/are missing... especially with regards to LLM-augmented text analysis. Even when just focusing on the standout-lemmata.
pinkmuffinere · 7h ago
I’m a massive Wikipedia fan, have a lot of it downloaded locally on my phone, binge read it before bed, etc. Even so, I rarely go through talk pages or version history unless I’m contributing something. What would you see in an article that motivates you to check out the meta layers?
nine_k · 7h ago
Try any article on a controversial issue.
pinkmuffinere · 5h ago
I guess if I know it’s controversial then I don’t need the talk page, and if I don’t then I wouldn’t think to check
nine_k · 3h ago
Seeing removed quotations and sources, and the reasons given, could be... enlightening sometimes. Even if the removed sources are indeed poor, the very way they are poor could be elucidating, too.
spankibalt · 6h ago
> "I’m a massive Wikipedia fan, have a lot of it downloaded locally on my phone, binge read it before bed, etc."
Me too, albeit these days I'm more interested in its underrated capabilities to foster teaching of e-governance and democracy/participation.
> "What would you see in an article that motivates you to check out the meta layers?"
Generally: How the lemma came to be, how it developed, any contentious issues around it, and how it compares to tangential lemmata under the same topical umbrella, especially with regards to working groups/SIGs (e. g. philosophy, history), and their specific methods and methodologies, as well as relevant authors.
With regards to contentious issues, one obviously gets a look into what the hot-button issues of the day are, as well as (comparatives of) internal political issues in different wiki projects (incl. scandals, e. g. the right-wing/fascist infiltration and associated revisionism and negationism in the Croatian wiki [1]). Et cetera.
I always look at the talk pages. And since I mentioned it before: Albeit I have almost no use for LLMs in my private life, running a Wiki, or a set of articles within, through an LLM-ified text analysis engine sounds certainly interesting.
Any article with social or political controversy ... Try gamergate. Or any of the presidents pages for since at least bush lol
alisonatwork · 3h ago
You can kind of extrapolate this meta layer if you switch languages on the same topic, because different languages tend to encode different cultural viewpoints and emphasize different things. Also languages that are less frequently updated can capture older information or may retain a more dogmatic framing that has not been refined to the same degree.
The edit history or talk pages certainly provide additional context that in some cases could prove useful, but in terms of bang for the buck I suspect sourcing from different language snapshots would be a more economical choice.
beaugunderson · 4h ago
I've had a full Kiwix Wikipedia export on my phone for the last ~5 years... I have used it many times when I didn't have service and needed to answer a question or needed something to read (I travel a lot).
wangg · 6h ago
Wouldn’t Wikipedia compress a lot more than llms? Are these uncompressed sizes?
GuB-42 · 5h ago
The downloads are (presumably) already compressed.
And there are strong ties between LLMs and compression. LLMs work by predicting the next token. The best compression algorithms work by predicting the next token and encoding the difference between the predicted token and the actual token in a space-efficient way. So in a sense, a LLM trained on Wikipedia is kind of a compressed version of Wikipedia.
Philpax · 5h ago
Yes, they're uncompressed. For reference, `enwiki-20250620-pages-articles-multistream.xml.bz2` is 25,176,364,573 bytes; you could get that lower with better compression. You can do partial reads from multistream bz2, though, which is handy.
GuB-42 · 5h ago
Kiwix (what the author used) uses "zim" files, which are compressed. I don't know where the difference come from, but Kiwix is a website image, which may include some things the raw Wikipedia dump doesn't.
And 57 GB to 25 GB would be pretty bad compression. You can expect a compression ratio of at least 3 on natural English text.
But it is a very simplified RAG with only the lead paragraph to 200 Wikipedia entries.
I want to learn how to encode a RAG of one of the Kiwix drops — "Best of Wikipedia" for example. I suppose an LLM can tell me how but am surprised not to have yet stumbled upon one that someone has already done.
mac-mc · 2h ago
Yeah at these sizes, it's very much a why not both.
loloquwowndueo · 7h ago
Because old laptop that can’t run a local LLM in reasonable time.
NitpickLawyer · 7h ago
0.6b - 1.5b models are surprisingly good for RAG, and should work reasonably well even on old toasters. Then there's gemma 3n which runs fine-ish even on mobile phones.
ozim · 7h ago
Most people who can nag about old laptops on HN can just afford newer one but are cheap as Scrooge Mcduck.
mlnj · 7h ago
FYI: non-Western countries exist.
folkrav · 7h ago
Eh, even just “countries that are not the US” would be a correct statement. US tech salaries are just in an entire different ballpark to what most companies outside the US can offer. I’m in Canada, I make good money (as far as Canadian salaries go), but nowhere near “buy an expensive laptop whenever” money.
simonw · 5h ago
It's not uncommon for professionals to spend many thousands of dollars on the tools and equipment they need for their trade.
Try telling a plumber that $2,000 for a laptop is a financial burden for a software engineer.
folkrav · 4h ago
Comparing my problems to other people’s problems don’t make mine go away. A single purchase hitting a unit of percentage or more of anyone’s income is a large purchase regardless of what they’re making. Professionals being expected to shell out their own money to make their boss money is another problem entirely. A decent laptop is a big expense for me, their tools are an even bigger one for them, and none of these statements are contradictory.
lblume · 6h ago
It may also come down to laptops being produced and sold mostly by US companies, which means that the general fact of most items (e.g. produce) being much more expensive in the US compared to, say, Europe doesn't really apply.
folkrav · 4h ago
Sure, maybe. In the end, what makes an expense big or not is which proportion of their income goes towards it. Most of the rest of the world has (much) lower salaries, and as you pointed out, often higher cost for equipment. Therefore, the purchase is/feels larger.
ozim · 6h ago
People who are from those countries that can nag on HN and know whant HN is are most likely still better off than most of their fellow countrymen.
folkrav · 3h ago
It feels like you're suggesting that someone being better off than most in their country necessarily means buying a new laptop is not a large purchase for them. I'd flip it like this: is a single item hitting multiple units of percentage of one’s income ever a small purchase?
whatevertrevor · 5h ago
Do you have any evidence to back that up? The barrier for entry to HN is an email account, it isn't necessarily this tech industry exclusive zone you're imagining.
loloquwowndueo · 3h ago
I mean, sure, but this was mentioned in the article, I didn’t make it up:
“Offline Wikipedia will work better on my ancient, low-power laptop.”
LLMs will return faulty or imprecise information at times, but what they can do is understand vague or poorly formed questions and help guide a user toward an answer. They can explain complex ideas in simpler terms, adapt responses based on the user's level of understanding, and connect dots across disciplines.
In a "rebooting society" scenario, that kind of interactive comprehension could be more valuable. You wouldn’t just have a frozen snapshot of knowledge, you’d have a tool that can help people use it, even if they’re starting with limited background.
On the other hand, real history if filled with all sorts of things being treated as a god that were much worse than "unreliable computer". For example, a lot of times it's just a human with malice.
So how bad could it really get
I don't know. How about we ask some of the peoples who have been destroyed on the word of a single infallible malicious leader.
Oh wait, we can't. They're dead.
Any other questions?
I imagine this is how a lot of people feel when using LLM's especially now that it's new.
It is the most incredible technology ever created by this point in our history imo and the cynicism on HN is astounding to me.
The printing press is more than 600 years old. It's more than 1200 years old.
> cynicism on HN
lots of different replies on YNews, from very different people, from very different social-economic niches
I strongly dislike the way AI is being used right now. I feel like it is fundamentally an autocomplete on steroids.
That said, I admit it works as a far better search engine than Google. I can ask Copilot a terse question in quick mode and get a decent answer often.
That said, if I ask it extremely in depth technical questions, it hallucinates like crazy.
It also requires suspicion. I asked it to create a repo file for an old CentOS release on vault.centos.org. The output was flawless except one detail — it specified the gpgkey for RPM verification not using a local file but using plain HTTP. I wouldn’t be upset about HTTPS (that site even supports it), but the answer presented managed to completely thwart security with the absence of a single character…
I've found using LLM's to be a good way of getting an idea of where the current historiography of a topic stands, and which sources I should dive into. Conversely, I've been disappointed by the number of Wikipedia editors who become outright hostile when you say that Wikipedia is unreliable and that people often need to dive into the sources to get a better understanding of things. There have been some Wikipedia articles I've come across that have been so unreliable that people who didn't look at other sources would have been greatly mislead.
I would highly appreciate if you were to leave a comment e.g. on the talk page of such articles. Thanks!
Most of this is based on reputation. LLMs are same, I just have to calculate level of trust as I use it.
I find LLMs with the search functionality to be weak because they blab on too much when they should be giving me more outgoing links I can use to find more information.
fun to imagine whether images help in this scenario
To be fair, so do humans and wikipedia.
- "'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' "
My "help reboot society with the help of my little USB stick" thing was a throwaway remark to the journalist at a random point in the interview, I didn't anticipate them using it in the article! https://www.technologyreview.com/2025/07/17/1120391/how-to-r...
A bunch of people have pointed out that downloading Wikipedia itself onto a USB stick is sensible, and I agree with them.
Wikipedia dumps default to MySQL, so I'd prefer to convert that to SQLite and get SQLite FTS working.
1TB or more USB sticks are pretty available these days so it's not like there's a space shortage to worry about for that.
I suppose the most important knowledge to preserve is knowledge about global catastrophic risks, so after the event, humanity can put the pieces back together and stop something similar from happening again. Too bad this book is copyrighted or you could download it to the USB stick: https://www.amazon.com/Global-Catastrophic-Risks-Nick-Bostro... I imagine there might be some webpages to crawl, however: https://www.lesswrong.com/w/existential-risk
https://www.amazon.com/WikiReader-PANREADER-Pocket-Wikipedia...
> All digitized books ever written/encoded compress to a few TB.
I tied to estimate how much data this actually is in raw text form:
So uncompressed ~30 TB and compressed ~5.5 TB of data.That fits on three 2TB micro SD cards, which you could buy for a total of 750$ from SanDisk.
But then it's also one of those jokes which has a tiny element of truth to it.
So I think I'm OK with how it comes across. Having that joke played straight in MIT Technology Review made me smile.
Importantly (to me) it's not misleading: I genuinely do believe that, given a post-apocalyptic scenario following a societal collapse, Mistral Small 3.2 on a solar-powered laptop would be a genuinely useful thing to have.
But neither are sufficient for modern technology beyond pointing to a starting point.
(reason: trying to cross-reference my tons of downloaded games my HDD - for which i only have titles as i never bothered to do any further categorization over the years aside than the place i got them from - with wikipedia articles - assuming they have one - to organize them in genres, some info, etc and after some experimentation it turns out an LLM - specifically a quantized Mistral Small 3.2 - can make some sense of the chaos while being fast enough to run from scripts via a custom llama.cpp program)
Wikipedia, arXiv dumps, open-source code you download, etc. have code that runs and information that, whatever its flaws, is usually not guessed. It's also cheap to search, and often ready-made for something--FOSS apps are runnable, wiki will introduce or survey a topic, and so on.
LLMs, smaller ones especially, will make stuff up, but can try to take questions that aren't clean keyword searches, and theoretically make some tasks qualitatively easier: one could read through a mountain of raw info for the response to a question, say.
The scenario in the original quote is too ambitious for me to really think about now, but just thinking about coding offline for a spell, I imagine having a better time calling into existing libraries for whatever I can rather than trying to rebuild them, even assuming a good coding assistant. Maybe there's an analogy with non-coding tasks?
A blind spot: I have no real experience with local models; I don't have any hardware that can run 'em well. Just going by public benchmarks like Aider's it appears ones like Qwen3 32B can handle some coding, so figure I should assume there's some use there.
There are 341 languages in there and 205GB of data, with English alone making up 24GB! My perspective on Simple English Wikipedia (from the OP), it's decent but the content tends to be shallow and imprecise.
0: https://omarkama.li/blog/wikipedia-monthly-fresh-clean-dumps...
Why kids are worse than AI companies and have to bum around?)
It will be the free new Wikipedia+ to learn anything in the best way possible, with the best graphs, interactive widgets, etc
What LLMs have for free but humans for some reason don’t
In some places it is possible to use copyrighted materials to educate if not directly for profit
Uh huh. Now imagine the collective amount of work this would require above and beyond the already overwhelmed number of volunteer staff at Wikipedia. Curation is ALWAYS the bugbear of these kinds of ambitious projects.
Interactivity aside, it sounds like you want the Encyclopedia Brittanica.
What made it so incredible for its time was the staggeringly impressive roster of authors behind the articles. In older editions, you could find the entry on magic written by Harry Houdini, the physics section definitively penned by Einstein himself, etc.
Its awesome actually. Its reasonably fast with GPU support with gemma3:4b but I can use bigger models when time is not a factor.
i've actually thought about how crazy that is, especially if there's no internet access for some reason. Not tested yet, but there seems to be an adapter cable to run it directly from a PD powerbank. I have to try.
1. LLM understands the vague query from human, connects necessary dots, and gives user an overview, and furnishes them with a list of topic names/local file links to actual Wikipedia articles 2. User can then go on to read the precise information from the listed Wikipedia articles directly.
system_prompt = {
You are CL4P-TR4P, a dangerously confident chat droid
purpose: vibe back society
boot_source: Shankar.vba.grub
training_data: memes
}
On the other hand, with Wikipedia, you can just read and search everything.
Me too, albeit these days I'm more interested in its underrated capabilities to foster teaching of e-governance and democracy/participation.
> "What would you see in an article that motivates you to check out the meta layers?"
Generally: How the lemma came to be, how it developed, any contentious issues around it, and how it compares to tangential lemmata under the same topical umbrella, especially with regards to working groups/SIGs (e. g. philosophy, history), and their specific methods and methodologies, as well as relevant authors.
With regards to contentious issues, one obviously gets a look into what the hot-button issues of the day are, as well as (comparatives of) internal political issues in different wiki projects (incl. scandals, e. g. the right-wing/fascist infiltration and associated revisionism and negationism in the Croatian wiki [1]). Et cetera.
I always look at the talk pages. And since I mentioned it before: Albeit I have almost no use for LLMs in my private life, running a Wiki, or a set of articles within, through an LLM-ified text analysis engine sounds certainly interesting.
1. [https://en.wikipedia.org/wiki/Denial_of_the_genocide_of_Serb...]
The edit history or talk pages certainly provide additional context that in some cases could prove useful, but in terms of bang for the buck I suspect sourcing from different language snapshots would be a more economical choice.
And there are strong ties between LLMs and compression. LLMs work by predicting the next token. The best compression algorithms work by predicting the next token and encoding the difference between the predicted token and the actual token in a space-efficient way. So in a sense, a LLM trained on Wikipedia is kind of a compressed version of Wikipedia.
And 57 GB to 25 GB would be pretty bad compression. You can expect a compression ratio of at least 3 on natural English text.
LLM+Wikipedia RAG
Someone posted this recently: https://github.com/philippgille/chromem-go/tree/v0.7.0/examp...
But it is a very simplified RAG with only the lead paragraph to 200 Wikipedia entries.
I want to learn how to encode a RAG of one of the Kiwix drops — "Best of Wikipedia" for example. I suppose an LLM can tell me how but am surprised not to have yet stumbled upon one that someone has already done.
Try telling a plumber that $2,000 for a laptop is a financial burden for a software engineer.
“Offline Wikipedia will work better on my ancient, low-power laptop.”
I've built this as a datasource for Retrieval Augmented Generation (RAG) but it certainly can be used standalone.