Just curious - What is the future of service like these? More and more content will be AI generated, to some degree. And should thereby that content be aggregated?
boombapoom · 55m ago
fuck those guys, annas archive is one of the last good things about the internet.
Koshkin · 13m ago
> the last good things
Last but not least?
lysace · 24m ago
1. Information wants to be free. :-)
2. I used to think that way about The Pirate Bay guys until they hacked into the Swedish equivalency of the US social security number database and then fled to Cambodia. (Or did it from Cambodia. I don’t remember the exact timeline.)
What I mean to say is: I have been disappointed by my heroes before.
tzs · 12m ago
If #1 is a reference to a famous quote from Steward Brand, founder of the Whole Earth Catalog, it's only part of the quote. The rest is relevant:
> On the one hand you have—the point you’re making Woz—is that information sort of wants to be expensive because it is so valuable—the right information in the right place just changes your life. On the other hand, information almost wants to be free because the costs of getting it out is getting lower and lower all of the time. So you have these two things fighting against each other
He stated later more succinctly:
> Information Wants To Be Free. Information also wants to be expensive. ...That tension will not go away
gjsman-1000 · 23m ago
> Information should be free
I'm sick and tired of this misquote; as it was merely an observation of trends, and was never meant to be a moral maxim or mandate. If you truly believe information needs to be free as a moral mandate, share your company's source code first.
danielPort9 · 19m ago
I see it as “everyone deserves respect”. No need to overanalyse it. It’s one of those few things in life that are simply true, no proof needed.
Ar-Curunir · 12m ago
People can do good things and bad things simultaneously. Unless me supporting the good things directly enables also the bad things, I don't see a reason to throw out the good thing.
Davidzheng · 20m ago
was the alternative for the pirate bay people jailtime?
justin66 · 18m ago
"Anna’s Archive itself has organized some of the largest scrapes: we acquired tens of millions of files from IA Controlled Digital Lending"
Not really helping in the big picture, here, guys.
thorn · 49m ago
Kudos to the team behind this project! It looks like they have improved UI in last year. The crucial problem right now is to remain accessible or to survive. I have no idea how much effort is being put into it. I wonder is it possible to remain afloat despite all efforts to take them down?
jauntywundrkind · 23m ago
There was a pretty major UI update in the past 2-5 days-ish.
Apologies for the minor grumble, but on mobile I used to be able to browse search results much more effectively; the new design only fits ~4-5 results on a screen.
dulpo · 1h ago
This is surprising. I thought last I heard they'd arrested the guy who was suspected of running the site, about a year or so ago. Guess I'm misremembering.
Also I'm surprised Cloudflare hasn't shut them down like they do for other dodgy sites.
lode · 1h ago
When accessing from Belgium the link is blocked by Cloudflare:
Error HTTP 451
Unavailable For Legal Reasons
In response to a legal order, Cloudflare has taken steps to limit access to this website through Cloudflare's pass-through security and CDN services within Belgium
dulpo · 1h ago
Interesting. Seems to be only certain jurisdictions. I can access it no problem from the UK Vodafone network.
teekert · 1m ago
Set proton VPN to Albania and enjoy the full internet is my experience.
camtarn · 34m ago
I'm unable to resolve the domain on EE UK - looks like it's DNS blocked.
By comparison, on my work network (TalkTalk) I can resolve the domain but I get a connection reset from the site.
I think this might be the first time I've hit a DNS block. It feels rather eerie seeing people talking about a site that, from my point of view, doesn't even exist...
spacedcowboy · 1h ago
Hmm. Even the title link above doesn't work for me on Virgin's cable, in the UK
dulpo · 53m ago
Do you see an error page / blocked page?
I used to get archive.org blocked and had to contact my provider to have the filters taken off.
spacedcowboy · 44m ago
Nope,it just takes forever, then eventually shows a blank screen...
barrell · 58m ago
Yep blocked by Ziggo in NL as well
telesilla · 51m ago
Whenever I'm in the Netherlands I need to set my DNS to 1.1.1.1 or similar, lots of blocks.
borski · 1m ago
Except that that’s CloudFlare, which is also blocking Anna’s Archive.
noble-lombax · 50m ago
I actually didn't know there were more error codes beyond error code 429
Mogzol · 43m ago
There's "431 Request Header Fields Too Large" which you will see occasionally. But after that 451 is the only other 400-level error code above 429. It was chosen as a reference to the book Fahrenheit 451.
mariusor · 31m ago
451 is kind of a novelty code, its meaning being related to Bradbury's "Fahrenheit 451" SciFi novel.
The two behind Z-Library were arrested in late 2022.
dulpo · 56m ago
Thank you, I think I must have got the details of that confused with the OCLC lawsuit.
baal80spam · 53m ago
annas-archive.li/blog, 2025-08-17
About recent events.
We are still alive and kicking. In recent weeks we’ve seen increased attacks on our mission. We are taking steps to harden our infrastructure and operational security. The work of securing humanity’s legacy is worth fighting for.
Since we started in 2022, we have liberated tens of millions of books, scientific articles, magazines, newspapers, and more. These are now forever protected from destruction by natural disasters, wars, budget cuts, and other catastrophes, thanks to everyone who helps with torrenting.
Anna’s Archive itself has organized some of the largest scrapes: we acquired tens of millions of files from IA Controlled Digital Lending, HathiTrust, DuXiu, and many more.
We have also scraped and published the largest book metadata collections in history: WorldCat, Google Books, and others. With this we’ll be able to identify which books are still missing from our collections, and prioritize saving the rarest ones.
Much thanks to all of our volunteers for making these projects happen.
We’ve forged some incredible partnerships. We’ve partnered with two LibGen forks, STC/Nexus, Z-Library. We’ve secured tens of millions additional files through these partnerships. And they are helping the mission by mirroring our files.
Unfortunately we have seen the disappearance of one of the LibGen forks. We don’t have further information about what happened there, but are saddened by this development.
There is a new entrant: WeLib. They appear to have mirrored most of our collection, and use a fork of our codebase. We have copied some of their user interface improvements, and are grateful for that push. Sadly, we are not seeing them share any new collections, nor share their codebase improvements. Since they haven’t shown commitment to contributing back to the ecosystem, we advise extreme caution. We recommend not using them.
In the meantime, we have some exciting projects in the works. We have hundreds of terabytes in new collections sitting on our servers, waiting to be processed. If you’re at all interested in helping out, feel free to check out our Volunteering and Donate pages. We run all of this on a minimal budget, so any help is greatly appreciated.
Keep fighting.
stonecharioteer · 1h ago
Please remain up. Libgen no longer works. I've used IRC for fiction and non-fiction but tech books needs Anna's Archive and Libgen. I buy the physical with company budget to pay the author but I need DRM free ebooks to read comfortably on my Tab S9 Ultra.
DyslexicAtheist · 20m ago
libgen is still there
gregorygoc · 1m ago
What’s the url?
slt2021 · 1h ago
Anna's archives is possibly the greatest site ever.
Infinite love to the team <3
xtracto · 58m ago
Kind of... the fact that they have the actual data behind a "soft" paywall (waiting times and terribly slow transfers otherwise) makes me a bit skeptic of their "goodwill".
SimianSci · 18m ago
No such thing as free when bandwidth costs money.
Any service online that is handing out things for free without restriction is getting their return through scrupulus means and shouldnt be trusted.
Anna's Archive straddles the line enough to allow people to download books for free but not at too great an expense to the volunteers who pay out of pocket to support the project.
nulld3v · 8m ago
I believe you only hit the paywall when you try to use the search engine & download individual files. They still offer the underlying data for free archival/mirroring via torrents.
0cf8612b2e1e · 53m ago
Their backdoor plan to get rich! Not going to fool me this time VCs!!
Everyone involved is taking on significant personal liability and hosting expenses. Not sure what more you expect.
klik99 · 20m ago
Yes spot on, crazy that asking for an optional pittance for less bandwidth throttling on such a huge and risky project can be seen as exploitative.
exe34 · 9m ago
you should ask for a refund!
mattl · 55m ago
Bandwidth isn’t free of charge
bibelo · 41m ago
and hosting
oguz-ismail · 42m ago
> We recommend not using them
I've been using WeLib since April and had a good experience so far
SimianSci · 21m ago
If efforts like this are to be sustainable in any lasting way, participants need to be cooperative, not parasitic.
I agree with the Anna's Archive team, it serves noone to have one of these players in the space hoarding their own collections and not sharing them to other archiving projects, it make the collection extremely vulnerable and at risk of becoming lost knowledge as time goes on.
jeron · 15m ago
I disagree with how this is framed. shadow libraries thrive on decentralization, any other servers mirroring a collection is better than no mirrors at all
carlosjobim · 7m ago
No honour among thieves.
keroro · 33m ago
Why use them over annas archive?
oguz-ismail · 2m ago
cleaner interface
max_ · 1h ago
The entire internet needs to be re-designed to stand up against attacks.
- DDOS attacks
- Spamming
- UK like surveillance laws
- LLM scraping
Why is it that there is almost not initiative for this?
grues-dinner · 57m ago
The Internet has been redesigned. It's just not been redesigned with your interests in mind and at least some of the "attacks" are features to the right people.
theturtletalks · 54m ago
The precursor to BitCoin was this interesting project called HashCash. It was built to combat email spam and forced the sender to spend compute solving a moderate hash and put it in the header. The person who receives the email can prove easily if the sender "paid" the cost.
progval · 55m ago
There are, but they each have their tradeoffs.
Proof of work and micropayments (eg. Xanadu or Internet Mail 2000) schemes solve spamming and LLM scraping, but are more expensive or more CPU-intensive.
P2P systems like FreeNet too, but they are harder to use and more storage intensive and make it easier to spy on individual users.
Tor solves UK-like surveillance laws but it's slower and makes it easier to spam.
freefaler · 1h ago
Decentralization and interoperability, including the TCP routing protocols give the ability for the network to grow freely, but makes those kind of attacks easier.
The easiest way to mitigate those problem will be to decrease the openness and centralize more. It might lead to even worse things that DDOS.
GuB-42 · 20m ago
RFC-3514 [1] proposed an effective solution against attacks.
So see, there are initiatives, but people treat it as a joke, maybe because of when it was released.
Out of curiosity, do you see the archive in question as being part of the problem or that it needs protection from the issues you raise?
butchkass · 1h ago
Go right ahead
ilovefood · 1h ago
I fully agree. It's difficult though because I genuinely believe that the solution space overlaps with cryptography, which is quickly discounted as viable option because it is now laden with negative connotations.
goku12 · 25m ago
Cryptography has negative connotations? Like what? Do you mean cryptocurrency by any chance? (If so, it's feasible to practice cryptography without touching cryptocurrency).
gia_ferrari · 2m ago
Not op, but in my bubble:
- DRM.
- Owner-unfriendly device locks (such as manufacturer-controlled secure boot or locked-down OSes).
- Inability to audit network traffic from one's own devices, i.e. an IoT device.
- Remote attestation, when in opposition to open computing.
I could also see folks seeing the use of cryptography as "having something to hide" - I don't personally agree.
vpribish · 56m ago
nah. cryptography is not seriously held back by cryptocurrency
monster_truck · 1h ago
I'll start the wiki
meindnoch · 55m ago
I'll design the logo!
IAmBroom · 22m ago
I'll make a GUI in Visual Basic!
exe34 · 7m ago
I'll bring my axe!
anon191928 · 1h ago
because they will come after new design? how do you not see this?
dulpo · 1h ago
Redesigned like how?
exe34 · 7m ago
the problem is that anybody who does that work will be targeted very quickly by the people in power.
even if it's decentralised, it'll be banned one way or another and you'll be hunted down.
random3 · 1h ago
"Be the change you want to see in the world"
NoMoreNicksLeft · 1h ago
I dread these. I still remember the rarbg announcement from a few years back I saw here. Do I even dare click the link?
HedgeMage · 1h ago
Not that scary. Click it.
crest · 1h ago
They just announced that they're still in the fight.
ronsor · 1h ago
I think you'll be happy if you do
revskill · 1h ago
Openai need to train their models based on these books, not stackoverflow or reddit.
The tweet only names Meta, but it would be very surprising if OpenAI didn't do the same thing.
CamperBob2 · 1h ago
Anyone who doesn't train on all material available, legal or otherwise, will be outcompeted by teams that do, including those based in countries that don't respect Western copyright law. It's that simple.
Either this is practice is judged (or legislated) to be fair use, or copyright is done. It's also that simple.
atrettel · 44m ago
I'm not convinced that LLMs and other AI models need to train on all material available. A representative sample is better.
I'll ignore the legality aspects in my response. I think coming up with a representative sample of all relevant information would be better in the long term (teams will not be outcompeted on long time horizons). Why don't the companies do this? Because it is easier to just "carpet bomb the parameter space" and worry about the potential confounding [1] and sampling bias [2] later. Coming up with a representative sample requires domain expertise and that is expensive in terms of time and money. But it reduces the total amount of training data and should reduce the amount of time and resources it takes to build the models. That may matter now that models are quite large.
This is definitely a design decision with tradeoffs on both sides. I can entertain the notion that we don't have time to sample things, but I think we are all too often dismissing the long-term benefits of proper sampling.
(In terms of the legality aspects, judges are trying to "split the baby" [3] in my opinion by saying that training on stuff you got legally is OK but training on pirated material isn't. So nobody is going to recommend training on pirated material in the first place.)
So, what? Authors and rights holders are supposed to just take it?
Copyright law exists for a reason. Trying to improve an LLM doesn't give you the right to flout our legal system. Yes, other countries might have an advantage in LLM training as a result but so be it.
crazygringo · 46m ago
> Authors and rights holders are supposed to just take it?
If it's judged as fair use, then yes. And then it's not flouting anything.
Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.
For example, nonfiction authors already "just take it" when reviews describe the main points of their book without paying them a cent. The justification is that it's for the greater good, and rights are limited.
atrettel · 33m ago
Judges have recently ruled [1] that training on legally obtained materials constitutes fair use, but we will have to see in the long term if that ruling holds up.
>the whole point of fair use is to benefit society
I'll stop you right there - I really don't think that applies at all. Does 'society' really benefit when the whole thing is a funnel for enormous amounts of wealth to go to already-gigantic companies like Microsoft?
bee_rider · 20m ago
It seems like it could conceivably be fair in some sense, as long as the models were actually released as open-weights (for the benefit of society).
bfrankline · 37m ago
> Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.
How do you think masked language models work?
bugufu8f83 · 1h ago
They do, don't they? I think OpenAI uses libgen.
Meta managed to get into a private ebook torrent tracker called Bibliotik a few years ago to use for training Llama and the resulting publicity essentially killed the tracker.
They are even offering decent bounties: https://software.annas-archive.li/AnnaArchivist/annas-archiv...
Whoever is running it must be doing really well for themselves laundering all that crypto.
No comments yet
https://annas-archive.org/torrents
Last but not least?
2. I used to think that way about The Pirate Bay guys until they hacked into the Swedish equivalency of the US social security number database and then fled to Cambodia. (Or did it from Cambodia. I don’t remember the exact timeline.)
What I mean to say is: I have been disappointed by my heroes before.
> On the one hand you have—the point you’re making Woz—is that information sort of wants to be expensive because it is so valuable—the right information in the right place just changes your life. On the other hand, information almost wants to be free because the costs of getting it out is getting lower and lower all of the time. So you have these two things fighting against each other
He stated later more succinctly:
> Information Wants To Be Free. Information also wants to be expensive. ...That tension will not go away
I'm sick and tired of this misquote; as it was merely an observation of trends, and was never meant to be a moral maxim or mandate. If you truly believe information needs to be free as a moral mandate, share your company's source code first.
Not really helping in the big picture, here, guys.
Apologies for the minor grumble, but on mobile I used to be able to browse search results much more effectively; the new design only fits ~4-5 results on a screen.
Also I'm surprised Cloudflare hasn't shut them down like they do for other dodgy sites.
Error HTTP 451 Unavailable For Legal Reasons
In response to a legal order, Cloudflare has taken steps to limit access to this website through Cloudflare's pass-through security and CDN services within Belgium
By comparison, on my work network (TalkTalk) I can resolve the domain but I get a connection reset from the site.
I think this might be the first time I've hit a DNS block. It feels rather eerie seeing people talking about a site that, from my point of view, doesn't even exist...
I used to get archive.org blocked and had to contact my provider to have the filters taken off.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
About recent events.
We are still alive and kicking. In recent weeks we’ve seen increased attacks on our mission. We are taking steps to harden our infrastructure and operational security. The work of securing humanity’s legacy is worth fighting for.
Since we started in 2022, we have liberated tens of millions of books, scientific articles, magazines, newspapers, and more. These are now forever protected from destruction by natural disasters, wars, budget cuts, and other catastrophes, thanks to everyone who helps with torrenting.
Anna’s Archive itself has organized some of the largest scrapes: we acquired tens of millions of files from IA Controlled Digital Lending, HathiTrust, DuXiu, and many more.
We have also scraped and published the largest book metadata collections in history: WorldCat, Google Books, and others. With this we’ll be able to identify which books are still missing from our collections, and prioritize saving the rarest ones.
Much thanks to all of our volunteers for making these projects happen.
We’ve forged some incredible partnerships. We’ve partnered with two LibGen forks, STC/Nexus, Z-Library. We’ve secured tens of millions additional files through these partnerships. And they are helping the mission by mirroring our files.
Unfortunately we have seen the disappearance of one of the LibGen forks. We don’t have further information about what happened there, but are saddened by this development.
There is a new entrant: WeLib. They appear to have mirrored most of our collection, and use a fork of our codebase. We have copied some of their user interface improvements, and are grateful for that push. Sadly, we are not seeing them share any new collections, nor share their codebase improvements. Since they haven’t shown commitment to contributing back to the ecosystem, we advise extreme caution. We recommend not using them.
In the meantime, we have some exciting projects in the works. We have hundreds of terabytes in new collections sitting on our servers, waiting to be processed. If you’re at all interested in helping out, feel free to check out our Volunteering and Donate pages. We run all of this on a minimal budget, so any help is greatly appreciated.
Keep fighting.
Infinite love to the team <3
Everyone involved is taking on significant personal liability and hosting expenses. Not sure what more you expect.
I've been using WeLib since April and had a good experience so far
- DDOS attacks
- Spamming
- UK like surveillance laws
- LLM scraping
Why is it that there is almost not initiative for this?
Proof of work and micropayments (eg. Xanadu or Internet Mail 2000) schemes solve spamming and LLM scraping, but are more expensive or more CPU-intensive.
P2P systems like FreeNet too, but they are harder to use and more storage intensive and make it easier to spy on individual users.
Tor solves UK-like surveillance laws but it's slower and makes it easier to spam.
The easiest way to mitigate those problem will be to decrease the openness and centralize more. It might lead to even worse things that DDOS.
So see, there are initiatives, but people treat it as a joke, maybe because of when it was released.
[1] https://www.ietf.org/rfc/rfc3514.txt
- DRM. - Owner-unfriendly device locks (such as manufacturer-controlled secure boot or locked-down OSes). - Inability to audit network traffic from one's own devices, i.e. an IoT device. - Remote attestation, when in opposition to open computing.
I could also see folks seeing the use of cryptography as "having something to hide" - I don't personally agree.
even if it's decentralised, it'll be banned one way or another and you'll be hunted down.
The tweet only names Meta, but it would be very surprising if OpenAI didn't do the same thing.
Either this is practice is judged (or legislated) to be fair use, or copyright is done. It's also that simple.
I'll ignore the legality aspects in my response. I think coming up with a representative sample of all relevant information would be better in the long term (teams will not be outcompeted on long time horizons). Why don't the companies do this? Because it is easier to just "carpet bomb the parameter space" and worry about the potential confounding [1] and sampling bias [2] later. Coming up with a representative sample requires domain expertise and that is expensive in terms of time and money. But it reduces the total amount of training data and should reduce the amount of time and resources it takes to build the models. That may matter now that models are quite large.
This is definitely a design decision with tradeoffs on both sides. I can entertain the notion that we don't have time to sample things, but I think we are all too often dismissing the long-term benefits of proper sampling.
(In terms of the legality aspects, judges are trying to "split the baby" [3] in my opinion by saying that training on stuff you got legally is OK but training on pirated material isn't. So nobody is going to recommend training on pirated material in the first place.)
[1] https://en.wikipedia.org/wiki/Confounding
[2] https://en.wikipedia.org/wiki/Sampling_bias
[3] https://www.404media.co/judge-rules-training-ai-on-authors-b...
Copyright law exists for a reason. Trying to improve an LLM doesn't give you the right to flout our legal system. Yes, other countries might have an advantage in LLM training as a result but so be it.
If it's judged as fair use, then yes. And then it's not flouting anything.
Remember the whole point of fair use is to benefit society by allowing reuse of material in ways that don't directly copy large portions of the material verbatim.
For example, nonfiction authors already "just take it" when reviews describe the main points of their book without paying them a cent. The justification is that it's for the greater good, and rights are limited.
[1] https://www.404media.co/judge-rules-training-ai-on-authors-b...
I'll stop you right there - I really don't think that applies at all. Does 'society' really benefit when the whole thing is a funnel for enormous amounts of wealth to go to already-gigantic companies like Microsoft?
How do you think masked language models work?
Meta managed to get into a private ebook torrent tracker called Bibliotik a few years ago to use for training Llama and the resulting publicity essentially killed the tracker.