Institutional Books: A 242B token dataset from Harvard Library's collections

73 strangecasts 21 6/11/2025, 9:36:06 PM arxiv.org ↗

Comments (21)

SloopJon · 21h ago
Although this is characterized as 1.0, it is governed by the Terms of Use for Early-Access, which are quite limiting, including: "You may use the Service solely for noncommercial purposes."
ninjin · 20h ago
It really is rather peculiar to me. They frame it like this (emphasis mine):

"With the preliminary publication of this dataset, we further seek to establish a community-led process to grow, improve, and use institutional data in ways that strengthen the knowledge ecosystem and assert the importance of ongoing stewardship of training data from the originating knowledge institutions themselves. To this end, we are experimenting to find the best way to release this data in a manner that facilitates collaboration. We encourage input on this process to guide the full publication of this and future dataset dataset releases, beginning with the following decisions:

* At preliminary launch, we have published the metadata, including experimental metadata, in full for anyone to access and use.

* At preliminary launch, we have published the dataset including OCR-extracted text under a noncommercial license, and with a 'click-through' that requires users to accept this license, additional terms of use, and to share basic contact information with us so that we can engage the community in its early use.

* At preliminary launch, we have chosen to postpone the release of the raw scan images, though we will share them liberally with researchers and libraries who wish to review them. While we know AI developers and researchers are eager for more raw materials, we believe this minor friction can help build the relationships and norms necessary to grow a collaborative community."

It is the fruit of their labour (well, the digitalisation is), so it is up to them to license it as they see fit. But it feels odd to me that they seem to want to be in control to this degree. In open source and my own research field, the pattern we tend to follow is to release freely, observe, and then build relationships rather than holding a "license gun" towards the head of potential collaborators.

Lastly, I have only skimmed the pre-print, but I noted no commitment to a final license either. Not even a direction for it. Thus, as a natural language processing researcher I will stay clear from this dataset for the time being and hope the licensing situation improves.

ks2048 · 19h ago
“noncommericial” seems pernicious to me lately. I can see why people reach for it, but it really seems hard to define (there are many ways to profit off of something without simply selling it directly).
DaSHacka · 18h ago
This has always been my problem with CC-NC, its just not clear to me what counts as "commercial" or not.

Can't sell the item itself? Okay, makes sense.

What about a downstream manufactured item? Such as a CC-NC STL, where you have since 3d printed it? You can't sell the STL, but what about the printed object? If not for profit, must you necessarily take a loss, or could you sell the items at-cost?

Or offering a CC-NC item for free, in the same place you are selling other products for profit? Where the CC-NC item may be acting as a "loss-leader" to get customers to purchase your commercial offerings?

Or giving everything away, the CC-NC item and all other items, but while representing a commercial entity who is doing such for marketing purposes with the end-goal of generating more revenue for the business?

I much prefer GPL/CC-SA licenses, they're much clearer where the line sits in regards to usage.

Incipient · 9h ago
Don't most of these licenses also include "derived works". The trivial case, you get an STL, you print the object, it's clearly derived, you get some code, you edit parts of it for a new application, it's clearly derived.

Personally I feel it's also fairly trivial that an AI model is a derived work, but...there is so much money, people risk it (eg early Spotify and sourcing music) and hope it becomes a non-issue.

HOWEVER, as China and co are going to wholesale ignore any IP/copyright to train AI models, the choice we have...may not be much of a choice at all.

otoburb · 8h ago
>>Can't sell the item itself? Okay, makes sense.

IANAL, but I think it goes even one step beyond that, which is that the item and derived works can't even be used to support a commercial enterprise, even if the (derived) work isn't being sold or seen by the outside public.

DaSHacka · 4h ago
Interesting; If true, that effectively means the answer to all my questions would be "no" then
ronsor · 18h ago
That's exactly why it's problematic in licenses unless explicitly defined.
xhkkffbf · 4h ago
I'm sure Harvard doesn't consider its use as commercial, even though some people there get big salaries. Claudine Gay, for instance, makes more than $1m/year even after losing the job of President in the scandal. There are only a few "commercial" businesses that pay that well.
Frummy · 19h ago
AIs lizard brain will be 60% 1800s apparently, it might act like a villainous steampunk anglosaxon twirling a mustache in moments of survival, or at least some blend of those values while playing 5d chess. Read it H G Wells "World brain" to calm it down like a fond childhood memory
DaSHacka · 18h ago
This would be the funniest possible future, and a very distinct possibility depending on how the NYT lawsuit turns out in regards to IP holder rights versus AI "copyright laundering".
adt · 16h ago