Storing abstracts of academic papers as inverted indexes for legal reasons

2 rossant 2 9/2/2025, 9:50:20 AM law.stackexchange.com ↗

Comments (2)

rossant · 5h ago
I was exploring the OpenAlex database [1] when I discovered that paper abstracts were not stored in plain text but as an inverted index, e.g.:

"abstract_inverted_index": { "To": [ 0 ], "determine": [ 1 ], "whether": [ 2, 154 ], "certain": [ 3 ], "computed": [ 4, 44 ], ... }

At first I thought this was for compression or something similar, but it turns out it is for legal reasons, to avoid copyright infringement on abstract plain text. Of course, it only takes a few lines of code to reconstruct the plain text from this inverted index, so I was surprised this would hold up in court.

I couldn’t find a definitive answer, but I believe the closest one is this comment on StackExchange: [2]

> It's possible that they could include the plain text abstracts legally, but some publication disagrees and they don't want to fight it in court. It's also possible that the publisher believes that the inverted index is indeed infringing their copyright, but they're not sufficiently confident that they would prevail in court to actually bring a suit. The ultimate answer to this question is that it's a close call, and until a court rules on it, nobody knows for sure whether it's illegal in any given jurisdiction. I don't know whether a court has ruled on it.

[1] https://en.wikipedia.org/wiki/OpenAlex

[2] https://law.stackexchange.com/questions/110313/how-does-inve...

Someone · 4h ago
If it does, other bijections such as compression or encryption, or conversion to images or movies would also avoid copyright infringement.

⇒ I don’t see that defense hold up in any reasonable court. It does make the infringement harder to find, though.