Show HN: Getting full-text scientific content into LLMs+Agents is stupidly hard

3 zk108 2 5/27/2025, 8:37:18 PM valyu.network ↗
Most APIs don’t return actual content. You get metadata, maybe an abstract, maybe a snippet...never the thing itself. And if you want proper sources like arXiv, PubMed, or major publishers? Good luck. You’re stuck scraping tens of millions PDFs or semantic scholar and building your own ingestion pipeline.

We hit this building agentic workflows and RAG backends. What we needed wasn’t “search”, it was a way to retrieve real, structured full text with enough metadata to plug straight into a reasoning system. So we built a system that could do that: multimodal inputs (text, math, figures), clean citations, reference chaining, and filters that work (by date, by source, etc).

The hard part wasn’t retrieval but preprocessing at scale. Figuring out how to analyse, chunk, structure tens of millions of docs without taking months or breaking the bank. Not to mention dealing with licensed content where formats vary wildly or building retrieval systems at this scale.

Still a work in progress with more updates on the way. But miles better than duct-taping together PDFs, AI search engines etc. and hoping to find the relevant context you need.

Comments (2)

yorkeccak · 20h ago
aligns very well with what Anthropic researchers said on a recent podcast that even if AI progress stalls, current AI models are already capable of automating all white-collar jobs - the only lacking components being better access to information, and the infra/workflows around the models themselves