InfoSeek: The First Open-Source Framework for Deep Research Data Synthesis
2 BAAIBeijing 1 9/17/2025, 9:00:44 AM
- The First Open-source Dataset Purpose-built for Deep Research tasks
- InfoSeek is the industry’s first dataset systematically designed for Deep Research tasks. It goes beyond the limitations of traditional QA and multi-hop QA by focusing on complex, hierarchical Deep Research problems, filling a critical gap in high-quality training data.
- End-to-end Open Source: Dataset + Data Synthesis Framework
- Both the dataset and its generation framework are fully open-sourced, enabling researchers to freely extend and adapt it.
- Leveraging tree-structured generation and backtracking verification, InfoSeek can automatically synthesize complex, multi-level questions while ensuring correctness.
- 50,000+ High-Quality, Multi-Step Reasoning Samples
- The dataset contains over 50,000 high-quality samples, each requiring on average 4–6 reasoning steps.
- Even advanced models such as Qwen2.5-72B + CoT still fail 91.6% of the time on the test set, highlighting the difficulty and rigor of InfoSeek.
- Resource Links
-https://huggingface.co/datasets/Lk123/InfoSeek
- https://github.com/VectorSpaceLab/InfoSeek
- https://arxiv.org/abs/2509.00375
Comments (1)
zephyrfalcon · 3h ago
THIS InfoSeek? https://en.wikipedia.org/wiki/Infoseek
Probably not...