InfoSeek: The First Open-Source Framework for Deep Research Data Synthesis

1 BAAIBeijing 0 9/17/2025, 9:00:44 AM

  - The First Open-source Dataset Purpose-built for Deep Research tasks 
    - InfoSeek is the industry’s first dataset systematically designed for Deep Research tasks. It goes beyond the limitations of traditional QA and multi-hop QA by focusing on complex, hierarchical Deep Research problems, filling a critical gap in high-quality training data.
  - End-to-end Open Source: Dataset + Data Synthesis Framework 
    - Both the dataset and its generation framework are fully open-sourced, enabling researchers to freely extend and adapt it.  
    - Leveraging tree-structured generation and backtracking verification, InfoSeek can automatically synthesize complex, multi-level questions while ensuring correctness.  
  - 50,000+ High-Quality, Multi-Step Reasoning Samples
    - The dataset contains over 50,000 high-quality samples, each requiring on average 4–6 reasoning steps.  
    - Even advanced models such as Qwen2.5-72B + CoT still fail 91.6% of the time on the test set, highlighting the difficulty and rigor of InfoSeek.   
  - Resource Links
    -https://huggingface.co/datasets/Lk123/InfoSeek
    - https://github.com/VectorSpaceLab/InfoSeek
    - https://arxiv.org/abs/2509.00375

Comments (0)

No comments yet