DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation

Yibo Wang1,2, Lei Wang1†, Yue Deng1, Keming Wu1, Yao Xiao1, Huanjin Yao2, Liwei Kang1, Hai Ye1, Yongcheng Jing2, Lidong Bing1
1Infinity Lab, Shanda Group 2Nanyang Technological University

Abstract

Deep research systems iterate through multi-step web research, analysis, and cross-source synthesis, yet their evaluation remains challenging due to the need for annotation-intensive task construction and static evaluation dimensions.

We present DeepResearchEval, an automated framework consisting of: (1) a persona-driven task construction pipeline that generates realistic research tasks anchored in diverse user profiles and applies a two-stage filter (Task Qualification and Search Necessity), (2) an adaptive point-wise quality evaluation framework that dynamically derives task-specific evaluation dimensions and criteria, and (3) an active fact-checking agent that proactively retrieves external evidence to verify report statements even when citations are missing.

Key Results: Gemini 2.5 Pro achieves the highest quality score (8.51), while Manus demonstrate superior factual accuracy (82.30%). Our framework provides a scalable, automated, and interpretable baseline for monitoring and evaluating deep research systems.

Domain Distribution

Overview of deep research systems' performance on our benchmark across Quality and Factual Correctness

Methodology

Task Generation Pipeline

  • Generate diverse personas conditioned on specific domains
  • Construct candidate tasks requiring multi-round search and cross-source synthesis
  • Apply a two-stage filter (Task Qualification and Search Necessity) to ensure deep research requirements
  • Result: 100 high-quality tasks validated by domain experts, enabling dynamic "live" benchmarking
Task Generation Pipeline

Agentic Evaluation

Our evaluation framework consists of two core components designed to assess both the creative quality and the factual grounding of deep research reports.

Evaluation Pipeline

Adaptive Point-wise Quality Evaluation

  • Combines fixed general dimensions with automatically generated task-specific dimensions
  • Assigns normalized weights to all dimensions based on task requirements
  • Hierarchical scoring aggregates criterion-level scores into a final quality metric
  • Provides interpretable, fine-grained scoring tailored to each research challenge

Fact-Checking Agent

  • Independent of citations, actively invokes MCP tools to retrieve external evidence
  • Segments reports into manageable parts for parallel verification
  • Fact-checking agents extract verifiable statements (numbers, events, dates, etc.) to check
  • Labels each statement as Right, Wrong, or Unknown with detailed reasoning and urls

Evaluation Results

Overall Quality Scores

Model Avg Covera. Insight Instr. Clarity Task.
Gemini-2.5-Pro DR 8.51 9.2 9.0 9.7 9.1 8.0
Claude-Sonnet-4.5 DR 7.53 8.8 8.0 9.2 7.8 6.8
OpenAI Deep Research 7.28 8.6 7.3 9.0 7.6 6.7
Qwen-3-235B DR 7.17 8.0 7.9 8.7 8.3 6.6
Doubao DR 7.06 8.6 7.0 9.2 7.7 6.3
Grok4 DR 6.92 8.5 6.6 9.6 8.2 6.0
Perplexity DR 6.86 8.2 6.6 9.3 8.6 5.9
Manus 5.95 7.2 5.8 8.3 7.1 5.2
DeepSeek DR 5.25 5.9 5.2 7.2 8.4 4.3

Factual Accuracy

Model Statements Right Wrong Unknown Ratio
Manus 57.90 47.65 2.23 8.02 82.30%
Gemini-2.5-Pro DR 86.99 66.65 4.16 16.18 76.62%
DeepSeek DR 25.08 19.17 1.81 4.10 76.44%
OpenAI Deep Research 45.98 35.04 2.72 8.22 76.21%
Qwen-3-235B DR 37.45 27.11 3.36 6.34 72.39%
Doubao DR 80.75 56.12 7.43 17.20 69.50%
Grok4 DR 47.16 29.15 5.44 12.57 61.81%
Claude-Sonnet-4.5 DR 57.30 34.79 6.16 16.35 60.72%
Perplexity DR 61.34 36.16 9.08 16.10 58.94%

Key Findings

Quality: Gemini 2.5 Pro leads with an 8.51 average score, excelling across all general dimensions. Task-specific scores remain lower across all systems, highlighting the difficulty of optimizing for adaptive research criteria.

Factual Accuracy: Manus achieves the highest correctness ratio (82.30%). Factual risks are primarily driven by weakly grounded or insufficiently verifiable claims rather than outright factual errors.

BibTeX

@misc{wang2026deepresearchevalautomatedframeworkdeep,
      title={DeepResearchEval: An Automated Framework for Deep Research Task Construction and Agentic Evaluation}, 
      author={Yibo Wang and Lei Wang and Yue Deng and Keming Wu and Yao Xiao and Huanjin Yao and Liwei Kang and Hai Ye and Yongcheng Jing and Lidong Bing},
      year={2026},
      eprint={2601.09688},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.09688}, 
}