Deep research systems iterate through multi-step web research, analysis, and cross-source synthesis, yet
their evaluation remains challenging due to the need for annotation-intensive task construction and static
evaluation dimensions.
We present DeepResearchEval, an automated framework consisting of: (1) a
persona-driven task construction pipeline that generates realistic research tasks
anchored in diverse user profiles and applies a two-stage filter (Task Qualification and Search
Necessity), (2) an adaptive point-wise quality evaluation framework that dynamically
derives task-specific evaluation dimensions and criteria, and (3) an active fact-checking
agent that proactively retrieves external evidence to verify report statements even when
citations are missing.
Key Results: Gemini 2.5 Pro achieves the highest quality score (8.51), while Manus
demonstrate superior factual accuracy (82.30%). Our framework provides a scalable, automated, and
interpretable baseline for monitoring and evaluating deep research systems.