PaperOrchestra

A Multi-Agent Framework for Automated AI Research Paper Writing
Yiwen Song1, Yale Song1, Tomas Pfister1, and Jinsung Yoon1 Google Cloud AI Research
Read the Paper (Coming Soon)

Generated Manuscript Gallery

Below are full-length manuscripts generated by PaperOrchestra using sparse idea summaries and raw experimental logs as input. These examples demonstrate the framework's ability to natively render submission-ready papers using provided LaTeX templates, with generated tables and synthesized visuals integrated cohesively.

Abstract

Synthesizing unstructured research materials into manuscripts is an essential yet under-explored challenge in AI-driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews. We introduce PaperOrchestra, a multi-agent framework for automated AI research paper writing. It flexibly transforms unconstrained pre-writing materials into submission-ready manuscripts, featuring a comprehensive literature review with API-grounded citations, alongside seamlessly generated visuals like plots and conceptual diagrams.

Key Highlight: In side-by-side human evaluations, PaperOrchestra significantly outperforms autonomous baselines, achieving an absolute win rate margin of 50%-68% in literature review quality, and 14%-38% in overall manuscript quality.

PaperWritingBench Dataset

To evaluate performance independently, we present PaperWritingBench, the first standardized benchmark of reverse-engineered raw materials from 200 top-tier AI conference papers (100 papers per venue from CVPR 2025 and ICLR 2025). This venue selection ensures high academic standards while rigorously testing the adaptability of AI writers to distinct conference formats, namely the double-column CVPR layout versus the single-column ICLR layout. Our dataset isolates the writing task by providing unconstrained, pre-writing inputs. This approach mimics the authentic research phase where experiments are completed but drafting has not yet begun, challenging AI writing systems to rely purely on sparse ideas and lab notes.

These raw materials consist of an Idea Summary (distilling the core methodology and theoretical foundation) and an Experimental Log (containing fully extracted numeric data from tables, while insights from figures are converted into standalone factual observations), alongside venue-specific LaTeX templates and conference guidelines.

CVPR 2025 Dataset Statistics
Figure 1: CVPR 2025 Dataset Statistics.
ICLR 2025 Dataset Statistics
Figure 2: ICLR 2025 Dataset Statistics.

The Multi-Agent Pipeline

PaperOrchestra strategically decouples the writing process across specialized agents to enable parallel execution and iterative self-reflection. The Outline Agent synthesizes inputs into a structured plan; the Plotting Agent generates conceptual diagrams and statistical plots; the Literature Review Agent conducts targeted web searches to discover candidate papers and verifies their existence and relevance via the Semantic Scholar API to build a robust citation graph; the Section Writing Agent authors the full LaTeX manuscript; and the Content Refinement Agent iteratively optimizes the draft based on simulated peer-review feedback.

PaperOrchestra Framework Overview
Figure 3: Overview of the PaperOrchestra framework. Specialized agents systematically parse raw materials, synthesize plots and literature, compile a full draft, and iteratively refine the manuscript into a submission-ready PDF.

Human Evaluation

We benchmarked PaperOrchestra against two primary autonomous pipelines: a Single Agent baseline (which processes all raw materials and executes drafting in a single monolithic LLM call) and AI Scientist-v2 (a state-of-the-art system featuring multi-round citation gathering and iterative self-reflection).

To rigorously compare ours against the baselines, we conducted side-by-side (SxS) human evaluations with 11 AI researchers. Evaluators blindly compared manuscripts generated by PaperOrchestra against the AI baselines, as well as the human-written Ground Truth (GT).

50%–68%
Win Margin in Lit Review Quality
14%–38%
Win Margin in Overall Quality
Human Side-by-Side Evaluation Bar Chart
Figure 4: Human Side-by-Side (SxS) Evaluation Results. The charts display the win, tie, and loss percentages of PaperOrchestra against baselines. PaperOrchestra consistently outperforms both AI baselines (Single Agent and AI Scientist), though a quality gap remains compared to the human-written ground truth (GT).

Ethics Statement

We position our system as an advanced assistive tool designed to accelerate the drafting process of AI research papers, rather than an independent entity capable of claiming authorship. Human researchers must retain full accountability for the factual accuracy, originality, and validity of the claims presented in any generated manuscript. While PaperOrchestra incorporates robust programmatic safeguards—such as API-grounded citation validation—to minimize hallucinations and ensure scholarly rigor, users are responsible for verifying the outputs to prevent the propagation of LLM-derived biases or misinformation.