Logo ORIG

Open Multimodal Retrieval-Augmented
Factual Image Generation

Yang Tian1, Fan Liu2, Jingyuan Zhang3, Wei Bi4, Yupeng Hu1, Liqiang Nie5,
Tianyangchn@gmail.com

1. iLearn Lab, Shandong University

2. National University of Singapore

3. Kuaishou Technology

4. Independent Researcher

5. Harbin Institute of Technology (Shenzhen)

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.
Sports case

Generate an image showing a comparison between the match balls used in the first and the most recent FIFA World Cup final.
GPT-Image-1, Direct Generation

Sports case

Generate an image showing a comparison between the match balls used in the first and the most recent FIFA World Cup final.
GPT-Image-1, Generation with ORIG

Transportation case

Generate an image of the XPeng Land Aircraft Carrier flying car.
GPT-Image-1, Direct Generation

Transportation case

Generate an image of the XPeng Land Aircraft Carrier flying car.
GPT-Image-1, Generation with ORIG

Transportation case

Generate an image of "邪恶大鼠标" on the road.
GPT-Image-1, Direct Generation

Transportation case

Generate an image of "邪恶大鼠标" on the road.
GPT-Image-1, Generation with ORIG

Transportation case

Generate an image to show the special entry method of the TEOREMA vehicle.
GPT-Image-1, Direct Generation

Transportation case

Generate an image to show the special entry method of the TEOREMA vehicle.
GPT-Image-1, Generation with ORIG

Sports case

Generate an image of the competition scene featuring the first female Olympic gold medalist.
GPT-Image-1, Direct Generation

Sports case

Generate an image of the competition scene featuring the first female Olympic gold medalist.
GPT-Image-1, Generation with ORIG

Sports case

Generate an image showcasing the winning photograph and the photographer of the Landscape category in the Sony World Photography Awards Open 2025.
GPT-Image-1, Direct Generation

Sports case

Generate an image showcasing the winning photograph and the photographer of the Landscape category in the Sony World Photography Awards Open 2025.
GPT-Image-1, Generation with ORIG

Sports case

Generate the working environment when Microsoft was founded.
GPT-Image-1, Direct Generation

Sports case

"Generate the working environment when Microsoft was founded.
GPT-Image-1, Generation with ORIG.

Sports case

Generate a picture showing the 5-in-1 clean ability of AquaSense 2 Ultra.
GPT-Image-1, Direct Generation

Sports case

Generate a picture showing the 5-in-1 clean ability of AquaSense 2 Ultra.
GPT-Image-1, Generation with ORIG.

Sports case

Generate an image of the current president of the University of Toronto standing in front of Robarts Library.
GPT-Image-1, Direct Generation

Sports case

Generate an image of the current president of the University of Toronto standing in front of Robarts Library.
GPT-Image-1, Generation with ORIG.

← Scroll horizontally to view more cases →

Factual Image Generation

Task Definition

We formalize FIG as a new task setting where the defining goal is to ensure factual consistency in generated images. Formally, given a query prompt P, the task requires producing an image that is semantically aligned with P, and grounded in verifiable knowledge about entities, attributes, relations, and temporal events. Specifically, factual consistency in FIG spans three dimensions: Perceptual Fidelity, ensuring faithful perception and accurate rendering of objects' visual appearance; Compositional Consistency, which enforces accurate object properties and spatial relations, and Temporal Consistency, which ensures proper depiction of event timing and entity states. Unlike conventional image generation, FIG requires grounding in external evidence beyond the limited and static parametric memory of LMMs. In practice, this necessitates open retrieval from the web, where textual and visual evidence contribute complementary knowledge to support Perceptual Fidelity and Compositional Consistency, while the real-time nature of retrieval supplies updated information essential for Temporal Consistency.

Motivation of factual image generation (FIG) with open multimodal retrieval: (a) Reliance on internal knowledge alone often leads to outdated or hallucinated content. (b) Incorporating external information improves grounding but remains constrained by static and unimodal sources. (c) Leveraging open retrieval of multimodal evidence integrates evolving knowledge and complementary cues to achieve FIG.

Method

ORIG

We propose ORIG, an agentic open multimodal retrieval-augmented framework that grounds image generation in verifiable knowledge. As illustrated in the figure below, ORIG adopts an iterative pipeline that plans sub-queries, retrieves modality-specific evidence, filters noise through coarse-to-fine filtering, and integrates refined knowledge into enriched prompts that guide factual image generation. The framework comprises three modules: Open Multimodal Retrieval Module that collects and filters web-scale evidence through adaptive query planning and sufficiency evaluation; Prompt Construction Module that integrates the input prompt with extracted features from filtered evidence to create generation-ready prompts, and Image Generation Module that produces factually grounded images based on the enriched prompts.

The overall pipeline of the ORIG framework: ORIG adaptively controls multimodal retrieval and prompt construction, dynamically deciding whether to continue retrieval or proceed based on the current state of accumulated knowledge.

benchmark

FIG-Eval

We introduce FIG-Eval, a curated dataset designed to evaluate whether image-generation models can effectively leverage web-retrieved multimodal evidence to achieve FIG. It covers 10 entity classes and three concept groups, featuring knowledge-intensive prompts that encode implicit, domain-specific facts requiring external evidence beyond parametric knowledge. For reliable assessment, each prompt is paired with human-annotated ground-truth references. From these evidence, we manually derive multimodal QA questions that rigorously operationalize factual correctness, enabling automated scoring via state-of-the-art Vision-Language Models.

Prompt and Question Distribution Across 10 Entity Classes and Three Concept Categories: The entity classes include Animal (An.), Sports (Sp.), Transportation (Tr.), Landmarks (La.), Food (Fo.), People (Pe.), Plants (Pl.), Products (Pr.), Culture (Cu.), and Events (Ev.).

Categories An. Sp. Tr. La. Fo. Pe. Pl. Pr. Cu. Ev. Total
Prompt Number 55 52 50 50 49 52 51 56 50 49 514
Perceptual Fidelity (PF) 233 197 207 138 116 194 219 287 171 105 1,867
Compositional Consistency (CC) 120 148 143 148 189 185 117 106 191 243 1,590
Temporal Consistency (TC) 73 46 44 95 88 40 89 31 65 65 636
All Concept Categories 426 391 394 381 388 419 425 418 427 413 4,093