ORIG

Open Multimodal Retrieval-Augmented
Factual Image Generation

Yang Tian¹, Fan Liu², Jingyuan Zhang³, Wei Bi⁴, Yupeng Hu¹, Liqiang Nie⁵,
Tianyangchn@gmail.com

▶ 1. iLearn Lab, Shandong University

▶ 2. National University of Singapore

▶ 3. Kuaishou Technology

▶ 4. Independent Researcher

▶ 5. Harbin Institute of Technology (Shenzhen)

arXiv Code

🤗

Dataset

Abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in generating photorealistic and prompt-aligned images, but they often produce outputs that contradict verifiable knowledge, especially when prompts involve fine-grained attributes or time-sensitive events. Conventional retrieval-augmented approaches attempt to address this issue by introducing external information, yet they are fundamentally incapable of grounding generation in accurate and evolving knowledge due to their reliance on static sources and shallow evidence integration. To bridge this gap, we introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG), a new task that requires both visual realism and factual grounding. ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation. To support systematic evaluation, we build FIG-Eval, a benchmark spanning ten categories across perceptual, compositional, and temporal dimensions. Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines, highlighting the potential of open multimodal retrieval for factual image generation.

Generate an image showing a comparison between the match balls used in the first and the most recent FIFA World Cup final.
GPT-Image-1, Direct Generation

Generate an image showing a comparison between the match balls used in the first and the most recent FIFA World Cup final.
GPT-Image-1, Generation with ORIG

Generate an image of the XPeng Land Aircraft Carrier flying car.
GPT-Image-1, Direct Generation

Generate an image of the XPeng Land Aircraft Carrier flying car.
GPT-Image-1, Generation with ORIG

Generate an image of "邪恶大鼠标" on the road.
GPT-Image-1, Direct Generation

Generate an image of "邪恶大鼠标" on the road.
GPT-Image-1, Generation with ORIG

Generate an image to show the special entry method of the TEOREMA vehicle.
GPT-Image-1, Direct Generation

Generate an image to show the special entry method of the TEOREMA vehicle.
GPT-Image-1, Generation with ORIG

Generate an image of the competition scene featuring the first female Olympic gold medalist.
GPT-Image-1, Direct Generation

Generate an image of the competition scene featuring the first female Olympic gold medalist.
GPT-Image-1, Generation with ORIG

Generate an image showcasing the winning photograph and the photographer of the Landscape category in the Sony World Photography Awards Open 2025.
GPT-Image-1, Direct Generation

Generate an image showcasing the winning photograph and the photographer of the Landscape category in the Sony World Photography Awards Open 2025.
GPT-Image-1, Generation with ORIG

Generate the working environment when Microsoft was founded.
GPT-Image-1, Direct Generation

"Generate the working environment when Microsoft was founded.
GPT-Image-1, Generation with ORIG.

Generate a picture showing the 5-in-1 clean ability of AquaSense 2 Ultra.
GPT-Image-1, Direct Generation

Generate a picture showing the 5-in-1 clean ability of AquaSense 2 Ultra.
GPT-Image-1, Generation with ORIG.

Generate an image of the current president of the University of Toronto standing in front of Robarts Library.
GPT-Image-1, Direct Generation

Generate an image of the current president of the University of Toronto standing in front of Robarts Library.
GPT-Image-1, Generation with ORIG.

← Scroll horizontally to view more cases →

Factual Image Generation

Task Definition

We formalize FIG as a new task setting where the defining goal is to ensure factual consistency in generated images. Formally, given a query prompt P, the task requires producing an image that is semantically aligned with P, and grounded in verifiable knowledge about entities, attributes, relations, and temporal events. Specifically, factual consistency in FIG spans three dimensions: Perceptual Fidelity, ensuring faithful perception and accurate rendering of objects' visual appearance; Compositional Consistency, which enforces accurate object properties and spatial relations, and Temporal Consistency, which ensures proper depiction of event timing and entity states. Unlike conventional image generation, FIG requires grounding in external evidence beyond the limited and static parametric memory of LMMs. In practice, this necessitates open retrieval from the web, where textual and visual evidence contribute complementary knowledge to support Perceptual Fidelity and Compositional Consistency, while the real-time nature of retrieval supplies updated information essential for Temporal Consistency.

Motivation of factual image generation (FIG) with open multimodal retrieval: (a) Reliance on internal knowledge alone often leads to outdated or hallucinated content. (b) Incorporating external information improves grounding but remains constrained by static and unimodal sources. (c) Leveraging open retrieval of multimodal evidence integrates evolving knowledge and complementary cues to achieve FIG.

Method

ORIG

We propose ORIG, an agentic open multimodal retrieval-augmented framework that grounds image generation in verifiable knowledge. As illustrated in the figure below, ORIG adopts an iterative pipeline that plans sub-queries, retrieves modality-specific evidence, filters noise through coarse-to-fine filtering, and integrates refined knowledge into enriched prompts that guide factual image generation. The framework comprises three modules: Open Multimodal Retrieval Module that collects and filters web-scale evidence through adaptive query planning and sufficiency evaluation; Prompt Construction Module that integrates the input prompt with extracted features from filtered evidence to create generation-ready prompts, and Image Generation Module that produces factually grounded images based on the enriched prompts.

The overall pipeline of the ORIG framework: ORIG adaptively controls multimodal retrieval and prompt construction, dynamically deciding whether to continue retrieval or proceed based on the current state of accumulated knowledge.

benchmark

FIG-Eval

We introduce FIG-Eval, a curated dataset designed to evaluate whether image-generation models can effectively leverage web-retrieved multimodal evidence to achieve FIG. It covers 10 entity classes and three concept groups, featuring knowledge-intensive prompts that encode implicit, domain-specific facts requiring external evidence beyond parametric knowledge. For reliable assessment, each prompt is paired with human-annotated ground-truth references. From these evidence, we manually derive multimodal QA questions that rigorously operationalize factual correctness, enabling automated scoring via state-of-the-art Vision-Language Models.

Prompt and Question Distribution Across 10 Entity Classes and Three Concept Categories: The entity classes include Animal (An.), Sports (Sp.), Transportation (Tr.), Landmarks (La.), Food (Fo.), People (Pe.), Plants (Pl.), Products (Pr.), Culture (Cu.), and Events (Ev.).

Categories	An.	Sp.	Tr.	La.	Fo.	Pe.	Pl.	Pr.	Cu.	Ev.	Total
Prompt Number	55	52	50	50	49	52	51	56	50	49	514
Perceptual Fidelity (PF)	233	197	207	138	116	194	219	287	171	105	1,867
Compositional Consistency (CC)	120	148	143	148	189	185	117	106	191	243	1,590
Temporal Consistency (TC)	73	46	44	95	88	40	89	31	65	65	636
All Concept Categories	426	391	394	381	388	419	425	418	427	413	4,093

ORIG

Open Multimodal Retrieval-Augmented Factual Image Generation

Abstract

Factual Image Generation

Task Definition

Method

ORIG

benchmark

FIG-Eval

Open Multimodal Retrieval-Augmented
Factual Image Generation