← Scroll horizontally to view more cases →
Motivation of factual image generation (FIG) with open multimodal retrieval: (a) Reliance on internal knowledge alone often leads to outdated or hallucinated content. (b) Incorporating external information improves grounding but remains constrained by static and unimodal sources. (c) Leveraging open retrieval of multimodal evidence integrates evolving knowledge and complementary cues to achieve FIG.
The overall pipeline of the ORIG framework: ORIG adaptively controls multimodal retrieval and prompt construction, dynamically deciding whether to continue retrieval or proceed based on the current state of accumulated knowledge.
Prompt and Question Distribution Across 10 Entity Classes and Three Concept Categories: The entity classes include Animal (An.), Sports (Sp.), Transportation (Tr.), Landmarks (La.), Food (Fo.), People (Pe.), Plants (Pl.), Products (Pr.), Culture (Cu.), and Events (Ev.).
| Categories | An. | Sp. | Tr. | La. | Fo. | Pe. | Pl. | Pr. | Cu. | Ev. | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Prompt Number | 55 | 52 | 50 | 50 | 49 | 52 | 51 | 56 | 50 | 49 | 514 |
| Perceptual Fidelity (PF) | 233 | 197 | 207 | 138 | 116 | 194 | 219 | 287 | 171 | 105 | 1,867 |
| Compositional Consistency (CC) | 120 | 148 | 143 | 148 | 189 | 185 | 117 | 106 | 191 | 243 | 1,590 |
| Temporal Consistency (TC) | 73 | 46 | 44 | 95 | 88 | 40 | 89 | 31 | 65 | 65 | 636 |
| All Concept Categories | 426 | 391 | 394 | 381 | 388 | 419 | 425 | 418 | 427 | 413 | 4,093 |