Recent Posts | SJTU Media Lab

Unified Multimodal Retrieval Framework for Multimodal RAG

文章来源：PAKDD 2026 PDF 下载多模态检索增强生成（Multimodal RAG）在缓解大模型幻觉、增强文档理解能力的同时，仍受限于模态割裂、OCR 带来的语义碎片化以及跨模态相似度不一致等问题。本文提出一种统一多模态检索框架，实现文本、图像与图文混合块在同一语义空间下的端到端检索。模型包含统一多模态编码器、编码后残差融合模块与缩放训练策略，分别负责消除模态壁垒、保持单模态语义一致性并捕捉跨模态交互、修正模态学习失衡。在训练过程中，通过均衡三模态样本分布强化视觉表征学习，让统一编码器既能独立完成各模态检索，搭配融合机制后表现进一步提升。在六个多模态文档问答基准上的实验结果表明，该检索框架在 Recall@1/Recall@3 指标上全面超越现有方法，以更小的模型规模实现更优的多模态检索效果。同时，框架的统一架构显著简化了多模态 RAG 系统复杂度，为分析与评测图文混合文档的跨模态对齐能力提供了新的思路。 PAKDD2026 | 面向多模态 RAG 的统一多模态检索框架简介图1.统一多模态检索概览检索增强生成（Retrieval-Augmented Generation, RAG）已成为提升大语言模型事实性准确率与知识覆盖广度的核心技术，广泛应用于文档问答、信息检索等场景。然而，现实世界中的高质量知识往往以文本 - 图像混合文档的形式存在，如学术论文图表、工业报告、信息流程图、演示幻灯片等，其中图文之间存在强语义耦合关系。传统 RAG 系统要么仅处理文本，依赖 OCR 提取信息会导致语义碎片化与视觉信息丢失；要么采用独立的文本与图像检索通路，导致跨模态相似度分数缺乏统一标准，造成检索结果不一致、系统部署复杂等问题。近年来，多模态检索与视觉语言模型（VLM）为混合文档理解提供了新路径。现有工作如 VisRAG、GME 等通过将文档编码为图像特征进行检索，或采用多分支独立编码策略，在特定数据集上取得了性能提升，但仍未从根本上解决模态割裂问题：一方面，分离的编码框架破坏了混合文档中图文的内在语义关联；另一方面，不同模态的表征空间不一致，导致跨模态检索的公平性与准确性难以兼顾。同时，现有训练策略普遍采用均匀采样方式，使得文本模态因参数占比高而快速收敛，图像与混合模态因学习容量不足而表征不充分，进一步加剧了跨模态检索的不平衡，限制了整体性能上限。

Haitao Huang, Tianyi Feng, Ruiyan Wang, Wei Xiong, Fei Huang, Zhengxue Cheng, Rong Xie, Li Song

2026-03-17 2 min read

People-Ojbect Interaction Dataset

Introduction The people-object interaction dataset comprises 38 series of 30-view multi-person or single-person RGB-D video sequences, complemented by corresponding camera parameters, foreground masks, SMPL models and some point clouds, mesh files.

2024-01-28 2 min read

Digital Human Video Generation

Medialab Digital Human Video Generation. Now support: Text-driven Generation. Voice-driven Generation. Try it: Digital Human Video Generation

Jun Ling, Han Xue, Anni Tang, Wentao Ma, Lisa Zhu

2023-03-28 1 min read

Job Wanted: Fundamental Model Design and Realization of 3D Digital Figures of Metaverse

Location: Electronic Department, Shanghai Jiao Tong University, 800 DongChuan Road, MinHang District, Shanghai, China Department: Institute of Image Communication and Network Engineering Forms of Work: Offline for candidates who are available and applicable to attend the job in Shanghai however also can be inline if the candidates cannot get to Institute under specified time or from overseas.

Li Song

2022-12-08 2 min read

Free Viewpoint RGB-D Video Dataset

Introduction Our free viewpoint RGB-D video dataset is a dynamic, synchronous and close to practical RGB-D HD free viewpoint synthesis dataset. Our dataset consists of 14 groups of video sequences, including 13 groups of human motion video sequences such as dancing, playing basketball and playing football, and 1 group of empty scene video sequences, as shown below.

2022-02-16 2 min read

Evolve Proposal Search Engine

Search Engine for public JVET/JCTVC/JCT3V proposals. Now support: Instant search by title, number or author. Filter by meeting number and tags. One-click to download or cite bibtex. Try it: Evolve Proposal Search Engine

Donghui Feng

2021-07-20 1 min read

Learning An Inverse Tone Mapping Network with A Generative Adversarial Regularizer

Introduction With the development of ultra-high-definition television techniques, the demands on high-dynamic-range (HDR) contents are increasing rapidly these years. These demands require us to transfer a large amount of existing low-dynamic-range (LDR) images and videos to HDR contents efficiently and effectively.

Shiyu Ning, Hongteng Xu, Li Song, Rong Xie, Wenjun Zhang

2018-04-20 2 min read

SJTU 8K 360-Degree Video Sequences

Ultra-high resolution 8K 360-degree immersive video, generally 8192 x 4096. It is still an emerging technology, so there remain a great number of challenges need to be solved. Video sequences play an important role in the corresponding researches.

2017-10-21 1 min read

The second place winner for 2017 IEEE ICME-Twitch Grand Challenge

Congratulations to Yue Ma, Yan Huang, Hao Wang, Zhengyi Luo and Xuchao Lu for their outstanding work “A Segment Constraint ABR Algorithm for HEVC Encoder”.

2017-07-01 1 min read

The second place winner for 2017 IEEE ICME-Twitch Grand Challenge

Our papers behind (i4K) was presented at GTC'16, BMSB'16 and IBC'16

2016-12-01 0 min read