In the field of artificial intelligence, large language models (LLMs) have made remarkable strides in reasoning capabilities. However, when these capabilities are extended to multimodal scenarios—where models must process both text and images—researchers face considerable challenges. These challenges are especially pronounced for small multimodal language models with limited parameter sizes.
A research team led by Professor Hongxia Yang at The Hong Kong Polytechnic University has proposed a training framework called Infi-MMR, which leverages an innovative three-phase reinforcement learning strategy. This framework successfully unlocks the multimodal reasoning potential of small language models, achieving state-of-the-art (SOTA) performance across several mathematical reasoning benchmarks—even surpassing some larger models in the process. The team's findings are detailed in their recent preprint titled “Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models”, now available on arXiv.
The paper lists Zeyu Liu, a research assistant at The Hong Kong Polytechnic University, and Yuhang Liu, a master's student at Zhejiang University, as co-first authors. Professor Hongxia Yang is the corresponding author. The team aims to extend rule-based reinforcement learning achievements from the text domain (such as those from DeepSeek-R1) to the multimodal domain, while addressing inherent challenges in multimodal reinforcement learning.
Small language models (SLMs), due to their limited number of parameters, face three core challenges:
-
Low-Quality Multimodal Reasoning Data
Rule-based reinforcement learning requires verifiable answers. However, most multimodal tasks focus on image captioning, description, or visual question answering, which lack rigorous reasoning elements. Existing datasets rarely offer complex reasoning tasks paired with verifiable outputs. -
Degradation of Core Reasoning Abilities
When multimodal LLMs integrate visual and textual data, they often compromise their core reasoning skills—a problem especially severe in smaller models. Moreover, the complexity of cross-modal fusion can disrupt structured reasoning, leading to reduced task performance. -
Complex but Unreliable Reasoning Paths
When trained directly on multimodal data using reinforcement learning, models tend to generate overly complex and often inaccurate reasoning processes.
The Infi-MMR framework addresses these issues through its three-stage curriculum learning approach:
Stage 1: Foundational Reasoning Activation
Instead of using multimodal inputs directly, this phase uses high-quality textual reasoning data to activate the model's reasoning capabilities through reinforcement learning. This approach builds a solid logical reasoning foundation and mitigates the degradation seen in standard multimodal models.
Stage 2: Cross-Modal Reasoning Adaptation
With the foundation in place, this phase gradually transitions the model to the multimodal domain using question-answer pairs supplemented with explanatory textual information. This helps the model adapt its reasoning skills to handle multimodal inputs.
Stage 3: Multimodal Reasoning Enhancement
To simulate real-world multimodal scenarios—where image descriptions may be missing—this stage removes textual hints and trains the model to perform reasoning directly from raw visual inputs. This reduces linguistic bias and promotes robust multimodal reasoning. Notably, the team introduced caption-augmented multimodal data, which aids the model in transferring its text-based reasoning skills to multimodal contexts and enables more reliable cross-modal reasoning.
Using the Infi-MMR framework, the team fine-tuned Qwen2.5-VL-3B into Infi-MMR-3B, a small multimodal model focused on mathematical reasoning. The results are striking:
-
On the MathVerse benchmark—which spans domains like algebra and geometry—Infi-MMR-3B achieved 43.68% accuracy, outperforming models of the same scale and even surpassing some 8-billion-parameter models.
-
On the MathVista benchmark, which assesses comprehensive reasoning ability, it achieved 67.2% accuracy, a 3.8% improvement over the baseline.
-
Impressively, its performance on MathVerse is approaching that of proprietary models such as GPT-4o (39.4%).
These achievements validate the effectiveness of the Infi-MMR framework and demonstrate the successful transfer of reasoning capabilities to the multimodal domain. The team emphasizes that while Infi-MMR-3B is tailored for mathematical reasoning, its core reasoning abilities are generalizable to other fields that require complex decision-making, such as education, healthcare, and autonomous driving.
Looking ahead, the team will continue exploring ways to enhance reasoning in multimodal models, aiming to empower small models with robust and transferable reasoning capabilities.