Small Team, Big Breakthrough: Prof. Yang Hongxia’s Team Overcomes the Three Major Barriers to Large Model Fusion with InfiFusion Framework
With just 100 GPUs, scientists have dismantled the “three walls” of large model fusion, making it possible to build stronger models from any open-source foundation. This breakthrough comes from the team led by Prof. Hongxia Yang, and is being hailed as a milestone in scalable, efficient large model integration.
I. Breaking Through the “Three Walls” of Large Model Fusion
According to the team, early attempts at large model fusion in the AI community often revolved around naïvely “stitching together” the parameters of multiple models. However, this approach quickly ran into three major barriers:
Distillation mismatch caused by differing vocabularies across models.
Semantic noise resulting from conflicting styles among multiple teacher models.
Persistent concerns over values and safety even after capabilities were distilled.
To address this, the team introduced a three-part fusion strategy:
InfiFusion: Tackles the vocabulary mismatch using Universal Logit Distillation (ULD) with Top-K selection and logit standardization, achieving stable and effective cross-vocabulary distillation with minimal computational cost.
InfiGFusion: Recognizes that aligning probability distributions is not enough—teacher models often encode different “syntactic skeletons.” This method treats logits as graphs and uses the Gromov-Wasserstein distance to perform structure-level alignment, resolving the second barrier.
InfiFPO: Focuses on preference alignment in the final stage using a modified RLHF (Reinforcement Learning from Human Feedback) framework. By introducing multi-source probability fusion, length normalization, and probability truncation, it ensures the resulting model is not only capable and coherent but also safe and aligned with human values.
“The trilogy of papers was designed to strengthen the three pillars of fusion: capability, structure, and value,” the team explained.
II. From “Reinforcing Foundations” to “Correcting Course”
Why were the three papers released in the order of distillation → structure → preference, rather than bundled together? According to the team, this reflects the rhythm of reinforcing foundations before correcting course.
Initially, the team set out to fuse the strengths of three stylistically distinct teacher models—Qwen-Coder, Qwen-Instruct, and Mistral-Small—into a central model, Phi-4. But their first experiments revealed a major roadblock: vocabulary mismatches. The same Chinese idiom would be tokenized completely differently by each teacher, often using obscure suffix tokens.
They focused first on the foundational distillation problem. In InfiFusion, they systematically swept the Top-K parameter and found that K = 10 captured almost all probability mass while minimizing gradient noise. They also applied Z-score standardization to logits before distillation, allowing the student model to focus on relative rankings rather than absolute values. “These technical details may seem trivial, but they’re what turn a ‘working’ distillation into a robust one,” the team noted.
Once capability was firmly established, the next hurdle emerged: conflicting reasoning structures. For instance, in a multi-step reasoning task, one teacher model might filter sets first before calculating values, while another does the reverse. Though probabilities aligned, the solution paths clashed. InfiGFusion addressed this by modeling logits as graph structures and aligning them using Gromov-Wasserstein distance, helping the student learn not just probabilities but reasoning chains.
With capability and structure integrated, they turned to preference alignment, a stage often ignored in model fusion. Existing techniques like RLHF and DPO focus on optimizing outputs using human preference data but don’t consider how to fuse preferences from multiple teacher models.
To solve this, InfiFPO fuses probabilistic preferences from all teachers, applies length normalization and max-margin stabilization, and achieves safer, more aligned outputs. As a result, the fused Phi-4 model improved its aggregate score from 79.95 to 83.33.
“We didn’t split the trilogy just for show—each stage exposed new bottlenecks that informed the next step,” the team said. “Every improvement fed directly into the following phase.”
They also recalled the night they finalized the distillation loss function. After testing over 20 loss variants—from temperature-scaled KL divergence to OT-based Wasserstein-KL hybrids—they realized the flashy methods couldn’t scale due to memory and time constraints on large models. Ultimately, they returned to a more elegant and practical solution: Universal Logit Distillation (ULD) loss, which converges faster than KL and boosts training speed by nearly 30%, without increasing GPU memory usage.
III. Building a Fused Phi-4 in 20 Hours—Democratizing Model Fusion for SMEs
In practical terms, the team reports that using an 8×H800 NVIDIA server, it took only 20 hours to transform Phi-4 into a fused version using their pipeline.
On math reasoning tasks (GSM8K and MATH), the fused Phi-4 achieved 3% higher accuracy than the standalone InfiFusion model.
In code generation, its pass rate improved by about 2%.
In multi-turn instruction following, refusal rates dropped dramatically—from nearly 50% to under 10%.
Most importantly, compute costs fell from millions of GPU hours to just a hundred, enabling smaller teams to integrate “expert collectives” into a single model deployable even on an 80GB GPU.
Two main application routes have emerged:
Vertical industries like finance, healthcare, and law, which have proprietary expert models but need a unified generalist interface. The three-step fusion packs capability, structure, and values without requiring shared weights.
Small and medium enterprises (SMEs) with limited compute and annotation resources. With this pipeline, they can simply plug in open-source teacher models and a small amount of domain-specific data to obtain a “custom expert team.”
Looking ahead, the team aims to extend this approach beyond text models into vision and speech, allowing cross-modal fusion through the same streamlined pipeline. They are also working on tensor-level plug-and-play distillation, reducing inference costs to under 70% of the original model—making it feasible for mobile deployment.
Will “fusion” become a product? The answer is yes. Prof. Yang’s team has already developed a “Fuse-as-a-Service” middleware platform, where users can upload models and minimal domain data, and the system automatically completes the three-stage pipeline, returning a lightweight fused model.
“We’re currently piloting with three industry partners and aiming for a public beta of PI next year,” the team told DeepTech.
In their view, the ultimate future of large models may not lie in training a single all-knowing behemoth—but in fusing thousands of specialized experts into one unified force.
“Our InfiFusion series is just the first brick laid,” they concluded. “The true path to infinite fusion still lies ahead.”