A novel multimodal agent advances understanding of long video

Research & Knowledge Transfer

Role-based workflow improves video understanding

Videos longer than 15 minutes often contain complex information, including sequences of events, causality and scene transitions. To understand this content, AI must identify objects and track their changes over time. However, the extensive visual data requires substantial computational resources, creating challenges for traditional AI models.

To address these challenges, Professor Chen Changwen, Interim Dean of the Faculty of Computer and Mathematical Sciences and Chair Professor of Visual Computing, led a research team that developed a framework incorporating a role-based workflow inspired by human cognitive processes. This approach enhances temporal reasoning and addresses a significant obstacle in AI video understanding.

The four key roles for VideoMind are:

Planner
Coordinating all other roles for each query
Grounder
Localising and retrieving moments
Verifier
Validating the information accuracy of the retrieved moments and selecting the most reliable one
Answerer
Generating the query-aware answer

In designing VideoMind, the researchers drew inspiration from human-like video understanding processes and incorporated four roles into the framework.

Chain-of-LoRA enhances performance

A core innovation of VideoMind is its Chain-of-LoRA strategy. LoRA is a modern fine-tuning technique that adapts AI models for specific tasks without full retraining. The team’s approach involves four lightweight LoRA adapters within a unified model, enabling dynamic role-switching during inference. This eliminates the need for multiple models, improving efficiency and reducing costs.

VideoMind is available as open source on GitHub and Hugging Face, with experimental results demonstrating its effectiveness across 14 benchmarks. The framework outperformed state-of-the-art AI models, including GPT-4o and Gemini 1.5 Pro, particularly in tasks involving videos averaging 27 minutes in length. Remarkably, the smaller two billion (2B) parameter version of VideoMind matched the performance of many larger seven billion (7B) models.

Emulating human efficiency

Humans switch between different thinking modes to understand videos, with the brain consuming about 25 watts of power – roughly a million times less than a supercomputer with equivalent capabilities. Professor Chen explained, “Inspired by this, we designed the role-based workflow that allows AI to understand videos like humans while leveraging the chain-of-LoRA strategy to minimise the need for computing power and memory in this process.”

The VideoMind framework, built on the open-source Qwen2-VL model and enhanced with optimisation tools, lowers technological costs and deployment thresholds. It offers a viable solution to the challenge of reducing power consumption in AI models, which are often constrained by insufficient computing power and high energy demands.

Professor Chen added, “VideoMind not only overcomes the performance limitations of AI models in video processing, but also serves as a modular, scalable and interpretable multimodal reasoning framework. We envision that it will expand the application of generative AI to various areas, such as intelligent surveillance, sports and entertainment video analysis, video search engines and more.”

Events

PolyU hosts Healthy Ageing Conference and participates in Asia Summit on Global Health to advance global healthcare

As the world’s ageing population presents global challenges and the demand for healthcare rises, PolyU is committed to collaborating with the global community in developing effective solutions. Its active participation in recent healthcare summits...

Events

PolyU to stage Hong Kong premiere of musical “青春印豐碑”

This year, in honour of the 60th anniversary of the supply of water from Dongjiang to Hong Kong, PolyU is proud to present the ori...

Collaborations

PolyU-Zhongshan Technology and Innovation Research Institute advances biomedicine research

The PolyU-Zhongshan Technology and Innovation Research Institute (the Institute), the latest addition to PolyU’s Mainland Translational Research Institutes, is dedicated to advancing research in biomedicine. This partnership between PolyU and the...

A novel multimodal agent advances understanding of long video

Role-based workflow improves video understanding

Chain-of-LoRA enhances performance

Emulating human efficiency

You Might Also Like