Video Comprehension

Toward a World Model: Building a Foundational Video Understanding Model

Summary

This project focuses on building advanced algorithms for intelligent video comprehension, enabling efficient search and retrieval of specific events across large, complex video datasets through Multimodal Learning (vision, language, and knowledge graphs) and Retrieval-Augmented Generation (RAG).

Funding Agency:
South Dakota State University | Startup Fund

Team:
Chulwoo Pack (PI) | McComish Dept. of EECS, SDSU
Harsh Dubey (M.S. Student) | McComish Dept. of EECS, SDSU
Muktiar Ali (Ph.D. Student) | McComish Dept. of EECS, SDSU
Sugam Mishura (M.S. Student) | McComish Dept. of EECS, SDSU
Omeshamisu Anigala (Ph.D. Student) | McComish Dept. of EECS, SDSU

Duration:
2023-2026

Total Funding:
$73,000

External Resources:

Video Comprehension Score (VCS)
Dense Caption Dataset (CLIP-CC)
Dense Caption Generator Forthcoming
Multimodal Video Anomaly Detection Forthcoming

2025

AAAI

Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)

Harsh Dubey, and Chulwoo Pack

In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

Abs HTML

To address the limitations of current Large-scale Video-Language Models (LVLMs) in fine-grained understanding and long-term temporal memory, we propose a novel video understanding approach that integrates a Vision Language Model (VLM) and a Large Language Model (LLM) with a textual memory mechanism to ensure continuity and contextual coherence. In addition, we introduce a novel evaluation metric, VAD-Score (Video Automated Description Score), to assess precision, recall, and F1 scores for events, subjects, and objects. Our approach delivers competitive results on a diverse set of videos from the DREAM-1K dataset, spanning categories such as live-action, animation, shorts, stock, and YouTube, with a focus on fine-grained comprehension.

ACR

PEARL: Perceptual and Analytical Representation Learning for Video Anomaly Detection

Omeshamisu Anigala, Kwanghee Won, and Chulwoo Pack

SIGAPP Appl. Comput. Rev., Apr 2025

Abs HTML

Video anomaly detection is crucial for applications like surveillance and autonomous systems. Traditional methods often rely solely on visual cues, missing valuable contextual data. This paper presents Perceptual and Analytical Representation Learning (PEARL), a novel method that combines perceptual (raw sensory input) and analytical (higher-level context) modalities. Specifically, we integrate visual information with object tracking data, along with the tracking data-specialized normalization method, DOT-Norm, leveraging ID switching to capture high-level contexts of abnormal movements. We evaluate early- and late-fusion strategies to enhance anomaly detection, particularly for irregular movements marked by frequent track ID switches. Our approach, tested on the UCSD-Ped1 dataset, outperforms the state-of-the-art by improving precision (+0.082), recall (+0.104), F1 score (+0.149), and AUC (+0.053). These findings highlight the potential of integrating analytical tracking data with perceptual video frames in a multimodal learning approach for anomaly detection, paving the way for future applications and research where knowledge-driven analytical modalities are crucial.