Video Comprehension
Toward a World Model: Building a Foundational Video Understanding Model
Summary
This project focuses on building advanced algorithms for intelligent video comprehension
, enabling efficient search and retrieval of specific events across large, complex video datasets through Multimodal Learning (vision, language, and knowledge graphs) and Retrieval-Augmented Generation (RAG).
Funding Agency:
South Dakota State University | Startup Fund
Team:
Chulwoo Pack (PI) | McComish Dept. of EECS, SDSU
Harsh Dubey (M.S. Student) | McComish Dept. of EECS, SDSU
Muktiar Ali (Ph.D. Student) | McComish Dept. of EECS, SDSU
Sugam Mishura (M.S. Student) | McComish Dept. of EECS, SDSU
Omeshamisu Anigala (Ph.D. Student) | McComish Dept. of EECS, SDSU
Duration:
2023-2026
Total Funding:
$73,000
External Resources:
- Video Comprehension Score (VCS)
- Dense Caption Dataset (CLIP-CC)
- Dense Caption Generator Forthcoming
- Multimodal Video Anomaly Detection Forthcoming
Related Publications:
2025
-
Leveraging Textual Memory and Key Frame Reasoning for Full Video Understanding Using Off-the-Shelf LLMs and VLMs (Student Abstract)
Harsh Dubey, and Chulwoo Pack
In Proceedings of the AAAI Conference on Artificial Intelligence, 2025
To address the limitations of current Large-scale Video-Language Models (LVLMs) in fine-grained understanding and long-term temporal memory, we propose a novel video understanding approach that integrates a Vision Language Model (VLM) and a Large Language Model (LLM) with a textual memory mechanism to ensure continuity and contextual coherence. In addition, we introduce a novel evaluation metric, VAD-Score (Video Automated Description Score), to assess precision, recall, and F1 scores for events, subjects, and objects. Our approach delivers competitive results on a diverse set of videos from the DREAM-1K dataset, spanning categories such as live-action, animation, shorts, stock, and YouTube, with a focus on fine-grained comprehension.
-
PEARL: Perceptual and Analytical Representation Learning for Video Anomaly Detection
Omeshamisu Anigala, Kwanghee Won, and Chulwoo Pack
SIGAPP Appl. Comput. Rev., Apr 2025
Video anomaly detection is crucial for applications like surveillance and autonomous systems. Traditional methods often rely solely on visual cues, missing valuable contextual data. This paper presents Perceptual and Analytical Representation Learning (PEARL), a novel method that combines perceptual (raw sensory input) and analytical (higher-level context) modalities. Specifically, we integrate visual information with object tracking data, along with the tracking data-specialized normalization method, DOT-Norm, leveraging ID switching to capture high-level contexts of abnormal movements. We evaluate early- and late-fusion strategies to enhance anomaly detection, particularly for irregular movements marked by frequent track ID switches. Our approach, tested on the UCSD-Ped1 dataset, outperforms the state-of-the-art by improving precision (+0.082), recall (+0.104), F1 score (+0.149), and AUC (+0.053). These findings highlight the potential of integrating analytical tracking data with perceptual video frames in a multimodal learning approach for anomaly detection, paving the way for future applications and research where knowledge-driven analytical modalities are crucial.