Hacking LLM Chat and Building a Local Video Summarization Tool
At VAST Data, innovation moves at the speed of AI itself. With our recent integration of NVIDIA’s Video Search and Summarization (VSS) technology into the VAST Data Platform, we’ve quickly gained recognition for enabling large-scale, high-performance video summarization. In a field where breakthroughs happen daily—not quarterly—we’re committed to pushing boundaries and pioneering what’s next.
TLDR:
We wrote a video summarization tool using Gemma3 + Ollama and the code can be found on github.com/ramborogers/mattsvlm
The New Tool
Released by Google in March, the Gemma3 model offers outstanding edge performance, including image processing within chat contexts. Its 34b model delivers particularly impressive capabilities, further enhanced by the recent April 18th, 2025 release of the Gemma 34b QAT version, which provides state-of-the-art AI performance on consumer GPUs.
Per Google:
To make Gemma 3 even more accessible, we are announcing new versions optimized with Quantization-Aware Training (QAT) that dramatically reduces memory requirements while maintaining high quality. This enables you to run powerful models like Gemma 3 27B locally on consumer-grade GPUs like the NVIDIA RTX 3090.
We now possess a robust, locally runnable model comparable to larger models and on par with Deepseek R1. Gemma3 demonstrates capabilities in image understanding, chat functionality, and comprehension of 140 languages. However, it currently lacks the ability to process video.
The Problem with Video
While video is essentially a sequence of images, current large language models (LLMs) are typically limited to describing a single image per prompt. This constraint presents challenges in maintaining temporal consistency across video frames. Although individual frames can be summarized, establishing context and coherence across a series of images becomes exceedingly difficult with existing LLM capabilities.
Hacking the System
What if we took the new multi image prompting in Gemma3 QAT and added in some prompt engineering and used the chat context window for building a video? Well you’d have a local video summarization tool.
We need to shell together some logic here for our video to frame pipeline:
We’re going to run this demo using an RTX 3090 TI on an AMD Arch Linux system using Ollama.
We run the following command:
uv run python app.py sample/chunk_0004.mp4 "describe what you see"
We’re using 15 seconds of some public domain video here: LOONEY TUNES (Looney Toons): BUGS BUNNY - The Wacky Wabbit (1942) (Remastered) (HD 1080p)
This is the summary output, we have a lot of detail per scene we’re going to skip.
## Description of What I See
Here's a description of the video content based on the provided frame analysis:
**1. Chronological Timeline:**
* **Time 0.0s - 3.88s:** A large, cartoonish man with a red hat, white shirt, and red vest is present, holding a pickaxe. He faces the viewer in a fixed pose with a wide, excited expression and makes hand gestures. The background consists of a blue sky and a brown, textured surface.
* **Time 4.00s - 7.88s:** A large, cartoonish skull with two horns is present, facing the viewer in a fixed pose. The background consists of a blue sky, a brown textured surface, and several cacti.
* **Time 8.00s - 11.88s:** A large, cartoonish skull with two horns is present, facing the viewer in a fixed pose. The background consists of a blue sky, a brown textured surface, and several cacti.
* **Time 12.00s - 15.88s:** A large, cartoonish skull with two horns is present, facing the viewer in a fixed pose. The background consists of a blue sky, a brown textured surface, and several cacti.
* **Time 16.00s - 18.62s:** A large, cartoonish skull with two horns is present, facing the viewer in a fixed pose. The background consists of a blue sky, a brown textured surface, and several cacti.
**2. Key Event Highlights:**
* A transition from the man to the skull begins (~4.0s).
**3. Overall Summary:**
The video depicts a transition from a cartoon man holding a pickaxe to a cartoon skull. The man is initially present, making gestures, and is then replaced by the skull. Throughout the majority of the analyzed duration, the skull remains static, facing the viewer against a consistent background of a blue sky and a desert-like landscape. There is minimal movement or change observed after the initial transition.
--------------------------------------------------------------------------------
Performance Statistics:
--------------------------------------------------------------------------------
--- Performance Statistics ---
Total runtime: 83.76 seconds
Frames processed: 150 in 5 batches of up to 32 frames each via ollama
Processing approach: Batched processing with temporal context
Average time per batch: 13.45 seconds
Summary generation time: 12.47 seconds
--------------------------------------------------------------------------------
Frame-by-frame analysis allows for temporally aligned video summarization. Batches of frames are combined to create summaries, which are then appended within the same context. This approach enables high-quality summarizations using a standard video ingest prompt and can be further improved by incorporating metadata.
This output demonstrates that even a local language model, accessible through chat, can now perform video summarization. While current processing speed may not match dedicated systems, the rapid pace of innovation suggests a future where such capabilities become seamlessly integrated and significantly faster, unlocking exciting new possibilities.
Try the code yourself! The prompt is critical and you may want to consider temperature adjustments if you experience hallucinations.