Dimensionality Reduction in Cinematography: Mapping Visual Style with UMAP

Dimensionality Reduction in Cinematography: Mapping Visual Style with UMAP

We often describe cinema with subjective adjectives like "gritty," "vibrant," or "chaotic," but a film is fundamentally a high-dimensional data signal—millions of pixels shifting over time. This project was born from a desire to bridge the gap between Film Theory and Computer Vision by asking a simple question: Can we mathematically quantify "cinematographic style"?

My goal was to build a pipeline that extracts the unique visual fingerprint of any movie, transforming abstract concepts like color grading, pacing, and entropy into analyzeable vectors. To achieve this, I developed moviesigdb, an open-source library that digests video content into temporal signals, and used the visually distinct world of Arcane as a proof-of-concept to demonstrate that algorithms can map the "visual geography" of a story without ever understanding the plot.

The Engine: Introducing moviesigdb

To handle the massive scale of video processing required for this project, I built and published moviesigdb (Movie Signal Database), an open-source Python library designed to extract "color fingerprints" from video files efficiently.

The library is available on PyPI and GitHub, serving as the foundational ETL (Extract, Transform, Load) tool for my cinematic analysis pipeline.

  • PyPI: Official open-source library

pip install moviesigdb

import moviesigdb as msdb        

How It Works: The "Barcode" Pipeline

At its core, moviesigdb converts the temporal dimension of video into spatial data. It uses OpenCV to scan video files, but rather than processing every pixel of every frame (which is computationally expensive), it employs an intelligent sampling method:

  1. Fast Frame Extraction: The library calculates exact step sizes to sample frames at a specific frequency (e.g., 1.0 FPS or 0.5 FPS) regardless of the source video's native framerate. This ensures consistent data density across different media formats.
  2. Spatial Compression: For every sampled frame, the library collapses the 2D image (Height × Width) into a single RGB tuple (1 × 1) by calculating the mean color vector. This reduces gigabytes of video data into a lightweight "signal."
  3. The "Image is the Database" Architecture: Perhaps the most unique feature of moviesigdb is how it stores data. When generating a "Movie Barcode" (a visualization of the timeline), the library injects the raw numerical data (JSON) directly into the PNG image’s metadata headers (Description field).

The following image represents a "barcode":

Article content
Figure 1: Barcode Season 2 Episode 1 Arcane


Project Application: Processing 1.6 Million Frames of Arcane

I tasked moviesigdb with digesting the entirety of Arcane (Seasons 1 & 2). The show contains approximately 1.6 million raw frames of animation.

Using the library's optimized extract_frames_fast module, I sampled the series at a fixed rate of 0.5 frames per second. This process converted roughly 18 hours of 4K animation into a precise, temporal color signal.

These extracted signals—visualized as the barcodes you see above—served as the raw input for the unsupervised learning (UMAP) model. By first compressing the show into these efficient color signals, I was able to map the series' entire visual trajectory without needing a supercomputer.

The following visual represents every episode of the show:

Article content
Figure 2: Every Frame Of Arcane

Final Graph

Extracting the raw colors with moviesigdb was just the first step. A "Movie Barcode" is beautiful, but it is linear—it traps you in the timeline. I wanted to break the timeline and see the structure.

To do this, I needed to group similar scenes together, regardless of when they happened in the show. If a scene in Episode 1 looks identical to a scene in Episode 9, they should be neighbors on the map.

1. Feature Engineering: Beyond Just Color

A simple average color isn't enough to define a style. A static red wall and a frantic red explosion might have the same average pixel value, but they feel completely different.

I built a custom VisualTrajectoryAnalyzer that slides a 30-second window across the entire series. For every window, it calculates an 8-dimensional feature vector:

  • Perceptual Color (CIELAB): How the human eye perceives the scene's palette.
  • Visual Entropy: A measure of complexity. (Is the image flat and foggy? Or sharp and detailed?)
  • Pacing Velocity: By measuring the rate of change between frames, we can mathematically quantify "action." High velocity means fast cuts and rapid movement; low velocity means stillness.

2. The Algorithm: UMAP

Humans can't visualize 8-dimensional data. To make sense of these vectors, I used UMAP (Uniform Manifold Approximation and Projection).

UMAP is a dimensionality reduction algorithm that acts like a translator. It takes complex, high-dimensional relationships and flattens them into a 2D map. It operates on a simple rule: Distance equals Difference.

  • If two dots on the graph are close together, those two scenes are visually indistinguishable (e.g., two conversations in Piltover).
  • If two dots are far apart, they are visually alien to one another (e.g., a bright Hextech lab vs. a dark Zaun alley).

Mathematically, UMAP constructs a high-dimensional graph by calculating the conditional probability pj∣i that a scene xj is similar to scene xi. It uses a local radius to ensure that even sparse regions (like the rare "Hextech" scenes) remain connected:

Article content
Figure 3: Similarity Equation


Where d(xi,xj) is the distance between the two feature vectors, ρi is the distance to the nearest neighbor, and σi acts as a normalization factor for the local density.

To generate the final 2D coordinates, the algorithm minimizes the fuzzy set cross-entropy (C) between the high-dimensional graph (P) and the low-dimensional embedding (Q). This forces the 2D layout to respect the complex relationships of the original 8D data:

Article content
Figure 4: Optimisation Equation


The result is the image you see above: a generated "Atlas" of Arcane, where every pixel is a scene, and the continents are defined not by land and water, but by light, color, and kinetic energy.


Article content
Figure 5: The Final Color Map

I will consider publicly posting a huggingface dataset for other researchers to use.





Congratulations on launching this, Shreyan! Looking forward to exploring this tool.

To view or add a comment, sign in

Others also viewed

Explore content categories