Upgrade to Pro — share decks privately, control downloads, hide ads and more …

V2V: Efficiently Synthesizing Video Results for...

Avatar for Dominik Winecki Dominik Winecki
February 16, 2026
0

V2V: Efficiently Synthesizing Video Results for Video Queries

Avatar for Dominik Winecki

Dominik Winecki

February 16, 2026
Tweet

Transcript

  1. V2V: Efficiently Synthesizing Video Results for Video Queries Dominik Winecki

    (The Ohio State University, USA) Arnab Nandi (The Ohio State University, USA)
  2. Issues with ad hoc Visualization Scripts 5 User-Managed Data/State User

    must manage source videos and relational results, including over multiple iterations of a query. The user must have the knowledge to write ad-hoc visualization scripts and the development environment to run them. Required Knowledge Slow Execution Imperative code can not be transparently optimized, so performance is the user’s responsibility
  3. VDBMSs which can Return Videos 7 Provide ad hoc Scripts

    for Bounding Boxes Only Return Clips of Videos EVA: A Symbolic Approach to Accelerating Exploratory Video Analytics with Materialized Views Zhuangdi Xu∗ Georgia Institute of Technology [email protected] Gaurav Tarlok Kakkar∗ Georgia Institute of Technology [email protected] Joy Arulraj Georgia Institute of Technology [email protected] Umakishore Ramachandran Georgia Institute of Technology [email protected] ABSTRACT Advances in deep learning have led to a resurgence of interest in video analytics. In an exploratory video analytics pipeline, a data scientist often starts by searching for a global trend and then iteratively re￿nes the query until they identify the desired local trend. These queries tend to have overlapping computation and often di￿er in their predicates. However, these predicates are com- putationally expensive to evaluate since they contain user-de￿ned functions (UDFs) that wrap around deep learning models. In this paper, we present EVA, a video database management sys- tem (VDBMS) that automatically materializes and reuses the results of expensive UDFs to facilitate faster exploratory data analysis. It di￿ers from the state-of-the-art (SOTA) reuse algorithms in tradi- tional DBMSs in three ways. First, it focuses on reusing the results of UDFs as opposed to those of sub-plans. Second, it takes a symbolic approach to analyze predicates and identify the degree of overlap between queries. Third, it factors reuse into UDF evaluation cost and uses the updated cost function in critical query optimization decisions like predicate reordering and model selection. Our empir- ical analysis of EVA demonstrates that it accelerates exploratory video analytics workloads by 4⇥ with a negligible storage over- head (1.001⇥). We demonstrate that the reuse algorithm in EVA complements the specialized ￿lters adopted in SOTA VDBMSs. CCS CONCEPTS • Computing methodologies ! Image processing; Symbolic and algebraic manipulation; • Information systems ! Query optimization. KEYWORDS Video Analytics, Symbolic Computation, Database Management System ACM Reference Format: Zhuangdi Xu, Gaurav Tarlok Kakkar, Joy Arulraj, and Umakishore Ra- machandran. 2022. EVA: A Symbolic Approach to Accelerating Exploratory ∗Both authors contributed equally to the paper. This work is licensed under a Creative Commons Attribution International 4.0 License. SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA © 2022 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9249-5/22/06. https://doi.org/10.1145/3514221.3526142 Video Analytics with Materialized Views. In Proceedings of the 2022 In- ternational Conference on Management of Data (SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM, New York, NY, USA, 15 pages. https: //doi.org/10.1145/3514221.3526142 1 INTRODUCTION Researchers have presented novel techniques for e￿ciently ana- lyzing visual data at scale in video database management systems (VDBMSs), including sampling, ￿ltering, and specialized neural net- works [35, 36, 44]. However, in exploratory video analytics, queries often exhibit a signi￿cant overlap of computation due to redundant execution of user-de￿ned functions (abbrev., UDFs) associated with computer vision tasks (e.g., object detection). Prior e￿orts in video analytics have not extensively studied the problem of materializing and reusing the results of expensive UDFs. While there have been e￿orts in traditional database management systems (DBMSs) to handle expensive UDFs [13, 17, 19, 27, 45, 50], they do not leverage all of the opportunities present in video analytics. M￿￿￿￿￿￿￿￿￿. Consider a law enforcement o￿cer analyzing a video data set for tracking a suspicious vehicle with the help of a witness. They typically ￿rst search for a global trend and then iter- atively re￿ne the query until they ￿nd the desired local trend [17]. During this query re￿nement process, queries tend to have overlap- ping computation. Initially, the witness may only recall the vehicle model (e.g., SUV) and the approximate time-frame in which they saw the vehicle (e.g., night time). So, the o￿cer starts with &1 in Listing 1 that searches for all SUVs present during that time-frame to identify the suspicious vehicle. While going over the frames with SUVs returned by &1, the witness might recall the color of the vehi- cle (e.g., red). Then, the o￿cer narrows down the search space and looks for the license plate of all red-colored SUVs (&2). Lastly, in &3, the o￿cer searches the entire dataset for the suspicious vehicle using the license plate information. Multiple applications may query over the same video and their queries may also contain overlapping computation. For instance, a tra￿c planner may be interested in analyzing the tra￿c congestion over the entire day using &4. This task only requires a less-accurate (and faster) object detection model. Across these queries, several UDFs are redundantly evaluated (i.e., V￿￿￿￿￿￿M￿￿￿￿, O￿￿￿￿￿D￿￿ ￿￿￿￿￿￿, V￿￿￿￿￿￿C￿￿￿￿, A￿￿￿, and L￿￿￿￿￿￿) on many frames. We seek to accelerate these queries by materializing and reusing the re- sults of expensive UDFs, since these functions dominate the overall query processing time. Session 8: Query Processing and Data Management for ML SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA VIVA: An End-to-End System for Interactive Video Analytics Daniel Kang∗ [email protected] Stanford University Francisco Romero∗ [email protected] Stanford University Peter Bailis [email protected] Stanford University Christos Kozyrakis [email protected] Stanford University Matei Zaharia [email protected] Stanford University and Databricks ABSTRACT The growth of video volumes and increased DNN capabilities has led to a growing desire for video analytics. In response, the data analytics community has proposed multiple systems that optimize speci￿c query types (e.g., selection queries) or a particular step in query execution (e.g., video retrieval from storage). However, none of these systems provide end-to-end, practical video analytics for users to iteratively and interactively engage with queries, as is the case with analytics systems for structured data. In response, we are building VIVA: an end-to-end system for interactive video analytics. VIVA contains ￿ve novel components. First, VIVA uses relational hints, which allow users to express rela- tionships between columns that are di￿cult to automatically infer (e.g., mentions of a person in a transcript can be used as a proxy for the person appearing in the video). Second, VIVA introduces a mixed-data query optimizer that optimizes queries across both structured and unstructured data. Third, VIVA features an embed- ding cache that decides which results/embeddings to store for future queries. Finally, VIVA co-optimizes storage and query execution with its video ￿le manager and accelerator-based execution engine. The former decides how to pre-fetch/manage video, while the latter selects and manages heterogeneous hardware backends spanning the growing number of DNN accelerators. We describe the chal- lenges and design requirements for VIVA’s development and outline ongoing and future work for realizing VIVA. 1 INTRODUCTION Video volumes are growing tremendously in scale: 500 hours of video are uploaded to YouTube every minute [36]. At the same time, deep neural networks (DNNs) have increased in capabilities, e.g., allowing for detection of objects in videos [17]. These two trends have made automatic and meaningful analyses of video increasingly feasible, allowing users to answer queries such as “how many birds of a particular species visit a feeder per day” or “do any cars that passed an intersection match an AMBER alert.” To date, research systems for DNN-based video analytics focus on optimizing a speci￿c query type (e.g., selecting a frame with a speci￿c criteria) or a single step in query execution. They range from reducing query costs via approximations [21], e￿cient use ∗Denotes equal contribution. This article is published under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/). 12th Annual Conference on Innovative Data Systems Research (CIDR ’22), January 9–12, 2022, Chaminade, USA. VIVA Structured table Query language Parser Mixed-data optimizer Accelerator-based execution engine Embedding cache Video metadata DNN metadata UI Inputs Relational hint explorer Model (re)-training End-to-end Outputs Query analysis Results Serverless GPU Video file manager Materialized results Model (re)-training service SELECT time_window FROM news_analysis WHERE tapper_angry_sanders = TRUE USING HINT transcript_frame_equals Analysis (lat, cost, acc) Plan 1 1hr, $30, 80% Plan 2 20s, $20, 60% Plan 3 1min, $8, 75% Final query results Video 1 Video 2 Serverless *PU Video file manager Figure 1: VIVA’s architecture diagram. Blue components/interfaces are typically found in analytic systems for structured data that re- quire rethinking. Orange components/interfaces are novel to VIVA. of hardware [26], indexing video data [24], to video query pro- gramming models [13]. However, users must manually decide how to combine and manage these systems to answer desired queries. Consider an analyst studying political news coverage from the “big three” cable news channels [19]. The analyst may issue several ad-hoc (i.e., exploratory) queries to understand the dataset, such as ￿nding instances of Bernie Sanders (a politician) reacting angrily to Jake Tapper (a TV news host). To answer this query e￿ciently, the analyst needs to consider several choices, such as: should they train a specialized proxy model for Jake Tapper, Bernie Sanders, or both (e.g., using N￿S￿￿￿￿ [22])? Is it best to ￿lter for Bernie Sanders and Jake Tapper using the video transcripts ￿rst before searching frames [13]? Have enough labels (e.g., angry face detections) been materialized to leverage a video event speci￿cation system [13]? Today, end-to-end systems for terabyte-scale structured data analytics (e.g., data warehouses) alleviate the burden on users from having to manually optimize queries over structured data. These systems e￿ciently automate query planning because structured 1 Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations Chanwut Kittivorawong*, Yongming Ge*, Yousef Helal, Alvin Cheung University of California, Berkeley ABSTRACT Videos that are shot using commodity hardware such as phones and surveillance cameras record various metadata such as time and location. We encounter such geospatial videos on a daily basis and such videos have been growing in volume signi￿cantly. Yet, we do not have data management systems that allow users to interact with such data e￿ectively. In this paper, we describe Spatialyze, a new framework for end- to-end querying of geospatial videos. Spatialyze comes with a domain-speci￿c language where users can construct geospatial video analytic work￿ows using a 3-step, declarative, build-￿lter- observe paradigm. Internally, Spatialyze leverages the declarative nature of such work￿ows, the temporal-spatial metadata stored with videos, and physical behavior of real-world objects to optimize the execution of work￿ows. Our results using real-world videos and work￿ows show that Spatialyze can reduce execution time by up to 5.3⇥, while maintaining up to 97.1% accuracy compared to unoptimized execution. 1 INTRODUCTION Geospatial videos are videos in which the location and time that they are shot are known. From surveillance footage, autonomous vehicle (AV) cameras to police body cameras, such videos are preva- lent in our daily lives. While the volume of such data has grown tremendously [36, 39], we still lack end-to-end systems that can process and query such data e￿ectively. The rise of machine learn- ing (ML) in recent years has aggravated this issue, with the latest deep learning models capable of carrying out various computer vi- sion tasks, such as object detection [14, 15, 20, 31, 32], multi-objects tracking [5, 7, 9, 38], image depth estimation [17], etc. All these trends have made geospatial video analytics computa- tionally intensive. For instance, a $2299 Nvidia T4 GPU takes 34 seconds on average to execute a simple work￿ow of object detec- tion, tracking, and image depth estimation on a 20-second 12-fps video. Modern autonomous driving datasets, such as nuScenes [8], contain 6000 such videos. Running the work￿ow mentioned above will take 3 full days to run on the entire dataset. To make matters worse, the lack of programming frameworks and data management systems for geospatial videos has made it challenging for end users to specify their work￿ows, let alone run- ning them e￿ciently. Consider a data journalist writing an article on police misconduct and would like to examine the footage col- lected on police body cameras to look for speci￿c behavior, e.g., police chasing after a suspect. Given today’s technology, they will either need to watch all the footage themself or string together various ML models for video analytics [2, 4, 21, 40] and video pro- cessing libraries (e.g., OpenCV [6], FFmpeg [37]) to construct their ⇤Both authors contributed equally to this research. work￿ow. Sadly, the former is simply infeasible given the amount of video data collected, while the latter requires deep programming expertise that most users do not possess. We see an opportunity to build a programming framework that bridges video processing and geospatial video analytics. Speci￿cally, given that users constructing work￿ows on geospatial videos are typically interested in the time and location where such videos are taken, we should exploit the metadata stored in geospatial videos to optimize such work￿ows. For instance, a car is at (1810.39, 1221.71, 0.90) in the Boston Seaport map, which is at an intersection, at 2:33:59.112 PM of September 18, 2018. Each object inside a video captures a real-world object; therefore, this object also inherits physical behavior that its real-world counterpart would have had. For example, when users look for a car at a road intersection, we know that a road intersection need to be visible in the video frames that the car of interest is visible; therefore, video frames without a visible road intersection, determined with geospatial metadata, need not be processed with expensive ML functions. We leverage this already existing geospatial metadata, the inherited physical behaviors, and users’ queries to speed up video processing. To leverage this insight, we present Spatialyze, a system for geospatial video analytics. To make it easy for users to specify their geospatial video work￿ows, Spatialyze comes with a conceptual data model where users create and ingest videos into a “world,” and users interact with the world by specifying objects (e.g., cars) and scenarios (e.g., cars crossing an intersection) of interest via Spatia- lyze’s domain-speci￿c language (DSL) embedded in Python called S-Flow. Spatialyze then e￿ciently executes the work￿ow by lever- aging various spatial-aware optimization techniques that make use of existing geospatial metadata and assumptions based on inherited physical behavior of objects in the videos. For instance, Spatialyze’s Road Visibility Pruner uses road’s visibility as a proxy for objects’ visibility to prune out video frames. The Object Type Pruner then prunes out objects that are not of interest to the users’ work￿ow. Spatialyze’s Geometry-Based 3D Location Estimator speeds up object 3D location estimation by replacing a computationally expensive ML-based approach with a geometry-based approach. Finally, the Exit Frame Sampler prunes out unnecessary video frames based on the inherited physical behavior of vehicles and tra￿c rules. Spatia- lyze’s optimizations are all driven by various geospatial metadata and real-world physical behavior, with a goal to reduce the number of video frames to be processed and ML operations to be invoked, thus speeding up video processing runtime. As we are unaware of end-to-end geospatial video analytics systems, we evaluate di￿erent parts of Spatialyze against state-of- the-art (SOTA) video analytic tools (OTIF [4], VIVA [33], EVA [40]), a geospatial data analytic tool (nuScenes devkit [8]), and an aerial arXiv:2308.03276v3 [cs.DB] 16 Mar 2024 VOCAL: Video Organization and Interactive Compositional AnaLytics Vision Paper Maureen Daum∗ [email protected] University of Washington Enhao Zhang∗ [email protected] University of Washington Dong He [email protected] University of Washington Magdalena Balazinska [email protected] University of Washington Brandon Haynes [email protected] Microsoft, Gray Systems Lab Ranjay Krishna [email protected] University of Washington Apryle Craig [email protected] University of Washington Aaron Wirsing [email protected] University of Washington ABSTRACT Current video database management systems (VDBMSs) fail to support the growing number of video datasets in diverse domains because these systems assume clean data and rely on pretrained models to detect known objects or actions. Existing systems also lack good support for compositional queries that seek events con- sisting of multiple objects with complex spatial and temporal rela- tionships. In this paper, we propose VOCAL, a vision of a VDBMS that supports e￿cient data cleaning, exploration and organization, and compositional queries, even when no pretrained model exists to extract semantic content. These techniques utilize optimizations to minimize the manual e￿ort required of users. 1 INTRODUCTION In many application domains—ranging from scienti￿c research to urban planning—the use of large video datasets has become com- mon in great part because camera quality continues to improve and storage remains a￿ordable. Video datasets capture content from diverse domains and support widely varying applications, such as using cameras mounted on vehicles to understand how neigh- borhoods evolved through the pandemic [39], using animal-borne cameras to investigate animal (e.g., shark) habitat preferences [28], or using citywide camera networks to measure the e￿ect of traf- ￿c interventions on crowded city streets [55]. In response to the proliferation of video datasets and applications, recent work has pushed the state of the art in the ￿eld of video database manage- ment systems (VDBMSs) [1, 12, 23, 31, 37]. These systems, however, (i) expect clean video data, (ii) require pretrained machine learning models to detect objects or actions, and (iii) have limited or no sup- port for compositional queries about multiple objects interacting over time. ∗Both authors contributed equally to the paper This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution, provided that you attribute the original work to the authors and CIDR 2022. 12th Annual Conference on Innovative Data Systems Research (CIDR ’22). January 10-13, 2022, Chaminade, USA. Large-scale dirty video dataset Compositional event Find “one deer foraging followed by multiple deer traveling” User-defined query Output video segments with one deer foraging followed by multiple deer traveling VOCAL Interactive video data cleaning Interactive exploration and labeling Compositional query processing … … Foraging Traveling Figure 1: We introduce VOCAL, an end-to-end video analyt- ics system that facilitates cleaning, exploration, and com- positional query processing over large-scale video datasets, even for domains where no pretrained models exist. In contrast, most real world collection e￿orts violate all three assumptions. They result in videos with dirty data. They often lack domain-speci￿c models to support the target application. Users would like to ask complex, compositional queries on their data. As a motivating example, we are working with a dataset of videos captured by cameras attached to wild deer (Odocoileus spp.) [16]. The videos are to be used to provide insight into deer activity patterns: How much time do deer spend foraging versus lying down? How do environmental conditions (e.g., slope and depth of snow) a￿ect deer activity? The collected data is dirty: camera lenses are often obscured by snow for a large fraction of time (up to 75%), requiring that scientists ￿rst manually clean each video to ￿nd usable fragments. Because egocentric deer video is an uncommon domain, there are no pretrained machine learning models available to detect deer activity directly. Scientists thus need to annotate their videos manually. This is tedious, so scientists have only annotated a small fraction of the data. Finally, scientists would like to understand complex deer behavior that goes beyond simple activity detection. In this paper, we propose VOCAL (Video Organization and Inter- active Compositional AnaLytics), a vision of a VDBMS that supports user-de￿ned applications and compositional queries for any dataset, even when the videos are dirty or lack domain-speci￿c pretrained models (Figure 1). We contribute initial techniques towards devel- oping VOCAL to (i) ￿lter dirty data from videos (Section 2), (ii) EVA Spatialize VIVA VOCAL
  4. Problem Statement 8 Create a VDBMS module which allows synthesizing

    video results for video queries • Support broad use cases, not domain-specific tasks • Answer every query that starts with “show me…” • Video edits should be able to use relational data “show me” Query VDBMS Result Video
  5. Our Proposed Solution: V2V 9 V2V Query VDBMS Video Result

    DVE Spec Relational Result V2V uses a Data-Oriented Declarative Video Editor (DVE) for Synthesizing Video Results for Video Queries DVE Editor
  6. 10 SQL queries are a declarative transformation over relational data.

    V2V specs are a declarative transformation over videos.
  7. Specification DSL for Editing Videos with Data Spec = <TimeDomain,

    Render, videos: {"vid1": "video1.mp4", ...}, data_arrays: {"vid1_bb": "annot1.json"}> TimeDomain = Range(0, 600, 1/30) Render(t) = match t { t in Range(0, 300, 1/30) => BoundingBox(vid1[t], vid1_bb[t]), t in Range(300, 600, 1/30) => Grid( vid1[t + 13463/30], Overlay(vid2[t], "overlay.png"), Zoom(vid3[t], 10.0), vid4[t + 9952/30])} 11 Result Video Length & Frame Rate Instructions to Render each frame Source Videos Data
  8. Data Model 12 A video is an array of frames

    A data array is an array of data values Arrays are indexed by a rational number timestamp Index 0/24 1/24 2/24 3/24 4/24 … Value 3 4 5 2 4 … Array: Arrays can be backed by JSON, SQL, etc.
  9. Spec Execution & Optimizations We build on existing data and

    video processing optimizations: • Temporal sharding • Operator merging • Stream copying & smart cuts All plans are reduced into FFmpeg commands for execution 14
  10. Data-Dependent Rewrites We use the specific values of data to

    further optimize editing • Every filter has a data-dependent equivalence (DDE) function • Specs are run twice, first the DDEs with data values and symbolic frames, then the actual filter with data values and actual frames 17
  11. Example of Data-Dependent Rewrite The BoundingBox filter is equivalent to

    the identity filter iff there are no objects drawn on that frame 18 BoundingBoxdde(𝑥: Frame, 𝑏: List<ObjectBound>) = > 𝑥, 𝑏 = 0 BoundingBox(𝑥, 𝑏), 𝑏 > 0
  12. Example of Data-Dependent Rewrite 19 TimeDomain = Range(0, 300) Render(t)

    = BoundingBox(vid1[t], vid1_bb[t]) Given a spec: We can evaluate it with DDE functions to find an equivalent optimized spec on the specific data values: TimeDomain = Range(0, 300) Render(t) = match t { t in Range(0, 100) => vid1[t], t in {101, 102, 103} => BoundingBox(vid1[t], vid1_bb[t]), t in Range(104, 300) => vid1[t], } Start and end of video can now be stream-copied
  13. Evaluation Datasets & Tasks Datasets: • Tears of Steel •

    KABR Drone Video Metrics: • Latency 20 Tasks: • Q1: Clip • Q2: Clip & Splice 4 segments • Q3: Compose 4 videos into a grid • Q4: Apply a basic filter (blur) • Q5: Draw bounding boxes Q1-Q5 are 5-second Q6-10 are 1-minute
  14. Results Optimized specs run 3-5x faster vs unoptimized and Python

    baselines 21 0 10 20 30 40 50 60 70 80 90 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Execution Time (s) Unoptimized Optimized 0 5 10 15 20 25 30 35 TOS - Q5 TOS - Q10 KABR - Q5 KABR - Q10 Execution Time (s) Python + OpenCV V2V - Unoptimized V2V - Optimized
  15. Future Work • On-Demand Streaming • Stream the results as

    they are generated • Hardware-Accelerated Processing • Hardware-accelerated processing could significantly lower latency • Natural Language Querying • Ideally, we should be able to answer any “show me …” style query 22
  16. Takeaways • VDBMSs can benefit from a video result synthesis

    engine • We use an intermediate DSL representation of edited result videos, and then uses DBMS-style optimizations to run them • Our implementation, V2V, has optimizations which consistently deliver speedups over 3x 23