V2V: Efficiently Synthesizing Video Results for Video Queries

V2V: Efficiently Synthesizing Video Results for Video Queries Dominik Winecki
(The Ohio State University, USA) Arnab Nandi (The Ohio State University, USA)

Tremendous advancements in recent years 2 Interactively Querying Video VDBMS
Query Result User Point of Friction

3 Presenting video query results as relations is difficult A
video result is preferable

4 VDBMS Query Result User Current State-of-the-Art Approach to Visualizing
Video Results

Issues with ad hoc Visualization Scripts 5 User-Managed Data/State User
must manage source videos and relational results, including over multiple iterations of a query. The user must have the knowledge to write ad-hoc visualization scripts and the development environment to run them. Required Knowledge Slow Execution Imperative code can not be transparently optimized, so performance is the user’s responsibility

Video Results for Video Queries 6 VDBMS Query Result User

VDBMSs which can Return Videos 7 Provide ad hoc Scripts
for Bounding Boxes Only Return Clips of Videos EVA: A Symbolic Approach to Accelerating Exploratory Video Analytics with Materialized Views Zhuangdi Xu∗ Georgia Institute of Technology [email protected] Gaurav Tarlok Kakkar∗ Georgia Institute of Technology [email protected] Joy Arulraj Georgia Institute of Technology [email protected] Umakishore Ramachandran Georgia Institute of Technology [email protected] ABSTRACT Advances in deep learning have led to a resurgence of interest in video analytics. In an exploratory video analytics pipeline, a data scientist often starts by searching for a global trend and then iteratively renes the query until they identify the desired local trend. These queries tend to have overlapping computation and often dier in their predicates. However, these predicates are computationally expensive to evaluate since they contain user-dened functions (UDFs) that wrap around deep learning models. In this paper, we present EVA, a video database management system (VDBMS) that automatically materializes and reuses the results of expensive UDFs to facilitate faster exploratory data analysis. It diers from the state-of-the-art (SOTA) reuse algorithms in traditional DBMSs in three ways. First, it focuses on reusing the results of UDFs as opposed to those of sub-plans. Second, it takes a symbolic approach to analyze predicates and identify the degree of overlap between queries. Third, it factors reuse into UDF evaluation cost and uses the updated cost function in critical query optimization decisions like predicate reordering and model selection. Our empir- ical analysis of EVA demonstrates that it accelerates exploratory video analytics workloads by 4⇥ with a negligible storage over- head (1.001⇥). We demonstrate that the reuse algorithm in EVA complements the specialized lters adopted in SOTA VDBMSs. CCS CONCEPTS • Computing methodologies ! Image processing; Symbolic and algebraic manipulation; • Information systems ! Query optimization. KEYWORDS Video Analytics, Symbolic Computation, Database Management System ACM Reference Format: Zhuangdi Xu, Gaurav Tarlok Kakkar, Joy Arulraj, and Umakishore Ra- machandran. 2022. EVA: A Symbolic Approach to Accelerating Exploratory ∗Both authors contributed equally to the paper. This work is licensed under a Creative Commons Attribution International 4.0 License. SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA © 2022 Copyright held by the owner/author(s). ACM ISBN 978-1-4503-9249-5/22/06. https://doi.org/10.1145/3514221.3526142 Video Analytics with Materialized Views. In Proceedings of the 2022 In- ternational Conference on Management of Data (SIGMOD ’22), June 12–17, 2022, Philadelphia, PA, USA. ACM, New York, NY, USA, 15 pages. https: //doi.org/10.1145/3514221.3526142 1 INTRODUCTION Researchers have presented novel techniques for eciently analyzing visual data at scale in video database management systems (VDBMSs), including sampling, ltering, and specialized neural networks [35, 36, 44]. However, in exploratory video analytics, queries often exhibit a signicant overlap of computation due to redundant execution of user-dened functions (abbrev., UDFs) associated with computer vision tasks (e.g., object detection). Prior eorts in video analytics have not extensively studied the problem of materializing and reusing the results of expensive UDFs. While there have been eorts in traditional database management systems (DBMSs) to handle expensive UDFs [13, 17, 19, 27, 45, 50], they do not leverage all of the opportunities present in video analytics. M. Consider a law enforcement ocer analyzing a video data set for tracking a suspicious vehicle with the help of a witness. They typically rst search for a global trend and then iteratively rene the query until they nd the desired local trend [17]. During this query renement process, queries tend to have overlapping computation. Initially, the witness may only recall the vehicle model (e.g., SUV) and the approximate time-frame in which they saw the vehicle (e.g., night time). So, the ocer starts with &1 in Listing 1 that searches for all SUVs present during that time-frame to identify the suspicious vehicle. While going over the frames with SUVs returned by &1, the witness might recall the color of the vehicle (e.g., red). Then, the ocer narrows down the search space and looks for the license plate of all red-colored SUVs (&2). Lastly, in &3, the ocer searches the entire dataset for the suspicious vehicle using the license plate information. Multiple applications may query over the same video and their queries may also contain overlapping computation. For instance, a trac planner may be interested in analyzing the trac congestion over the entire day using &4. This task only requires a less-accurate (and faster) object detection model. Across these queries, several UDFs are redundantly evaluated (i.e., VM, OD , VC, A, and L) on many frames. We seek to accelerate these queries by materializing and reusing the results of expensive UDFs, since these functions dominate the overall query processing time. Session 8: Query Processing and Data Management for ML SIGMOD ’22, June 12–17, 2022, Philadelphia, PA, USA VIVA: An End-to-End System for Interactive Video Analytics Daniel Kang∗ [email protected] Stanford University Francisco Romero∗ [email protected] Stanford University Peter Bailis [email protected] Stanford University Christos Kozyrakis [email protected] Stanford University Matei Zaharia [email protected] Stanford University and Databricks ABSTRACT The growth of video volumes and increased DNN capabilities has led to a growing desire for video analytics. In response, the data analytics community has proposed multiple systems that optimize specic query types (e.g., selection queries) or a particular step in query execution (e.g., video retrieval from storage). However, none of these systems provide end-to-end, practical video analytics for users to iteratively and interactively engage with queries, as is the case with analytics systems for structured data. In response, we are building VIVA: an end-to-end system for interactive video analytics. VIVA contains ve novel components. First, VIVA uses relational hints, which allow users to express rela- tionships between columns that are dicult to automatically infer (e.g., mentions of a person in a transcript can be used as a proxy for the person appearing in the video). Second, VIVA introduces a mixed-data query optimizer that optimizes queries across both structured and unstructured data. Third, VIVA features an embedding cache that decides which results/embeddings to store for future queries. Finally, VIVA co-optimizes storage and query execution with its video le manager and accelerator-based execution engine. The former decides how to pre-fetch/manage video, while the latter selects and manages heterogeneous hardware backends spanning the growing number of DNN accelerators. We describe the chal- lenges and design requirements for VIVA’s development and outline ongoing and future work for realizing VIVA. 1 INTRODUCTION Video volumes are growing tremendously in scale: 500 hours of video are uploaded to YouTube every minute [36]. At the same time, deep neural networks (DNNs) have increased in capabilities, e.g., allowing for detection of objects in videos [17]. These two trends have made automatic and meaningful analyses of video increasingly feasible, allowing users to answer queries such as “how many birds of a particular species visit a feeder per day” or “do any cars that passed an intersection match an AMBER alert.” To date, research systems for DNN-based video analytics focus on optimizing a specic query type (e.g., selecting a frame with a specic criteria) or a single step in query execution. They range from reducing query costs via approximations [21], ecient use ∗Denotes equal contribution. This article is published under the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/). 12th Annual Conference on Innovative Data Systems Research (CIDR ’22), January 9–12, 2022, Chaminade, USA. VIVA Structured table Query language Parser Mixed-data optimizer Accelerator-based execution engine Embedding cache Video metadata DNN metadata UI Inputs Relational hint explorer Model (re)-training End-to-end Outputs Query analysis Results Serverless GPU Video file manager Materialized results Model (re)-training service SELECT time_window FROM news_analysis WHERE tapper_angry_sanders = TRUE USING HINT transcript_frame_equals Analysis (lat, cost, acc) Plan 1 1hr, $30, 80% Plan 2 20s, $20, 60% Plan 3 1min, $8, 75% Final query results Video 1 Video 2 Serverless *PU Video file manager Figure 1: VIVA’s architecture diagram. Blue components/interfaces are typically found in analytic systems for structured data that require rethinking. Orange components/interfaces are novel to VIVA. of hardware [26], indexing video data [24], to video query programming models [13]. However, users must manually decide how to combine and manage these systems to answer desired queries. Consider an analyst studying political news coverage from the “big three” cable news channels [19]. The analyst may issue several ad-hoc (i.e., exploratory) queries to understand the dataset, such as nding instances of Bernie Sanders (a politician) reacting angrily to Jake Tapper (a TV news host). To answer this query eciently, the analyst needs to consider several choices, such as: should they train a specialized proxy model for Jake Tapper, Bernie Sanders, or both (e.g., using NS [22])? Is it best to lter for Bernie Sanders and Jake Tapper using the video transcripts rst before searching frames [13]? Have enough labels (e.g., angry face detections) been materialized to leverage a video event specication system [13]? Today, end-to-end systems for terabyte-scale structured data analytics (e.g., data warehouses) alleviate the burden on users from having to manually optimize queries over structured data. These systems eciently automate query planning because structured 1 Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations Chanwut Kittivorawong*, Yongming Ge*, Yousef Helal, Alvin Cheung University of California, Berkeley ABSTRACT Videos that are shot using commodity hardware such as phones and surveillance cameras record various metadata such as time and location. We encounter such geospatial videos on a daily basis and such videos have been growing in volume signicantly. Yet, we do not have data management systems that allow users to interact with such data eectively. In this paper, we describe Spatialyze, a new framework for end- to-end querying of geospatial videos. Spatialyze comes with a domain-specic language where users can construct geospatial video analytic workows using a 3-step, declarative, build-lter- observe paradigm. Internally, Spatialyze leverages the declarative nature of such workows, the temporal-spatial metadata stored with videos, and physical behavior of real-world objects to optimize the execution of workows. Our results using real-world videos and workows show that Spatialyze can reduce execution time by up to 5.3⇥, while maintaining up to 97.1% accuracy compared to unoptimized execution. 1 INTRODUCTION Geospatial videos are videos in which the location and time that they are shot are known. From surveillance footage, autonomous vehicle (AV) cameras to police body cameras, such videos are preva- lent in our daily lives. While the volume of such data has grown tremendously [36, 39], we still lack end-to-end systems that can process and query such data eectively. The rise of machine learning (ML) in recent years has aggravated this issue, with the latest deep learning models capable of carrying out various computer vision tasks, such as object detection [14, 15, 20, 31, 32], multi-objects tracking [5, 7, 9, 38], image depth estimation [17], etc. All these trends have made geospatial video analytics computationally intensive. For instance, a $2299 Nvidia T4 GPU takes 34 seconds on average to execute a simple workow of object detection, tracking, and image depth estimation on a 20-second 12-fps video. Modern autonomous driving datasets, such as nuScenes [8], contain 6000 such videos. Running the workow mentioned above will take 3 full days to run on the entire dataset. To make matters worse, the lack of programming frameworks and data management systems for geospatial videos has made it challenging for end users to specify their workows, let alone running them eciently. Consider a data journalist writing an article on police misconduct and would like to examine the footage collected on police body cameras to look for specic behavior, e.g., police chasing after a suspect. Given today’s technology, they will either need to watch all the footage themself or string together various ML models for video analytics [2, 4, 21, 40] and video processing libraries (e.g., OpenCV [6], FFmpeg [37]) to construct their ⇤Both authors contributed equally to this research. workow. Sadly, the former is simply infeasible given the amount of video data collected, while the latter requires deep programming expertise that most users do not possess. We see an opportunity to build a programming framework that bridges video processing and geospatial video analytics. Specically, given that users constructing workows on geospatial videos are typically interested in the time and location where such videos are taken, we should exploit the metadata stored in geospatial videos to optimize such workows. For instance, a car is at (1810.39, 1221.71, 0.90) in the Boston Seaport map, which is at an intersection, at 2:33:59.112 PM of September 18, 2018. Each object inside a video captures a real-world object; therefore, this object also inherits physical behavior that its real-world counterpart would have had. For example, when users look for a car at a road intersection, we know that a road intersection need to be visible in the video frames that the car of interest is visible; therefore, video frames without a visible road intersection, determined with geospatial metadata, need not be processed with expensive ML functions. We leverage this already existing geospatial metadata, the inherited physical behaviors, and users’ queries to speed up video processing. To leverage this insight, we present Spatialyze, a system for geospatial video analytics. To make it easy for users to specify their geospatial video workows, Spatialyze comes with a conceptual data model where users create and ingest videos into a “world,” and users interact with the world by specifying objects (e.g., cars) and scenarios (e.g., cars crossing an intersection) of interest via Spatia- lyze’s domain-specic language (DSL) embedded in Python called S-Flow. Spatialyze then eciently executes the workow by lever- aging various spatial-aware optimization techniques that make use of existing geospatial metadata and assumptions based on inherited physical behavior of objects in the videos. For instance, Spatialyze’s Road Visibility Pruner uses road’s visibility as a proxy for objects’ visibility to prune out video frames. The Object Type Pruner then prunes out objects that are not of interest to the users’ workow. Spatialyze’s Geometry-Based 3D Location Estimator speeds up object 3D location estimation by replacing a computationally expensive ML-based approach with a geometry-based approach. Finally, the Exit Frame Sampler prunes out unnecessary video frames based on the inherited physical behavior of vehicles and trac rules. Spatia- lyze’s optimizations are all driven by various geospatial metadata and real-world physical behavior, with a goal to reduce the number of video frames to be processed and ML operations to be invoked, thus speeding up video processing runtime. As we are unaware of end-to-end geospatial video analytics systems, we evaluate dierent parts of Spatialyze against state-of- the-art (SOTA) video analytic tools (OTIF [4], VIVA [33], EVA [40]), a geospatial data analytic tool (nuScenes devkit [8]), and an aerial arXiv:2308.03276v3 [cs.DB] 16 Mar 2024 VOCAL: Video Organization and Interactive Compositional AnaLytics Vision Paper Maureen Daum∗ [email protected] University of Washington Enhao Zhang∗ [email protected] University of Washington Dong He [email protected] University of Washington Magdalena Balazinska [email protected] University of Washington Brandon Haynes [email protected] Microsoft, Gray Systems Lab Ranjay Krishna [email protected] University of Washington Apryle Craig [email protected] University of Washington Aaron Wirsing [email protected] University of Washington ABSTRACT Current video database management systems (VDBMSs) fail to support the growing number of video datasets in diverse domains because these systems assume clean data and rely on pretrained models to detect known objects or actions. Existing systems also lack good support for compositional queries that seek events con- sisting of multiple objects with complex spatial and temporal rela- tionships. In this paper, we propose VOCAL, a vision of a VDBMS that supports ecient data cleaning, exploration and organization, and compositional queries, even when no pretrained model exists to extract semantic content. These techniques utilize optimizations to minimize the manual eort required of users. 1 INTRODUCTION In many application domains—ranging from scientic research to urban planning—the use of large video datasets has become com- mon in great part because camera quality continues to improve and storage remains aordable. Video datasets capture content from diverse domains and support widely varying applications, such as using cameras mounted on vehicles to understand how neigh- borhoods evolved through the pandemic [39], using animal-borne cameras to investigate animal (e.g., shark) habitat preferences [28], or using citywide camera networks to measure the eect of traf- c interventions on crowded city streets [55]. In response to the proliferation of video datasets and applications, recent work has pushed the state of the art in the eld of video database management systems (VDBMSs) [1, 12, 23, 31, 37]. These systems, however, (i) expect clean video data, (ii) require pretrained machine learning models to detect objects or actions, and (iii) have limited or no support for compositional queries about multiple objects interacting over time. ∗Both authors contributed equally to the paper This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution, provided that you attribute the original work to the authors and CIDR 2022. 12th Annual Conference on Innovative Data Systems Research (CIDR ’22). January 10-13, 2022, Chaminade, USA. Large-scale dirty video dataset Compositional event Find “one deer foraging followed by multiple deer traveling” User-defined query Output video segments with one deer foraging followed by multiple deer traveling VOCAL Interactive video data cleaning Interactive exploration and labeling Compositional query processing … … Foraging Traveling Figure 1: We introduce VOCAL, an end-to-end video analytics system that facilitates cleaning, exploration, and compositional query processing over large-scale video datasets, even for domains where no pretrained models exist. In contrast, most real world collection eorts violate all three assumptions. They result in videos with dirty data. They often lack domain-specic models to support the target application. Users would like to ask complex, compositional queries on their data. As a motivating example, we are working with a dataset of videos captured by cameras attached to wild deer (Odocoileus spp.) [16]. The videos are to be used to provide insight into deer activity patterns: How much time do deer spend foraging versus lying down? How do environmental conditions (e.g., slope and depth of snow) aect deer activity? The collected data is dirty: camera lenses are often obscured by snow for a large fraction of time (up to 75%), requiring that scientists rst manually clean each video to nd usable fragments. Because egocentric deer video is an uncommon domain, there are no pretrained machine learning models available to detect deer activity directly. Scientists thus need to annotate their videos manually. This is tedious, so scientists have only annotated a small fraction of the data. Finally, scientists would like to understand complex deer behavior that goes beyond simple activity detection. In this paper, we propose VOCAL (Video Organization and Inter- active Compositional AnaLytics), a vision of a VDBMS that supports user-dened applications and compositional queries for any dataset, even when the videos are dirty or lack domain-specic pretrained models (Figure 1). We contribute initial techniques towards devel- oping VOCAL to (i) lter dirty data from videos (Section 2), (ii) EVA Spatialize VIVA VOCAL

Problem Statement 8 Create a VDBMS module which allows synthesizing
video results for video queries • Support broad use cases, not domain-specific tasks • Answer every query that starts with “show me…” • Video edits should be able to use relational data “show me” Query VDBMS Result Video

Our Proposed Solution: V2V 9 V2V Query VDBMS Video Result
DVE Spec Relational Result V2V uses a Data-Oriented Declarative Video Editor (DVE) for Synthesizing Video Results for Video Queries DVE Editor

10 SQL queries are a declarative transformation over relational data.
V2V specs are a declarative transformation over videos.

Specification DSL for Editing Videos with Data Spec = <TimeDomain,
Render, videos: {"vid1": "video1.mp4", ...}, data_arrays: {"vid1_bb": "annot1.json"}> TimeDomain = Range(0, 600, 1/30) Render(t) = match t { t in Range(0, 300, 1/30) => BoundingBox(vid1[t], vid1_bb[t]), t in Range(300, 600, 1/30) => Grid( vid1[t + 13463/30], Overlay(vid2[t], "overlay.png"), Zoom(vid3[t], 10.0), vid4[t + 9952/30])} 11 Result Video Length & Frame Rate Instructions to Render each frame Source Videos Data

Data Model 12 A video is an array of frames
A data array is an array of data values Arrays are indexed by a rational number timestamp Index 0/24 1/24 2/24 3/24 4/24 … Value 3 4 5 2 4 … Array: Arrays can be backed by JSON, SQL, etc.

Architecture 13

Spec Execution & Optimizations We build on existing data and
video processing optimizations: • Temporal sharding • Operator merging • Stream copying & smart cuts All plans are reduced into FFmpeg commands for execution 14

Stream-Copying & Smart Cuts 15 GOP 2 GOP 1

Stream-Copying & Smart Cuts 16 Needed Clip 𝝈 Decode: Select:
Encode: 𝝈 Output:

Data-Dependent Rewrites We use the specific values of data to
further optimize editing • Every filter has a data-dependent equivalence (DDE) function • Specs are run twice, first the DDEs with data values and symbolic frames, then the actual filter with data values and actual frames 17

Example of Data-Dependent Rewrite The BoundingBox filter is equivalent to
the identity filter iff there are no objects drawn on that frame 18 BoundingBoxdde(𝑥: Frame, 𝑏: List<ObjectBound>) = > 𝑥, 𝑏 = 0 BoundingBox(𝑥, 𝑏), 𝑏 > 0

Example of Data-Dependent Rewrite 19 TimeDomain = Range(0, 300) Render(t)
= BoundingBox(vid1[t], vid1_bb[t]) Given a spec: We can evaluate it with DDE functions to find an equivalent optimized spec on the specific data values: TimeDomain = Range(0, 300) Render(t) = match t { t in Range(0, 100) => vid1[t], t in {101, 102, 103} => BoundingBox(vid1[t], vid1_bb[t]), t in Range(104, 300) => vid1[t], } Start and end of video can now be stream-copied

Evaluation Datasets & Tasks Datasets: • Tears of Steel •
KABR Drone Video Metrics: • Latency 20 Tasks: • Q1: Clip • Q2: Clip & Splice 4 segments • Q3: Compose 4 videos into a grid • Q4: Apply a basic filter (blur) • Q5: Draw bounding boxes Q1-Q5 are 5-second Q6-10 are 1-minute

Results Optimized specs run 3-5x faster vs unoptimized and Python
baselines 21 0 10 20 30 40 50 60 70 80 90 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Execution Time (s) Unoptimized Optimized 0 5 10 15 20 25 30 35 TOS - Q5 TOS - Q10 KABR - Q5 KABR - Q10 Execution Time (s) Python + OpenCV V2V - Unoptimized V2V - Optimized

Future Work • On-Demand Streaming • Stream the results as
they are generated • Hardware-Accelerated Processing • Hardware-accelerated processing could significantly lower latency • Natural Language Querying • Ideally, we should be able to answer any “show me …” style query 22

Takeaways • VDBMSs can benefit from a video result synthesis
engine • We use an intermediate DSL representation of edited result videos, and then uses DBMS-style optimizations to run them • Our implementation, V2V, has optimizations which consistently deliver speedups over 3x 23

24 Thank you https://ixlab.github.io/v2v/

V2V: Efficiently Synthesizing Video Results for...

V2V: Efficiently Synthesizing Video Results for Video Queries

Dominik Winecki

More Decks by Dominik Winecki

Featured

Transcript