Slide 1

Slide 1 text

NoScope: Optimizing Neural Network Queries over Video at Scale Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia DAWN Project, Stanford InfoLab http://dawn.cs.stanford.edu/ 30 August 2017 @ VLDB 2017

Slide 2

Slide 2 text

Video is a rapidly growing source of data » London alone has 500K CCTVs » 300 hours of video are uploaded to YouTube every minute » High quality image sensors are incredibly cheap (<$0.70)

Slide 3

Slide 3 text

We can query video to understand the world e.g., traffic analysis, environmental monitoring, surveillance, customer behavior, urban dynamics, social science and media studies running example: when did buses pass by this intersection today? increasingly cheap to acquire this data… ...how to process it?

Slide 4

Slide 4 text

Computer vision lets us query video automatically Core capability: Object detection Input: visual data (e.g., images) Output: objects and boxes in scene

Slide 5

Slide 5 text

Neural networks dominate object detection » Idea: many parameters + nonlinear functions capture representations » Preferred for image analytics, often better than humans » High-quality models widely available (e.g., open source on GitHub) Enables new kinds of downstream analytics (e.g., use with DBMS)

Slide 6

Slide 6 text

Object detection neural networks evaluate video one frame at a time NN (YOLOv2) Problem: Analysis with NNs doesn’t scale NN Hardware Cost to Purchase Frames / Second K80 GPU $4000 50 P100 GPU $6000 80 Video Labels (e.g.,Y,N,Y,N) 500K video feeds? $1B+ of GPUs

Slide 7

Slide 7 text

This talk: NoScope a system for accelerating neural network video analysis using model specialization and database-inspired query optimization Our research: Can we make analytics on video scale?

Slide 8

Slide 8 text

Outline » Motivation: Exploding video data demands scalable processing » NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation

Slide 9

Slide 9 text

NoScope Architecture, Interfaces Input: a) target object + b) target video (fixed angle only) + c) reference NN e.g., “find buses in this webcam feed using YOLOv2” ... NoScope Output: 0s 30s 60s 90s Binary labels over time e.g., buses appeared at 5-14s, 28-33s, … Objective: minimize runtime while mimicking reference NN within target accuracy (e.g., 1%)

Slide 10

Slide 10 text

Outline » Motivation: Exploding video data demands scalable processing » NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation

Slide 11

Slide 11 text

Query: “When did buses pass by this intersection?” Target objects appear from similar perspectives in video Opportunity 1: Query-specific locality

Slide 12

Slide 12 text

NNs are typically trained to detect tens of object categories in arbitrary scenes and from arbitrary angles images from training set for YOLOv2 Query: “When did buses pass by this intersection?” Opportunity 1: Query-specific locality If we only want to detect buses in a given video, we’re overpaying

Slide 13

Slide 13 text

Key idea: specialize for query and video Idea: use big reference NN to train a smaller, specialized NN The specialized NN: Only works for a given video feed and object Is much, much smaller than the reference NN

Slide 14

Slide 14 text

Specialized models are much smaller 24 convolutional layers 64-1024 filters per layer 4096 neurons in FC layer 4 convolutional layers 32-128 filters per layer 32 neurons in FC layer 35 billion FLOPS 3 million FLOPS 10,000x fewer FLOPS NoScope specialized model YOLOv2

Slide 15

Slide 15 text

Specialized models are much faster YOLOv2: 80 fps Specialized NN: 25k+ fps 300x faster execution on GPU

Slide 16

Slide 16 text

Specialization != Model Compression Model compression/distillation [NIPS14, ICLR16]: lossless models Goal: smaller model for same task as reference model Result: typically 2-10x faster execution Specialization: perform “lossy” compression of reference model A specialized model does not generalize to other videos… …but is accurate on target video, up to 300x faster

Slide 17

Slide 17 text

NoScope’s Model Specialization Procedure 1. Run big NN for few hours to obtain video-specific training data 2. Train specialized NN over video-specific training data 3. Enable specialized NN, only call big NN when unsure In paper: NoScope automatically searches for the smallest NN

Slide 18

Slide 18 text

Outline » Motivation: Exploding video data demands scalable processing » NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation

Slide 19

Slide 19 text

Opportunity 2: Temporal locality Both videos run at 30 frames per second, requiring 30 NN evaluations per second Query: “When did buses pass by this intersection?”

Slide 20

Slide 20 text

Opportunity 2: Temporal locality Query: “When did buses pass by this intersection?” Observation: frames close in time are often redundant NoScope: train a fast model to detect redundancy

Slide 21

Slide 21 text

Difference detection: detect redundant frames Many techniques in the literature for detecting scene changes NoScope: simple regression model over subtracted frames - = Frame 1 Frame 0 Difference Difference detection runs at 100k+ fps on CPU Surprising: detecting differences is faster than even specialized NNs

Slide 22

Slide 22 text

Outline » Motivation: Exploding video data demands scalable processing » NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation

Slide 23

Slide 23 text

NoScope combines fast models in a cascade YOLOv2 80fps Labels Video Previously: [Viola, Jones CVPR 2001] YOLOv2 80fps Labels Specialized NN 25k fps Difference Detector 100k fps Idea: Use the cheapest model possible for each frame Video

Slide 24

Slide 24 text

Specialized NN Bus absent YOLOv2 Bus present Nothing confident? unsure Difference Detector unsure confident? Cascades avoid unnecessary computation on each frame Truck Bus

Slide 25

Slide 25 text

NoScope performs cost-based optimization for cascades Given an accuracy target, NoScope performs: Model search: e.g., how many layers in specialized NN? Cascade search: e.g., how to set the cascade thresholds? Data-dependent process: high-quality choices vary across queries and videos (see paper)

Slide 26

Slide 26 text

Typical NoScope Query Lifecycle 1. Run big NN over part of video for training data (~75 minutes) 2. Model search specialized NN, difference detector (15 minutes) 3. Perform cascade firing threshold search (2 minutes) 4. Activate cascade to process rest of video

Slide 27

Slide 27 text

Current Limitations (cf. Section 8) Targets binary detection tasks (e.g., bus/no bus) Ongoing research on also locating objects (e.g., bus location) Targets fixed-angle cameras (e.g., surveillance cameras) Ongoing research on moving cameras Does not automatically handle model drift Requires representative training set (e.g., morning, afternoon) Batch-oriented processing Poor on-GPU support for control flow in cascades

Slide 28

Slide 28 text

Outline » Motivation: Exploding video data demands scalable processing » NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation

Slide 29

Slide 29 text

Experimental configuration and videos System setup: Difference detectors run on 32 CPU cores + specialized/target NNs runs on P100 GPU; omit MPEG decode time Seven video streams from real-world, fixed-angle surveillance cameras; 8-12 hours of video per stream (evaluation set) Taipei: bus Amsterdam: car Store: person Jackson Hole: car

Slide 30

Slide 30 text

NoScope enables accuracy-speed trade-offs Elevator (best result) Taipei (worst result) 40x faster @ 99.9% accuracy 5858x faster @ 96% accuracy 36.5x faster @ 99.9% accuracy 206x faster @ 96% accuracy

Slide 31

Slide 31 text

Factor Analysis: All components contribute to speedups Difference detection can filter 95% of frames Specialized models can filter all remaining frames 1 10 100 1000 10000 100000 1000000 YOLOv2 + Diff + Spec Frames per Second video: elevator false positives: 1% false negatives: 1% For this video: Similar trends for other videos, depending on content

Slide 32

Slide 32 text

Comparison w/ classic methods, non-specialized NNs NoScope delivers best trade-off Classic CV NN (no spec.) NoScope video: elevator

Slide 33

Slide 33 text

Additional content in paper » Lesion study evaluating contribution of each optimization » Demonstration of optimizer selection procedure » Efficient firing threshold search for optimizer » Additional details on limitations and extensions » Additional related work for computer vision, NNs, RDBMS

Slide 34

Slide 34 text

Conclusions Neural networks can automatically analyze rapidly growing video datasets, but are very slow to execute (50-80fps on GPU) NoScope accelerates NN-based video queries by: 1. Specializing networks to exploit query-specific locality 2. Training difference detectors to exploit temporal locality 3. Cost-based optimization for video-specific cascades Promising results (10-1000x speedups) for many queries https://github.com/stanford-futuredata/noscope