NoScope: Optimizing Neural Network Queries over Video at Scale

NoScope: Optimizing Neural Network Queries over Video at Scale

Transcript

  1. NoScope: Optimizing Neural Network Queries over Video at Scale Daniel

    Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia DAWN Project, Stanford InfoLab http://dawn.cs.stanford.edu/ 30 August 2017 @ VLDB 2017
  2. Video is a rapidly growing source of data » London

    alone has 500K CCTVs » 300 hours of video are uploaded to YouTube every minute » High quality image sensors are incredibly cheap (<$0.70)
  3. We can query video to understand the world e.g., traffic

    analysis, environmental monitoring, surveillance, customer behavior, urban dynamics, social science and media studies running example: when did buses pass by this intersection today? increasingly cheap to acquire this data… ...how to process it?
  4. Computer vision lets us query video automatically Core capability: Object

    detection Input: visual data (e.g., images) Output: objects and boxes in scene
  5. Neural networks dominate object detection » Idea: many parameters +

    nonlinear functions capture representations » Preferred for image analytics, often better than humans » High-quality models widely available (e.g., open source on GitHub) Enables new kinds of downstream analytics (e.g., use with DBMS)
  6. Object detection neural networks evaluate video one frame at a

    time NN (YOLOv2) Problem: Analysis with NNs doesn’t scale NN Hardware Cost to Purchase Frames / Second K80 GPU $4000 50 P100 GPU $6000 80 Video Labels (e.g.,Y,N,Y,N) 500K video feeds? $1B+ of GPUs
  7. This talk: NoScope a system for accelerating neural network video

    analysis using model specialization and database-inspired query optimization Our research: Can we make analytics on video scale?
  8. Outline » Motivation: Exploding video data demands scalable processing »

    NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation
  9. NoScope Architecture, Interfaces Input: a) target object + b) target

    video (fixed angle only) + c) reference NN e.g., “find buses in this webcam feed using YOLOv2” ... NoScope Output: 0s 30s 60s 90s Binary labels over time e.g., buses appeared at 5-14s, 28-33s, … Objective: minimize runtime while mimicking reference NN within target accuracy (e.g., 1%)
  10. Outline » Motivation: Exploding video data demands scalable processing »

    NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation
  11. Query: “When did buses pass by this intersection?” Target objects

    appear from similar perspectives in video Opportunity 1: Query-specific locality
  12. NNs are typically trained to detect tens of object categories

    in arbitrary scenes and from arbitrary angles images from training set for YOLOv2 Query: “When did buses pass by this intersection?” Opportunity 1: Query-specific locality If we only want to detect buses in a given video, we’re overpaying
  13. Key idea: specialize for query and video Idea: use big

    reference NN to train a smaller, specialized NN The specialized NN: Only works for a given video feed and object Is much, much smaller than the reference NN
  14. Specialized models are much smaller 24 convolutional layers 64-1024 filters

    per layer 4096 neurons in FC layer 4 convolutional layers 32-128 filters per layer 32 neurons in FC layer 35 billion FLOPS 3 million FLOPS 10,000x fewer FLOPS NoScope specialized model YOLOv2
  15. Specialized models are much faster YOLOv2: 80 fps Specialized NN:

    25k+ fps 300x faster execution on GPU
  16. Specialization != Model Compression Model compression/distillation [NIPS14, ICLR16]: lossless models

    Goal: smaller model for same task as reference model Result: typically 2-10x faster execution Specialization: perform “lossy” compression of reference model A specialized model does not generalize to other videos… …but is accurate on target video, up to 300x faster
  17. NoScope’s Model Specialization Procedure 1. Run big NN for few

    hours to obtain video-specific training data 2. Train specialized NN over video-specific training data 3. Enable specialized NN, only call big NN when unsure In paper: NoScope automatically searches for the smallest NN
  18. Outline » Motivation: Exploding video data demands scalable processing »

    NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation
  19. Opportunity 2: Temporal locality Both videos run at 30 frames

    per second, requiring 30 NN evaluations per second Query: “When did buses pass by this intersection?”
  20. Opportunity 2: Temporal locality Query: “When did buses pass by

    this intersection?” Observation: frames close in time are often redundant NoScope: train a fast model to detect redundancy
  21. Difference detection: detect redundant frames Many techniques in the literature

    for detecting scene changes NoScope: simple regression model over subtracted frames - = Frame 1 Frame 0 Difference Difference detection runs at 100k+ fps on CPU Surprising: detecting differences is faster than even specialized NNs
  22. Outline » Motivation: Exploding video data demands scalable processing »

    NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation
  23. NoScope combines fast models in a cascade YOLOv2 80fps Labels

    Video Previously: [Viola, Jones CVPR 2001] YOLOv2 80fps Labels Specialized NN 25k fps Difference Detector 100k fps Idea: Use the cheapest model possible for each frame Video
  24. Specialized NN Bus absent YOLOv2 Bus present Nothing confident? unsure

    Difference Detector unsure confident? Cascades avoid unnecessary computation on each frame Truck Bus
  25. NoScope performs cost-based optimization for cascades Given an accuracy target,

    NoScope performs: Model search: e.g., how many layers in specialized NN? Cascade search: e.g., how to set the cascade thresholds? Data-dependent process: high-quality choices vary across queries and videos (see paper)
  26. Typical NoScope Query Lifecycle 1. Run big NN over part

    of video for training data (~75 minutes) 2. Model search specialized NN, difference detector (15 minutes) 3. Perform cascade firing threshold search (2 minutes) 4. Activate cascade to process rest of video
  27. Current Limitations (cf. Section 8) Targets binary detection tasks (e.g.,

    bus/no bus) Ongoing research on also locating objects (e.g., bus location) Targets fixed-angle cameras (e.g., surveillance cameras) Ongoing research on moving cameras Does not automatically handle model drift Requires representative training set (e.g., morning, afternoon) Batch-oriented processing Poor on-GPU support for control flow in cascades
  28. Outline » Motivation: Exploding video data demands scalable processing »

    NoScope Architecture + Key Contributions » Specialized models to exploit query-specific locality » Difference detectors to exploit temporal locality » Cost-based optimizer for data-dependent model cascade » Experimental evaluation
  29. Experimental configuration and videos System setup: Difference detectors run on

    32 CPU cores + specialized/target NNs runs on P100 GPU; omit MPEG decode time Seven video streams from real-world, fixed-angle surveillance cameras; 8-12 hours of video per stream (evaluation set) Taipei: bus Amsterdam: car Store: person Jackson Hole: car
  30. NoScope enables accuracy-speed trade-offs Elevator (best result) Taipei (worst result)

    40x faster @ 99.9% accuracy 5858x faster @ 96% accuracy 36.5x faster @ 99.9% accuracy 206x faster @ 96% accuracy
  31. Factor Analysis: All components contribute to speedups Difference detection can

    filter 95% of frames Specialized models can filter all remaining frames 1 10 100 1000 10000 100000 1000000 YOLOv2 + Diff + Spec Frames per Second video: elevator false positives: 1% false negatives: 1% For this video: Similar trends for other videos, depending on content
  32. Comparison w/ classic methods, non-specialized NNs NoScope delivers best trade-off

    Classic CV NN (no spec.) NoScope video: elevator
  33. Additional content in paper » Lesion study evaluating contribution of

    each optimization » Demonstration of optimizer selection procedure » Efficient firing threshold search for optimizer » Additional details on limitations and extensions » Additional related work for computer vision, NNs, RDBMS
  34. Conclusions Neural networks can automatically analyze rapidly growing video datasets,

    but are very slow to execute (50-80fps on GPU) NoScope accelerates NN-based video queries by: 1. Specializing networks to exploit query-specific locality 2. Training difference detectors to exploit temporal locality 3. Cost-based optimization for video-specific cascades Promising results (10-1000x speedups) for many queries https://github.com/stanford-futuredata/noscope