Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NoScope: Optimizing Neural Network Queries over Video at Scale

NoScope: Optimizing Neural Network Queries over Video at Scale

More Decks by Stanford Future Data Systems

Other Decks in Technology

Transcript

  1. NoScope:

    Optimizing Neural Network Queries
    over Video at Scale
    Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia
    DAWN Project, Stanford InfoLab
    http://dawn.cs.stanford.edu/
    30 August 2017 @ VLDB 2017

    View full-size slide

  2. Video is a rapidly growing source of data
    » London alone
    has 500K CCTVs
    » 300 hours of
    video are
    uploaded to
    YouTube every
    minute
    » High quality
    image sensors
    are incredibly
    cheap (<$0.70)

    View full-size slide

  3. We can query video to understand the world
    e.g., traffic analysis, environmental monitoring, surveillance, customer
    behavior, urban dynamics, social science and media studies
    running example: when
    did buses pass by this
    intersection today?
    increasingly cheap to
    acquire this data…
    ...how to process it?

    View full-size slide

  4. Computer vision lets us query video automatically
    Core capability: Object detection
    Input: visual data (e.g., images) Output: objects and boxes in scene

    View full-size slide

  5. Neural networks dominate object detection
    » Idea: many parameters + nonlinear functions capture representations
    » Preferred for image analytics, often better than humans
    » High-quality models widely available (e.g., open source on GitHub)
    Enables new kinds of downstream analytics (e.g., use with DBMS)

    View full-size slide

  6. Object detection neural networks evaluate video one frame at a time
    NN
    (YOLOv2)
    Problem: Analysis with NNs doesn’t scale
    NN
    Hardware
    Cost to
    Purchase
    Frames /
    Second
    K80 GPU $4000 50
    P100 GPU $6000 80
    Video
    Labels
    (e.g.,Y,N,Y,N)
    500K video feeds?
    $1B+ of GPUs

    View full-size slide

  7. This talk:
    NoScope
    a system for accelerating neural network video
    analysis using model specialization and
    database-inspired query optimization
    Our research:
    Can we make analytics on video scale?

    View full-size slide

  8. Outline
    » Motivation: Exploding video data demands scalable processing
    » NoScope Architecture + Key Contributions
    » Specialized models to exploit query-specific locality
    » Difference detectors to exploit temporal locality
    » Cost-based optimizer for data-dependent model cascade
    » Experimental evaluation

    View full-size slide

  9. NoScope Architecture, Interfaces
    Input:
    a) target object
    +
    b) target video
    (fixed angle only)
    +
    c) reference NN
    e.g., “find buses in
    this webcam feed
    using YOLOv2”
    ...
    NoScope

    Output:
    0s 30s 60s 90s
    Binary labels over time
    e.g., buses appeared at
    5-14s, 28-33s, …
    Objective: minimize runtime
    while mimicking reference NN
    within target accuracy (e.g., 1%)

    View full-size slide

  10. Outline
    » Motivation: Exploding video data demands scalable processing
    » NoScope Architecture + Key Contributions
    » Specialized models to exploit query-specific locality
    » Difference detectors to exploit temporal locality
    » Cost-based optimizer for data-dependent model cascade
    » Experimental evaluation

    View full-size slide

  11. Query: “When did buses pass by this intersection?”
    Target objects appear
    from similar
    perspectives in video
    Opportunity 1: Query-specific locality

    View full-size slide

  12. NNs are typically trained to detect
    tens of object categories
    in arbitrary scenes
    and from arbitrary angles
    images from
    training set for YOLOv2
    Query: “When did buses pass by this intersection?”
    Opportunity 1: Query-specific locality
    If we only want to detect
    buses in a given video,
    we’re overpaying

    View full-size slide

  13. Key idea: specialize for query and video
    Idea: use big reference NN to train a smaller, specialized NN
    The specialized NN:
    Only works for a given video feed and object
    Is much, much smaller than the reference NN

    View full-size slide

  14. Specialized models are much smaller
    24 convolutional layers
    64-1024 filters per layer
    4096 neurons in FC layer
    4 convolutional layers
    32-128 filters per layer
    32 neurons in FC layer
    35 billion FLOPS 3 million FLOPS
    10,000x fewer FLOPS
    NoScope specialized model
    YOLOv2

    View full-size slide

  15. Specialized models are much faster
    YOLOv2:
    80 fps
    Specialized NN:
    25k+ fps
    300x faster execution on GPU

    View full-size slide

  16. Specialization != Model Compression
    Model compression/distillation [NIPS14, ICLR16]: lossless models
    Goal: smaller model for same task as reference model
    Result: typically 2-10x faster execution
    Specialization: perform “lossy” compression of reference model
    A specialized model does not generalize to other videos…
    …but is accurate on target video, up to 300x faster

    View full-size slide

  17. NoScope’s Model Specialization Procedure
    1. Run big NN for few hours to obtain video-specific training data
    2. Train specialized NN over video-specific training data
    3. Enable specialized NN, only call big NN when unsure
    In paper: NoScope automatically searches for the smallest NN

    View full-size slide

  18. Outline
    » Motivation: Exploding video data demands scalable processing
    » NoScope Architecture + Key Contributions
    » Specialized models to exploit query-specific locality
    » Difference detectors to exploit temporal locality
    » Cost-based optimizer for data-dependent model cascade
    » Experimental evaluation

    View full-size slide

  19. Opportunity 2: Temporal locality
    Both videos run at 30 frames per second,
    requiring 30 NN evaluations per second
    Query: “When did buses pass by this intersection?”

    View full-size slide

  20. Opportunity 2: Temporal locality
    Query: “When did buses pass by this intersection?”
    Observation: frames close in time are often redundant
    NoScope: train a fast model to detect redundancy

    View full-size slide

  21. Difference detection: detect redundant frames
    Many techniques in the literature for detecting scene changes
    NoScope: simple regression model over subtracted frames
    - =
    Frame 1
    Frame 0 Difference
    Difference detection runs at 100k+ fps on CPU
    Surprising: detecting differences is faster than even specialized NNs

    View full-size slide

  22. Outline
    » Motivation: Exploding video data demands scalable processing
    » NoScope Architecture + Key Contributions
    » Specialized models to exploit query-specific locality
    » Difference detectors to exploit temporal locality
    » Cost-based optimizer for data-dependent model cascade
    » Experimental evaluation

    View full-size slide

  23. NoScope combines fast models in a cascade
    YOLOv2
    80fps
    Labels
    Video
    Previously:
    [Viola, Jones CVPR 2001]
    YOLOv2
    80fps
    Labels
    Specialized
    NN
    25k fps
    Difference
    Detector
    100k fps
    Idea: Use the cheapest model possible for each frame
    Video

    View full-size slide

  24. Specialized
    NN
    Bus
    absent
    YOLOv2
    Bus
    present
    Nothing
    confident?
    unsure
    Difference
    Detector
    unsure
    confident?
    Cascades avoid unnecessary
    computation on each frame
    Truck
    Bus

    View full-size slide

  25. NoScope performs cost-based optimization
    for cascades
    Given an accuracy target,
    NoScope performs:
    Model search: e.g., how many
    layers in specialized NN?
    Cascade search: e.g., how to
    set the cascade thresholds?
    Data-dependent process:
    high-quality choices vary across
    queries and videos (see paper)

    View full-size slide

  26. Typical NoScope Query Lifecycle
    1. Run big NN over part of video for training data (~75 minutes)
    2. Model search specialized NN, difference detector (15 minutes)
    3. Perform cascade firing threshold search (2 minutes)
    4. Activate cascade to process rest of video

    View full-size slide

  27. Current Limitations (cf. Section 8)
    Targets binary detection tasks (e.g., bus/no bus)
    Ongoing research on also locating objects (e.g., bus location)
    Targets fixed-angle cameras (e.g., surveillance cameras)
    Ongoing research on moving cameras
    Does not automatically handle model drift
    Requires representative training set (e.g., morning, afternoon)
    Batch-oriented processing
    Poor on-GPU support for control flow in cascades

    View full-size slide

  28. Outline
    » Motivation: Exploding video data demands scalable processing
    » NoScope Architecture + Key Contributions
    » Specialized models to exploit query-specific locality
    » Difference detectors to exploit temporal locality
    » Cost-based optimizer for data-dependent model cascade
    » Experimental evaluation

    View full-size slide

  29. Experimental configuration and videos
    System setup: Difference detectors run on 32 CPU cores +
    specialized/target NNs runs on P100 GPU; omit MPEG decode time
    Seven video streams from real-world, fixed-angle surveillance cameras;
    8-12 hours of video per stream (evaluation set)
    Taipei: bus Amsterdam: car Store: person Jackson Hole: car

    View full-size slide

  30. NoScope enables accuracy-speed trade-offs
    Elevator (best result) Taipei (worst result)
    40x faster @ 99.9% accuracy
    5858x faster @ 96% accuracy
    36.5x faster @ 99.9% accuracy
    206x faster @ 96% accuracy

    View full-size slide

  31. Factor Analysis: All components contribute to speedups
    Difference detection can
    filter 95% of frames
    Specialized models can
    filter all remaining frames
    1
    10
    100
    1000
    10000
    100000
    1000000
    YOLOv2 + Diff + Spec
    Frames per Second
    video: elevator false positives: 1% false negatives: 1%
    For this video:
    Similar trends for other videos,
    depending on content

    View full-size slide

  32. Comparison w/ classic methods, non-specialized NNs
    NoScope
    delivers
    best
    trade-off
    Classic CV
    NN (no spec.)
    NoScope
    video: elevator

    View full-size slide

  33. Additional content in paper
    » Lesion study evaluating contribution of each optimization
    » Demonstration of optimizer selection procedure
    » Efficient firing threshold search for optimizer
    » Additional details on limitations and extensions
    » Additional related work for computer vision, NNs, RDBMS

    View full-size slide

  34. Conclusions
    Neural networks can automatically analyze rapidly growing video
    datasets, but are very slow to execute (50-80fps on GPU)
    NoScope accelerates NN-based video queries by:
    1. Specializing networks to exploit query-specific locality
    2. Training difference detectors to exploit temporal locality
    3. Cost-based optimization for video-specific cascades
    Promising results (10-1000x speedups) for many queries
    https://github.com/stanford-futuredata/noscope

    View full-size slide