$30 off During Our Annual Pro Sale. View Details »

Scaling Verizon IPTV Recommendations with Scala and Spark

Diana Hu
September 16, 2016

Scaling Verizon IPTV Recommendations with Scala and Spark

This talk will go over the architecture we’ve built to fully scale recommendations for the new television service we are building at Verizon. Scala has been helpful to our group to scale up models as we have learned to apply functional programming patterns along with Big Data patterns in Spark to build our models. We’ll highlight some use cases for building large similarity matrices

Diana Hu

September 16, 2016
Tweet

More Decks by Diana Hu

Other Decks in Technology

Transcript

  1. SCALING VERIZON IPTV
    RECOMMENDATIONS
    Diana Hu & Russell Horton
    Verizon Labs
    RecSys LSRS, September 2016

    View Slide

  2. Russell Horton
    ○  Working in IPTV since 2013
    ○  Formerly Intel Labs
    ○  Large Scale Machine Learning
    & Computer Vision
    ○  Scala & Spark since 2014
    @sdianahu
    ○  Joined Verizon Labs in 2015
    ○  Formerly Wordnik, EcoHealth
    ○  NLP, machine learning, entity
    detection, recommendations
    ○  Scala a while, Spark this year
    @ngr_am
    HELLO!
    Diana Hu

    View Slide

  3. Development – Data Scientists
    ○  Data exploration
    ○  Feature engineering
    ○  Model construction
    ○  Evaluation
    Production – Data Engineers
    ○  Re-implementation in Java, Scala, C
    ○  Re-evaluation at scale
    ○  Pipeline deployment
    ○  Integratation with services
    MODEL LIFECYCLE
    } R, Python

    View Slide

  4. MODEL LIFECYCLE
    Costs
    ○  Extra implementation
    ○  Late scaling
    ○  Synchronization pain
    ○  Slow iteration
    Development – Data Scientists
    ○  Data exploration
    ○  Feature engineering
    ○  Model construction
    ○  Evaluation
    Production – Data Engineers
    ○  Re-implementation in Java, Scala, C
    ○  Re-evaluation at scale
    ○  Pipeline deployment
    ○  Integratation with services

    View Slide

  5. IPTV SPEED & SCALE

    View Slide

  6. DATA @ IPTV VERIZON
    ○  Millions of subscribers
    ○  Live TV viewings
    ○  DVR recordings
    ○  DVR playbacks
    ○  Raw events
    ○  Recs are time-sensitive (TV schedule)
    ○  Hundreds of channels + VOD
    Time Sensitive Heavy head, long tail

    View Slide

  7. OUR STACK
    Data Exploration
    Data Visualization
    Production Models and Data Pipelines

    View Slide

  8. We Want:
    ○  Scalability
    ○  Fast Aggregations
    ○  Counts and sorts
    ○  Libraries
    ○  Community
    ○  Readability
    ○  Expressiveness
    ○  Fast
    ○  Testability
    ○  Fault Tolerant
    ○  Unified
    Distributed System
    WHY SPARK?

    View Slide

  9. WHY SPARK?
    ○  Readability
    ○  Expressiveness
    ○  Fast
    ○  Testability
    ○  Fault Tolerant
    ○  Unified
    Distributed System
    We Want:
    ○  Scalability
    ○  Fast Aggregations
    ○  Counts and sorts
    ○  Libraries
    ○  Community

    View Slide

  10. WHY SCALA?
    https://nicholassterling.wordpress.com/2012/11/16/scala-performance/

    View Slide

  11. WHY SCALA FOR SPARK?
    ○  Spark is Scala
    ○  Scala gets new features first
    ○  Static typing
    ○  Performance (still)
    ○  Functional orientation

    View Slide

  12. MODEL LIFECYCLE
    Development – Data Scientists
    ○  Data exploration
    ○  Feature engineering
    ○  Model construction
    ○  Evaluation
    Production – Data Scientists and friends
    ○  Re-implementation in Java, Scala, C
    ○  Re-evaluation at scale
    ○  Pipeline deployment
    ○  Integratation with services
    } Scala + Spark

    View Slide

  13. DISTRIBUTED SYSTEMS
    =>
    PARALLELISM

    View Slide

  14. DISTRIBUTED SYSTEMS
    =>
    PARALLELISM
    =>
    ENABLED BY
    ASSOCIATIVITY

    View Slide

  15. TRICKS WITH
    ABSTRACT ALGEBRA

    View Slide

  16. ABSTRACT ALGEBRA?
    “Abstract algebra is the set of advanced topics of
    algebra that deal with abstract algebraic structures
    rather than the usual number systems.”
    www.mathworld.wolfram.com/AbstractAlgebra.html

    View Slide

  17. SUMMING INTEGERS
    1 + 2 + … + 28 + 29 + 30 =
    (1 + 2 + 3 + 4 + 5 + 6) + … + (25 + 26 + 27 + 28 + 29 +30)
    *http://nbviewer.jupyter.org/github/spark-mooc/mooc-setup/blob/master/spark_tutorial_student.ipynb

    View Slide

  18. ○  Associative
    a + (b + c) = (a + b) + c
    ○  Commutative
    a + b = b + a
    ○  Identity
    a + 0 = a
    ○  Closed
    Int + Int = Int
    ALGEBRAIC PROPERTIES

    View Slide

  19. ○  Associative
    a + (b + c) = (a + b) + c
    ○  Commutative
    a + b = b + a
    ○  Identity
    a + 0 = a
    ○  Closed
    Int + Int = Int
    ALGEBRAIC PROPERTIES
    FOR PARALLELISM
    ○  Enables parallelism
    a + (b + c) = (a + b) + c
    ○  Ignore order
    a + b = b + a
    ○  Ignore empty data
    a + 0 = a
    ○  Type safety

    View Slide

  20. ○  Associative
    a + (b + c) = (a + b) + c
    ○  *Commutative
    a + b = b + a
    ○  Identity
    a + 0 = a
    ○  Closed
    Int + Int = Int
    Integers form a Monoid
    under addition
    Monoid computations
    run efficiently on Spark
    rdd  
     .map(toMonoid(_))  
     .reduce(_  +  _)
    ALGEBRAIC PROPERTIES
    FOR PARALLELISM

    View Slide

  21. MONOIDIFY!
    ○  Numbers, Lists, Sets, Strings
    ○  Operations
    □  addition
    □  min, max
    □  moments
    ○  Approximate data structures
    □  Bloom Filters
    □  HyperLogLog
    □  CountMinSketch
    ○  Approximate histograms
    ○  SGD

    View Slide

  22. ○  Semigroup
    □  Closed
    □  Associative
    ○  Monoid
    □  Closed
    □  Associative
    □  Identity
    ○  Group
    □  Closed
    □  Associative
    □  Identity
    □  Inverse
    Many are implemented in Twitter’s Algebird
    MANY MORE

    View Slide

  23. SCALING TOPK
    WITH ALGEBRA

    View Slide

  24. NAÏVE TOPK
    ○  Comparisons
    ○ Space complexity: O(n2)
    ○ Time complexity: O(n2)
    ○ IO cost: Very high - will need to shuffle!
    ○ With N ~ 1M => OOM

    View Slide

  25. TOPK WITH ABSTRACT
    ALGEBRA
    class  PiorityQueueMonoid[T]  (max  :  Int)  
     (implicit  order  :  Ordering[T]  )  
     extends  Monoid[Priorityqueue[T]  ]  
     
    From Algebird, Priority Queue:
    ○  Can be empty
    ○  Two Priority Queues can be “added” in any order
    ○  Associative + Commutative
     
     
     

    View Slide

  26. TOPK WITH ABSTRACT
    ALGEBRA
    type  item  =  (id,  feature)  
    val  data:  RDD[item]  
    val  bcData:  Broadcast[Array[Item]]  
     
    val  topK  =  data.mapPartitions  {  iter  =>  
       val  pq  =  new  PriorityQueueMonoid  
       iter.map  {  item  =>  
           bcData.value.foldLeft(pq.zero)(  
     (topK:  PriorityQueue,  i:  Item)  =>  
                   pq.plus(topK,  calcSimilar(item,  i)))  
       }  
    }  
     
     

    View Slide

  27. SPEED WITH MONOIDS
    TopK computed in a single map operation
    for filtering and a single reduce
    *without shuffling* to join results

    View Slide

  28. SPEED WITH MONOIDS
    TopK computed in a single map operation
    for filtering and a single reduce
    *without shuffling* to join results
    ○ Space complexity: O(n log(n))
    ○ Time complexity: O(n2)
    ○ IO cost: Minimal - No Shuffling!
    ○ With N ~ 1M => It works J

    View Slide

  29. THANKS!
    Any questions?
    You can find us at:
    @sdianahu
    [email protected]
    [email protected]
    Also, we are hiring!

    View Slide

  30. REFERENCES
    ○ Lin, Jimmy. "Monoidify! Monoids as a design principle for efficient
    MapReduce algorithms." arXiv preprint arXiv:1304.7544 (2013).
    ○ Chiusano, Paul, and Rúnar Bjarnason. Functional programming in
    Scala. Manning Publications Co., 2014.
    ○ B. Bloom. Space/time trade-offs in hash coding with allowable
    errors. CACM, 13(7):422–426, July 1970
    ○ en.wikipedia.org/wiki/Abstract_algebra
    ○ github.com/twitter/algebird

    View Slide