Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Verizon IPTV Recommendations with Scala and Spark

05ee7b9a450069f210aac00cd5edd630?s=47 Diana Hu
September 16, 2016

Scaling Verizon IPTV Recommendations with Scala and Spark

This talk will go over the architecture we’ve built to fully scale recommendations for the new television service we are building at Verizon. Scala has been helpful to our group to scale up models as we have learned to apply functional programming patterns along with Big Data patterns in Spark to build our models. We’ll highlight some use cases for building large similarity matrices

05ee7b9a450069f210aac00cd5edd630?s=128

Diana Hu

September 16, 2016
Tweet

Transcript

  1. SCALING VERIZON IPTV RECOMMENDATIONS Diana Hu & Russell Horton Verizon

    Labs RecSys LSRS, September 2016
  2. Russell Horton ◦  Working in IPTV since 2013 ◦  Formerly

    Intel Labs ◦  Large Scale Machine Learning & Computer Vision ◦  Scala & Spark since 2014 @sdianahu ◦  Joined Verizon Labs in 2015 ◦  Formerly Wordnik, EcoHealth ◦  NLP, machine learning, entity detection, recommendations ◦  Scala a while, Spark this year @ngr_am HELLO! Diana Hu
  3. Development – Data Scientists ◦  Data exploration ◦  Feature engineering

    ◦  Model construction ◦  Evaluation Production – Data Engineers ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services MODEL LIFECYCLE } R, Python
  4. MODEL LIFECYCLE Costs ◦  Extra implementation ◦  Late scaling ◦ 

    Synchronization pain ◦  Slow iteration Development – Data Scientists ◦  Data exploration ◦  Feature engineering ◦  Model construction ◦  Evaluation Production – Data Engineers ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services
  5. IPTV SPEED & SCALE

  6. DATA @ IPTV VERIZON ◦  Millions of subscribers ◦  Live

    TV viewings ◦  DVR recordings ◦  DVR playbacks ◦  Raw events ◦  Recs are time-sensitive (TV schedule) ◦  Hundreds of channels + VOD Time Sensitive Heavy head, long tail
  7. OUR STACK Data Exploration Data Visualization Production Models and Data

    Pipelines
  8. We Want: ◦  Scalability ◦  Fast Aggregations ◦  Counts and

    sorts ◦  Libraries ◦  Community ◦  Readability ◦  Expressiveness ◦  Fast ◦  Testability ◦  Fault Tolerant ◦  Unified Distributed System WHY SPARK?
  9. WHY SPARK? ◦  Readability ◦  Expressiveness ◦  Fast ◦  Testability

    ◦  Fault Tolerant ◦  Unified Distributed System We Want: ◦  Scalability ◦  Fast Aggregations ◦  Counts and sorts ◦  Libraries ◦  Community
  10. WHY SCALA? https://nicholassterling.wordpress.com/2012/11/16/scala-performance/

  11. WHY SCALA FOR SPARK? ◦  Spark is Scala ◦  Scala

    gets new features first ◦  Static typing ◦  Performance (still) ◦  Functional orientation
  12. MODEL LIFECYCLE Development – Data Scientists ◦  Data exploration ◦ 

    Feature engineering ◦  Model construction ◦  Evaluation Production – Data Scientists and friends ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services } Scala + Spark
  13. DISTRIBUTED SYSTEMS => PARALLELISM

  14. DISTRIBUTED SYSTEMS => PARALLELISM => ENABLED BY ASSOCIATIVITY

  15. TRICKS WITH ABSTRACT ALGEBRA

  16. ABSTRACT ALGEBRA? “Abstract algebra is the set of advanced topics

    of algebra that deal with abstract algebraic structures rather than the usual number systems.” www.mathworld.wolfram.com/AbstractAlgebra.html
  17. SUMMING INTEGERS 1 + 2 + … + 28 +

    29 + 30 = (1 + 2 + 3 + 4 + 5 + 6) + … + (25 + 26 + 27 + 28 + 29 +30) *http://nbviewer.jupyter.org/github/spark-mooc/mooc-setup/blob/master/spark_tutorial_student.ipynb
  18. ◦  Associative a + (b + c) = (a +

    b) + c ◦  Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int ALGEBRAIC PROPERTIES
  19. ◦  Associative a + (b + c) = (a +

    b) + c ◦  Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int ALGEBRAIC PROPERTIES FOR PARALLELISM ◦  Enables parallelism a + (b + c) = (a + b) + c ◦  Ignore order a + b = b + a ◦  Ignore empty data a + 0 = a ◦  Type safety
  20. ◦  Associative a + (b + c) = (a +

    b) + c ◦  *Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int Integers form a Monoid under addition Monoid computations run efficiently on Spark rdd    .map(toMonoid(_))    .reduce(_  +  _) ALGEBRAIC PROPERTIES FOR PARALLELISM
  21. MONOIDIFY! ◦  Numbers, Lists, Sets, Strings ◦  Operations □  addition

    □  min, max □  moments ◦  Approximate data structures □  Bloom Filters □  HyperLogLog □  CountMinSketch ◦  Approximate histograms ◦  SGD
  22. ◦  Semigroup □  Closed □  Associative ◦  Monoid □  Closed

    □  Associative □  Identity ◦  Group □  Closed □  Associative □  Identity □  Inverse Many are implemented in Twitter’s Algebird MANY MORE
  23. SCALING TOPK WITH ALGEBRA

  24. NAÏVE TOPK ◦  Comparisons ◦ Space complexity: O(n2) ◦ Time complexity: O(n2)

    ◦ IO cost: Very high - will need to shuffle! ◦ With N ~ 1M => OOM
  25. TOPK WITH ABSTRACT ALGEBRA class  PiorityQueueMonoid[T]  (max  :  Int)  

     (implicit  order  :  Ordering[T]  )    extends  Monoid[Priorityqueue[T]  ]     From Algebird, Priority Queue: ◦  Can be empty ◦  Two Priority Queues can be “added” in any order ◦  Associative + Commutative      
  26. TOPK WITH ABSTRACT ALGEBRA type  item  =  (id,  feature)  

    val  data:  RDD[item]   val  bcData:  Broadcast[Array[Item]]     val  topK  =  data.mapPartitions  {  iter  =>      val  pq  =  new  PriorityQueueMonoid      iter.map  {  item  =>          bcData.value.foldLeft(pq.zero)(    (topK:  PriorityQueue,  i:  Item)  =>                  pq.plus(topK,  calcSimilar(item,  i)))      }   }      
  27. SPEED WITH MONOIDS TopK computed in a single map operation

    for filtering and a single reduce *without shuffling* to join results
  28. SPEED WITH MONOIDS TopK computed in a single map operation

    for filtering and a single reduce *without shuffling* to join results ◦ Space complexity: O(n log(n)) ◦ Time complexity: O(n2) ◦ IO cost: Minimal - No Shuffling! ◦ With N ~ 1M => It works J
  29. THANKS! Any questions? You can find us at: @sdianahu diana.hu@verizon.com

    russell.horton@verizon.com Also, we are hiring!
  30. REFERENCES ◦ Lin, Jimmy. "Monoidify! Monoids as a design principle for

    efficient MapReduce algorithms." arXiv preprint arXiv:1304.7544 (2013). ◦ Chiusano, Paul, and Rúnar Bjarnason. Functional programming in Scala. Manning Publications Co., 2014. ◦ B. Bloom. Space/time trade-offs in hash coding with allowable errors. CACM, 13(7):422–426, July 1970 ◦ en.wikipedia.org/wiki/Abstract_algebra ◦ github.com/twitter/algebird