Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling Verizon IPTV Recommendations with Scala...

Diana Hu
September 16, 2016

Scaling Verizon IPTV Recommendations with Scala and Spark

This talk will go over the architecture we’ve built to fully scale recommendations for the new television service we are building at Verizon. Scala has been helpful to our group to scale up models as we have learned to apply functional programming patterns along with Big Data patterns in Spark to build our models. We’ll highlight some use cases for building large similarity matrices

Diana Hu

September 16, 2016
Tweet

More Decks by Diana Hu

Other Decks in Technology

Transcript

  1. Russell Horton ◦  Working in IPTV since 2013 ◦  Formerly

    Intel Labs ◦  Large Scale Machine Learning & Computer Vision ◦  Scala & Spark since 2014 @sdianahu ◦  Joined Verizon Labs in 2015 ◦  Formerly Wordnik, EcoHealth ◦  NLP, machine learning, entity detection, recommendations ◦  Scala a while, Spark this year @ngr_am HELLO! Diana Hu
  2. Development – Data Scientists ◦  Data exploration ◦  Feature engineering

    ◦  Model construction ◦  Evaluation Production – Data Engineers ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services MODEL LIFECYCLE } R, Python
  3. MODEL LIFECYCLE Costs ◦  Extra implementation ◦  Late scaling ◦ 

    Synchronization pain ◦  Slow iteration Development – Data Scientists ◦  Data exploration ◦  Feature engineering ◦  Model construction ◦  Evaluation Production – Data Engineers ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services
  4. DATA @ IPTV VERIZON ◦  Millions of subscribers ◦  Live

    TV viewings ◦  DVR recordings ◦  DVR playbacks ◦  Raw events ◦  Recs are time-sensitive (TV schedule) ◦  Hundreds of channels + VOD Time Sensitive Heavy head, long tail
  5. We Want: ◦  Scalability ◦  Fast Aggregations ◦  Counts and

    sorts ◦  Libraries ◦  Community ◦  Readability ◦  Expressiveness ◦  Fast ◦  Testability ◦  Fault Tolerant ◦  Unified Distributed System WHY SPARK?
  6. WHY SPARK? ◦  Readability ◦  Expressiveness ◦  Fast ◦  Testability

    ◦  Fault Tolerant ◦  Unified Distributed System We Want: ◦  Scalability ◦  Fast Aggregations ◦  Counts and sorts ◦  Libraries ◦  Community
  7. WHY SCALA FOR SPARK? ◦  Spark is Scala ◦  Scala

    gets new features first ◦  Static typing ◦  Performance (still) ◦  Functional orientation
  8. MODEL LIFECYCLE Development – Data Scientists ◦  Data exploration ◦ 

    Feature engineering ◦  Model construction ◦  Evaluation Production – Data Scientists and friends ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services } Scala + Spark
  9. ABSTRACT ALGEBRA? “Abstract algebra is the set of advanced topics

    of algebra that deal with abstract algebraic structures rather than the usual number systems.” www.mathworld.wolfram.com/AbstractAlgebra.html
  10. SUMMING INTEGERS 1 + 2 + … + 28 +

    29 + 30 = (1 + 2 + 3 + 4 + 5 + 6) + … + (25 + 26 + 27 + 28 + 29 +30) *http://nbviewer.jupyter.org/github/spark-mooc/mooc-setup/blob/master/spark_tutorial_student.ipynb
  11. ◦  Associative a + (b + c) = (a +

    b) + c ◦  Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int ALGEBRAIC PROPERTIES
  12. ◦  Associative a + (b + c) = (a +

    b) + c ◦  Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int ALGEBRAIC PROPERTIES FOR PARALLELISM ◦  Enables parallelism a + (b + c) = (a + b) + c ◦  Ignore order a + b = b + a ◦  Ignore empty data a + 0 = a ◦  Type safety
  13. ◦  Associative a + (b + c) = (a +

    b) + c ◦  *Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int Integers form a Monoid under addition Monoid computations run efficiently on Spark rdd    .map(toMonoid(_))    .reduce(_  +  _) ALGEBRAIC PROPERTIES FOR PARALLELISM
  14. MONOIDIFY! ◦  Numbers, Lists, Sets, Strings ◦  Operations □  addition

    □  min, max □  moments ◦  Approximate data structures □  Bloom Filters □  HyperLogLog □  CountMinSketch ◦  Approximate histograms ◦  SGD
  15. ◦  Semigroup □  Closed □  Associative ◦  Monoid □  Closed

    □  Associative □  Identity ◦  Group □  Closed □  Associative □  Identity □  Inverse Many are implemented in Twitter’s Algebird MANY MORE
  16. NAÏVE TOPK ◦  Comparisons ◦ Space complexity: O(n2) ◦ Time complexity: O(n2)

    ◦ IO cost: Very high - will need to shuffle! ◦ With N ~ 1M => OOM
  17. TOPK WITH ABSTRACT ALGEBRA class  PiorityQueueMonoid[T]  (max  :  Int)  

     (implicit  order  :  Ordering[T]  )    extends  Monoid[Priorityqueue[T]  ]     From Algebird, Priority Queue: ◦  Can be empty ◦  Two Priority Queues can be “added” in any order ◦  Associative + Commutative      
  18. TOPK WITH ABSTRACT ALGEBRA type  item  =  (id,  feature)  

    val  data:  RDD[item]   val  bcData:  Broadcast[Array[Item]]     val  topK  =  data.mapPartitions  {  iter  =>      val  pq  =  new  PriorityQueueMonoid      iter.map  {  item  =>          bcData.value.foldLeft(pq.zero)(    (topK:  PriorityQueue,  i:  Item)  =>                  pq.plus(topK,  calcSimilar(item,  i)))      }   }      
  19. SPEED WITH MONOIDS TopK computed in a single map operation

    for filtering and a single reduce *without shuffling* to join results
  20. SPEED WITH MONOIDS TopK computed in a single map operation

    for filtering and a single reduce *without shuffling* to join results ◦ Space complexity: O(n log(n)) ◦ Time complexity: O(n2) ◦ IO cost: Minimal - No Shuffling! ◦ With N ~ 1M => It works J
  21. REFERENCES ◦ Lin, Jimmy. "Monoidify! Monoids as a design principle for

    efficient MapReduce algorithms." arXiv preprint arXiv:1304.7544 (2013). ◦ Chiusano, Paul, and Rúnar Bjarnason. Functional programming in Scala. Manning Publications Co., 2014. ◦ B. Bloom. Space/time trade-offs in hash coding with allowable errors. CACM, 13(7):422–426, July 1970 ◦ en.wikipedia.org/wiki/Abstract_algebra ◦ github.com/twitter/algebird