Scaling Verizon IPTV Recommendations with Scala and Spark

Slide 1

Slide 1 text

SCALING VERIZON IPTV RECOMMENDATIONS Diana Hu & Russell Horton Verizon Labs RecSys LSRS, September 2016

Slide 2

Slide 2 text

Russell Horton ○  Working in IPTV since 2013 ○  Formerly Intel Labs ○  Large Scale Machine Learning & Computer Vision ○  Scala & Spark since 2014 @sdianahu ○  Joined Verizon Labs in 2015 ○  Formerly Wordnik, EcoHealth ○  NLP, machine learning, entity detection, recommendations ○  Scala a while, Spark this year @ngr_am HELLO! Diana Hu

Slide 3

Slide 3 text

Development – Data Scientists ○  Data exploration ○  Feature engineering ○  Model construction ○  Evaluation Production – Data Engineers ○  Re-implementation in Java, Scala, C ○  Re-evaluation at scale ○  Pipeline deployment ○  Integratation with services MODEL LIFECYCLE } R, Python

Slide 4

Slide 4 text

MODEL LIFECYCLE Costs ○  Extra implementation ○  Late scaling ○  Synchronization pain ○  Slow iteration Development – Data Scientists ○  Data exploration ○  Feature engineering ○  Model construction ○  Evaluation Production – Data Engineers ○  Re-implementation in Java, Scala, C ○  Re-evaluation at scale ○  Pipeline deployment ○  Integratation with services

Slide 5

Slide 5 text

IPTV SPEED & SCALE

Slide 6

Slide 6 text

DATA @ IPTV VERIZON ○  Millions of subscribers ○  Live TV viewings ○  DVR recordings ○  DVR playbacks ○  Raw events ○  Recs are time-sensitive (TV schedule) ○  Hundreds of channels + VOD Time Sensitive Heavy head, long tail

Slide 7

Slide 7 text

OUR STACK Data Exploration Data Visualization Production Models and Data Pipelines

Slide 8

Slide 8 text

We Want: ○  Scalability ○  Fast Aggregations ○  Counts and sorts ○  Libraries ○  Community ○  Readability ○  Expressiveness ○  Fast ○  Testability ○  Fault Tolerant ○  Unified Distributed System WHY SPARK?

Slide 9

Slide 9 text

WHY SPARK? ○  Readability ○  Expressiveness ○  Fast ○  Testability ○  Fault Tolerant ○  Unified Distributed System We Want: ○  Scalability ○  Fast Aggregations ○  Counts and sorts ○  Libraries ○  Community

Slide 10

Slide 10 text

WHY SCALA? https://nicholassterling.wordpress.com/2012/11/16/scala-performance/

Slide 11

Slide 11 text

WHY SCALA FOR SPARK? ○  Spark is Scala ○  Scala gets new features first ○  Static typing ○  Performance (still) ○  Functional orientation

Slide 12

Slide 12 text

MODEL LIFECYCLE Development – Data Scientists ○  Data exploration ○  Feature engineering ○  Model construction ○  Evaluation Production – Data Scientists and friends ○  Re-implementation in Java, Scala, C ○  Re-evaluation at scale ○  Pipeline deployment ○  Integratation with services } Scala + Spark

Slide 13

Slide 13 text

DISTRIBUTED SYSTEMS => PARALLELISM

Slide 14

Slide 14 text

DISTRIBUTED SYSTEMS => PARALLELISM => ENABLED BY ASSOCIATIVITY

Slide 15

Slide 15 text

TRICKS WITH ABSTRACT ALGEBRA

Slide 16

Slide 16 text

ABSTRACT ALGEBRA? “Abstract algebra is the set of advanced topics of algebra that deal with abstract algebraic structures rather than the usual number systems.” www.mathworld.wolfram.com/AbstractAlgebra.html

Slide 17

Slide 17 text

SUMMING INTEGERS 1 + 2 + … + 28 + 29 + 30 = (1 + 2 + 3 + 4 + 5 + 6) + … + (25 + 26 + 27 + 28 + 29 +30) *http://nbviewer.jupyter.org/github/spark-mooc/mooc-setup/blob/master/spark_tutorial_student.ipynb

Slide 18

Slide 18 text

○  Associative a + (b + c) = (a + b) + c ○  Commutative a + b = b + a ○  Identity a + 0 = a ○  Closed Int + Int = Int ALGEBRAIC PROPERTIES

Slide 19

Slide 19 text

○  Associative a + (b + c) = (a + b) + c ○  Commutative a + b = b + a ○  Identity a + 0 = a ○  Closed Int + Int = Int ALGEBRAIC PROPERTIES FOR PARALLELISM ○  Enables parallelism a + (b + c) = (a + b) + c ○  Ignore order a + b = b + a ○  Ignore empty data a + 0 = a ○  Type safety

Slide 20

Slide 20 text

○  Associative a + (b + c) = (a + b) + c ○  *Commutative a + b = b + a ○  Identity a + 0 = a ○  Closed Int + Int = Int Integers form a Monoid under addition Monoid computations run efficiently on Spark rdd .map(toMonoid(_)) .reduce(_ + _) ALGEBRAIC PROPERTIES FOR PARALLELISM

Slide 21

Slide 21 text

MONOIDIFY! ○  Numbers, Lists, Sets, Strings ○  Operations □  addition □  min, max □  moments ○  Approximate data structures □  Bloom Filters □  HyperLogLog □  CountMinSketch ○  Approximate histograms ○  SGD

Slide 22

Slide 22 text

○  Semigroup □  Closed □  Associative ○  Monoid □  Closed □  Associative □  Identity ○  Group □  Closed □  Associative □  Identity □  Inverse Many are implemented in Twitter’s Algebird MANY MORE

Slide 23

Slide 23 text

SCALING TOPK WITH ALGEBRA

Slide 24

Slide 24 text

NAÏVE TOPK ○  Comparisons ○ Space complexity: O(n2) ○ Time complexity: O(n2) ○ IO cost: Very high - will need to shuffle! ○ With N ~ 1M => OOM

Slide 25

Slide 25 text

TOPK WITH ABSTRACT ALGEBRA class PiorityQueueMonoid[T] (max : Int) (implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] From Algebird, Priority Queue: ○  Can be empty ○  Two Priority Queues can be “added” in any order ○  Associative + Commutative

Slide 26

Slide 26 text

TOPK WITH ABSTRACT ALGEBRA type item = (id, feature) val data: RDD[item] val bcData: Broadcast[Array[Item]] val topK = data.mapPartitions { iter => val pq = new PriorityQueueMonoid iter.map { item => bcData.value.foldLeft(pq.zero)( (topK: PriorityQueue, i: Item) => pq.plus(topK, calcSimilar(item, i))) } }

Slide 27

Slide 27 text

SPEED WITH MONOIDS TopK computed in a single map operation for filtering and a single reduce *without shuffling* to join results

Slide 28

Slide 28 text

SPEED WITH MONOIDS TopK computed in a single map operation for filtering and a single reduce *without shuffling* to join results ○ Space complexity: O(n log(n)) ○ Time complexity: O(n2) ○ IO cost: Minimal - No Shuffling! ○ With N ~ 1M => It works J

Slide 29

Slide 29 text

THANKS! Any questions? You can find us at: @sdianahu [email protected] [email protected] Also, we are hiring!

Slide 30

Slide 30 text

REFERENCES ○ Lin, Jimmy. "Monoidify! Monoids as a design principle for efficient MapReduce algorithms." arXiv preprint arXiv:1304.7544 (2013). ○ Chiusano, Paul, and Rúnar Bjarnason. Functional programming in Scala. Manning Publications Co., 2014. ○ B. Bloom. Space/time trade-offs in hash coding with allowable errors. CACM, 13(7):422–426, July 1970 ○ en.wikipedia.org/wiki/Abstract_algebra ○ github.com/twitter/algebird