Scaling Verizon IPTV Recommendations with Scala and Spark

SCALING VERIZON IPTV RECOMMENDATIONS Diana Hu & Russell Horton Verizon
Labs RecSys LSRS, September 2016

Russell Horton ◦  Working in IPTV since 2013 ◦  Formerly
Intel Labs ◦  Large Scale Machine Learning & Computer Vision ◦  Scala & Spark since 2014 @sdianahu ◦  Joined Verizon Labs in 2015 ◦  Formerly Wordnik, EcoHealth ◦  NLP, machine learning, entity detection, recommendations ◦  Scala a while, Spark this year @ngr_am HELLO! Diana Hu

Development – Data Scientists ◦  Data exploration ◦  Feature engineering
◦  Model construction ◦  Evaluation Production – Data Engineers ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services MODEL LIFECYCLE } R, Python

MODEL LIFECYCLE Costs ◦  Extra implementation ◦  Late scaling ◦ 
Synchronization pain ◦  Slow iteration Development – Data Scientists ◦  Data exploration ◦  Feature engineering ◦  Model construction ◦  Evaluation Production – Data Engineers ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services

IPTV SPEED & SCALE

DATA @ IPTV VERIZON ◦  Millions of subscribers ◦  Live
TV viewings ◦  DVR recordings ◦  DVR playbacks ◦  Raw events ◦  Recs are time-sensitive (TV schedule) ◦  Hundreds of channels + VOD Time Sensitive Heavy head, long tail

OUR STACK Data Exploration Data Visualization Production Models and Data
Pipelines

We Want: ◦  Scalability ◦  Fast Aggregations ◦  Counts and
sorts ◦  Libraries ◦  Community ◦  Readability ◦  Expressiveness ◦  Fast ◦  Testability ◦  Fault Tolerant ◦  Unified Distributed System WHY SPARK?

WHY SPARK? ◦  Readability ◦  Expressiveness ◦  Fast ◦  Testability
◦  Fault Tolerant ◦  Unified Distributed System We Want: ◦  Scalability ◦  Fast Aggregations ◦  Counts and sorts ◦  Libraries ◦  Community

WHY SCALA? https://nicholassterling.wordpress.com/2012/11/16/scala-performance/

WHY SCALA FOR SPARK? ◦  Spark is Scala ◦  Scala
gets new features first ◦  Static typing ◦  Performance (still) ◦  Functional orientation

MODEL LIFECYCLE Development – Data Scientists ◦  Data exploration ◦ 
Feature engineering ◦  Model construction ◦  Evaluation Production – Data Scientists and friends ◦  Re-implementation in Java, Scala, C ◦  Re-evaluation at scale ◦  Pipeline deployment ◦  Integratation with services } Scala + Spark

DISTRIBUTED SYSTEMS => PARALLELISM

DISTRIBUTED SYSTEMS => PARALLELISM => ENABLED BY ASSOCIATIVITY

TRICKS WITH ABSTRACT ALGEBRA

ABSTRACT ALGEBRA? “Abstract algebra is the set of advanced topics
of algebra that deal with abstract algebraic structures rather than the usual number systems.” www.mathworld.wolfram.com/AbstractAlgebra.html

SUMMING INTEGERS 1 + 2 + … + 28 +
29 + 30 = (1 + 2 + 3 + 4 + 5 + 6) + … + (25 + 26 + 27 + 28 + 29 +30) *http://nbviewer.jupyter.org/github/spark-mooc/mooc-setup/blob/master/spark_tutorial_student.ipynb

◦  Associative a + (b + c) = (a +
b) + c ◦  Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int ALGEBRAIC PROPERTIES

b) + c ◦  Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int ALGEBRAIC PROPERTIES FOR PARALLELISM ◦  Enables parallelism a + (b + c) = (a + b) + c ◦  Ignore order a + b = b + a ◦  Ignore empty data a + 0 = a ◦  Type safety

b) + c ◦  *Commutative a + b = b + a ◦  Identity a + 0 = a ◦  Closed Int + Int = Int Integers form a Monoid under addition Monoid computations run efficiently on Spark rdd .map(toMonoid(_)) .reduce(_ + _) ALGEBRAIC PROPERTIES FOR PARALLELISM

MONOIDIFY! ◦  Numbers, Lists, Sets, Strings ◦  Operations □  addition
□  min, max □  moments ◦  Approximate data structures □  Bloom Filters □  HyperLogLog □  CountMinSketch ◦  Approximate histograms ◦  SGD

◦  Semigroup □  Closed □  Associative ◦  Monoid □  Closed
□  Associative □  Identity ◦  Group □  Closed □  Associative □  Identity □  Inverse Many are implemented in Twitter’s Algebird MANY MORE

SCALING TOPK WITH ALGEBRA

NAÏVE TOPK ◦  Comparisons ◦ Space complexity: O(n2) ◦ Time complexity: O(n2)
◦ IO cost: Very high - will need to shuffle! ◦ With N ~ 1M => OOM

TOPK WITH ABSTRACT ALGEBRA class PiorityQueueMonoid[T] (max : Int)
(implicit order : Ordering[T] ) extends Monoid[Priorityqueue[T] ] From Algebird, Priority Queue: ◦  Can be empty ◦  Two Priority Queues can be “added” in any order ◦  Associative + Commutative

TOPK WITH ABSTRACT ALGEBRA type item = (id, feature)
val data: RDD[item] val bcData: Broadcast[Array[Item]] val topK = data.mapPartitions { iter => val pq = new PriorityQueueMonoid iter.map { item => bcData.value.foldLeft(pq.zero)( (topK: PriorityQueue, i: Item) => pq.plus(topK, calcSimilar(item, i))) } }

SPEED WITH MONOIDS TopK computed in a single map operation
for filtering and a single reduce *without shuffling* to join results

SPEED WITH MONOIDS TopK computed in a single map operation
for filtering and a single reduce *without shuffling* to join results ◦ Space complexity: O(n log(n)) ◦ Time complexity: O(n2) ◦ IO cost: Minimal - No Shuffling! ◦ With N ~ 1M => It works J

THANKS! Any questions? You can find us at: @sdianahu [email protected]
[email protected] Also, we are hiring!

REFERENCES ◦ Lin, Jimmy. "Monoidify! Monoids as a design principle for
efficient MapReduce algorithms." arXiv preprint arXiv:1304.7544 (2013). ◦ Chiusano, Paul, and Rúnar Bjarnason. Functional programming in Scala. Manning Publications Co., 2014. ◦ B. Bloom. Space/time trade-offs in hash coding with allowable errors. CACM, 13(7):422–426, July 1970 ◦ en.wikipedia.org/wiki/Abstract_algebra ◦ github.com/twitter/algebird

Scaling Verizon IPTV Recommendations with Scala...

Scaling Verizon IPTV Recommendations with Scala and Spark

Diana Hu

More Decks by Diana Hu

Other Decks in Technology

Featured

Transcript

SCALING VERIZON IPTV RECOMMENDATIONS Diana Hu & Russell Horton Verizon

Russell Horton ◦  Working in IPTV since 2013 ◦  Formerly

Development – Data Scientists ◦  Data exploration ◦  Feature engineering

MODEL LIFECYCLE Costs ◦  Extra implementation ◦  Late scaling ◦

IPTV SPEED & SCALE

DATA @ IPTV VERIZON ◦  Millions of subscribers ◦  Live

OUR STACK Data Exploration Data Visualization Production Models and Data

We Want: ◦  Scalability ◦  Fast Aggregations ◦  Counts and

WHY SPARK? ◦  Readability ◦  Expressiveness ◦  Fast ◦  Testability

WHY SCALA? https://nicholassterling.wordpress.com/2012/11/16/scala-performance/

WHY SCALA FOR SPARK? ◦  Spark is Scala ◦  Scala

MODEL LIFECYCLE Development – Data Scientists ◦  Data exploration ◦

DISTRIBUTED SYSTEMS => PARALLELISM

DISTRIBUTED SYSTEMS => PARALLELISM => ENABLED BY ASSOCIATIVITY

TRICKS WITH ABSTRACT ALGEBRA

ABSTRACT ALGEBRA? “Abstract algebra is the set of advanced topics

SUMMING INTEGERS 1 + 2 + … + 28 +

◦  Associative a + (b + c) = (a +

◦  Associative a + (b + c) = (a +

◦  Associative a + (b + c) = (a +

MONOIDIFY! ◦  Numbers, Lists, Sets, Strings ◦  Operations □  addition

◦  Semigroup □  Closed □  Associative ◦  Monoid □  Closed

SCALING TOPK WITH ALGEBRA

NAÏVE TOPK ◦  Comparisons ◦ Space complexity: O(n2) ◦ Time complexity: O(n2)

TOPK WITH ABSTRACT ALGEBRA class PiorityQueueMonoid[T] (max : Int)

TOPK WITH ABSTRACT ALGEBRA type item = (id, feature)

SPEED WITH MONOIDS TopK computed in a single map operation

SPEED WITH MONOIDS TopK computed in a single map operation

THANKS! Any questions? You can find us at: @sdianahu [email protected]

REFERENCES ◦ Lin, Jimmy. "Monoidify! Monoids as a design principle for