Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large-scale Recommender Systems on Just a PC (with GraphChi)

Large-scale Recommender Systems on Just a PC (with GraphChi)

Talk by Aapo Kyrölä, Sr Engineer @Facebook, at Data Science London @ds-ldn meetup

Data Science London

February 05, 2015
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. Large-scale Recommender Systems on Just a PC (with GraphChi) Data

    Science London Dec 10, 2014 Aapo  Kyrölä   Ph.D.,  Carnegie  Mellon  University  2014   (now:  Facebook)     hBp://www.cs.cmu.edu/~akyrola   TwiBer:  @kyrpov   Big  Data  –  small  machine  
  2. Contents 1.  Why “Just a PC” 2.  Introduction to GraphChi

    3.  Recsys on GraphChi –  Examples: ALS, item-CF (triangle counting), random walks for link prediction 4.  Graphchi-DB (very briefly)
  3. Why on a single machine? Can’t    we  just  use

     the   Cloud?   Large-Scale Recommender Systems on Just a PC
  4. Why use a cluster? Two reasons: 1.  One computer cannot

    handle my problem in a reasonable time. 2.  I need to solve the problem very fast.
  5. Why use a cluster? Two reasons: 1.  One computer cannot

    handle my problem in a reasonable time. 2.  I need to solve the problem very fast. Our  work  expands  the  space  of  feasible  (graph)  problems  on   one  machine:   -­‐  Our  experiments  use  the  same  graphs,  or  bigger,  than  previous   papers  on  distributed  graph  computaPon.  (+  we  can  do  TwiBer   graph  on  a  laptop)   -­‐  Most  data  not  that  “big”  anyway.   Our  work  raises  the  bar  on  required  performance  for  a   “complicated”  system.  
  6. Benefits of single machine systems Assuming it can handle your

    big problems… 1.  Programmer productivity –  Global state –  Can use “real data” for development 2.  Inexpensive to install, administer, less power. 3.  Scalability: –  10x machines doing a full job each = 10x throughput
  7. Why graphs for recommender systems? •  Graph = matrix: edge(u,v)

    = M[u,v] –  Note: always sparse graphs •  Intuitive, human-understandable representation –  Easy to visualize and explain. •  Unifies collaborative filtering (typically matrix based) with recommendation in social networks. –  Random walk algorithms. •  Local view à vertex-centric computation
  8. Vertex-Centric Computational Model •  Graph G = (V, E) – 

    directed edges: e = (source, destination) –  each edge and vertex associated with a value (user-defined type) –  vertex and edge values can be modified •  (structure modification also supported) Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   9   GraphChi  –  Aapo  Kyrola   A   B  
  9. Data   Data   Data   Data   Data  

    Data   Data   Data   Data   Data   Vertex-centric Programming •  “Think like a vertex” •  Popularized by the Pregel and GraphLab projects MyFunc(vertex)     {  //  modify  neighborhood  }   Data   Data   Data   Data   Data  
  10. 6. Before 8. A er 7. A er What is

    GraphChi 8. A er 7. A er Both  in  OSDI’12!  
  11. The Main Challenge of Disk-based Graph Computation: Random Access ~

     100K  reads  /  sec  (commodity)   ~  1M  reads  /  sec  (high-­‐end  arrays)     <<    5-­‐10  M  random   edges  /  sec  to  achieve   “reasonable   performance”   100s  reads/writes  per  sec    
  12. •  Vertices are numbered from 1 to n –  P

    intervals, each associated with a shard on disk. –  sub-graph = interval of vertices GraphChi’s Data Storage shard(1)   interval(1)   interval(2)   interval(P)   shard(2)   shard(P)   1   n   v1   v2   13   Expensive  graph  parPPoning  not  required  
  13. Parallel Sliding Windows Only P large reads for each interval

    (sub-graph). P2 reads on one full pass. or   Details:    Kyrola,  Blelloch,  Guestrin:  “Large-­‐scale  graph  computaPon  on  just  a  PC”  (OSDI  2012)  
  14. Performance GraphChi  can  compute  on  the   full  Twi(er  follow-­‐graph

     with   just  a  standard  laptop  (2012).   ~  as  fast  as  a  very  large  Hadoop  cluster!   (size  of  the  graph  Fall  2013,  >  20B  edges  [Gupta  et  al  2013])  
  15. Overview of Recommender Systems for GraphChi 1.  Collaborative Filtering toolkit

    1.  Example 1: ALS 2.  Example 2: Item-based CF 2.  Link prediction in large networks – Random-walk based approaches
  16. GraphChi’s Collaborative Filtering Toolkit •  Developed by Danny Bickson (CMU

    / GraphLab Inc) •  Includes: –  Alternative Least Squares (ALS) –  Sparse-ALS –  SVD++ –  LibFM (factorization machines) –  GenSGD –  Item-similarity based methods –  PMF –  CliMF (contributed by Mark Levy) –  …. Note:  In  the  C++  -­‐version.   See  Danny’s  blog  for  more   informaPon:     hBp:// bickson.blogspot.com/ 2012/12/collaboraPve-­‐ filtering-­‐with-­‐graphchi.html  
  17. Example: Alternative Least Squares Matrix Factorization (ALS) Reference:    Y.

     Zhou,  D.  Wilkinson,  R.  Schreiber,  R.  Pan:  “Large-­‐Scale   Parallel  CollaboraPve  Filtering  for  the  Neolix  Prize”  (2008)   •  Task: Predict ratings for items (movies) by users. •  Model: – Latent factor model (see next slide)
  18. ALS: Product – Item bipartite graph City  of  God  

    Wild  Strawberries   The  CelebraPon   La  Dolce  Vita   Women  on  the  Verge  of  a   Nervous  Breakdown   4   3   2   5   0.4   2.3   -­‐1.8   2.9   1.2   -­‐3.2   2.8   0.9   0.2   4.1   8.7   2.9   0.04   2.1   3.141   2.3   2.5   3.9   0.02   0.04   User’s  raPng  of  a  movie  modeled  as  a  dot-­‐product:          <factor(user), factor(movie)>!
  19. ALS: GraphChi implementation •  Update function handles one vertex a

    time (user or movie) •  For each user: –  Estimate latent(user): minimize least squares of dot-product predicted ratings •  GraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from disk –  Latent factors in memory: need O(V) memory. –  If factors don’t fit in memory, can replicate to edges. and thus store on disk Scales  to  very  large  problems!  
  20. ALS: Performance Matrix  FactorizaPon  (AlternaPve  Least  Squares)   GraphLab  v1

      (8  cores)   GraphChi  (Mac   Mini)   0   2   4   6   8   10   12   Minutes   Ne1lix  (99M  edges),  D=20   Remark:  Neolix  is  not  a  big  problem,  but   GraphChi  will  scale  at  most  linearly  with   input  size  (ALS  is  CPU  bounded,  so  should   be  sub-­‐linear  in  #raPngs).  
  21. Example: Item Based-CF •  Task: compute a similarity score [e,g.

    Jaccard] for each movie-pair that has at least one viewer in common. – Similarity(X, Y) ~ # common viewers •  Problem: enumerating all pairs takes too much time.
  22. City  of  God   Wild  Strawberries   The  CelebraPon  

    La  Dolce  Vita   Women  on  the  Verge  of  a   Nervous  Breakdown   3   SoluPon:  Enumerate  all   triangles  of  the  graph.     New  problem:  how  to   enumerate  triangles  if  the   graph  does  not  fit  in  RAM?    
  23. PIVOTS   Algorithm:   •  Let  pivots  be  a  subset

     of  the  verPces;   •  Load  the  list  of  neighbors  of  pivots  into  RAM   •  Use  GraphChi  to  load  all  verPces  from  disk,   one  by  one,  and  compare  their  neighbors  to   neighboring  pivots’  neighbor  list   •  Repeat  with  a  new  set  of  pivots.   Triangle Enumeration in GraphChi
  24. Triangle Counting Performance Triangle  CounPng   Hadoop  (1636   machines)

      GraphChi  (Mac   Mini)   0   50   100   150   200   250   300   350   400   450   Minutes   twiBer-­‐2010  (1.5B  edges)  
  25. Random Walk Engine •  Simulating random walks to quickly rank

    most important (non-friend) persons for a person: – Example: Pick top 10 nodes visited by 10,000-step random walk (with restart). •  Used by Twitter as first step in their “Who to Follow” –algorithm (Gupta et al., WWW’13)
  26. Random walk in an in-memory graph •  Compute one walk

    a time (multiple in parallel, of course): DrunkardMob  -­‐  RecSys  '13  
  27. Problem: What if Graph does not fit in memory? TwiBer

     network  visualizaPon,   by  Akshay  Java,  2009   Distributed  graph   systems:   -­‐  Each  hop  across   parPPon  boundary   is  costly.   Disk-­‐based  “single-­‐ machine”  graph   systems:   -­‐  “Paging”  from  disk   is  costly.   DrunkardMob  -­‐  RecSys  '13  
  28. Random walks in GraphChi •  DrunkardMob –algorithm (Kyrola, ACM RecSys

    ‘13) –  Reverse thinking: simulate m/billions of short walks in parallel. –  Handle one vertex a time (instead of one walk a time). Note:  Need  to  store  only   current  posiPon  of  each  walk   in  memory  (4B/walk)  !   DrunkardMob  -­‐  RecSys  '13  
  29. Comparison to in-memory walks 0.0e+00 1.0e+09 2.0e+09 3.0e+09 0 1000

    3000 5000 Number of walks Seconds DrunkardMob in−memory walks (Cassovary) (a) Comparison to in-memory walks 0e+00 0 2000 4000 6000 Gra Seconds • • • • • • • • • • • • • • • • (b) Runn CompePPve  with  in-­‐memory  walks.  However,  if  you  can  fit   your  graph  in  memory  –  no  need  for  DrunkardMob.   DrunkardMob  -­‐  RecSys  '13  
  30. GraphChi  (OSDI  ’12)   Batch  computaGon  on  graphs  with  

    billions  of  edges  on  just  a  PC  /  laptop     GraphChi-­‐DB   Database  funcGonality   Updates  (Online)   Insert  edge/vertex   Update  edge/vertex  value   Delete  edge/vertex   (No  high  level   transacGons)   Associated  data   -­‐  Edge  type  (label)   -­‐  Edge  properPes   -­‐  Vertex  properPes   -­‐  Vardata-­‐columns   Queries  (graph-­‐style)   -­‐  In/out  neighbor   queries   -­‐  Two-­‐hop  queries   -­‐  Point  queries   -­‐  Shortest  paths   -­‐  Graph  sampling   à  Incremental   computaPon  on   evolving  graphs  
  31. Highlights •  Fast edge ingest by using Log-Structured Merge –tree

    (similar to RocksDB, LevelDB) •  Fast in- and out-edge queries using sparse and compressed indices – Storage model optimized for large graphs. •  Columnar data storage for fast analytical computation and schema changes Read  more  from  my  thesis  /  arxiv.  
  32. Comparison: Database Size 36   Baseline:  4  +  4  bytes

     /   edge.   0   10   20   30   40   50   60   70   MySQL  (data  +  indices)   Neo4j   GraphChi-­‐DB   BASELINE   Database  file  size  (twijer-­‐2010  graph,  1.5B  edges)  
  33. Comparison: Ingest 37   System   Time  to  ingest  1.5B

     edges   GraphChi-­‐DB    (ONLINE)   1  hour  45  mins   Neo4j  (batch)   45  hours   MySQL  (batch)   3  hour  30  minutes     (including  index  creaPon)   If  running  Pagerank  simultaneously,  GraphChi-­‐DB   takes  3  hour  45  minutes  
  34. Comparison: Friends-of-Friends Query See  thesis  for  shortest-­‐path  comparison.   22.4

      759.8   5.9   0   50   100   150   200   GraphChi-­‐DB   Neo4j   MySQL   milliseconds   50-­‐percenPle   1264   1631   4776   0   1000   2000   3000   4000   5000   6000   GraphChi-­‐DB   GraphChi-­‐DB   +  Pagerank   MySQL   milliseconds   99-­‐percenPle   Latency  percenGles  over  100K  random  queries   Graph:  1.5B   edges   GraphChi-­‐DB   is  the  most   scalable  DB   with  large   power-­‐law   graphs  
  35. Summary •  Single PC can handle very large datasets: – 

    Easier to work with, better economics. •  GraphChi and Parallel Sliding Window –algorithm allow processing graphs in big chunks from disk •  GraphChi’s collaborative filtering toolkit for matrix- and graph-oriented recommendation algorithms –  Scales to big problems, high efficiency by storing critical data in memory. •  GraphChi-DB adds online database features: –  Graph database that can do analytical computation.
  36. GraphChi in GitHub •  http://github.com/graphchi-cpp – Includes collaborative filtering toolkit • 

    http://github.com/graphchi-java •  http://github.com/graphchiDB-scala Thank  you!     [email protected]   twiBer:  @kyrpov   See  also  GraphLab  Create   by  graphlab.com!  
  37. Random Access Problem 42   Disk File:  edge-­‐values   A:

     in-­‐edges   A:  out-­‐edges   A   B   B:  in-­‐edges   B:  out-­‐edges   x   Moral:  You  can  either  access  in-­‐  or  out-­‐edges   sequenPally,  but  not  both!   Random  write!   Random  read!   Processing  sequenPally  
  38. Efficient Scaling Task  7   Task  6   Task  5

      Task  4   Task  3   Task  2   Task  1   Time   T Distributed  Graph   System   Single-­‐computer   system  (capable  of  big  tasks)   Task  1   Task  2   Task  3   Task  4   Task  5   Task  6   Time   T T11   T10   T9   T8   T7   T6   T5   T4   T3   T2   T1   6  machines   12  machines   Task  1   Task  2   Task  3   Task  4   Task  5   Task  6   Task  10   Task  11   Task  12   (Significantly)  less   than  2x  throughput   with  2x  machines   Exactly  2x   throughput  with  2x   machines  
  39. GraphChi Program Execution For  T  iteraPons:   For  p=1  to

     P     For  v  in  interval(p)    updateFuncPon(v)     For  T  iteraPons:   For  v=1  to  V   updateFuncPon(v)     “Asynchronous”:  updates  immediately   visible  (vs.  bulk-­‐synchronous).