Large-scale Recommender Systems on Just a PC (with GraphChi)

Slide 1

Slide 1 text

Large-scale Recommender Systems on Just a PC (with GraphChi) Data Science London Dec 10, 2014 Aapo Kyrölä Ph.D., Carnegie Mellon University 2014 (now: Facebook) hBp://www.cs.cmu.edu/~akyrola TwiBer: @kyrpov Big Data – small machine

Slide 2

Slide 2 text

Contents 1.  Why “Just a PC” 2.  Introduction to GraphChi 3.  Recsys on GraphChi –  Examples: ALS, item-CF (triangle counting), random walks for link prediction 4.  Graphchi-DB (very briefly)

Slide 3

Slide 3 text

Why on a single machine? Can’t we just use the Cloud? Large-Scale Recommender Systems on Just a PC

Slide 4

Slide 4 text

Why use a cluster? Two reasons: 1.  One computer cannot handle my problem in a reasonable time. 2.  I need to solve the problem very fast.

Slide 5

Slide 5 text

Why use a cluster? Two reasons: 1.  One computer cannot handle my problem in a reasonable time. 2.  I need to solve the problem very fast. Our work expands the space of feasible (graph) problems on one machine: -‐  Our experiments use the same graphs, or bigger, than previous papers on distributed graph computaPon. (+ we can do TwiBer graph on a laptop) -‐  Most data not that “big” anyway. Our work raises the bar on required performance for a “complicated” system.

Slide 6

Slide 6 text

Benefits of single machine systems Assuming it can handle your big problems… 1.  Programmer productivity –  Global state –  Can use “real data” for development 2.  Inexpensive to install, administer, less power. 3.  Scalability: –  10x machines doing a full job each = 10x throughput

Slide 7

Slide 7 text

GRAPH COMPUTATION AND GRAPHCHI

Slide 8

Slide 8 text

Why graphs for recommender systems? •  Graph = matrix: edge(u,v) = M[u,v] –  Note: always sparse graphs •  Intuitive, human-understandable representation –  Easy to visualize and explain. •  Unifies collaborative filtering (typically matrix based) with recommendation in social networks. –  Random walk algorithms. •  Local view à vertex-centric computation

Slide 9

Slide 9 text

Vertex-Centric Computational Model •  Graph G = (V, E) –  directed edges: e = (source, destination) –  each edge and vertex associated with a value (user-defined type) –  vertex and edge values can be modified •  (structure modification also supported) Data Data Data Data Data Data Data Data Data Data 9 GraphChi – Aapo Kyrola A B

Slide 10

Slide 10 text

Data Data Data Data Data Data Data Data Data Data Vertex-centric Programming •  “Think like a vertex” •  Popularized by the Pregel and GraphLab projects MyFunc(vertex) { // modify neighborhood } Data Data Data Data Data

Slide 11

Slide 11 text

6. Before 8. A er 7. A er What is GraphChi 8. A er 7. A er Both in OSDI’12!

Slide 12

Slide 12 text

The Main Challenge of Disk-based Graph Computation: Random Access ~ 100K reads / sec (commodity) ~ 1M reads / sec (high-‐end arrays) << 5-‐10 M random edges / sec to achieve “reasonable performance” 100s reads/writes per sec

Slide 13

Slide 13 text

•  Vertices are numbered from 1 to n –  P intervals, each associated with a shard on disk. –  sub-graph = interval of vertices GraphChi’s Data Storage shard(1) interval(1) interval(2) interval(P) shard(2) shard(P) 1 n v1 v2 13 Expensive graph parPPoning not required

Slide 14

Slide 14 text

Parallel Sliding Windows Only P large reads for each interval (sub-graph). P2 reads on one full pass. or Details: Kyrola, Blelloch, Guestrin: “Large-‐scale graph computaPon on just a PC” (OSDI 2012)

Slide 15

Slide 15 text

Performance GraphChi can compute on the full Twi(er follow-‐graph with just a standard laptop (2012). ~ as fast as a very large Hadoop cluster! (size of the graph Fall 2013, > 20B edges [Gupta et al 2013])

Slide 16

Slide 16 text

RECSYS MODEL TRAINING WITH GRAPHCHI

Slide 17

Slide 17 text

Overview of Recommender Systems for GraphChi 1.  Collaborative Filtering toolkit 1.  Example 1: ALS 2.  Example 2: Item-based CF 2.  Link prediction in large networks – Random-walk based approaches

Slide 18

Slide 18 text

GraphChi’s Collaborative Filtering Toolkit •  Developed by Danny Bickson (CMU / GraphLab Inc) •  Includes: –  Alternative Least Squares (ALS) –  Sparse-ALS –  SVD++ –  LibFM (factorization machines) –  GenSGD –  Item-similarity based methods –  PMF –  CliMF (contributed by Mark Levy) –  …. Note: In the C++ -‐version. See Danny’s blog for more informaPon: hBp:// bickson.blogspot.com/ 2012/12/collaboraPve-‐ ﬁltering-‐with-‐graphchi.html

Slide 19

Slide 19 text

Example: Alternative Least Squares Matrix Factorization (ALS) Reference: Y. Zhou, D. Wilkinson, R. Schreiber, R. Pan: “Large-‐Scale Parallel CollaboraPve Filtering for the Neolix Prize” (2008) •  Task: Predict ratings for items (movies) by users. •  Model: – Latent factor model (see next slide)

Slide 20

Slide 20 text

ALS: Product – Item bipartite graph City of God Wild Strawberries The CelebraPon La Dolce Vita Women on the Verge of a Nervous Breakdown 4 3 2 5 0.4 2.3 -‐1.8 2.9 1.2 -‐3.2 2.8 0.9 0.2 4.1 8.7 2.9 0.04 2.1 3.141 2.3 2.5 3.9 0.02 0.04 User’s raPng of a movie modeled as a dot-‐product: !

Slide 21

Slide 21 text

ALS: GraphChi implementation •  Update function handles one vertex a time (user or movie) •  For each user: –  Estimate latent(user): minimize least squares of dot-product predicted ratings •  GraphChi executes the update function for each vertex (in parallel), and loads edges (ratings) from disk –  Latent factors in memory: need O(V) memory. –  If factors don’t fit in memory, can replicate to edges. and thus store on disk Scales to very large problems!

Slide 22

Slide 22 text

ALS: Performance Matrix FactorizaPon (AlternaPve Least Squares) GraphLab v1 (8 cores) GraphChi (Mac Mini) 0 2 4 6 8 10 12 Minutes Ne1lix (99M edges), D=20 Remark: Neolix is not a big problem, but GraphChi will scale at most linearly with input size (ALS is CPU bounded, so should be sub-‐linear in #raPngs).

Slide 23

Slide 23 text

Example: Item Based-CF •  Task: compute a similarity score [e,g. Jaccard] for each movie-pair that has at least one viewer in common. – Similarity(X, Y) ~ # common viewers •  Problem: enumerating all pairs takes too much time.

Slide 24

Slide 24 text

City of God Wild Strawberries The CelebraPon La Dolce Vita Women on the Verge of a Nervous Breakdown 3 SoluPon: Enumerate all triangles of the graph. New problem: how to enumerate triangles if the graph does not ﬁt in RAM?

Slide 25

Slide 25 text

PIVOTS Algorithm: •  Let pivots be a subset of the verPces; •  Load the list of neighbors of pivots into RAM •  Use GraphChi to load all verPces from disk, one by one, and compare their neighbors to neighboring pivots’ neighbor list •  Repeat with a new set of pivots. Triangle Enumeration in GraphChi

Slide 26

Slide 26 text

Triangle Counting Performance Triangle CounPng Hadoop (1636 machines) GraphChi (Mac Mini) 0 50 100 150 200 250 300 350 400 450 Minutes twiBer-‐2010 (1.5B edges)

Slide 27

Slide 27 text

RECOMMENDATIONS IN SOCIAL NETWORKS

Slide 28

Slide 28 text

Random Walk Engine •  Simulating random walks to quickly rank most important (non-friend) persons for a person: – Example: Pick top 10 nodes visited by 10,000-step random walk (with restart). •  Used by Twitter as first step in their “Who to Follow” –algorithm (Gupta et al., WWW’13)

Slide 29

Slide 29 text

Random walk in an in-memory graph •  Compute one walk a time (multiple in parallel, of course): DrunkardMob -‐ RecSys '13

Slide 30

Slide 30 text

Problem: What if Graph does not fit in memory? TwiBer network visualizaPon, by Akshay Java, 2009 Distributed graph systems: -‐ Each hop across parPPon boundary is costly. Disk-‐based “single-‐ machine” graph systems: -‐  “Paging” from disk is costly. DrunkardMob -‐ RecSys '13

Slide 31

Slide 31 text

Random walks in GraphChi •  DrunkardMob –algorithm (Kyrola, ACM RecSys ‘13) –  Reverse thinking: simulate m/billions of short walks in parallel. –  Handle one vertex a time (instead of one walk a time). Note: Need to store only current posiPon of each walk in memory (4B/walk) ! DrunkardMob -‐ RecSys '13

Slide 32

Slide 32 text

Comparison to in-memory walks 0.0e+00 1.0e+09 2.0e+09 3.0e+09 0 1000 3000 5000 Number of walks Seconds DrunkardMob in−memory walks (Cassovary) (a) Comparison to in-memory walks 0e+00 0 2000 4000 6000 Gra Seconds ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● (b) Runn CompePPve with in-‐memory walks. However, if you can ﬁt your graph in memory – no need for DrunkardMob. DrunkardMob -‐ RecSys '13

Slide 33

Slide 33 text

8. A er 7. A er -‐DB

Slide 34

Slide 34 text

GraphChi (OSDI ’12) Batch computaGon on graphs with billions of edges on just a PC / laptop GraphChi-‐DB Database funcGonality Updates (Online) Insert edge/vertex Update edge/vertex value Delete edge/vertex (No high level transacGons) Associated data -‐  Edge type (label) -‐  Edge properPes -‐  Vertex properPes -‐  Vardata-‐columns Queries (graph-‐style) -‐  In/out neighbor queries -‐  Two-‐hop queries -‐  Point queries -‐  Shortest paths -‐  Graph sampling à Incremental computaPon on evolving graphs

Slide 35

Slide 35 text

Highlights •  Fast edge ingest by using Log-Structured Merge –tree (similar to RocksDB, LevelDB) •  Fast in- and out-edge queries using sparse and compressed indices – Storage model optimized for large graphs. •  Columnar data storage for fast analytical computation and schema changes Read more from my thesis / arxiv.

Slide 36

Slide 36 text

Comparison: Database Size 36 Baseline: 4 + 4 bytes / edge. 0 10 20 30 40 50 60 70 MySQL (data + indices) Neo4j GraphChi-‐DB BASELINE Database ﬁle size (twijer-‐2010 graph, 1.5B edges)

Slide 37

Slide 37 text

Comparison: Ingest 37 System Time to ingest 1.5B edges GraphChi-‐DB (ONLINE) 1 hour 45 mins Neo4j (batch) 45 hours MySQL (batch) 3 hour 30 minutes (including index creaPon) If running Pagerank simultaneously, GraphChi-‐DB takes 3 hour 45 minutes

Slide 38

Slide 38 text

Comparison: Friends-of-Friends Query See thesis for shortest-‐path comparison. 22.4 759.8 5.9 0 50 100 150 200 GraphChi-‐DB Neo4j MySQL milliseconds 50-‐percenPle 1264 1631 4776 0 1000 2000 3000 4000 5000 6000 GraphChi-‐DB GraphChi-‐DB + Pagerank MySQL milliseconds 99-‐percenPle Latency percenGles over 100K random queries Graph: 1.5B edges GraphChi-‐DB is the most scalable DB with large power-‐law graphs

Slide 39

Slide 39 text

SUMMARY

Slide 40

Slide 40 text

Summary •  Single PC can handle very large datasets: –  Easier to work with, better economics. •  GraphChi and Parallel Sliding Window –algorithm allow processing graphs in big chunks from disk •  GraphChi’s collaborative filtering toolkit for matrix- and graph-oriented recommendation algorithms –  Scales to big problems, high efficiency by storing critical data in memory. •  GraphChi-DB adds online database features: –  Graph database that can do analytical computation.

Slide 41

Slide 41 text

GraphChi in GitHub •  http://github.com/graphchi-cpp – Includes collaborative filtering toolkit •  http://github.com/graphchi-java •  http://github.com/graphchiDB-scala Thank you! [email protected] twiBer: @kyrpov See also GraphLab Create by graphlab.com!

Slide 42

Slide 42 text

Random Access Problem 42 Disk File: edge-‐values A: in-‐edges A: out-‐edges A B B: in-‐edges B: out-‐edges x Moral: You can either access in-‐ or out-‐edges sequenPally, but not both! Random write! Random read! Processing sequenPally

Slide 43

Slide 43 text

Efficient Scaling Task 7 Task 6 Task 5 Task 4 Task 3 Task 2 Task 1 Time T Distributed Graph System Single-‐computer system (capable of big tasks) Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Time T T11 T10 T9 T8 T7 T6 T5 T4 T3 T2 T1 6 machines 12 machines Task 1 Task 2 Task 3 Task 4 Task 5 Task 6 Task 10 Task 11 Task 12 (Signiﬁcantly) less than 2x throughput with 2x machines Exactly 2x throughput with 2x machines

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

GraphChi Program Execution For T iteraPons: For p=1 to P For v in interval(p) updateFuncPon(v) For T iteraPons: For v=1 to V updateFuncPon(v) “Asynchronous”: updates immediately visible (vs. bulk-‐synchronous).