Slide 1

Slide 1 text

SociaLite: A Python-Integrated Query Language for Big Data Analysis Jiwon Seo *Jongsoo Park Jaeho Shin Stephen Guo Monica Lam STANFORD UNIVERSITY M O B I SO C I A L R E S E A R C H GR O U P *Intel Parallel Research Lab

Slide 2

Slide 2 text

Existing platforms are … §  Not fast enough (not network bandwidth) §  Too difficult (low-level primitives) §  Too many (sub) frameworks § Graph analysis § Data mining § Machine learning Why another Big Data Platform?

Slide 3

Slide 3 text

SociaLite is a high-level query language § Compiled to parallel code § 1,000x hadoop § Hadoop compatible § Python integration § Designed for graph analysis § Good for data mining & machine learning Introducing SociaLite

Slide 4

Slide 4 text

§  Language § Tables § Queries § Python integration § Approximation §  Analysis algorithms § Shortest paths, PageRank § K-Means, Logistic regression §  Evaluation §  Demo Outline

Slide 5

Slide 5 text

§  Primary data structure in SociaLite §  Column oriented storage §  § Primitive types § Object types Distributed In-Memory Tables Table ( cx , …, ( cy , … ( cz …))).

Slide 6

Slide 6 text

Distributed In-Memory Tables Foo(int x, int y). 1 9 1 10 2 5 Bar[int x](int y). Foo(int x, (int y)). 9 7 1 2 9 1 2 3 4 9 7 2 8 Machine 1 Machine 2 Bar[int x:0..10](int y). Machine 1 Machine 2 1 2 2 8 3 4 9 7 9 10 5 7

Slide 7

Slide 7 text

Table options § indexby § sortby § multiset Column options § range §  (distributed) partition Distributed In-Memory Tables Foo(int x, int y) indexby x. Foo(int x, int y) sortby x. Foo(int x, int y) multiset. Foo(int x:0..100, int y). Foo[int x](int y).

Slide 8

Slide 8 text

Rules (Queries) Foo(a, c) :- Bar(a, b), Qux(b, c). Rule head Rule body

Slide 9

Slide 9 text

Rules Foo(a, c) :- Bar(a, b), Qux(b, c). 1 2 1 3 8 4 8 7 9 11 2 9 2 10 5 4 10 7 11 9 Bar Qux Foo

Slide 10

Slide 10 text

Rules Foo(a, c) :- Bar(a, b), Qux(b, c). 1 2 1 3 8 4 8 7 9 11 2 9 2 10 5 4 10 7 11 9 1 9 1 10 Bar Qux Foo

Slide 11

Slide 11 text

Rules Foo(a, c) :- Bar(a, b), Qux(b, c). 1 2 1 3 8 4 8 7 9 11 2 9 2 10 5 4 10 7 11 9 1 9 1 10 9 9 Bar Qux Foo

Slide 12

Slide 12 text

Distributed Execution Foo[int a](int b). Bar[int a](int b). Qux[int a](int b). Foo(a, c) :- Bar(a, b), Qux(b, c). 1 2 Bar 2 9 Qux 1 9 Qux Foo Bar Foo Machine 1 Machine 2 join transfer 1 9

Slide 13

Slide 13 text

Distributed Execution Foo[int a](int b). Bar[int a](int b). Qux[int a](int b). Foo(a, c) :- Qux(b, c), Bar(a, b). 1 2 Bar 2 9 Qux Qux Foo Bar Foo Machine 1 Machine 2 1 9

Slide 14

Slide 14 text

Aggregation Foo(a, $min(c)) :- Bar(a, b), Qux(b, c). The $min aggregate function is applied to tuples in Foo having the same first column value. §  Built-in aggregate functions § min, max, sum, avg, argmin §  User-defined functions § in Java or Python

Slide 15

Slide 15 text

§  Head table also appears in rule body Foo(a,c) :- Foo(a,b), Bar(b,c). §  Semantics – rule executed repeatedly until no changes to Foo Recursive Rules

Slide 16

Slide 16 text

Recursive Rules `Edge(int s, (int t, double len)) indexby s. Path(int n, double dist) indexby n. ` `Path(t, $min(d)) :- t=$SRC, d=0; :- Path(n, d1 ), Edge(n, t, d2 ), d=d1 +d2 .` Shortest Path algorithm in recursion + aggregation

Slide 17

Slide 17 text

§  SociaLite queries in Python code § `Queries are quoted in backtick` à Preprocessing with Python import-hook §  Python ßà SociaLite § Python functions, variables are accessible in SociaLite queries § SociaLite tables are readable from Python Python (Jython) Integration

Slide 18

Slide 18 text

Python (Jython) Integration print “This is Python code!” # now we use SociaLite queries below `Foo[int i](String s). Foo(i, s) :- i=42, s=“the answer”.` v=“Python variable” `Foo(i, s) :- i=43, s=$v.` @returns(str) def func(): return “Python func” `Foo(i, s) :- i=44, s=$func().` for i, s in `Foo(i, s)`: print i, s

Slide 19

Slide 19 text

CPython Integration §  JyNI – Jython Native Interface § Stefan Richthofer § http://jyni.org §  To support CPython extensions in Jython § NumPy, SciPy, Pandas, etc §  Tkinter works on Jython

Slide 20

Slide 20 text

Approximate Computation §  Bloom Filter, FM Sketch BloomFilter §  Bitmap-based set §  Quickly check set membership à  false positives, but no false negatives §  In SociaLite, useful to store large intermediate results approximately Approximation hash1 hash3 hash2

Slide 21

Slide 21 text

Approximation w/ Bloom Filter Foaf(i, ff) :- Friend(i, f), Friend(f, ff). LocalCount(i, $inc(1)) :- Foaf(i, ff), Attr(ff, “Some Attr”).

Slide 22

Slide 22 text

Approximation w/ Bloom Filter Foaf(i, ff) :- Friend(i, f), Friend(f, ff). LocalCount(i, $inc(1)) :- Foaf(i, ff), Attr(ff, “Some Attr”). (2nd column of Foaf table is represented with a Bloom filter)

Slide 23

Slide 23 text

Approximation w/ Bloom Filter Foaf(i, ff) :- Friend(i, f), Friend(f, ff). LocalCount(i, $inc(1)) :- Foaf(i, ff), Attr(ff, “Some Attr”). (2nd column of Foaf table is represented with a Bloom filter) Exact Approximation Comparison Exec time (min) 28.9 19.4 32.8% faster Memory usage(GB) 26.0 3.0 11.5% usage Accuracy(<10% error) 100.0% 92.5% * LiveJournal (4.8M nodes, 68M edges)

Slide 24

Slide 24 text

§ Graph algorithms § Shortest Paths § PageRank § Data mining/machine learning algorithms § K-Means Clustering § Logistic regression Analysis Algorithms

Slide 25

Slide 25 text

§  Shortest Path Graph Algorithm `Edge(int s, (int t, double len)) indexby s. Path(int n, double dist) indexby n. ` `Path(t, $min(d)) :- t=$SRC, d=0; :- Path(n, d1 ), Edge(n, t, d2 ), d=d1 +d2 .`

Slide 26

Slide 26 text

§  PageRank Graph Algorithm `Rank(n, 0, r) :- Node(n), r=1.0/$N.` for t in range(30): `Rank(pi, $t+1, $sum(r)) :- Node(pi), r=0.15*1.0/$N; :- Rank(pj, $t, r1 ), Edge(pj, pi), EdgeCnt(pj, cnt), r=0.85*r1 /cnt.`

Slide 27

Slide 27 text

§  PageRank Graph Algorithm `Rank(n, 0, r) :- Node(n), r=1.0/$N.` for t in range(30): `Rank(pi, $t+1, $sum(r)) :- Node(pi), r=0.15*1.0/$N; :- Rank(pj, $t, r1 ), Edge(pj, pi), EdgeCnt(pj, cnt), r=0.85*r1 /cnt.`

Slide 28

Slide 28 text

§  K-Means Clustering Data Mining Algorithm for i in range(50): `Center(cid, $i+1, $avg(p)) :- Data(id, p), Cluster(id, $i, c), cid=c.value.` `Cluster(id, $i+1, $argmin(idx, d)) :- Data(id, p), Center(idx, $i+1, a), d=$getDiff(p, a).`

Slide 29

Slide 29 text

§  Logistic Regression Data Mining Algorithm for i in range(0, 100): `Gradient($i, $sum(w)) :- Data(id, p), Weight($i, w1 ), dot=$dot(w1 , p), y=$sigmoid(dot), w = $computeWeights(p, y).` `Weight($i+1, w) :- Weight($i, w1 ), Gradient($i, g), w=$vecSum (w1 , g).`

Slide 30

Slide 30 text

Benchmark algorithms (graph algorithms) §  Shortest-Paths §  PageRank §  Mutual Neighbors §  Connected Components §  Finding Triangles §  Clustering Coefficients à  Evaluation on a multi-core & distributed cluster Evaluation

Slide 31

Slide 31 text

Input Graph for Multi-Core Source Size Machine Friendster 120M nodes 2.5B edges Intel Xeon E5-2670 16 cores(8+8) 2.60GHz 20MB last-level cache 256GB memory

Slide 32

Slide 32 text

Parallel Performance (Multi-Core) PageRank Mutual Neighbors Connected Components Triangle Clustering Coefficients Shortest Paths 0 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 Speedup over 1 core Number of Cores speedup ideal speedup 0 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16

Slide 33

Slide 33 text

Input Graph for Distributed Evaluation Source Size Machine Synthetic Graph* up to 268M nodes 4.3B edges (weak scaling) 64 Amazon EC2 Instances Intel Xeon X5570, 8 cores 23GB memory *RMAT algorithm, Graph 500 Generator

Slide 34

Slide 34 text

Giraph (Pregel) vs SociaLite 8 32 128 512 2 4 8 16 32 64 Exec Time (Sec.) 2 8 32 128 2 4 8 16 32 64 Exec Time (Min.) 8 32 128 512 2 4 8 16 32 64 Exec Time (Min.) Clustering Coefficients Triangle Connected Components 40 160 640 2 4 8 16 32 64 Exec Time (Sec.) 40 160 640 2 4 8 16 32 64 Exec Time (Sec.) 1 2 4 8 16 32 2 4 8 16 32 64 Exec Time (Sec.) Shortest paths PageRank Mutual neighbors # of machines

Slide 35

Slide 35 text

§  Giraph vs SociaLite (lines of code) Programmability Comparison Giraph SociaLite Shortest Paths 232 4 PageRank 146 13 Mutual Neighbors 169 6 Connected Components 122 9 Triangles 181 6 Clustering Coefficients 218 12 Total 1,068 50 à  SociaLite is 20x simpler!

Slide 36

Slide 36 text

§  Collaboration with Intel Parallel Research Lab* §  Compared frameworks § SociaLite § Giraph § GraphLab § Combinatorial BLAS §  Native Implementation in C, assembly – optimal * Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets, Satish et al., SIGMOD ‘14 Comparing More Graph Frameworks

Slide 37

Slide 37 text

§  Benchmark Algorithms § BFS (Breadth First Search) § PageRank § Collaborative Filtering § Triangle §  Evaluation on Intel cluster – Intel Xeon, 24 cores 2.7GHz, 64GB memory, InfiniBand network §  Input Graph – up to 512M nodes, 16G edges (weak scaling) Comparing More Graph Frameworks

Slide 38

Slide 38 text

§  BFS (Breadth First Search) Programmability Lines of Code Development Time SociaLite 4 1~2 min Giraph 200 1~2 hours GraphLab 180 1~2 hours Combinatorial BLAS 450 a few hours Native > 1000 > A few months

Slide 39

Slide 39 text

Distributed Execution – Comparison 0   1   10   100   1000   1   4   16   64   Exec  &me  (sec.)   Breadth First Search 0.1   1   10   100   1   4   16   64   Time  per  iter.  (sec.)   PageRank 1   10   100   1000   10000   1   4   16   64   Time  per  iter.  (sec.)   0   1   10   100   1000   1   4   16   64   Exec  &me  (sec.)   Triangle Collaborative Filtering # of machines

Slide 40

Slide 40 text

Distributed Execution – Comparison 0   1   10   100   1000   1   4   16   64   Exec  &me  (sec.)   Breadth First Search 0.1   1   10   100   1   4   16   64   Time  per  iter.  (sec.)   PageRank 1   10   100   1000   10000   1   4   16   64   Time  per  iter.  (sec.)   0   1   10   100   1000   1   4   16   64   Exec  &me  (sec.)   Triangle Collaborative Filtering # of machines

Slide 41

Slide 41 text

§  Collaboration with LinkedIn § Real-time pattern matching queries § Off-line analysis §  Discussing collaboration with other companies § Kakao § etc Work In-Progress

Slide 42

Slide 42 text

§  20x easier than Giraph §  10x faster than Giraph §  As fast as, or faster than -  GraphLab, CombBlas §  How? -  High-level query interface -  Compiler optimizations -  Python integration Summary Big Data Analysis

Slide 43

Slide 43 text

DBLP (CS bibliography) § Co-authorship graph § vertices: authors (1 million) § edges: co-authorship (10 million) § Guido van Rossum’s academic network § How Guido is connected to Armin Rigo (PyPy) Jim Hugunin (Jython, IronPython) § Run shortest-paths from Guido & visualize Demo

Slide 44

Slide 44 text

§ Visit http://socialite.stanford.edu for § Trying out § Participation (Apache v2) Questions?