Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
230
Advanced Spark @ Spark Summit 2014
rxin
4
340
Apache Spark: Easier and Faster Big Data
rxin
2
290
GraphX at Spark User Meetup
rxin
0
150
Shark SIGMOD research deck
rxin
2
530
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
710
Other Decks in Research
See All in Research
離散凸解析に基づく予測付き離散最適化手法 (IBIS '25)
taihei_oki
PRO
1
620
CoRL2025速報
rpc
2
3.5k
学習型データ構造:機械学習を内包する新しいデータ構造の設計と解析
matsui_528
4
2k
IMC の細かすぎる話 2025
smly
2
780
令和最新技術で伝統掲示板を再構築: HonoX で作る型安全なスレッドフロート型掲示板 / かろっく@calloc134 - Hono Conference 2025
calloc134
0
440
国際論文を出そう!ICRA / IROS / RA-L への論文投稿の心構えとノウハウ / RSJ2025 Luncheon Seminar
koide3
10
6.3k
Thirty Years of Progress in Speech Synthesis: A Personal Perspective on the Past, Present, and Future
ktokuda
0
120
SkySense V2: A Unified Foundation Model for Multi-modal Remote Sensing
satai
3
150
Open Gateway 5GC利用への期待と不安
stellarcraft
2
160
第二言語習得研究における 明示的・暗示的知識の再検討:この分類は何に役に立つか,何に役に立たないか
tam07pb915
0
400
製造業主導型経済からサービス経済化における中間層形成メカニズムのパラダイムシフト
yamotty
0
250
湯村研究室の紹介2025 / yumulab2025
yumulab
0
230
Featured
See All Featured
The Power of CSS Pseudo Elements
geoffreycrofte
80
6.1k
Context Engineering - Making Every Token Count
addyosmani
9
510
Design and Strategy: How to Deal with People Who Don’t "Get" Design
morganepeng
132
19k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
[SF Ruby Conf 2025] Rails X
palkan
0
510
10 Git Anti Patterns You Should be Aware of
lemiorhan
PRO
659
61k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
48
9.8k
Visualization
eitanlees
150
16k
Principles of Awesome APIs and How to Build Them.
keavy
127
17k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
249
1.3M
Designing for humans not robots
tammielis
254
26k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
9.8k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com