Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
230
Advanced Spark @ Spark Summit 2014
rxin
4
320
Apache Spark: Easier and Faster Big Data
rxin
2
280
GraphX at Spark User Meetup
rxin
0
140
Shark SIGMOD research deck
rxin
2
510
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
700
Other Decks in Research
See All in Research
EOGS: Gaussian Splatting for Efficient Satellite Image Photogrammetry
satai
4
420
PhD Defense 2025: Visual Understanding of Human Hands in Interactions
tkhkaeio
1
160
在庫管理のための機械学習と最適化の融合
mickey_kubo
3
1.1k
診断前の病歴テキストを対象としたLLMによるエンティティリンキング精度検証
hagino3000
1
120
Vision and LanguageからのEmbodied AIとAI for Science
yushiku
PRO
1
500
学生向けアンケート<データサイエンティストについて>
datascientistsociety
PRO
0
5.4k
業界横断 副業・兼業者の実態調査
fkske
0
230
IMC の細かすぎる話 2025
smly
2
580
最適化と機械学習による問題解決
mickey_kubo
0
160
データサイエンティストの採用に関するアンケート
datascientistsociety
PRO
0
1.2k
Agentic AIとMCPを利用したサービス作成入門
mickey_kubo
0
430
Combinatorial Search with Generators
kei18
0
640
Featured
See All Featured
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
Scaling GitHub
holman
462
140k
Navigating Team Friction
lara
189
15k
The Pragmatic Product Professional
lauravandoore
36
6.8k
Building an army of robots
kneath
306
46k
Fight the Zombie Pattern Library - RWD Summit 2016
marcelosomers
234
17k
Writing Fast Ruby
sferik
628
62k
Optimizing for Happiness
mojombo
379
70k
The Language of Interfaces
destraynor
160
25k
StorybookのUI Testing Handbookを読んだ
zakiyama
30
6k
How To Stay Up To Date on Web Technology
chriscoyier
790
250k
Music & Morning Musume
bryan
46
6.7k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com