Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
230
Advanced Spark @ Spark Summit 2014
rxin
4
330
Apache Spark: Easier and Faster Big Data
rxin
2
290
GraphX at Spark User Meetup
rxin
0
150
Shark SIGMOD research deck
rxin
2
520
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
700
Other Decks in Research
See All in Research
Stealing LUKS Keys via TPM and UUID Spoofing in 10 Minutes - BSides 2025
anykeyshik
0
160
学習型データ構造:機械学習を内包する新しいデータ構造の設計と解析
matsui_528
4
1.4k
IMC の細かすぎる話 2025
smly
2
740
EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observation and Wikipedia
satai
3
350
論文紹介:Not All Tokens Are What You Need for Pretraining
kosuken
1
210
超高速データサイエンス
matsui_528
1
210
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
63
34k
その推薦システムの評価指標、ユーザーの感覚とズレてるかも
kuri8ive
1
240
大規模言語モデルにおけるData-Centric AIと合成データの活用 / Data-Centric AI and Synthetic Data in Large Language Models
tsurubee
1
300
CVPR2025論文紹介:Unboxed
murakawatakuya
0
210
snlp2025_prevent_llm_spikes
takase
0
400
一人称視点映像解析の最先端(MIRU2025 チュートリアル)
takumayagi
6
4.2k
Featured
See All Featured
GraphQLとの向き合い方2022年版
quramy
49
14k
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
3
340
Reflections from 52 weeks, 52 projects
jeffersonlam
355
21k
Put a Button on it: Removing Barriers to Going Fast.
kastner
60
4.1k
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
Mobile First: as difficult as doing things right
swwweet
225
10k
GitHub's CSS Performance
jonrohan
1032
470k
The Language of Interfaces
destraynor
162
25k
Build your cross-platform service in a week with App Engine
jlugia
234
18k
Save Time (by Creating Custom Rails Generators)
garrettdimon
PRO
32
1.8k
Statistics for Hackers
jakevdp
799
230k
A designer walks into a library…
pauljervisheath
210
24k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com