Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
150
1
Share
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
130
Apache Spark: Unified Platform for Big Data
rxin
1
250
Advanced Spark @ Spark Summit 2014
rxin
4
350
Apache Spark: Easier and Faster Big Data
rxin
2
300
GraphX at Spark User Meetup
rxin
0
160
Shark SIGMOD research deck
rxin
2
550
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
720
Other Decks in Research
See All in Research
FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
satai
3
520
[チュートリアル] 電波マップ構築入門 :研究動向と課題設定の勘所
k_sato
0
400
計算情報学研究室(数理情報学第7研究室)2026
tomohirokoana
0
210
空間音響処理における物理法則に基づく機械学習
skoyamalab
0
270
typst の使い方:言語学を研究する学生のために
gitomochang
0
370
AIを叩き台として、 「検証」から「共創」へと進化するリサーチ
mela_dayo
0
240
ICCV2025参加報告_採択されやすいワークショップの選び方
kobayashi31
0
110
Dwangoでの漫画データ活用〜漫画理解と動画作成〜@コミック工学シンポジウム2025
kzmssk
0
220
視覚から身体性を持つAIへ: 巧緻な動作の3次元理解
tkhkaeio
1
250
IEEE AIxVR 2026 Keynote Talk: "Beyond Visibility: Understanding Scenes and Humans under Challenging Conditions with Diverse Sensing"
miso2024
0
160
2026.01ウェビナー資料
elith
0
340
【SIGGRAPH Asia 2025】Lo-Fi Photograph with Lo-Fi Communication
toremolo72
0
150
Featured
See All Featured
Chasing Engaging Ingredients in Design
codingconduct
0
170
The Limits of Empathy - UXLibs8
cassininazir
1
310
A Guide to Academic Writing Using Generative AI - A Workshop
ks91
PRO
1
280
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
490
Introduction to Domain-Driven Design and Collaborative software design
baasie
1
740
jQuery: Nuts, Bolts and Bling
dougneiner
66
8.4k
The SEO Collaboration Effect
kristinabergwall1
1
430
Code Reviewing Like a Champion
maltzj
528
40k
End of SEO as We Know It (SMX Advanced Version)
ipullrank
3
4.1k
brightonSEO & MeasureFest 2025 - Christian Goodrich - Winning strategies for Black Friday CRO & PPC
cargoodrich
3
680
VelocityConf: Rendering Performance Case Studies
addyosmani
333
25k
Effective software design: The role of men in debugging patriarchy in IT @ Voxxed Days AMS
baasie
0
300
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com