Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
130
Apache Spark: Unified Platform for Big Data
rxin
1
240
Advanced Spark @ Spark Summit 2014
rxin
4
340
Apache Spark: Easier and Faster Big Data
rxin
2
290
GraphX at Spark User Meetup
rxin
0
150
Shark SIGMOD research deck
rxin
2
540
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
710
Other Decks in Research
See All in Research
svc-hook: hooking system calls on ARM64 by binary rewriting
retrage
1
110
R&Dチームを起ち上げる
shibuiwilliam
1
170
[Devfest Incheon 2025] 모두를 위한 친절한 언어모델(LLM) 학습 가이드
beomi
2
1.4k
世界の人気アプリ100個を分析して見えたペイウォール設計の心得
akihiro_kokubo
PRO
66
37k
2026.01ウェビナー資料
elith
0
220
姫路市 -都市OSの「再実装」-
hopin
0
1.6k
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
230
生成AI による論文執筆サポート・ワークショップ ─ サーベイ/リサーチクエスチョン編 / Workshop on AI-Assisted Paper Writing Support: Survey/Research Question Edition
ks91
PRO
0
140
視覚から身体性を持つAIへ: 巧緻な動作の3次元理解
tkhkaeio
0
190
生成的情報検索時代におけるAI利用と認知バイアス
trycycle
PRO
0
300
HoliTracer:Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery
satai
3
630
2026年1月の生成AI領域の重要リリース&トピック解説
kajikent
0
430
Featured
See All Featured
How to build an LLM SEO readiness audit: a practical framework
nmsamuel
1
650
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
0
140
How to optimise 3,500 product descriptions for ecommerce in one day using ChatGPT
katarinadahlin
PRO
0
3.4k
How To Speak Unicorn (iThemes Webinar)
marktimemedia
1
380
Writing Fast Ruby
sferik
630
62k
Designing Powerful Visuals for Engaging Learning
tmiket
0
240
How People are Using Generative and Agentic AI to Supercharge Their Products, Projects, Services and Value Streams Today
helenjbeal
1
130
[Rails World 2023 - Day 1 Closing Keynote] - The Magic of Rails
eileencodes
38
2.7k
sira's awesome portfolio website redesign presentation
elsirapls
0
150
Navigating the Design Leadership Dip - Product Design Week Design Leaders+ Conference 2024
apolaine
0
190
SEO for Brand Visibility & Recognition
aleyda
0
4.2k
Future Trends and Review - Lecture 12 - Web Technologies (1019888BNR)
signer
PRO
0
3.2k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com