Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong Show
Search
Reynold Xin
January 07, 2013
Research
1
120
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
1.9k
Interface Design for Spark Community
rxin
12
1.3k
Spark Committer Night meetup @ NYC
rxin
1
100
Apache Spark: Unified Platform for Big Data
rxin
1
200
Advanced Spark @ Spark Summit 2014
rxin
4
260
Apache Spark: Easier and Faster Big Data
rxin
2
260
GraphX at Spark User Meetup
rxin
0
130
Shark SIGMOD research deck
rxin
2
430
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
690
Other Decks in Research
See All in Research
HP (Hitto Point: 筆頭ポイント)
tanichu
0
750
My Journey as a UX Researcher
aranciap
0
1.1k
インタビューだけじゃない!ユーザーに共感しユーザーの目👀を手に入れるためのインプット
moco1013
0
270
Weekly AI Agents News!
masatoto
12
3.6k
Introduction of NII S. Koyama's Lab (AY2024)
skoyamalab
0
130
自己教師あり学習による事前学習(CVIMチュートリアル)
naok615
2
1.4k
Azure Arc-enabled Serversを利用した ハイブリッド・マルチクラウド環境の管理 / Managing Hybrid Multi-cloud Environments with Azure Arc-enabled Servers
nttcom
0
220
Alternative Photographic Processes Reimagined: The Role of Digital Technology in Revitalizing Classic Printing Techniques【SIGGRAPH Asia 2023】
toremolo72
0
440
一般化ランダムフォレストの理論と統計的因果推論への応用
tomoshige_n
10
1.8k
Breaking Tradeoffs: Extremely Scalable Multi-Agent Pathfinding Algorithms
kei18
0
150
Ground Metric Learning with applications in genomics
gpeyre
0
370
F0に基づいて伸縮された画像文字からの音声合成 [ASJ2024春]
nehi0615
0
120
Featured
See All Featured
Keith and Marios Guide to Fast Websites
keithpitt
408
22k
A better future with KSS
kneath
231
16k
Web development in the modern age
philhawksworth
203
10k
Docker and Python
trallard
35
2.7k
Fireside Chat
paigeccino
22
2.6k
Design by the Numbers
sachag
274
18k
Thoughts on Productivity
jonyablonski
60
3.9k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
65
14k
VelocityConf: Rendering Performance Case Studies
addyosmani
321
23k
Gamification - CAS2011
davidbonilla
77
4.6k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
155
14k
How To Stay Up To Date on Web Technology
chriscoyier
782
250k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com