Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
220
Advanced Spark @ Spark Summit 2014
rxin
4
320
Apache Spark: Easier and Faster Big Data
rxin
2
280
GraphX at Spark User Meetup
rxin
0
140
Shark SIGMOD research deck
rxin
2
510
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
700
Other Decks in Research
See All in Research
「エージェントって何?」から「実際の開発現場で役立つ考え方やベストプラクティス」まで
mickey_kubo
0
130
問いを起点に、社会と共鳴する知を育む場へ
matsumoto_r
PRO
0
450
RHO-1: Not All Tokens Are What You Need
sansan_randd
1
140
Creation and environmental applications of 15-year daily inundation and vegetation maps for Siberia by integrating satellite and meteorological datasets
satai
3
140
EarthSynth: Generating Informative Earth Observation with Diffusion Models
satai
3
120
Delta Airlines® Customer Care in the U.S.: How to Reach Them Now
bookingcomcustomersupportusa
PRO
0
100
Looking for Escorts in Sydney?
lunsophia
1
120
Transparency to sustain open science infrastructure - Printemps Couperin
mlarrieu
1
200
SSII2025 [SS2] 横浜DeNAベイスターズの躍進を支えたAIプロダクト
ssii
PRO
7
3.7k
Combinatorial Search with Generators
kei18
0
390
Galileo: Learning Global & Local Features of Many Remote Sensing Modalities
satai
3
100
研究テーマのデザインと研究遂行の方法論
hisashiishihara
5
1.5k
Featured
See All Featured
YesSQL, Process and Tooling at Scale
rocio
173
14k
Producing Creativity
orderedlist
PRO
346
40k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
8
840
GraphQLの誤解/rethinking-graphql
sonatard
71
11k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
34
3.1k
Building an army of robots
kneath
306
45k
Why Our Code Smells
bkeepers
PRO
337
57k
Art, The Web, and Tiny UX
lynnandtonic
299
21k
Side Projects
sachag
455
42k
Fantastic passwords and where to find them - at NoRuKo
philnash
51
3.3k
Building Flexible Design Systems
yeseniaperezcruz
328
39k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
53
2.9k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com