Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Reynold Xin
January 07, 2013
Research
160
1
Share
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
140
Apache Spark: Unified Platform for Big Data
rxin
1
250
Advanced Spark @ Spark Summit 2014
rxin
4
360
Apache Spark: Easier and Faster Big Data
rxin
2
310
GraphX at Spark User Meetup
rxin
0
170
Shark SIGMOD research deck
rxin
2
560
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
720
Other Decks in Research
See All in Research
さくらインターネット研究所テックトーク2026春、研究開発Gr.25年度成果26年度方針
kikuzo
0
140
PGDM: Physically Guided Diffusion Model for L Downscaling
satai
0
230
Ankylosing Spondylitis
ankh2054
0
170
COFFEE-Japan PROJECT Impact Report(海ノ向こうコーヒー)
ontheslope
0
1.8k
論文紹介 "ReSim: Reliable World Simulation for Autonomous Driving"
kogo
0
610
SoftMatcha 2: 1兆語規模コーパスの超高速かつ柔らかい検索
e869120_sub
6
3.4k
AI Agentの精度改善に見るML開発との共通点 / commonalities in accuracy improvements in agentic era
shimacos
6
1.7k
人間中心の意思決定支援AI
yukinobaba
PRO
3
1.8k
Unified Audio Source Separation (Defense Slides)
kohei_1979
1
610
[チュートリアル] 電波マップ構築入門 :研究動向と課題設定の勘所
k_sato
0
460
Sequences of Logits Reveal the Low Rank Structure of Language Models
sansantech
PRO
1
260
存立危機事態の再検討
jimboken
0
290
Featured
See All Featured
Site-Speed That Sticks
csswizardry
13
1.2k
エンジニアに許された特別な時間の終わり
watany
107
240k
A Modern Web Designer's Workflow
chriscoyier
698
190k
Evolution of real-time – Irina Nazarova, EuRuKo, 2024
irinanazarova
9
1.4k
Redefining SEO in the New Era of Traffic Generation
szymonslowik
1
320
[SF Ruby Conf 2025] Rails X
palkan
2
1.1k
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
380
GraphQLとの向き合い方2022年版
quramy
50
15k
WENDY [Excerpt]
tessaabrams
11
38k
Collaborative Software Design: How to facilitate domain modelling decisions
baasie
1
230
The Director’s Chair: Orchestrating AI for Truly Effective Learning
tmiket
1
180
Documentation Writing (for coders)
carmenintech
77
5.4k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com