Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Reynold Xin
January 07, 2013
Research
1
140
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
130
Apache Spark: Unified Platform for Big Data
rxin
1
250
Advanced Spark @ Spark Summit 2014
rxin
4
350
Apache Spark: Easier and Faster Big Data
rxin
2
290
GraphX at Spark User Meetup
rxin
0
160
Shark SIGMOD research deck
rxin
2
540
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
720
Other Decks in Research
See All in Research
Collective Predictive Coding and World Models in LLMs: A System 0/1/2/3 Perspective on Hierarchical Physical AI (IEEE SII 2026 Plenary Talk)
tanichu
1
290
AWSの耐久性のあるRedis互換KVSのMemoryDBについての論文を読んでみた
bootjp
1
530
教師あり学習と強化学習で作る 最強の数学特化LLM
analokmaus
2
940
2026年3月1日(日)福島「除染土」の公共利用をかんがえる
atsukomasano2026
0
420
湯村研究室の紹介2025 / yumulab2025
yumulab
0
320
討議:RACDA設立30周年記念都市交通フォーラム2026
trafficbrain
0
560
ペットのかわいい瞬間を撮影する オートシャッターAIアプリへの スマートラベリングの適用
mssmkmr
0
370
FUSE-RSVLM: Feature Fusion Vision-Language Model for Remote Sensing
satai
3
210
Proposal of an Information Delivery Method for Electronic Paper Signage Using Human Mobility as the Communication Medium / ICCE-Asia 2025
yumulab
0
240
【NICOGRAPH2025】Photographic Conviviality: ボディペイント・ワークショップによる 同時的かつ共生的な写真体験
toremolo72
0
190
第66回コンピュータビジョン勉強会@関東 Epona: Autoregressive Diffusion World Model for Autonomous Driving
kentosasaki
0
470
"主観で終わらせない"定性データ活用 ― プロダクトディスカバリーを加速させるインサイトマネジメント / Utilizing qualitative data that "doesn't end with subjectivity" - Insight management that accelerates product discovery
kaminashi
16
22k
Featured
See All Featured
Mind Mapping
helmedeiros
PRO
1
110
Building Experiences: Design Systems, User Experience, and Full Site Editing
marktimemedia
0
440
What's in a price? How to price your products and services
michaelherold
247
13k
I Don’t Have Time: Getting Over the Fear to Launch Your Podcast
jcasabona
34
2.7k
JavaScript: Past, Present, and Future - NDC Porto 2020
reverentgeek
52
5.9k
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.2k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.4k
Ruling the World: When Life Gets Gamed
codingconduct
0
170
実際に使うSQLの書き方 徹底解説 / pgcon21j-tutorial
soudai
PRO
199
73k
Six Lessons from altMBA
skipperchong
29
4.2k
Building the Perfect Custom Keyboard
takai
2
710
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
508
140k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com