Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
130
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
1.9k
Interface Design for Spark Community
rxin
12
1.3k
Spark Committer Night meetup @ NYC
rxin
1
110
Apache Spark: Unified Platform for Big Data
rxin
1
210
Advanced Spark @ Spark Summit 2014
rxin
4
310
Apache Spark: Easier and Faster Big Data
rxin
2
270
GraphX at Spark User Meetup
rxin
0
130
Shark SIGMOD research deck
rxin
2
480
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
690
Other Decks in Research
See All in Research
Optimal and Diffusion Transports in Machine Learning
gpeyre
0
1.2k
請求書仕分け自動化での物体検知モデル活用 / Utilization of Object Detection Models in Automated Invoice Sorting
sansan_randd
0
110
VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding
sansan_randd
1
470
JSAI NeurIPS 2024 参加報告会(AI アライメント)
akifumi_wachi
5
860
Vision Language Modelと完全自動運転AIの最新動向
tsubasashi
0
240
Large Vision Language Model (LVLM) に関する最新知見まとめ (Part 1)
onely7
24
6k
Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment
satai
3
130
チュートリアル:Mamba, Vision Mamba (Vim)
hf149
6
2.2k
Poster: Feasibility of Runtime-Neutral Wasm Instrumentation for Edge-Cloud Workload Handover
chikuwait
0
360
Neural Fieldの紹介
nnchiba
2
720
「熊本県内バス・電車無料デー」の振り返りとその後の展開@土木計画学SS:成功失敗事例に学ぶ公共交通運賃設定
trafficbrain
0
230
o1 pro mode の調査レポート
smorce
0
130
Featured
See All Featured
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
507
140k
Optimizing for Happiness
mojombo
376
70k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
280
13k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
27
1.6k
The World Runs on Bad Software
bkeepers
PRO
67
11k
What’s in a name? Adding method to the madness
productmarketing
PRO
22
3.3k
How to Think Like a Performance Engineer
csswizardry
22
1.4k
Building a Scalable Design System with Sketch
lauravandoore
461
33k
RailsConf 2023
tenderlove
29
1k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
46
2.3k
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
27
1.9k
A better future with KSS
kneath
238
17k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com