Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Sponsored
·
Your Podcast. Everywhere. Effortlessly.
Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
→
Reynold Xin
January 07, 2013
Research
150
1
Share
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
130
Apache Spark: Unified Platform for Big Data
rxin
1
250
Advanced Spark @ Spark Summit 2014
rxin
4
350
Apache Spark: Easier and Faster Big Data
rxin
2
300
GraphX at Spark User Meetup
rxin
0
170
Shark SIGMOD research deck
rxin
2
550
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
720
Other Decks in Research
See All in Research
Ankylosing Spondylitis
ankh2054
0
170
東京大学工学部計数工学科、計数工学特別講義の説明資料
kikuzo
0
390
「なんとなく」の顧客理解から脱却する ──顧客の解像度を武器にするインサイトマネジメント
tajima_kaho
10
7.5k
Aurora Serverless からAurora Serverless v2への課題と知見を論文から読み解く/Understanding the challenges and insights of moving from Aurora Serverless to Aurora Serverless v2 from a paper
bootjp
6
1.7k
衛星×エッジAI勉強会 衛星上におけるAI処理制約とそ取組について
satai
4
490
Model Discovery and Graph Simulation: A Lightweight Gateway to Chaos Engineering
anatolykr
0
160
Self-Hosted WebAssembly Runtime for Runtime-Neutral Checkpoint/Restore in Edge–Cloud Continuum
chikuwait
0
530
2026年3月1日(日)福島「除染土」の公共利用をかんがえる
atsukomasano2026
0
590
AIを叩き台として、 「検証」から「共創」へと進化するリサーチ
mela_dayo
0
260
世界モデルにおける分布外データ対応の方法論
koukyo1994
7
2.2k
LOSの検討(λ Kansai 2026 in Winter)
motopu
0
120
論文紹介 "ReSim: Reliable World Simulation for Autonomous Driving"
kogo
0
520
Featured
See All Featured
How to Ace a Technical Interview
jacobian
281
24k
The Limits of Empathy - UXLibs8
cassininazir
1
330
Lessons Learnt from Crawling 1000+ Websites
charlesmeaden
PRO
1
1.2k
Navigating the Design Leadership Dip - Product Design Week Design Leaders+ Conference 2024
apolaine
1
310
How to Grow Your eCommerce with AI & Automation
katarinadahlin
PRO
1
180
Technical Leadership for Architectural Decision Making
baasie
3
360
SEO for Brand Visibility & Recognition
aleyda
0
4.5k
Test your architecture with Archunit
thirion
1
2.2k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
810
Tell your own story through comics
letsgokoyo
1
920
Why You Should Never Use an ORM
jnunemaker
PRO
61
9.8k
How to Build an AI Search Optimization Roadmap - Criteria and Steps to Take #SEOIRL
aleyda
1
2k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com