Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Beating State-of-the-art By -10000% @ CIDR Gong...
Search
Reynold Xin
January 07, 2013
Research
1
130
Beating State-of-the-art By -10000% @ CIDR Gong Show
I gave a 5-min Gong Show talk at CIDR on my experience with Spark, Shark, and GraphX.
Reynold Xin
January 07, 2013
Tweet
Share
More Decks by Reynold Xin
See All by Reynold Xin
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around Comes Around
rxin
12
2k
Interface Design for Spark Community
rxin
12
1.4k
Spark Committer Night meetup @ NYC
rxin
1
120
Apache Spark: Unified Platform for Big Data
rxin
1
220
Advanced Spark @ Spark Summit 2014
rxin
4
320
Apache Spark: Easier and Faster Big Data
rxin
2
280
GraphX at Spark User Meetup
rxin
0
140
Shark SIGMOD research deck
rxin
2
500
The Spark Ecosystem: Fast and Expressive Big Data Analytics in Scala @ Scala Days 2013
rxin
3
700
Other Decks in Research
See All in Research
AIによる画像認識技術の進化 -25年の技術変遷を振り返る-
hf149
6
3.4k
RapidPen: AIエージェントによるペネトレーションテスト 初期侵入全自動化の研究
laysakura
0
1.4k
数理最適化と機械学習の融合
mickey_kubo
15
8.8k
定性データ、どう活かす? 〜定性データのための分析基盤、はじめました〜 / How to utilize qualitative data? ~We have launched an analysis platform for qualitative data~
kaminashi
6
1k
Transparency to sustain open science infrastructure - Printemps Couperin
mlarrieu
1
170
90 分で学ぶ P 対 NP 問題
e869120
17
7.5k
3D Gaussian Splattingによる高効率な新規視点合成技術とその応用
muskie82
5
2.5k
Google Agent Development Kit (ADK) 入門 🚀
mickey_kubo
2
980
【緊急警告】日本の未来設計図 ~沈没か、再生か。国民と断行するラストチャンス~
yuutakasan
0
130
[輪講] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
nk35jk
2
470
CSP: Self-Supervised Contrastive Spatial Pre-Training for Geospatial-Visual Representations
satai
3
200
GeoCLIP: Clip-Inspired Alignment between Locations and Images for Effective Worldwide Geo-localization
satai
3
230
Featured
See All Featured
Fashionably flexible responsive web design (full day workshop)
malarkey
407
66k
How to Create Impact in a Changing Tech Landscape [PerfNow 2023]
tammyeverts
53
2.8k
Become a Pro
speakerdeck
PRO
28
5.4k
A better future with KSS
kneath
239
17k
How GitHub (no longer) Works
holman
314
140k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
26
2.8k
Building Adaptive Systems
keathley
43
2.6k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
252
21k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
48
2.8k
VelocityConf: Rendering Performance Case Studies
addyosmani
330
24k
Practical Orchestrator
shlominoach
188
11k
How to train your dragon (web standard)
notwaldorf
92
6.1k
Transcript
Beating State-of-the-art By -10000% Reynold Xin, AMPLab, UC Berkeley with
help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
Beating State-of-the-art By -10000% NOT A TYPO Reynold Xin, AMPLab,
UC Berkeley with help from Joseph Gonzalez, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica
MapReduce deterministic, idempotent tasks fault-tolerance elasticity resource sharing
“The bar for open source software is at historical low.”
“The bar for open source software is at historical low.”
i.e. “This is the right time to do grad school.”
iterative machine learning OLAP strong temporal locality
Does in-memory computation help in petabyte-scale warehouses?
Does in-memory computation help in petabyte-scale warehouses? YES
Spark How to do in-memory computation efficiently in a fault-tolerant
way?
Shark How to do SQL query processing efficiently in “MapReduce”
style SQL on top of Spark Hive compatible (UDF, Type, InputFormat, Metadata)
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.”
“You need to beat Hadoop by at least 100X to
publish a paper in 2013.” i.e. “You should’ve come to grad school 2 years earlier.”
Shark in-memory columnar store dynamic query re-optimization and a lot
of engineering...
Query 1 Query 2 Log Regress 0 20 40 60
80 100 120 110 94 64 0.96 1 0.7 Runtime (seconds) on a 100-node EC2 cluster Shark/Spark Hive/Hadoop
iterative machine learning SQL query processing
iterative machine learning SQL query processing graph computation
GraphLab on Spark
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. Not bad for a day of work!
I spent a day pair-programming with Joey Gonzalez and improved
performance by 10X. but I later found out that it is still 10X slower than the latest version of GraphLab :(
A lot of open questions for fault- tolerant, distributed graph
computation. “MapReduce”? Data partitioning? Fault-tolerance? Asynchrony?
iterative machine learning www.spark-project.org SQL query processing shark.cs.berkeley.edu graph computation
www.wait-another-year.com