Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
320
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
370
High Performance RPC with Finagle
samklr
1
220
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
820
Datageeks_27-05.pdf
samklr
0
72
Big data and Machine learning APIs
samklr
4
280
Scalable Machine Learning
samklr
2
260
mesos.devoxx.2014
samklr
2
290
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
3k
Algebra for analytics
samklr
1
300
Other Decks in Technology
See All in Technology
The essence of decision-making lies in primary data
kaminashi
0
120
データマネジメント戦略Night - 4社のリアルを語る会
ktatsuya
1
410
Laravelで学ぶOAuthとOpenID Connectの基礎と実装
kyoshidaxx
4
1.9k
AWS Systems Managerのハイブリッドアクティベーションを使用したガバメントクラウド環境の統合管理
toru_kubota
1
180
ハーネスエンジニアリング×AI適応開発
aictokamiya
1
350
Why we keep our community?
kawaguti
PRO
0
300
脳が溶けた話 / Melted Brain
keisuke69
1
1.1k
FastMCP OAuth Proxy with Cognito
hironobuiga
3
210
AIエージェント時代に必要な オペレーションマネージャーのロールとは
kentarofujii
0
160
昔話で振り返るAWSの歩み ~S3誕生から20年、クラウドはどう進化したのか~
nrinetcom
PRO
0
100
開発チームとQAエンジニアの新しい協業モデル -年末調整開発チームで実践する【QAリード施策】-
kaomi_wombat
0
250
夢の無限スパゲッティ製造機 #phperkaigi
o0h
PRO
0
380
Featured
See All Featured
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1.1k
Claude Code のすすめ
schroneko
67
220k
Testing 201, or: Great Expectations
jmmastey
46
8.1k
Hiding What from Whom? A Critical Review of the History of Programming languages for Music
tomoyanonymous
2
600
The Invisible Side of Design
smashingmag
302
51k
個人開発の失敗を避けるイケてる考え方 / tips for indie hackers
panda_program
122
21k
From Legacy to Launchpad: Building Startup-Ready Communities
dugsong
0
180
Visualization
eitanlees
150
17k
Navigating Algorithm Shifts & AI Overviews - #SMXNext
aleyda
1
1.2k
VelocityConf: Rendering Performance Case Studies
addyosmani
333
24k
Large-scale JavaScript Application Architecture
addyosmani
515
110k
Applied NLP in the Age of Generative AI
inesmontani
PRO
4
2.2k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None