Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
270
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
340
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
780
Datageeks_27-05.pdf
samklr
0
50
Big data and Machine learning APIs
samklr
4
250
Scalable Machine Learning
samklr
2
220
mesos.devoxx.2014
samklr
2
240
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.8k
Algebra for analytics
samklr
1
280
Other Decks in Technology
See All in Technology
Go Modulesの仕組み Bundler(Ruby)との比較を添えて
daisuketakeda
0
1.8k
なぜ「Event Sourcing」を選択したのか〜事実に基づくことの重要性〜/Why did we choose "Event Sourcing"?
bitkey
1
330
マネコン操作いらず! TerraformでAWSインフラのコーディングに入門しよう
minorun365
PRO
5
1.6k
AI_Agent_の作り方_近藤憲児
kenjikondobai
19
5.3k
心に火を灯すヒントは自分の中にある/The clue to lighting a fire in your heart is within you.
bitkey
1
120
移行できそうでやりきれなかった 10年超えのシステムを葬るための戦略 / phper-kaigi-2025-ryu
carta_engineering
0
570
痒い所に手が届く!要約モデルのつくり方
sakusakumura
1
280
LangGraphを使ったAIエージェント実装
iwakiyusaku
1
160
「backlog-exporter」とAIの連携による業務効率化
shuntatoda
0
380
fukuoka.ts #3 社内でESLintの共通設定を配りたい2025年春版
pirosikick
0
150
Autonomous Database サービス・アップデート (FY25)
oracle4engineer
PRO
1
550
TechBullエンジニアコミュニティの取り組みについて
rvirus0817
0
560
Featured
See All Featured
Raft: Consensus for Rubyists
vanstee
137
6.8k
What's in a price? How to price your products and services
michaelherold
244
12k
The Art of Delivering Value - GDevCon NA Keynote
reverentgeek
11
1.4k
Principles of Awesome APIs and How to Build Them.
keavy
126
17k
Automating Front-end Workflow
addyosmani
1369
200k
Build The Right Thing And Hit Your Dates
maggiecrowley
34
2.6k
How GitHub (no longer) Works
holman
314
140k
Building Applications with DynamoDB
mza
93
6.3k
Side Projects
sachag
452
42k
How to Think Like a Performance Engineer
csswizardry
22
1.4k
Facilitating Awesome Meetings
lara
53
6.3k
StorybookのUI Testing Handbookを読んだ
zakiyama
28
5.6k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None