Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
330
0
Share
Embed
Copy iframe code
Copy JS code
Copy link
Start on current slide
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
380
High Performance RPC with Finagle
samklr
1
230
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
840
Datageeks_27-05.pdf
samklr
0
81
Big data and Machine learning APIs
samklr
4
300
Scalable Machine Learning
samklr
2
270
mesos.devoxx.2014
samklr
2
300
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
3k
Algebra for analytics
samklr
1
310
Other Decks in Technology
See All in Technology
OTel × Datadog で 「AI活用」を計測し、改善に繋げる
shihochan
2
1.1k
Deep Data Security 機能解説
oracle4engineer
PRO
2
230
AIに障害切り分けを全部やってもらった。 。 。 。
estie
0
260
10年間のブログ発信を振り返って見えたWebアプリケーションエンジニアとしての軌跡
stefafafan
0
190
AWS Summit 2026で見えたSIerにとっての Amazon Quickの位置づけ
maf_0521
0
120
秘密度ラベル初心者が第1歩でつまづかないための「設計・運用」ポイント
seafay
PRO
1
520
toB プロダクトから見たWAF
tokai235
0
250
トークン最適化のためのユーザーストーリー分析 / User Story Analysis for Token Optimization
oomatomo
0
130
感情と身体を置き去りにしない、エンジニアの生きのこり方 ──いまから、ここから「自分の状態」を扱うという選択
saorimurooka
0
360
Multi-Agent並列開発を 安全に回すための技術 / Technology for Safely Multi-Agent Parallel Development
tooppoo
0
220
AI 不只幫你寫 Code: 當專案從 300 暴增到 1500, 我們如何撐住 DevOps
appleboy
0
280
CVE-2026-20833_脆弱性対応とAES 化について
jukishiya
0
140
Featured
See All Featured
Speed Design
sergeychernyshev
33
1.9k
Making Projects Easy
brettharned
120
6.7k
Context Engineering - Making Every Token Count
addyosmani
9
990
Creating an realtime collaboration tool: Agile Flush - .NET Oxford
marcduiker
35
2.5k
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
31
10k
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
420
Un-Boring Meetings
codingconduct
0
320
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
128
56k
Lessons Learnt from Crawling 1000+ Websites
charlesmeaden
PRO
1
1.3k
The Limits of Empathy - UXLibs8
cassininazir
1
370
Large-scale JavaScript Application Architecture
addyosmani
515
110k
The Impact of AI in SEO - AI Overviews June 2024 Edition
aleyda
5
1.1k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None