Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
Sam Bessalah
April 06, 2016
Technology
320
0
Share
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
380
High Performance RPC with Finagle
samklr
1
220
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
830
Datageeks_27-05.pdf
samklr
0
76
Big data and Machine learning APIs
samklr
4
290
Scalable Machine Learning
samklr
2
260
mesos.devoxx.2014
samklr
2
290
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
3k
Algebra for analytics
samklr
1
310
Other Decks in Technology
See All in Technology
サンプリングは「作る」のか「使う」のか? 分散トレースのコストと運用を両立する実践的戦略 / Why you need the tail sampling and why you don't want it
ymotongpoo
3
120
EMから幅を広げるために最近挑戦していること / Recent challenges I'm undertaking to expand my horizons beyond EM
hiro_torii
1
180
知ってた?JavaScriptの"正しさ"を検証するテストが5万以上もあること(Test262)
riyaamemiya
1
150
鹿野さんに聞く!CSSの最新トレンド Ver.2026
tonkotsuboy_com
6
2.4k
エンタープライズの厳格な制約を開発者に意識させない:クラウドネイティブ開発基盤設計/cloudnative-kaigi-golden-path
mhrtech
0
350
Building a Study Buddy AI Agent from Scratch: From Passive Chatbots to Autonomous Systems
itchimonji
0
140
自動テストだけで リリース判断できるチームへ - 鍵はテストの量ではなくリリース判断基準の再設計にあった / Redesigning Release Criteria for Lightweight Releases
ewa
7
3.5k
Forget technical debt
ufried
0
170
20260513_生成AIを専属DSに_AI分析結果の検品テクニック_ハンズオン_交通事故データ
doradora09
PRO
0
210
AI駆動開発で生産性を追いかけたら、行き着いたのは品質とシフトレフトだった
littlehands
0
450
生成AIはソフトウェア開発の革命か、ソフトウェア工学の宿題再提出なのか -ソフトウェア品質特性の追加提案-
kyonmm
PRO
2
860
【技術書典20】OpenFOAM(自宅で深める流体解析)流れと熱移動(2)
kamakiri1225
0
380
Featured
See All Featured
Optimising Largest Contentful Paint
csswizardry
37
3.7k
Tips & Tricks on How to Get Your First Job In Tech
honzajavorek
1
500
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
141
35k
The Hidden Cost of Media on the Web [PixelPalooza 2025]
tammyeverts
2
290
<Decoding/> the Language of Devs - We Love SEO 2024
nikkihalliwell
1
210
Highjacked: Video Game Concept Design
rkendrick25
PRO
1
350
HU Berlin: Industrial-Strength Natural Language Processing with spaCy and Prodigy
inesmontani
PRO
0
370
Making Projects Easy
brettharned
120
6.6k
Public Speaking Without Barfing On Your Shoes - THAT 2023
reverentgeek
1
380
Product Roadmaps are Hard
iamctodd
PRO
55
12k
Groundhog Day: Seeking Process in Gaming for Health
codingconduct
0
170
Test your architecture with Archunit
thirion
1
2.2k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None