Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Intro to Parquet (June 2015)
Search
Sam Bessalah
April 06, 2016
Technology
0
250
Intro to Parquet (June 2015)
Sam Bessalah
April 06, 2016
Tweet
Share
More Decks by Sam Bessalah
See All by Sam Bessalah
Streaming Platforms
samklr
0
330
High Performance RPC with Finagle
samklr
1
160
Dotscale 2015 Lightning - Distributed Systems Research
samklr
1
770
Datageeks_27-05.pdf
samklr
0
47
Big data and Machine learning APIs
samklr
4
240
Scalable Machine Learning
samklr
2
210
mesos.devoxx.2014
samklr
2
230
Algebird : Abstract Algebra for Big Data Analytics.
samklr
9
2.7k
Algebra for analytics
samklr
1
270
Other Decks in Technology
See All in Technology
音声AIエージェントの世界とRetell AI入門 / Introduction to the World of Voice AI Agents and Retell AI
rkaga
4
890
より快適なエラーログ監視を目指して
leveragestech
3
1.1k
OR学会2024秋_短期収益と将来のオフ方策評価性能を考慮したクーポン割当方策混合比の決定
recruitengineers
PRO
4
410
Towards Effortless Transaction Management in Microservices @KubeDay Japan 2024
scalar
1
110
疎通2024
sadnessojisan
5
1k
AWSを始めた頃に陥りがちなポイントをまとめてみた
oshanqq
1
3.4k
LandingZoneAccelerator と学ぶ 「スケーラブルで安全なマルチアカウントAWS環境」と 私たちにもできるベストプラクティス
maimyyym
1
120
マーケットプレイス版Oracle WebCenter Content For OCI
oracle4engineer
PRO
2
170
DroidKaigi 2024 たすけて!ViewModel
mhidaka
5
530
自社開発した大規模言語モデルをどうプロダクションに乗せて運用していくか〜インフラ編〜
pfn
PRO
22
6.7k
標準ライブラリの奥深アップデートを掘り下げよう!
logica0419
2
450
React Aria で実現する次世代のアクセシビリティ
ryo_manba
4
1k
Featured
See All Featured
The Brand Is Dead. Long Live the Brand.
mthomps
53
37k
Done Done
chrislema
180
16k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
131
32k
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
24
3.9k
Let's Do A Bunch of Simple Stuff to Make Websites Faster
chriscoyier
502
140k
Optimising Largest Contentful Paint
csswizardry
29
2.8k
The Cult of Friendly URLs
andyhume
76
5.9k
Raft: Consensus for Rubyists
vanstee
135
6.5k
The Power of CSS Pseudo Elements
geoffreycrofte
71
5.2k
Building Your Own Lightsaber
phodgson
101
6k
KATA
mclloyd
27
13k
Build your cross-platform service in a week with App Engine
jlugia
228
18k
Transcript
Sam BESSALAH @samklr http://parquet.apache.org
Typical Data workflow
Typical Data workflow
Typical Data workflow
Typical Data workflow
Multiple Data Format
Big Data Data Format Zoo - Sequence Files
these formats provide
None
Binary, columnar storage format for big data analytics workloads, inspired
by the Google Dremel Paper. - Language independent - Processing framework independent - Formally specified - More than a columnar storage : Dynamic partionning, automatic predicate and projections push down - Awesome performance
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101
Columnar Storage 101 Advantages : - Limits I/O to the
data only needed - Big Space savings, better compression, and faster and low overhead encodings - Enables vectorized engine
Columnar Storage 101
None
Parquet Model
Example Parquet Schema
None
None
Definition and Repetition Levels Definition Level : Stores the level
for which the field is null Repetition Level : Store levels when new lists are starting in column values.
None
None
None
None
None
None
Numbers Example: Appnexus 2 MM Logs of Ads impressions 270
TB of Log Data in Protobuf on HDFS http://techblog.appnexus.com/blog/2015/03/31/parquet-columnar-storage-for-hadoop-data/
simple bench with HIVE
None
None
Disk Space usage on HDFS with 128 MB blocks
None
None
None
None
None
None
Slides shamelessly cloned from Julien Le Dem(@J_) , Lead of
the Apache Parquet Project
BACKUP SLIDES
None
None
None
None
None
None
None
None
None
None
None
None