$30 off During Our Annual Pro Sale. View Details »

Strongly typed datasets in a weakly typed world

Strongly typed datasets in a weakly typed world

Strongly typed Parquet Datasets / Hive Tables are often used to exchange and preserve data in a Pandas-driven environment, where types are rather unstable. This results in multiple issues and these as well as potential solutions will be presented, together with an RFC directed to the community.

Avatar for Marco Neumann

Marco Neumann

October 26, 2018
Tweet

Other Decks in Technology

Transcript

  1. 3

  2. 13 Dask Types dask_dataframe_1 ├── _common_metadata ├── month=1 │ └──

    552da9ba018c48c1b8722643af3e081d.parquet └── month=2 └── cc686be9ffb54f21a5aaca7e9502e3c6.parquet
  3. 14 Dask Types dask_dataframe_1 ├── _common_metadata ├── month=1 │ └──

    552da9ba018c48c1b8722643af3e081d.parquet └── month=2 └── cc686be9ffb54f21a5aaca7e9502e3c6.parquet
  4. 17 Turbodbc >>> cursor.execute("SELECT * FROM (VALUES(1), (2), (3))") >>>

    table = cursor.fetchallarrow() >>> table pyarrow.Table __COL0__: int64 >>> cursor.execute("SELECT * FROM (VALUES(1), (2), (3))") >>> table = cursor.fetchallarrow(adaptive_integers=True) >>> table pyarrow.Table __COL0__: int8
  5. 31 Dask Types dask_dataframe_1 ├── _common_metadata ├── month=1 │ └──

    552da9ba018c48c1b8722643af3e081d.parquet └── month=2 └── cc686be9ffb54f21a5aaca7e9502e3c6.parquet RECAP
  6. 32 Type Normalization data AType = ABinary | ABool |

    ADate Int | ADecimal128 Int Int | ADictionary AType AType Bool | AFloat Int | AInt Int | AList AType | ANull | AString | ATime Int String | ATimestamp String String | AUInt Int deriving (Show) convert :: AType -> AType convert (AInt w) = AInt 64 convert (AUInt w) = AUInt 64 convert (AFloat w) = AFloat 64 convert (ADate w) = ADate 64 convert (ATime w u) = ATime 64 u convert (AList t) = AList (convert t) convert (ADictionary t i s) = t convert x = x
  7. 33 Takeaways 1. Dynamic Typed Runtime, but Statically Typed Datasets

    2. Bit-Width: 1. Allow Variance during Read/Write 2. (if required) Store Maximum-Width Type 3. Categoricals: 1. Allow “Compressed” and “Uncompressed” Data 2. (if required) Store “Uncompressed”/Internal Type
  8. 34 Thank You 1. Bottles, portions, liquid and perfume HD


    by Alex Sajan
 https://unsplash.com/photos/BNuxGwj-a24
 2. Sunset boulevard
 by Simon Matzinger
 https://unsplash.com/photos/twukN12EN7c
 3. Stages of a Monarch Butterfly
 by Suzanne D. Williams
 https://unsplash.com/photos/VMKBFR6r_jg
 4. Lovers bike ride
 by Sabina Ciesielska
 https://unsplash.com/photos/-2PrzYfgotw
 5. Jar, jam, market food HD
 by Viktor Forgacs
 https://unsplash.com/photos/5mGGOWD-Ths Marco Neumann @crepererum