Slide 1

Slide 1 text

Extract Transform Load FALUDI, Bence Database Manager

Slide 2

Slide 2 text

Positions: -  Database Manager @ Mito -  Chief Technology Officer @ Ozmo Tasks: -  Data warehouse design -  Data Cleansing & Analytics -  ETL & Python Development -  IT Project Management Introduction FALUDI, Bence [email protected]

Slide 3

Slide 3 text

Extract, Transform and Load (ETL) refers to a process in database usage and especially in data warehousing that involves: •  Extracting data from outside sources •  Transforming it to fit operational needs, which can include quality levels •  Loading it into the end target @ Wikipedia ETL

Slide 4

Slide 4 text

1.  Build reference data 2.  Extract from XML, XLS, JSON, YML, CSV, TSV, Google Spreadsheet, Fixed Width Text, Relational Database, … 3.  Validate data 4.  Transform: clean, apply business rules, convert, key-value map, check for data integrity, … 5.  Manipulation: drop, merge, aggregate, … 6.  Loading it into any target (mostly Relational Database) ETL Short Cycle

Slide 5

Slide 5 text

Challenges of implementation: 1.  Pause & Continue load process 2.  Update data: which data is different 3.  Data consistency is horrible 4.  Large datasets 5.  Quick transforms 6.  Every source has different format ETL software problems: 1.  Hard to use & not user friendly GUI 2.  Expensive software 3.  Console tool is not available Most common problems

Slide 6

Slide 6 text

•  Oracle Warehouse Builder (OWB) •  Data Services •  IBM Information Server (Datastage) •  SAS Data Integration Studio •  PowerCenter Informatica •  Elixir Repertoire •  Data Migrator •  SQL Server Integration Services •  Talend Studio for Data Integration •  DataFlow Manager •  Pentaho Kettle •  … ETL Tools

Slide 7

Slide 7 text

PyF - http://pyfproject.org Hard to use, many bugs, … Tools in Python

Slide 8

Slide 8 text

Brewery - http://pythonhosted.org/brewery Without GUI & transforms. Programming skills are required Tools in Python

Slide 9

Slide 9 text

We have a (large) business software, and have to load new or modified data every day from different source. E.g.: Customer Relationship Management What could we use? Fixtures? J Tools in Python

Slide 10

Slide 10 text

Fixtures are the test context. Loading a database with a known (specific) set of data where the test results are repeatable. Some frameworks use fixtures to load default (or new) data into relational database. Not suitable for periodical updated data migration! Fixtures

Slide 11

Slide 11 text

We created a tool from scratch based on experiences of Brewery for CEU Economics. We call this tool: mETL It will be published soon with GPL license. (currently in test phase)

Slide 12

Slide 12 text

-  No GUI, just a console script -  Update & migration file management -  Check differences between migrations -  Configuration in YAMLformat -  Programming skills is not required -  Transformations are using TARR, with Conditions and Statements -  Quick transforms and manipulations -  Easy to extend (for special use) -  9 source types & 8 target types Features

Slide 13

Slide 13 text

Sources: CSV, Database*, Fixed Width Text, Google Spreadsheet, JSON, TSV, XLS, XML, Yaml Targets: CSV, Database*, Fixed Width Text, JSON, TSV, XLS, XML, Yaml * SQLAlchemy support: Postgresql, MySQL, Oracle, Microsoft SQL Server, SQLite, URL API, … Features

Slide 14

Slide 14 text

Architecture

Slide 15

Slide 15 text

Source Configuration example

Slide 16

Slide 16 text

Transforms Configuration example

Slide 17

Slide 17 text

Manipulations & Target Configuration example

Slide 18

Slide 18 text

Most common problems: 1.  Pause & Continue load process 2.  Update data: which data is different 3.  Data consistency is horrible 4.  Large datasets 5.  Quick transforms 6.  Every source has different format Unexpected challenges: 7.  Filtering data via different source(s) 8.  Type conversion 9.  Readable field map for every source type Solutions

Slide 19

Slide 19 text

Problems: 1.  Pause & Continue load process 2.  Update data: which data is different Every row gets a generated hash in the field value’s final stage and a unique id based on the marked* fields in the configuration file. Saving row’s primary key and the generated hash into a pickle migration file. If the unique id exists with the same hash, we skip the row, if the hash is different we update the row. * Not required Solutions

Slide 20

Slide 20 text

Problems: 3.  Data consistency is horrible 4.  Large datasets 5.  Quick transforms mETL has 8 transforms to produce consistent data sets. These transforms are supported by statements and conditions. Uses TARR* package which is a non-parallel data flow language created by CEU Economics with multiple thread support. * It will be open source as well. Solutions

Slide 21

Slide 21 text

Problems: 6.  Every source has different format Easy YAML configuration file with all fields and information. You don’t have to write program code, just change the field’s mapping setting and run the console script again with different source. These Yaml files are savable and changeable if the source format is modified. Easy to maintain. Solutions

Slide 22

Slide 22 text

Problems: 7.  Filtering data via different source(s) Solutions Source + Condition + Join key (same name required)

Slide 23

Slide 23 text

Problems: 8.  Type conversion Means: Field’s start type != Field’s final type But: All field’s have to be the same type. We have to manage type conversion in runtime with transforms: •  ConvertType: change type to the Field and add a default value if it necessary. •  Set: set any value to the Field. After the transform process we have to force type change automatically. Solutions

Slide 24

Slide 24 text

Problems: 9.  Readable field map for every source type Convert every source input into dicts and lists and give a route for the data. If the route’s item is not exists we set the field’s value to empty. E.g. in XML: pictures/picture/0/picture_thumb/url E.g. in CSV, XLS, …: 5 E.g. in Database: created Easy to learn because not using different keywords (e.g.: index, route, cell, …) for every type. Solutions

Slide 25

Slide 25 text

FALUDI, Bence [email protected]   Thank you for your attention!