Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extract Transform Load in Python

Extract Transform Load in Python

Our solutions to create a new Python ETL tool from scratch.

Bence Faludi

March 27, 2013
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. Positions: -  Database Manager @ Mito -  Chief Technology Officer

    @ Ozmo Tasks: -  Data warehouse design -  Data Cleansing & Analytics -  ETL & Python Development -  IT Project Management Introduction FALUDI, Bence [email protected]
  2. Extract, Transform and Load (ETL) refers to a process in

    database usage and especially in data warehousing that involves: •  Extracting data from outside sources •  Transforming it to fit operational needs, which can include quality levels •  Loading it into the end target @ Wikipedia ETL
  3. 1.  Build reference data 2.  Extract from XML, XLS, JSON,

    YML, CSV, TSV, Google Spreadsheet, Fixed Width Text, Relational Database, … 3.  Validate data 4.  Transform: clean, apply business rules, convert, key-value map, check for data integrity, … 5.  Manipulation: drop, merge, aggregate, … 6.  Loading it into any target (mostly Relational Database) ETL Short Cycle
  4. Challenges of implementation: 1.  Pause & Continue load process 2. 

    Update data: which data is different 3.  Data consistency is horrible 4.  Large datasets 5.  Quick transforms 6.  Every source has different format ETL software problems: 1.  Hard to use & not user friendly GUI 2.  Expensive software 3.  Console tool is not available Most common problems
  5. •  Oracle Warehouse Builder (OWB) •  Data Services •  IBM

    Information Server (Datastage) •  SAS Data Integration Studio •  PowerCenter Informatica •  Elixir Repertoire •  Data Migrator •  SQL Server Integration Services •  Talend Studio for Data Integration •  DataFlow Manager •  Pentaho Kettle •  … ETL Tools
  6. We have a (large) business software, and have to load

    new or modified data every day from different source. E.g.: Customer Relationship Management What could we use? Fixtures? J Tools in Python
  7. Fixtures are the test context. Loading a database with a

    known (specific) set of data where the test results are repeatable. Some frameworks use fixtures to load default (or new) data into relational database. Not suitable for periodical updated data migration! Fixtures
  8. We created a tool from scratch based on experiences of

    Brewery for CEU Economics. We call this tool: mETL It will be published soon with GPL license. (currently in test phase)
  9. -  No GUI, just a console script -  Update &

    migration file management -  Check differences between migrations -  Configuration in YAMLformat -  Programming skills is not required -  Transformations are using TARR, with Conditions and Statements -  Quick transforms and manipulations -  Easy to extend (for special use) -  9 source types & 8 target types Features
  10. Sources: CSV, Database*, Fixed Width Text, Google Spreadsheet, JSON, TSV,

    XLS, XML, Yaml Targets: CSV, Database*, Fixed Width Text, JSON, TSV, XLS, XML, Yaml * SQLAlchemy support: Postgresql, MySQL, Oracle, Microsoft SQL Server, SQLite, URL API, … Features
  11. Most common problems: 1.  Pause & Continue load process 2. 

    Update data: which data is different 3.  Data consistency is horrible 4.  Large datasets 5.  Quick transforms 6.  Every source has different format Unexpected challenges: 7.  Filtering data via different source(s) 8.  Type conversion 9.  Readable field map for every source type Solutions
  12. Problems: 1.  Pause & Continue load process 2.  Update data:

    which data is different Every row gets a generated hash in the field value’s final stage and a unique id based on the marked* fields in the configuration file. Saving row’s primary key and the generated hash into a pickle migration file. If the unique id exists with the same hash, we skip the row, if the hash is different we update the row. * Not required Solutions
  13. Problems: 3.  Data consistency is horrible 4.  Large datasets 5. 

    Quick transforms mETL has 8 transforms to produce consistent data sets. These transforms are supported by statements and conditions. Uses TARR* package which is a non-parallel data flow language created by CEU Economics with multiple thread support. * It will be open source as well. Solutions
  14. Problems: 6.  Every source has different format Easy YAML configuration

    file with all fields and information. You don’t have to write program code, just change the field’s mapping setting and run the console script again with different source. These Yaml files are savable and changeable if the source format is modified. Easy to maintain. Solutions
  15. Problems: 8.  Type conversion Means: Field’s start type != Field’s

    final type But: All field’s have to be the same type. We have to manage type conversion in runtime with transforms: •  ConvertType: change type to the Field and add a default value if it necessary. •  Set: set any value to the Field. After the transform process we have to force type change automatically. Solutions
  16. Problems: 9.  Readable field map for every source type Convert

    every source input into dicts and lists and give a route for the data. If the route’s item is not exists we set the field’s value to empty. E.g. in XML: pictures/picture/0/picture_thumb/url E.g. in CSV, XLS, …: 5 E.g. in Database: created Easy to learn because not using different keywords (e.g.: index, route, cell, …) for every type. Solutions