Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extract, Transform, Load using mETL

Extract, Transform, Load using mETL

It was presented at PyCon Sei, Florince. #pycon6

mETL is an ETL package written in Python. Program can be used to load practically any kind of data to any target. Code is open source and available for anyone who want to use it. The main advantage to configurable via Yaml files and You have the possibility to write any transformation in Python and You can use it natively from any framework as well.

We are using this tool in production for many of our clients and It is really stable and reliable. The project has a few contributors all around the world right now and I hope many developer will join soon. I want to introduce you this tool.

In this presentation I will show you the functionality and the common use cases. Furthermore I will talk about other ETL tools in Python.

Bence Faludi

April 17, 2015
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. Mito is a Budapest-based agency and digital powerhouse. We believe

    in the power of expertise, dedication, honesty, and well, chemistry.
  2. We are Mito Our Competences DATA MINING DEVELOPMENT FRONT-END &

    BACK-END CONCEPT CREATION GRAPHIC DESIGN CONTENT CREATION SEO & SEM CRM & DATA ANALYSIS ATL PRODUCTION STRATEGIC CONSULTING SOCIAL MEDIA MANAGEMENT QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN WEBSITE HOSTING & MAINTENANCE Created by:
  3. Data warehouse specialist, IT project manager, Python expert. Built databases

    for Nissan, led IT development for Union Insurance, Central European University, Procter & Gamble and Profession.hu. Organizer of the Budapest Database Meetup and Co-Organizer of the Budapest.py Meetup. Creator of the mETL business intelligence tool! Bence has more than 9 years of experience in development and project management! Created by: Data warehouse maintenance Data cleansing & deduplication Data reporting Data mining BI software development in Python Data Visualisation Data Analysis Data Enrichment CRM Development Team roles Data Bence Faludi SR. DATABASE MANAGER @bfaludi
  4. PYTHON for DATA Richer data ecosystem is quickly evolving in

    Python. Currently there are huge amount of data related libraries available. ! Created by:
  5. Data Harvesting Data Cleansing Analyzing Data Scrapy Visualisa- tion Plotly

    Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX Created by: ETL mETL Bubbles Luigi rdc-etl petl Python for Data Ecosystem
  6. Data Harvesting Data Cleansing Analyzing Data Scrapy Visualisa- tion Plotly

    Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX Created by: ETL mETL Bubbles Luigi rdc-etl petl Python for Data Ecosystem
  7. Extract, Transform and Load 1. Extracting data from outside sources.

    2. Transforming it to fit operational needs, which can include quality levels. 3. Loading it into the end target. ? Created by:
  8. Created by: Python for Data ETL libraries PETL 2011 General

    purpose Python package for extracting, transforming and loading tables of data. 2011 BREWERY Understandable data streaming, auditing and mining framework. 2013 METL Versatile loader with easy configuration and mini- programming. 2013 BUBBLES Python framework for data processing and data quality measurement. 2013 RDC-ETL ETL toolkit for python 2.7+. It gives you all the tools needed to create complex data integration jobs. 2014 LUIGI Build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration and much more.
  9. Created by: Python for Data This package is designed primarily

    for convenience and ease of use, especially when working interactively with data that are unfamiliar, heterogeneous and/or of mixed quality.. PETL >>> import petl as etl >>> table = ( ... etl ... .fromcsv('example.csv') ... .convert('foo', 'upper') ... .convert('bar', int) ... .convert('baz', float) ... .addfield('quux', lambda row: row.bar * row.baz) ... ) >>> table.look() +-----+-----+-----+--------------------+ | foo | bar | baz | quux | +=====+=====+=====+====================+ | 'A' | 1 | 3.4 | 3.4 | +-----+-----+-----+--------------------+ | 'B' | 2 | 7.4 | 14.8 | +-----+-----+-----+--------------------+ | 'C' | 6 | 2.2 | 13.200000000000001 | +-----+-----+-----+--------------------+ | 'D' | 9 | 8.1 | 72.89999999999999 | +-----+-----+-----+--------------------+ Example ETL pipelines. Functional and OOP style. Interactive use.
  10. Created by: Python for Data This package is designed primarily

    for convenience and ease of use, especially when working interactively with data that are unfamiliar, heterogeneous and/or of mixed quality.. PETL ETL pipelines. Functional and OOP style. Interactive use. Read from/Write to Delimited files, Pickle, Text files, XML, HTML, JSON, Database, Excel, NumPy Arrays, Pandas DataFrame Transformations Header, Converting, Selecting, Regular expressions, Unpacking compound values, Sorting, Joins, Deduplicating rows, Reducing rows, Reshaping tables, filling missing values, etc.
  11. Python for Data LUIGI class AggregateArtists(luigi.Task): date_interval = luigi.DateIntervalParameter() def

    output(self): return luigi.LocalTarget( "data/artist_streams_%s.tsv" % self.date_interval ) def requires(self): return [Streams(date) \ for date in self.date_interval] def run(self): artist_count = defaultdict(int) for input in self.input(): with input.open('r') as in_file: for line in in_file: timestamp, artist, track = \ line.strip().split() artist_count[artist] += 1 with self.output().open('w') as out_file: for artist, count in artist_count.iteritems(): print >> out_file, artist, count Example // Aggregate artist streams Batch data processing with data flow support. Hadoop integration. Data flow visualisation. Command line integration. Dependency definitions. Task templating using OOP Created by:
  12. Python for Data LUIGI class AggregateArtists(luigi.Task): date_interval = luigi.DateIntervalParameter() def

    output(self): return luigi.LocalTarget( "data/artist_streams_%s.tsv" % self.date_interval ) def requires(self): return [Streams(date) \ for date in self.date_interval] def run(self): artist_count = defaultdict(int) for input in self.input(): with input.open('r') as in_file: for line in in_file: timestamp, artist, track = \ line.strip().split() artist_count[artist] += 1 with self.output().open('w') as out_file: for artist, count in artist_count.iteritems(): print >> out_file, artist, count Example // Aggregate artist streams Batch data processing with data flow support. Hadoop integration. Data flow visualisation. Command line integration. Dependency definitions. Task templating using OOP Created by:
  13. mETL We created a tool from scratch based on experiences

    of Brewery. Founded by the European Union and written in Python. + Created by:
  14. mETL FEATURES Works with 9 source types and 11 target

    types. Created by: Over 35 built-in transformations. No GUI, configuration in Yaml format. Checks differences between migrations. Quick transformations and manipulations. Easy to extend.
  15. mETL COMPONENTS Created by: All processes start with a source

    file from which the data are retrieved. There are unique types, which all have their own settings. After the data is read from the source and the transformations are completed, the finalized record gets to the Target which will write and create the file with the final data.
  16. mETL SOURCE The source is mandatory and responsible for the

    following: Created by: Description of source type and format of the file containing the data. Description of processed fields. Definition of the interlocking/mapping between them.
  17. mETL Source TYPE Created by: Description of source type and

    format of the file containing the data. Data retrieval from CSV, Database, Fixed Width Text, Google Spreadsheet, JSON, XLS, XML, Yaml. Definition of the selected type’s own settings. 
 (e.g.: delimiter, quote for CSV, etc.)
  18. mETL Source FIELDS Created by: Description of processed fields. Each

    field possesses an unique name and a type.
 (e.g.: Boolean, Date, Float, Integer, String, …) Each field can define transformations.
 (e.g.: Title, UpperCase, Homogenize, Map, …)
  19. mETL Source MAP Definition of the interlocking/ mapping between data

    and fields. Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /. country_code,domain,popularity HU,freemail.hu,22.5168436096366 HU,gmail.com,21.2331563903634 HU,t-online.hu,8.00581870151082 HU,citromail.hu,5.38485095957534 HU,hotmail.com,2.39638628011433 HU,chello.hu,2.31472029399755 HU,yahoo.com,2.27133523887301 HU,vipmail.hu,1.76347488770927 HU,invitel.hu,1.41129032258065 HU,t-email.hu,0.939158840342997 HU,upcmail.hu,0.737545937117191 HU,mailbox.hu,0.489995916700694 HU,enternet.hu,0.46192323397305 HU,vnet.hu,0.456819109840751 HU,mail.datanet.hu,0.431298489179257 HU,axelero.hu,0.421090240914659 HU,fibermail.hu,0.352184565128624 HU,index.hu,0.341976316864026 HU,pr.hu,0.313903634136382 HU,monornet.hu,0.293487137607187 HU,freestart.hu,0.290935075541037 Example // CSV ? Country Code: 0 Domain: 1 Popularity: 2 Created by:
  20. mETL Source MAP Definition of the interlocking/ mapping between data

    and fields. Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /. { "meta":{ "code":200 }, "notifications":[ { "type":"notificationTray", "item":{ "unreadCount":0 } } ], "response":{ "venue":{ "id":"40a55d80f964a52020f31ee3", "name":"Clinton St. Baking Co. & Restaurant", "contact":{ "phone":"+16466026263", "formattedPhone":"+1 646-602-6263" }, "location":{ "address":"4 Clinton St", Example // JSON ? Name of the venue: response/venue/name Unread notification count : notifications/0/item/unreadCount Created by:
  21. mETL Source MAP data = dm.Mapper([ { 'id': 1, 'active':

    True, 'name': 'Steve', 'others': { 'age': 41, 'childs': { 'year': 1991 } }, }, { 'id': 2, 'active': False, 'name': 'Peter', 'others': { 'age': 31, 'childs': [{ 'year': 1999 },{ 'year': 1992 }] }, }, { 'id': 3, 'active': True, 'name': 'Bruce', 'others': { 'age': 45, 'childs': [{ 'year': 1987 },{ 'year': 1987 }] }, } More complex mapping mETL uses the dm package for mapping. */name ['Steve', 'Peter', 'Bruce'] */others/childs/!/*/year [[1991], [1999, 1992], [1987, 1987]] id=2/name 'Peter' Created by:
  22. mETL Source WORKFLOW Created by: 1. The program reads the

    given source file. 2. Line by line, the program fills in the fields with the values with the help of mapping. 3. Different transformations are carried out individually for each field.
  23. mETL Source EXAMPLE from metl.utils import * s = CSVSource(FieldSet([

    Field('country_code', StringFieldType()), Field('name',StringFieldType()), Field('type',StringFieldType()), Field('gender',StringFieldType()), Field('population',IntegerFieldType()) ]), skipRows = 1, headerRow = 0 ) s.setResource('source/names.csv') s.initialize() for fs in s.getRecords(): print fs.getValues() s.finalize() Same in Python Created by: source: source: CSV resource: source/names.csv skipRows: 1 headerRow: 0 fields: - name: country_code - name: name - name: type - name: gender - name: population type: Integer mETL YAML configuration
  24. mETL TARGET Target is required for every process, and only

    one instance of it could exist. Target is responsible for the following: Created by: Write or update the data into the selected storage. Defining of the selected type’s own settings. 
 (e.g.: delimiter, quote for CSV, etc.)
  25. mETL EXAMPLE from metl.utils import * s = CSVSource(FieldSet([ Field('country_code',

    StringFieldType()), Field('name',StringFieldType()), Field('nfkd_name',StringFieldType(), transforms = [ LowerCaseTransform, HomogenizeTransform ]), Field('type',StringFieldType()), Field('gender',StringFieldType()), Field('population',FloatFieldType()) ], FieldMap({ 'nfkd_name': 'name', 'name': 'name', 'country_code': 'country_code', 'gender': 'gender', 'population': 'population' })), skipRows = 1, headerRow = 0 ) s.setResource('source/names.csv') t = JSONTarget( s, rootIterator = 'records', compact = False ) t.setResource('output.json') t.initialize() t.write() Same in Python Created by: source: source: CSV resource: source/names.csv skipRows: 1 headerRow: 0 fields: - name: country_code - name: name - name: nfkd_name map: name transforms: - transform: Homogenize - transform: LowerCase - name: type - name: gender - name: population type: Integer target: type: JSON compact: false rootIterator: records resource: output.json Execute: $ metl config.yml
  26. mETL MANIPULATIONS Created by: After the whole line is processed,

    the values are in the fields and the transforms are done on the field level, there is a possibility to manipulate the entire, cleaned values based on their correlations.
  27. mETL MANIPULATIONS - modifier Created by: Modifiers are those objects

    that are given a whole line (record) and always return with a whole line. However, during their processes they make changes to values with the usage of the related values of different fields.
  28. mETL MANIPULATIONS - filter Created by: Their function is primarily

    filtering. It is used when we would like to evaluate or get rid of incomplete or faulty records as a result of an earlier transformation.
  29. mETL MANIPULATIONS - expand Created by: It is used for

    expansion if we want to add additional values after the current source or breaks up list type elements to separate lines.
  30. mETL MANIPULATIONS - aggregator Created by: It is used to

    create groups and calculate information from them. Aggregators act many times as Filters or Modifiers as well, since in several cases they delete lines or columns, modify and collect given values.
  31. mETL MIGRATIONS & DIFFERENCES Created by: During the running of

    the mETL script, there is a possibility to define a migration file and to generate a new migration file. The metl-differences script is able to compare migration files and write out the keys of those elements that are to be deleted / updated / added / unchanged during the migration.
  32. EXAMPLE I. Geolocate given addresses from an Excel file and

    write the result into a PostgreSQL database. ! Created by:
  33. EXAMPLE II. Download from Foursquare all Venues’ JSON files located

    near the conference. Put all contact information into a CSV file. ! Created by:
  34. EXAMPLE III. Task: get a full product export every day

    and want to update your database on a daily basis. ! Created by: