Slide 1

Slide 1 text

Extract Transform Load using mETL Created by: Bence Faludi

Slide 2

Slide 2 text

Mito is a Budapest-based agency and digital powerhouse. We believe in the power of expertise, dedication, honesty, and well, chemistry.

Slide 3

Slide 3 text

FOUNDED IN 2008

Slide 4

Slide 4 text

WITH TWO SIMPLE OBJECTIVES: CREATE CLEVER THINGS

Slide 5

Slide 5 text

ASSEMBLE AN AMAZING TEAM

Slide 6

Slide 6 text

We are Mito Our Competences DATA MINING DEVELOPMENT FRONT-END & BACK-END CONCEPT CREATION GRAPHIC DESIGN CONTENT CREATION SEO & SEM CRM & DATA ANALYSIS ATL PRODUCTION STRATEGIC CONSULTING SOCIAL MEDIA MANAGEMENT QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN WEBSITE HOSTING & MAINTENANCE Created by:

Slide 7

Slide 7 text

Created by: We are Mito Clients we are proud of

Slide 8

Slide 8 text

Data warehouse specialist, IT project manager, Python expert. Built databases for Nissan, led IT development for Union Insurance, Central European University, Procter & Gamble and Profession.hu. Organizer of the Budapest Database Meetup and Co-Organizer of the Budapest.py Meetup. Creator of the mETL business intelligence tool! Bence has more than 9 years of experience in development and project management! Created by: Data warehouse maintenance Data cleansing & deduplication Data reporting Data mining BI software development in Python Data Visualisation Data Analysis Data Enrichment CRM Development Team roles Data Bence Faludi SR. DATABASE MANAGER @bfaludi

Slide 9

Slide 9 text

PYTHON for DATA Richer data ecosystem is quickly evolving in Python. Currently there are huge amount of data related libraries available. ! Created by:

Slide 10

Slide 10 text

Data Harvesting Data Cleansing Analyzing Data Scrapy Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX Created by: ETL mETL Bubbles Luigi rdc-etl petl Python for Data Ecosystem

Slide 11

Slide 11 text

Data Harvesting Data Cleansing Analyzing Data Scrapy Visualisa- tion Plotly Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX Created by: ETL mETL Bubbles Luigi rdc-etl petl Python for Data Ecosystem

Slide 12

Slide 12 text

Extract, Transform and Load 1. Extracting data from outside sources. 2. Transforming it to fit operational needs, which can include quality levels. 3. Loading it into the end target. ? Created by:

Slide 13

Slide 13 text

Created by: Python for Data ETL libraries PETL 2011 General purpose Python package for extracting, transforming and loading tables of data. 2011 BREWERY Understandable data streaming, auditing and mining framework. 2013 METL Versatile loader with easy configuration and mini- programming. 2013 BUBBLES Python framework for data processing and data quality measurement. 2013 RDC-ETL ETL toolkit for python 2.7+. It gives you all the tools needed to create complex data integration jobs. 2014 LUIGI Build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration and much more.

Slide 14

Slide 14 text

Created by: Python for Data This package is designed primarily for convenience and ease of use, especially when working interactively with data that are unfamiliar, heterogeneous and/or of mixed quality.. PETL >>> import petl as etl >>> table = ( ... etl ... .fromcsv('example.csv') ... .convert('foo', 'upper') ... .convert('bar', int) ... .convert('baz', float) ... .addfield('quux', lambda row: row.bar * row.baz) ... ) >>> table.look() +-----+-----+-----+--------------------+ | foo | bar | baz | quux | +=====+=====+=====+====================+ | 'A' | 1 | 3.4 | 3.4 | +-----+-----+-----+--------------------+ | 'B' | 2 | 7.4 | 14.8 | +-----+-----+-----+--------------------+ | 'C' | 6 | 2.2 | 13.200000000000001 | +-----+-----+-----+--------------------+ | 'D' | 9 | 8.1 | 72.89999999999999 | +-----+-----+-----+--------------------+ Example ETL pipelines. Functional and OOP style. Interactive use.

Slide 15

Slide 15 text

Created by: Python for Data This package is designed primarily for convenience and ease of use, especially when working interactively with data that are unfamiliar, heterogeneous and/or of mixed quality.. PETL ETL pipelines. Functional and OOP style. Interactive use. Read from/Write to Delimited files, Pickle, Text files, XML, HTML, JSON, Database, Excel, NumPy Arrays, Pandas DataFrame Transformations Header, Converting, Selecting, Regular expressions, Unpacking compound values, Sorting, Joins, Deduplicating rows, Reducing rows, Reshaping tables, filling missing values, etc.

Slide 16

Slide 16 text

Python for Data LUIGI class AggregateArtists(luigi.Task): date_interval = luigi.DateIntervalParameter() def output(self): return luigi.LocalTarget( "data/artist_streams_%s.tsv" % self.date_interval ) def requires(self): return [Streams(date) \ for date in self.date_interval] def run(self): artist_count = defaultdict(int) for input in self.input(): with input.open('r') as in_file: for line in in_file: timestamp, artist, track = \ line.strip().split() artist_count[artist] += 1 with self.output().open('w') as out_file: for artist, count in artist_count.iteritems(): print >> out_file, artist, count Example // Aggregate artist streams Batch data processing with data flow support. Hadoop integration. Data flow visualisation. Command line integration. Dependency definitions. Task templating using OOP Created by:

Slide 17

Slide 17 text

Python for Data LUIGI class AggregateArtists(luigi.Task): date_interval = luigi.DateIntervalParameter() def output(self): return luigi.LocalTarget( "data/artist_streams_%s.tsv" % self.date_interval ) def requires(self): return [Streams(date) \ for date in self.date_interval] def run(self): artist_count = defaultdict(int) for input in self.input(): with input.open('r') as in_file: for line in in_file: timestamp, artist, track = \ line.strip().split() artist_count[artist] += 1 with self.output().open('w') as out_file: for artist, count in artist_count.iteritems(): print >> out_file, artist, count Example // Aggregate artist streams Batch data processing with data flow support. Hadoop integration. Data flow visualisation. Command line integration. Dependency definitions. Task templating using OOP Created by:

Slide 18

Slide 18 text

mETL We created a tool from scratch based on experiences of Brewery. Founded by the European Union and written in Python. + Created by:

Slide 19

Slide 19 text

mETL FEATURES Works with 9 source types and 11 target types. Created by: Over 35 built-in transformations. No GUI, configuration in Yaml format. Checks differences between migrations. Quick transformations and manipulations. Easy to extend.

Slide 20

Slide 20 text

mETL COMPONENTS Created by: All processes start with a source file from which the data are retrieved. There are unique types, which all have their own settings. After the data is read from the source and the transformations are completed, the finalized record gets to the Target which will write and create the file with the final data.

Slide 21

Slide 21 text

mETL SOURCE The source is mandatory and responsible for the following: Created by: Description of source type and format of the file containing the data. Description of processed fields. Definition of the interlocking/mapping between them.

Slide 22

Slide 22 text

mETL Source TYPE Created by: Description of source type and format of the file containing the data. Data retrieval from CSV, Database, Fixed Width Text, Google Spreadsheet, JSON, XLS, XML, Yaml. Definition of the selected type’s own settings. 
 (e.g.: delimiter, quote for CSV, etc.)

Slide 23

Slide 23 text

mETL Source FIELDS Created by: Description of processed fields. Each field possesses an unique name and a type.
 (e.g.: Boolean, Date, Float, Integer, String, …) Each field can define transformations.
 (e.g.: Title, UpperCase, Homogenize, Map, …)

Slide 24

Slide 24 text

mETL Source MAP Definition of the interlocking/ mapping between data and fields. Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /. country_code,domain,popularity HU,freemail.hu,22.5168436096366 HU,gmail.com,21.2331563903634 HU,t-online.hu,8.00581870151082 HU,citromail.hu,5.38485095957534 HU,hotmail.com,2.39638628011433 HU,chello.hu,2.31472029399755 HU,yahoo.com,2.27133523887301 HU,vipmail.hu,1.76347488770927 HU,invitel.hu,1.41129032258065 HU,t-email.hu,0.939158840342997 HU,upcmail.hu,0.737545937117191 HU,mailbox.hu,0.489995916700694 HU,enternet.hu,0.46192323397305 HU,vnet.hu,0.456819109840751 HU,mail.datanet.hu,0.431298489179257 HU,axelero.hu,0.421090240914659 HU,fibermail.hu,0.352184565128624 HU,index.hu,0.341976316864026 HU,pr.hu,0.313903634136382 HU,monornet.hu,0.293487137607187 HU,freestart.hu,0.290935075541037 Example // CSV ? Country Code: 0 Domain: 1 Popularity: 2 Created by:

Slide 25

Slide 25 text

mETL Source MAP Definition of the interlocking/ mapping between data and fields. Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /. { "meta":{ "code":200 }, "notifications":[ { "type":"notificationTray", "item":{ "unreadCount":0 } } ], "response":{ "venue":{ "id":"40a55d80f964a52020f31ee3", "name":"Clinton St. Baking Co. & Restaurant", "contact":{ "phone":"+16466026263", "formattedPhone":"+1 646-602-6263" }, "location":{ "address":"4 Clinton St", Example // JSON ? Name of the venue: response/venue/name Unread notification count : notifications/0/item/unreadCount Created by:

Slide 26

Slide 26 text

mETL Source MAP data = dm.Mapper([ { 'id': 1, 'active': True, 'name': 'Steve', 'others': { 'age': 41, 'childs': { 'year': 1991 } }, }, { 'id': 2, 'active': False, 'name': 'Peter', 'others': { 'age': 31, 'childs': [{ 'year': 1999 },{ 'year': 1992 }] }, }, { 'id': 3, 'active': True, 'name': 'Bruce', 'others': { 'age': 45, 'childs': [{ 'year': 1987 },{ 'year': 1987 }] }, } More complex mapping mETL uses the dm package for mapping. */name ['Steve', 'Peter', 'Bruce'] */others/childs/!/*/year [[1991], [1999, 1992], [1987, 1987]] id=2/name 'Peter' Created by:

Slide 27

Slide 27 text

mETL Source WORKFLOW Created by: 1. The program reads the given source file. 2. Line by line, the program fills in the fields with the values with the help of mapping. 3. Different transformations are carried out individually for each field.

Slide 28

Slide 28 text

mETL Source EXAMPLE from metl.utils import * s = CSVSource(FieldSet([ Field('country_code', StringFieldType()), Field('name',StringFieldType()), Field('type',StringFieldType()), Field('gender',StringFieldType()), Field('population',IntegerFieldType()) ]), skipRows = 1, headerRow = 0 ) s.setResource('source/names.csv') s.initialize() for fs in s.getRecords(): print fs.getValues() s.finalize() Same in Python Created by: source: source: CSV resource: source/names.csv skipRows: 1 headerRow: 0 fields: - name: country_code - name: name - name: type - name: gender - name: population type: Integer mETL YAML configuration

Slide 29

Slide 29 text

mETL TARGET Target is required for every process, and only one instance of it could exist. Target is responsible for the following: Created by: Write or update the data into the selected storage. Defining of the selected type’s own settings. 
 (e.g.: delimiter, quote for CSV, etc.)

Slide 30

Slide 30 text

mETL EXAMPLE from metl.utils import * s = CSVSource(FieldSet([ Field('country_code', StringFieldType()), Field('name',StringFieldType()), Field('nfkd_name',StringFieldType(), transforms = [ LowerCaseTransform, HomogenizeTransform ]), Field('type',StringFieldType()), Field('gender',StringFieldType()), Field('population',FloatFieldType()) ], FieldMap({ 'nfkd_name': 'name', 'name': 'name', 'country_code': 'country_code', 'gender': 'gender', 'population': 'population' })), skipRows = 1, headerRow = 0 ) s.setResource('source/names.csv') t = JSONTarget( s, rootIterator = 'records', compact = False ) t.setResource('output.json') t.initialize() t.write() Same in Python Created by: source: source: CSV resource: source/names.csv skipRows: 1 headerRow: 0 fields: - name: country_code - name: name - name: nfkd_name map: name transforms: - transform: Homogenize - transform: LowerCase - name: type - name: gender - name: population type: Integer target: type: JSON compact: false rootIterator: records resource: output.json Execute: $ metl config.yml

Slide 31

Slide 31 text

mETL MANIPULATIONS Created by: After the whole line is processed, the values are in the fields and the transforms are done on the field level, there is a possibility to manipulate the entire, cleaned values based on their correlations.

Slide 32

Slide 32 text

mETL MANIPULATIONS - modifier Created by: Modifiers are those objects that are given a whole line (record) and always return with a whole line. However, during their processes they make changes to values with the usage of the related values of different fields.

Slide 33

Slide 33 text

mETL MANIPULATIONS - filter Created by: Their function is primarily filtering. It is used when we would like to evaluate or get rid of incomplete or faulty records as a result of an earlier transformation.

Slide 34

Slide 34 text

mETL MANIPULATIONS - expand Created by: It is used for expansion if we want to add additional values after the current source or breaks up list type elements to separate lines.

Slide 35

Slide 35 text

mETL MANIPULATIONS - aggregator Created by: It is used to create groups and calculate information from them. Aggregators act many times as Filters or Modifiers as well, since in several cases they delete lines or columns, modify and collect given values.

Slide 36

Slide 36 text

mETL MIGRATIONS & DIFFERENCES Created by: During the running of the mETL script, there is a possibility to define a migration file and to generate a new migration file. The metl-differences script is able to compare migration files and write out the keys of those elements that are to be deleted / updated / added / unchanged during the migration.

Slide 37

Slide 37 text

EXAMPLE I. Geolocate given addresses from an Excel file and write the result into a PostgreSQL database. ! Created by:

Slide 38

Slide 38 text

mETL EXAMPLE I. Created by:

Slide 39

Slide 39 text

mETL EXAMPLE I. Created by:

Slide 40

Slide 40 text

mETL EXAMPLE I. Created by:

Slide 41

Slide 41 text

mETL EXAMPLE I. Created by:

Slide 42

Slide 42 text

mETL EXAMPLE I. Created by:

Slide 43

Slide 43 text

EXAMPLE II. Download from Foursquare all Venues’ JSON files located near the conference. Put all contact information into a CSV file. ! Created by:

Slide 44

Slide 44 text

mETL EXAMPLE II. Created by:

Slide 45

Slide 45 text

mETL EXAMPLE II. Created by:

Slide 46

Slide 46 text

mETL EXAMPLE II. Created by:

Slide 47

Slide 47 text

mETL EXAMPLE II. Created by:

Slide 48

Slide 48 text

EXAMPLE III. Task: get a full product export every day and want to update your database on a daily basis. ! Created by:

Slide 49

Slide 49 text

mETL EXAMPLE III. Created by:

Slide 50

Slide 50 text

mETL EXAMPLE III. Created by:

Slide 51

Slide 51 text

mETL EXAMPLE III. Created by:

Slide 52

Slide 52 text

mETL EXAMPLE III. Created by: ? New Yes

Slide 53

Slide 53 text

mETL EXAMPLE III. Created by: ? New Yes Updated Yes

Slide 54

Slide 54 text

mETL EXAMPLE III. Created by: ? New Yes Updated Yes Deleted *No

Slide 55

Slide 55 text

THANK YOU FOR YOUR ATTENTION! Created by: Bence Faludi @bfaludi we are hiring!