Extract, Transform, Load using mETL

Extract Transform Load using mETL Created by: Bence Faludi

Mito is a Budapest-based agency and digital powerhouse. We believe
in the power of expertise, dedication, honesty, and well, chemistry.

FOUNDED IN 2008

WITH TWO SIMPLE OBJECTIVES: CREATE CLEVER THINGS

ASSEMBLE AN AMAZING TEAM

We are Mito Our Competences DATA MINING DEVELOPMENT FRONT-END &
BACK-END CONCEPT CREATION GRAPHIC DESIGN CONTENT CREATION SEO & SEM CRM & DATA ANALYSIS ATL PRODUCTION STRATEGIC CONSULTING SOCIAL MEDIA MANAGEMENT QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN WEBSITE HOSTING & MAINTENANCE Created by:

Created by: We are Mito Clients we are proud of

Data warehouse specialist, IT project manager, Python expert. Built databases
for Nissan, led IT development for Union Insurance, Central European University, Procter & Gamble and Profession.hu. Organizer of the Budapest Database Meetup and Co-Organizer of the Budapest.py Meetup. Creator of the mETL business intelligence tool! Bence has more than 9 years of experience in development and project management! Created by: Data warehouse maintenance Data cleansing & deduplication Data reporting Data mining BI software development in Python Data Visualisation Data Analysis Data Enrichment CRM Development Team roles Data Bence Faludi SR. DATABASE MANAGER @bfaludi

PYTHON for DATA Richer data ecosystem is quickly evolving in
Python. Currently there are huge amount of data related libraries available. ! Created by:

Data Harvesting Data Cleansing Analyzing Data Scrapy Visualisa- tion Plotly
Matplotlib NumPy SciPy Pandas NLTK IPython Machine Learning Scikit Learn Scikit Image Data Reporting, Publish CKAN Cubes Numba Bokeh Python SymPy mply PyBrain Vispy NetworkX Created by: ETL mETL Bubbles Luigi rdc-etl petl Python for Data Ecosystem

Extract, Transform and Load 1. Extracting data from outside sources.
2. Transforming it to ﬁt operational needs, which can include quality levels. 3. Loading it into the end target. ? Created by:

Created by: Python for Data ETL libraries PETL 2011 General
purpose Python package for extracting, transforming and loading tables of data. 2011 BREWERY Understandable data streaming, auditing and mining framework. 2013 METL Versatile loader with easy conﬁguration and mini- programming. 2013 BUBBLES Python framework for data processing and data quality measurement. 2013 RDC-ETL ETL toolkit for python 2.7+. It gives you all the tools needed to create complex data integration jobs. 2014 LUIGI Build complex pipelines of batch jobs. It handles dependency resolution, workﬂow management, visualization, handling failures, command line integration and much more.

Created by: Python for Data This package is designed primarily
for convenience and ease of use, especially when working interactively with data that are unfamiliar, heterogeneous and/or of mixed quality.. PETL >>> import petl as etl >>> table = ( ... etl ... .fromcsv('example.csv') ... .convert('foo', 'upper') ... .convert('bar', int) ... .convert('baz', ﬂoat) ... .addﬁeld('quux', lambda row: row.bar * row.baz) ... ) >>> table.look() +-----+-----+-----+--------------------+ | foo | bar | baz | quux | +=====+=====+=====+====================+ | 'A' | 1 | 3.4 | 3.4 | +-----+-----+-----+--------------------+ | 'B' | 2 | 7.4 | 14.8 | +-----+-----+-----+--------------------+ | 'C' | 6 | 2.2 | 13.200000000000001 | +-----+-----+-----+--------------------+ | 'D' | 9 | 8.1 | 72.89999999999999 | +-----+-----+-----+--------------------+ Example ETL pipelines. Functional and OOP style. Interactive use.

Created by: Python for Data This package is designed primarily
for convenience and ease of use, especially when working interactively with data that are unfamiliar, heterogeneous and/or of mixed quality.. PETL ETL pipelines. Functional and OOP style. Interactive use. Read from/Write to Delimited files, Pickle, Text files, XML, HTML, JSON, Database, Excel, NumPy Arrays, Pandas DataFrame Transformations Header, Converting, Selecting, Regular expressions, Unpacking compound values, Sorting, Joins, Deduplicating rows, Reducing rows, Reshaping tables, filling missing values, etc.

Python for Data LUIGI class AggregateArtists(luigi.Task): date_interval = luigi.DateIntervalParameter() def
output(self): return luigi.LocalTarget( "data/artist_streams_%s.tsv" % self.date_interval ) def requires(self): return [Streams(date) \ for date in self.date_interval] def run(self): artist_count = defaultdict(int) for input in self.input(): with input.open('r') as in_file: for line in in_file: timestamp, artist, track = \ line.strip().split() artist_count[artist] += 1 with self.output().open('w') as out_file: for artist, count in artist_count.iteritems(): print >> out_file, artist, count Example // Aggregate artist streams Batch data processing with data flow support. Hadoop integration. Data flow visualisation. Command line integration. Dependency definitions. Task templating using OOP Created by:

mETL We created a tool from scratch based on experiences
of Brewery. Founded by the European Union and written in Python. + Created by:

mETL FEATURES Works with 9 source types and 11 target
types. Created by: Over 35 built-in transformations. No GUI, conﬁguration in Yaml format. Checks differences between migrations. Quick transformations and manipulations. Easy to extend.

mETL COMPONENTS Created by: All processes start with a source
file from which the data are retrieved. There are unique types, which all have their own settings. After the data is read from the source and the transformations are completed, the finalized record gets to the Target which will write and create the file with the final data.

mETL SOURCE The source is mandatory and responsible for the
following: Created by: Description of source type and format of the file containing the data. Description of processed fields. Definition of the interlocking/mapping between them.

mETL Source TYPE Created by: Description of source type and
format of the ﬁle containing the data. Data retrieval from CSV, Database, Fixed Width Text, Google Spreadsheet, JSON, XLS, XML, Yaml. Deﬁnition of the selected type’s own settings.   (e.g.: delimiter, quote for CSV, etc.)

mETL Source FIELDS Created by: Description of processed fields. Each
field possesses an unique name and a type.  (e.g.: Boolean, Date, Float, Integer, String, …) Each field can define transformations.  (e.g.: Title, UpperCase, Homogenize, Map, …)

mETL Source MAP Definition of the interlocking/ mapping between data
and fields. Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /. country_code,domain,popularity HU,freemail.hu,22.5168436096366 HU,gmail.com,21.2331563903634 HU,t-online.hu,8.00581870151082 HU,citromail.hu,5.38485095957534 HU,hotmail.com,2.39638628011433 HU,chello.hu,2.31472029399755 HU,yahoo.com,2.27133523887301 HU,vipmail.hu,1.76347488770927 HU,invitel.hu,1.41129032258065 HU,t-email.hu,0.939158840342997 HU,upcmail.hu,0.737545937117191 HU,mailbox.hu,0.489995916700694 HU,enternet.hu,0.46192323397305 HU,vnet.hu,0.456819109840751 HU,mail.datanet.hu,0.431298489179257 HU,axelero.hu,0.421090240914659 HU,fibermail.hu,0.352184565128624 HU,index.hu,0.341976316864026 HU,pr.hu,0.313903634136382 HU,monornet.hu,0.293487137607187 HU,freestart.hu,0.290935075541037 Example // CSV ? Country Code: 0 Domain: 1 Popularity: 2 Created by:

mETL Source MAP Definition of the interlocking/ mapping between data
and fields. Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /. { "meta":{ "code":200 }, "notifications":[ { "type":"notificationTray", "item":{ "unreadCount":0 } } ], "response":{ "venue":{ "id":"40a55d80f964a52020f31ee3", "name":"Clinton St. Baking Co. & Restaurant", "contact":{ "phone":"+16466026263", "formattedPhone":"+1 646-602-6263" }, "location":{ "address":"4 Clinton St", Example // JSON ? Name of the venue: response/venue/name Unread notification count : notifications/0/item/unreadCount Created by:

mETL Source MAP data = dm.Mapper([ { 'id': 1, 'active':
True, 'name': 'Steve', 'others': { 'age': 41, 'childs': { 'year': 1991 } }, }, { 'id': 2, 'active': False, 'name': 'Peter', 'others': { 'age': 31, 'childs': [{ 'year': 1999 },{ 'year': 1992 }] }, }, { 'id': 3, 'active': True, 'name': 'Bruce', 'others': { 'age': 45, 'childs': [{ 'year': 1987 },{ 'year': 1987 }] }, } More complex mapping mETL uses the dm package for mapping. */name ['Steve', 'Peter', 'Bruce'] */others/childs/!/*/year [[1991], [1999, 1992], [1987, 1987]] id=2/name 'Peter' Created by:

mETL Source WORKFLOW Created by: 1. The program reads the
given source file. 2. Line by line, the program fills in the fields with the values with the help of mapping. 3. Different transformations are carried out individually for each field.

mETL Source EXAMPLE from metl.utils import * s = CSVSource(FieldSet([
Field('country_code', StringFieldType()), Field('name',StringFieldType()), Field('type',StringFieldType()), Field('gender',StringFieldType()), Field('population',IntegerFieldType()) ]), skipRows = 1, headerRow = 0 ) s.setResource('source/names.csv') s.initialize() for fs in s.getRecords(): print fs.getValues() s.ﬁnalize() Same in Python Created by: source: source: CSV resource: source/names.csv skipRows: 1 headerRow: 0 ﬁelds: - name: country_code - name: name - name: type - name: gender - name: population type: Integer mETL YAML configuration

mETL TARGET Target is required for every process, and only
one instance of it could exist. Target is responsible for the following: Created by: Write or update the data into the selected storage. Deﬁning of the selected type’s own settings.   (e.g.: delimiter, quote for CSV, etc.)

mETL EXAMPLE from metl.utils import * s = CSVSource(FieldSet([ Field('country_code',
StringFieldType()), Field('name',StringFieldType()), Field('nfkd_name',StringFieldType(), transforms = [ LowerCaseTransform, HomogenizeTransform ]), Field('type',StringFieldType()), Field('gender',StringFieldType()), Field('population',FloatFieldType()) ], FieldMap({ 'nfkd_name': 'name', 'name': 'name', 'country_code': 'country_code', 'gender': 'gender', 'population': 'population' })), skipRows = 1, headerRow = 0 ) s.setResource('source/names.csv') t = JSONTarget( s, rootIterator = 'records', compact = False ) t.setResource('output.json') t.initialize() t.write() Same in Python Created by: source: source: CSV resource: source/names.csv skipRows: 1 headerRow: 0 ﬁelds: - name: country_code - name: name - name: nfkd_name map: name transforms: - transform: Homogenize - transform: LowerCase - name: type - name: gender - name: population type: Integer target: type: JSON compact: false rootIterator: records resource: output.json Execute: $ metl conﬁg.yml

mETL MANIPULATIONS Created by: After the whole line is processed,
the values are in the ﬁelds and the transforms are done on the ﬁeld level, there is a possibility to manipulate the entire, cleaned values based on their correlations.

mETL MANIPULATIONS - modifier Created by: Modiﬁers are those objects
that are given a whole line (record) and always return with a whole line. However, during their processes they make changes to values with the usage of the related values of different ﬁelds.

mETL MANIPULATIONS - filter Created by: Their function is primarily
ﬁltering. It is used when we would like to evaluate or get rid of incomplete or faulty records as a result of an earlier transformation.

mETL MANIPULATIONS - expand Created by: It is used for
expansion if we want to add additional values after the current source or breaks up list type elements to separate lines.

mETL MANIPULATIONS - aggregator Created by: It is used to
create groups and calculate information from them. Aggregators act many times as Filters or Modiﬁers as well, since in several cases they delete lines or columns, modify and collect given values.

mETL MIGRATIONS & DIFFERENCES Created by: During the running of
the mETL script, there is a possibility to define a migration file and to generate a new migration file. The metl-differences script is able to compare migration files and write out the keys of those elements that are to be deleted / updated / added / unchanged during the migration.

EXAMPLE I. Geolocate given addresses from an Excel ﬁle and
write the result into a PostgreSQL database. ! Created by:

mETL EXAMPLE I. Created by:

EXAMPLE II. Download from Foursquare all Venues’ JSON ﬁles located
near the conference. Put all contact information into a CSV ﬁle. ! Created by:

mETL EXAMPLE II. Created by:

EXAMPLE III. Task: get a full product export every day
and want to update your database on a daily basis. ! Created by:

mETL EXAMPLE III. Created by:

mETL EXAMPLE III. Created by: ? New Yes

mETL EXAMPLE III. Created by: ? New Yes Updated Yes

mETL EXAMPLE III. Created by: ? New Yes Updated Yes
Deleted *No

THANK YOU FOR YOUR ATTENTION! Created by: Bence Faludi @bfaludi
we are hiring!

Extract, Transform, Load using mETL

Extract, Transform, Load using mETL

More Decks by Bence Faludi

Other Decks in Technology

Featured

Transcript