Extract, Transform, Load using mETL

Extract Transform Load using mETL

HOW TO START Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip ! http://metl.mito.hu

Mito is a Budapest-based agency and digital powerhouse. We believe
in the power of expertise, dedication, honesty, and well, chemistry. Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

FOUNDED IN 2008 Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

WITH TWO SIMPLE OBJECTIVES: CREATE CLEVER THINGS Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

AND ASSEMBLE AN AMAZING TEAM Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

MAIN COMPETENCIES GRAPHIC DESIGN CONTENT CREATION SEO & SEM DATA
MINING CRM & DATA ANALYSIS STRATEGIC CONSULTING QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN WEBSITE HOSTING & OPERATION SOCIAL MEDIA MANAGEMENT CONCEPT CREATION CONCEPT CREATION DEVELOPMENT FRONT-END & BACK- END DEVELOPMENT FRONT-END &  BACK-END GRAPHIC DESIGN CONTENT CREATION SEO & SEM DATA MINING CRM & DATA ANALYSIS STRATEGIC CONSULTANCY QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN ATL PRODUCTION ATL PRODUCTION WEBSITE HOSTING & MAINTENANCE SOCIAL MEDIA MANAGEMENT MITO DEVELOPMENT FRONT-END & BACK-END MOBILE DEVELOPMENT Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

CLIENTS WE ARE PROUD OF Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

DATA DEPARTMENT data mining data cleansing & deduplication data data
warehouse maintenance data visualisation data analysis CRM data reporting BI software development data migration Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

HELLO WORLD Bence Faludi (@bfaludi) Senior Database Manager ! Data
warehouse specialist, IT project manager, Python expert. Built databases for Nissan, led IT development for Union Insurance, Central European University, Procter & Gamble and Profession.hu. Organizer of the Budapest Database Meetup and creator of the mETL business intelligence tool. ! Bence has more than 9 years of experience in development and project management. ! email: [email protected] twitter: @bfaludi Positions * Senior Database Manager @ Mito Europe * Organizer @ Budapest Database Meetup Responsibilities * Data warehouse design * Mathematical predictions * Data Cleansing & Analytics * Data Consulting * ETL & Python/Go Development * IT Project Management Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

! http://metl.mito.hu

WAT? ‣ Built an ETL tool for CEU Economics MicroData
to load elective data. ‣ Inspiration coming from Brewery and Kettle. ‣ Founded by the European Union. ‣ ETL with mini-programming. ‣ Versatile loader with easy conﬁguration. ‣ Written in Python language.

MILESTONES now 2013 2014 Goals ‣ recruit contributors ‣ additional
new features ‣ get new users May - v0.1.4 alpha, first public release Jun - v0.1.5 alpha, minor fixes Jun - v0.1.6 beta, fixes and new features Sep - v0.1.7 beta, running time reduction Jan - v0.1.8 beta, adding Jul - v1.0 stable, english documentation reached 1k downloads / month reached 1.5k downloads / month

FEATURES ‣ Works with 9 source types and 11 target
types. ‣ Over 35 built-in transformations. ‣ No GUI, conﬁguration in Yaml format. ‣ Checks diﬀerences between migrations. ‣ Quick transformations and manipulations. ‣ Easy to extend.

BASICS

INSTALLATION easy_install mETL (run the command if you have not
installed)

MAIN COMPONENTS All processes start with a source file from
which the data are retrieved. There are unique types, which all have their own settings. After the data is read from the source, and the transformations are completed, the finalized record gets to the Target which will write and create the file with the final data.

SOURCE The source is mandatory and responsible for the following:
1. Description of source type and format of the file containing the data. 2. Description of processed fields. 3. Definition of the interlocking/mapping between them.

SOURCE - TYPE 1. Description of source type and format
of the ﬁle containing the data. ‣ Data retrieval from CSV, Database, Fixed Width Text, Google Spreadsheet, JSON, XLS, XML, Yaml. ‣ Deﬁnition of the selected type’s own settings.   (e.g.: delimiter, quote for CSV, etc.)

SOURCE - FIELDS 2. Description of processed fields. ‣ Each
field possesses an unique name and a type.  (e.g.: Boolean, Date, Float, Integer, String, …) ‣ Each field describes transformations.  (e.g.: Title, UpperCase, Homogenize, Map, …) Hint: If any of the fields is not necessary for the process, it does not have to be included unless we want it to appear in the output. Those fields in which we would like to write values must be listed, as during the process there is no possibility to add new fields.

SOURCE - MAP 3. Deﬁnition of the interlocking/mapping between data
and ﬁelds. ‣ Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /.

SOURCE - MAP What is the path to the “data”?
Country Code ‣ 0, because it’s the ﬁrst column. ! Gender ‣ 3, because it’s the 4th column.

SOURCE - MAP What is the path to the “data”?
Name of the venue ‣ response/venue/name ! Phone number ‣ response/venue/contact/phone ! Unread Notiﬁcation ‣ notiﬁcations/0/item/unreadCount

SOURCE - WORKFLOW ‣ The program reads the given source
file. ‣ Line by line, the program fills in the fields with the values with the help of mapping. ‣ Different transformations are carried out individually in each field.

TARGET Target is required for every process, and only one
instance of it could exist. Target is responsible for the following: ‣ Write or update the data into the selected target. ‣ Deﬁne the selected type’s settings.   (e.g.: delimiter, quote for CSV, etc.)

TARGET - WORKFLOW ‣ After the record had reached the
target object, the ﬁnal line is saved to the given ﬁle type.

DEMO (N1 FOLDER)

COMPONENTS After the whole line is processed, the values are
in the ﬁelds and the transforms are done on the ﬁeld level, there is a possibility to manipulate the entire, cleaned values based on their correlations.

MANIPULATIONS Manipulate the entire record. Manipulation steps can follow each
other in any order, regardless of the type. ‣ Modiﬁer ‣ Filter ‣ Expander ‣ Aggregator

MODIFIER Modifiers are those objects that are given a whole
line (record) and always return with a whole line. However, during their processes they make changes to values with the usage of the related values of different fields.

FILTER Their function is primarily ﬁltering. It is used when
we would like to evaluate or get rid of incomplete or faulty records as a result of an earlier transformation.

EXPAND It is used for expansion if we want to
add additional values after the current source or breaks up list type elements to separate lines.

AGGREGATOR It is used to create groups and calculate information
from them. Aggregators act many times as Filters or Modiﬁers as well, since in several cases they delete lines or columns, modify and collect given values.

DEMO (N2 & N3 FOLDER)

MIGRATIONS During the running of the mETL script, there is
a possibility to define a migration file and to generate a new migration file. ! The metl-differences script is able to compare migration files and write out the keys of those elements that are to be deleted / updated / added / unchanged during the migration. !

DEMO (N4 FOLDER)

Bence Faludi (@bfaludi) Senior Database Manager [email protected] Bence’s inbox is
open 0-24 if you have any questions. Thanks for your attention! we are hiring! [email protected]

Extract, Transform, Load using mETL

Extract, Transform, Load using mETL

More Decks by Bence Faludi

Other Decks in Technology

Featured

Transcript