Slide 1

Slide 1 text

Extract Transform Load using mETL

Slide 2

Slide 2 text

HOW TO START Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip ! http://metl.mito.hu

Slide 3

Slide 3 text

Mito is a Budapest-based agency and digital powerhouse. We believe in the power of expertise, dedication, honesty, and well, chemistry. Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

Slide 4

Slide 4 text

FOUNDED IN 2008 Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

Slide 5

Slide 5 text

WITH TWO SIMPLE OBJECTIVES: CREATE CLEVER THINGS Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

Slide 6

Slide 6 text

AND ASSEMBLE AN AMAZING TEAM Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

Slide 7

Slide 7 text

MAIN COMPETENCIES GRAPHIC DESIGN CONTENT CREATION SEO & SEM DATA MINING CRM & DATA ANALYSIS STRATEGIC CONSULTING QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN WEBSITE HOSTING & OPERATION SOCIAL MEDIA MANAGEMENT CONCEPT CREATION CONCEPT CREATION DEVELOPMENT FRONT-END & BACK- END DEVELOPMENT FRONT-END &
 BACK-END GRAPHIC DESIGN CONTENT CREATION SEO & SEM DATA MINING CRM & DATA ANALYSIS STRATEGIC CONSULTANCY QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN ATL PRODUCTION ATL PRODUCTION WEBSITE HOSTING & MAINTENANCE SOCIAL MEDIA MANAGEMENT MITO DEVELOPMENT FRONT-END & BACK-END MOBILE DEVELOPMENT Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

Slide 8

Slide 8 text

CLIENTS WE ARE PROUD OF Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

Slide 9

Slide 9 text

DATA DEPARTMENT data mining data cleansing & deduplication data data warehouse maintenance data visualisation data analysis CRM data reporting BI software development data migration Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

Slide 10

Slide 10 text

HELLO WORLD Bence Faludi (@bfaludi) Senior Database Manager ! Data warehouse specialist, IT project manager, Python expert. Built databases for Nissan, led IT development for Union Insurance, Central European University, Procter & Gamble and Profession.hu. Organizer of the Budapest Database Meetup and creator of the mETL business intelligence tool. ! Bence has more than 9 years of experience in development and project management. ! email: [email protected] twitter: @bfaludi Positions * Senior Database Manager @ Mito Europe * Organizer @ Budapest Database Meetup Responsibilities * Data warehouse design * Mathematical predictions * Data Cleansing & Analytics * Data Consulting * ETL & Python/Go Development * IT Project Management Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip

Slide 11

Slide 11 text

! http://metl.mito.hu

Slide 12

Slide 12 text

WAT? ‣ Built an ETL tool for CEU Economics MicroData to load elective data. ‣ Inspiration coming from Brewery and Kettle. ‣ Founded by the European Union. ‣ ETL with mini-programming. ‣ Versatile loader with easy configuration. ‣ Written in Python language.

Slide 13

Slide 13 text

MILESTONES now 2013 2014 Goals ‣ recruit contributors ‣ additional new features ‣ get new users May - v0.1.4 alpha, first public release Jun - v0.1.5 alpha, minor fixes Jun - v0.1.6 beta, fixes and new features Sep - v0.1.7 beta, running time reduction Jan - v0.1.8 beta, adding Jul - v1.0 stable, english documentation reached 1k downloads / month reached 1.5k downloads / month

Slide 14

Slide 14 text

FEATURES ‣ Works with 9 source types and 11 target types. ‣ Over 35 built-in transformations. ‣ No GUI, configuration in Yaml format. ‣ Checks differences between migrations. ‣ Quick transformations and manipulations. ‣ Easy to extend.

Slide 15

Slide 15 text

BASICS

Slide 16

Slide 16 text

INSTALLATION easy_install mETL (run the command if you have not installed)

Slide 17

Slide 17 text

MAIN COMPONENTS All processes start with a source file from which the data are retrieved. There are unique types, which all have their own settings. After the data is read from the source, and the transformations are completed, the finalized record gets to the Target which will write and create the file with the final data.

Slide 18

Slide 18 text

SOURCE The source is mandatory and responsible for the following: 1. Description of source type and format of the file containing the data. 2. Description of processed fields. 3. Definition of the interlocking/mapping between them.

Slide 19

Slide 19 text

SOURCE - TYPE 1. Description of source type and format of the file containing the data. ‣ Data retrieval from CSV, Database, Fixed Width Text, Google Spreadsheet, JSON, XLS, XML, Yaml. ‣ Definition of the selected type’s own settings. 
 (e.g.: delimiter, quote for CSV, etc.)

Slide 20

Slide 20 text

SOURCE - FIELDS 2. Description of processed fields. ‣ Each field possesses an unique name and a type.
 (e.g.: Boolean, Date, Float, Integer, String, …) ‣ Each field describes transformations.
 (e.g.: Title, UpperCase, Homogenize, Map, …) Hint: If any of the fields is not necessary for the process, it does not have to be included unless we want it to appear in the output. Those fields in which we would like to write values must be listed, as during the process there is no possibility to add new fields.

Slide 21

Slide 21 text

SOURCE - MAP 3. Definition of the interlocking/mapping between data and fields. ‣ Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /.

Slide 22

Slide 22 text

SOURCE - MAP What is the path to the “data”? Country Code ‣ 0, because it’s the first column. ! Gender ‣ 3, because it’s the 4th column.

Slide 23

Slide 23 text

SOURCE - MAP What is the path to the “data”? Name of the venue ‣ response/venue/name ! Phone number ‣ response/venue/contact/phone ! Unread Notification ‣ notifications/0/item/unreadCount

Slide 24

Slide 24 text

SOURCE - WORKFLOW ‣ The program reads the given source file. ‣ Line by line, the program fills in the fields with the values with the help of mapping. ‣ Different transformations are carried out individually in each field.

Slide 25

Slide 25 text

TARGET Target is required for every process, and only one instance of it could exist. Target is responsible for the following: ‣ Write or update the data into the selected target. ‣ Define the selected type’s settings. 
 (e.g.: delimiter, quote for CSV, etc.)

Slide 26

Slide 26 text

TARGET - WORKFLOW ‣ After the record had reached the target object, the final line is saved to the given file type.

Slide 27

Slide 27 text

DEMO (N1 FOLDER)

Slide 28

Slide 28 text

COMPONENTS After the whole line is processed, the values are in the fields and the transforms are done on the field level, there is a possibility to manipulate the entire, cleaned values based on their correlations.

Slide 29

Slide 29 text

MANIPULATIONS Manipulate the entire record. Manipulation steps can follow each other in any order, regardless of the type. ‣ Modifier ‣ Filter ‣ Expander ‣ Aggregator

Slide 30

Slide 30 text

MODIFIER Modifiers are those objects that are given a whole line (record) and always return with a whole line. However, during their processes they make changes to values with the usage of the related values of different fields.

Slide 31

Slide 31 text

FILTER Their function is primarily filtering. It is used when we would like to evaluate or get rid of incomplete or faulty records as a result of an earlier transformation.

Slide 32

Slide 32 text

EXPAND It is used for expansion if we want to add additional values after the current source or breaks up list type elements to separate lines.

Slide 33

Slide 33 text

AGGREGATOR It is used to create groups and calculate information from them. Aggregators act many times as Filters or Modifiers as well, since in several cases they delete lines or columns, modify and collect given values.

Slide 34

Slide 34 text

DEMO (N2 & N3 FOLDER)

Slide 35

Slide 35 text

MIGRATIONS During the running of the mETL script, there is a possibility to define a migration file and to generate a new migration file. ! The metl-differences script is able to compare migration files and write out the keys of those elements that are to be deleted / updated / added / unchanged during the migration. !

Slide 36

Slide 36 text

DEMO (N4 FOLDER)

Slide 37

Slide 37 text

Bence Faludi (@bfaludi) Senior Database Manager [email protected] Bence’s inbox is open 0-24 if you have any questions. Thanks for your attention! we are hiring! [email protected]