Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Extract, Transform, Load using mETL

Extract, Transform, Load using mETL

Presentation was published at PyData '14 conference in Berlin. Novice level training to help you learn and use mETL in your daily work.

Bence Faludi

July 25, 2014
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. Mito is a Budapest-based agency and digital powerhouse. We believe

    in the power of expertise, dedication, honesty, and well, chemistry. Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip
  2. MAIN COMPETENCIES GRAPHIC DESIGN CONTENT CREATION SEO & SEM DATA

    MINING CRM & DATA ANALYSIS STRATEGIC CONSULTING QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN WEBSITE HOSTING & OPERATION SOCIAL MEDIA MANAGEMENT CONCEPT CREATION CONCEPT CREATION DEVELOPMENT FRONT-END & BACK- END DEVELOPMENT FRONT-END &
 BACK-END GRAPHIC DESIGN CONTENT CREATION SEO & SEM DATA MINING CRM & DATA ANALYSIS STRATEGIC CONSULTANCY QUALITY ASSURANCE MEDIA PLANNING & BUYING UX RESEARCH & DESIGN ATL PRODUCTION ATL PRODUCTION WEBSITE HOSTING & MAINTENANCE SOCIAL MEDIA MANAGEMENT MITO DEVELOPMENT FRONT-END & BACK-END MOBILE DEVELOPMENT Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip
  3. DATA DEPARTMENT data mining data cleansing & deduplication data data

    warehouse maintenance data visualisation data analysis CRM data reporting BI software development data migration Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip
  4. HELLO WORLD Bence Faludi (@bfaludi) Senior Database Manager ! Data

    warehouse specialist, IT project manager, Python expert. Built databases for Nissan, led IT development for Union Insurance, Central European University, Procter & Gamble and Profession.hu. Organizer of the Budapest Database Meetup and creator of the mETL business intelligence tool. ! Bence has more than 9 years of experience in development and project management. ! email: [email protected] twitter: @bfaludi Positions * Senior Database Manager @ Mito Europe * Organizer @ Budapest Database Meetup Responsibilities * Data warehouse design * Mathematical predictions * Data Cleansing & Analytics * Data Consulting * ETL & Python/Go Development * IT Project Management Downloadable exercises: https://github.com/bfaludi/mETL-tutorials/archive/master.zip
  5. WAT? ‣ Built an ETL tool for CEU Economics MicroData

    to load elective data. ‣ Inspiration coming from Brewery and Kettle. ‣ Founded by the European Union. ‣ ETL with mini-programming. ‣ Versatile loader with easy configuration. ‣ Written in Python language.
  6. MILESTONES now 2013 2014 Goals ‣ recruit contributors ‣ additional

    new features ‣ get new users May - v0.1.4 alpha, first public release Jun - v0.1.5 alpha, minor fixes Jun - v0.1.6 beta, fixes and new features Sep - v0.1.7 beta, running time reduction Jan - v0.1.8 beta, adding Jul - v1.0 stable, english documentation reached 1k downloads / month reached 1.5k downloads / month
  7. FEATURES ‣ Works with 9 source types and 11 target

    types. ‣ Over 35 built-in transformations. ‣ No GUI, configuration in Yaml format. ‣ Checks differences between migrations. ‣ Quick transformations and manipulations. ‣ Easy to extend.
  8. MAIN COMPONENTS All processes start with a source file from

    which the data are retrieved. There are unique types, which all have their own settings. After the data is read from the source, and the transformations are completed, the finalized record gets to the Target which will write and create the file with the final data.
  9. SOURCE The source is mandatory and responsible for the following:

    1. Description of source type and format of the file containing the data. 2. Description of processed fields. 3. Definition of the interlocking/mapping between them.
  10. SOURCE - TYPE 1. Description of source type and format

    of the file containing the data. ‣ Data retrieval from CSV, Database, Fixed Width Text, Google Spreadsheet, JSON, XLS, XML, Yaml. ‣ Definition of the selected type’s own settings. 
 (e.g.: delimiter, quote for CSV, etc.)
  11. SOURCE - FIELDS 2. Description of processed fields. ‣ Each

    field possesses an unique name and a type.
 (e.g.: Boolean, Date, Float, Integer, String, …) ‣ Each field describes transformations.
 (e.g.: Title, UpperCase, Homogenize, Map, …) Hint: If any of the fields is not necessary for the process, it does not have to be included unless we want it to appear in the output. Those fields in which we would like to write values must be listed, as during the process there is no possibility to add new fields.
  12. SOURCE - MAP 3. Definition of the interlocking/mapping between data

    and fields. ‣ Map means a path to the “data”. The path can contain words, numbers (indices) and the combinations of them divided by a /.
  13. SOURCE - MAP What is the path to the “data”?

    Country Code ‣ 0, because it’s the first column. ! Gender ‣ 3, because it’s the 4th column.
  14. SOURCE - MAP What is the path to the “data”?

    Name of the venue ‣ response/venue/name ! Phone number ‣ response/venue/contact/phone ! Unread Notification ‣ notifications/0/item/unreadCount
  15. SOURCE - WORKFLOW ‣ The program reads the given source

    file. ‣ Line by line, the program fills in the fields with the values with the help of mapping. ‣ Different transformations are carried out individually in each field.
  16. TARGET Target is required for every process, and only one

    instance of it could exist. Target is responsible for the following: ‣ Write or update the data into the selected target. ‣ Define the selected type’s settings. 
 (e.g.: delimiter, quote for CSV, etc.)
  17. TARGET - WORKFLOW ‣ After the record had reached the

    target object, the final line is saved to the given file type.
  18. COMPONENTS After the whole line is processed, the values are

    in the fields and the transforms are done on the field level, there is a possibility to manipulate the entire, cleaned values based on their correlations.
  19. MANIPULATIONS Manipulate the entire record. Manipulation steps can follow each

    other in any order, regardless of the type. ‣ Modifier ‣ Filter ‣ Expander ‣ Aggregator
  20. MODIFIER Modifiers are those objects that are given a whole

    line (record) and always return with a whole line. However, during their processes they make changes to values with the usage of the related values of different fields.
  21. FILTER Their function is primarily filtering. It is used when

    we would like to evaluate or get rid of incomplete or faulty records as a result of an earlier transformation.
  22. EXPAND It is used for expansion if we want to

    add additional values after the current source or breaks up list type elements to separate lines.
  23. AGGREGATOR It is used to create groups and calculate information

    from them. Aggregators act many times as Filters or Modifiers as well, since in several cases they delete lines or columns, modify and collect given values.
  24. MIGRATIONS During the running of the mETL script, there is

    a possibility to define a migration file and to generate a new migration file. ! The metl-differences script is able to compare migration files and write out the keys of those elements that are to be deleted / updated / added / unchanged during the migration. !
  25. Bence Faludi (@bfaludi) Senior Database Manager [email protected] Bence’s inbox is

    open 0-24 if you have any questions. Thanks for your attention! we are hiring! [email protected]