$30 off During Our Annual Pro Sale. View Details »

Extract, Transform, Load using mETL

Extract, Transform, Load using mETL

Presentation was published at PyData '14 conference in Berlin. Novice level training to help you learn and use mETL in your daily work.

Bence Faludi

July 25, 2014
Tweet

More Decks by Bence Faludi

Other Decks in Technology

Transcript

  1. Extract Transform Load
    using mETL

    View Slide

  2. HOW TO START
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip
    !
    http://metl.mito.hu

    View Slide

  3. Mito is a Budapest-based agency and
    digital powerhouse. We believe in the
    power of expertise, dedication,
    honesty, and well, chemistry.
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip

    View Slide

  4. FOUNDED IN
    2008
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip

    View Slide

  5. WITH TWO SIMPLE OBJECTIVES:
    CREATE CLEVER THINGS
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip

    View Slide

  6. AND ASSEMBLE AN
    AMAZING TEAM
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip

    View Slide

  7. MAIN COMPETENCIES
    GRAPHIC
    DESIGN
    CONTENT
    CREATION
    SEO &
    SEM
    DATA
    MINING
    CRM & DATA
    ANALYSIS
    STRATEGIC
    CONSULTING
    QUALITY
    ASSURANCE
    MEDIA
    PLANNING &
    BUYING
    UX
    RESEARCH
    & DESIGN
    WEBSITE
    HOSTING &
    OPERATION
    SOCIAL MEDIA
    MANAGEMENT
    CONCEPT
    CREATION
    CONCEPT
    CREATION
    DEVELOPMENT
    FRONT-END & BACK-
    END
    DEVELOPMENT
    FRONT-END &

    BACK-END
    GRAPHIC
    DESIGN
    CONTENT
    CREATION
    SEO &
    SEM
    DATA MINING
    CRM & DATA
    ANALYSIS
    STRATEGIC
    CONSULTANCY
    QUALITY
    ASSURANCE
    MEDIA
    PLANNING &
    BUYING
    UX
    RESEARCH
    & DESIGN
    ATL
    PRODUCTION
    ATL
    PRODUCTION
    WEBSITE
    HOSTING &
    MAINTENANCE
    SOCIAL MEDIA
    MANAGEMENT
    MITO
    DEVELOPMENT
    FRONT-END &
    BACK-END
    MOBILE
    DEVELOPMENT
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip

    View Slide

  8. CLIENTS WE ARE PROUD OF
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip

    View Slide

  9. DATA DEPARTMENT
    data mining
    data cleansing & deduplication
    data
    data warehouse maintenance
    data visualisation
    data analysis
    CRM
    data reporting
    BI software development
    data migration
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip

    View Slide

  10. HELLO WORLD
    Bence Faludi (@bfaludi)
    Senior Database Manager
    !
    Data warehouse specialist, IT project manager,
    Python expert. Built databases for Nissan, led
    IT development for Union Insurance, Central
    European University, Procter & Gamble and
    Profession.hu. Organizer of the Budapest
    Database Meetup and creator of the mETL
    business intelligence tool.
    !
    Bence has more than 9 years of experience in
    development and project management.
    !
    email: [email protected]
    twitter: @bfaludi
    Positions
    * Senior Database Manager @ Mito Europe
    * Organizer @ Budapest Database Meetup
    Responsibilities
    * Data warehouse design
    * Mathematical predictions
    * Data Cleansing & Analytics
    * Data Consulting
    * ETL & Python/Go Development
    * IT Project Management
    Downloadable exercises:
    https://github.com/bfaludi/mETL-tutorials/archive/master.zip

    View Slide

  11. !
    http://metl.mito.hu

    View Slide

  12. WAT?
    ‣ Built an ETL tool for CEU Economics MicroData to
    load elective data.
    ‣ Inspiration coming from Brewery and Kettle.
    ‣ Founded by the European Union.
    ‣ ETL with mini-programming.
    ‣ Versatile loader with easy configuration.
    ‣ Written in Python language.

    View Slide

  13. MILESTONES
    now
    2013
    2014
    Goals
    ‣ recruit contributors
    ‣ additional new features
    ‣ get new users
    May - v0.1.4 alpha, first public release
    Jun - v0.1.5 alpha, minor fixes
    Jun - v0.1.6 beta, fixes and new features
    Sep - v0.1.7 beta, running time reduction
    Jan - v0.1.8 beta, adding
    Jul - v1.0 stable, english documentation
    reached 1k downloads / month
    reached 1.5k downloads / month

    View Slide

  14. FEATURES
    ‣ Works with 9 source types and 11 target types.
    ‣ Over 35 built-in transformations.
    ‣ No GUI, configuration in Yaml format.
    ‣ Checks differences between migrations.
    ‣ Quick transformations and manipulations.
    ‣ Easy to extend.

    View Slide

  15. BASICS

    View Slide

  16. INSTALLATION
    easy_install mETL
    (run the command if you have not installed)

    View Slide

  17. MAIN COMPONENTS
    All processes start with a source file
    from which the data are retrieved.
    There are unique types, which all
    have their own settings.
    After the data is read from the
    source, and the transformations
    are completed, the finalized record
    gets to the Target which will write
    and create the file with the final
    data.

    View Slide

  18. SOURCE
    The source is mandatory and responsible for the
    following:
    1. Description of source type and format of the
    file containing the data.
    2. Description of processed fields.
    3. Definition of the interlocking/mapping
    between them.

    View Slide

  19. SOURCE - TYPE
    1. Description of source type and format of the
    file containing the data.
    ‣ Data retrieval from CSV, Database, Fixed Width
    Text, Google Spreadsheet, JSON, XLS, XML,
    Yaml.
    ‣ Definition of the selected type’s own settings. 

    (e.g.: delimiter, quote for CSV, etc.)

    View Slide

  20. SOURCE - FIELDS
    2. Description of processed fields.
    ‣ Each field possesses an unique name and a type.

    (e.g.: Boolean, Date, Float, Integer, String, …)
    ‣ Each field describes transformations.

    (e.g.: Title, UpperCase, Homogenize, Map, …)
    Hint: If any of the fields is not necessary for the process, it does not have to
    be included unless we want it to appear in the output. Those fields in which
    we would like to write values must be listed, as during the process there is
    no possibility to add new fields.

    View Slide

  21. SOURCE - MAP
    3. Definition of the interlocking/mapping
    between data and fields.
    ‣ Map means a path to the “data”. The path can
    contain words, numbers (indices) and the
    combinations of them divided by a /.

    View Slide

  22. SOURCE - MAP
    What is the path to the “data”?
    Country Code
    ‣ 0, because it’s the first
    column.
    !
    Gender
    ‣ 3, because it’s the 4th
    column.

    View Slide

  23. SOURCE - MAP
    What is the path to the “data”?
    Name of the venue
    ‣ response/venue/name
    !
    Phone number
    ‣ response/venue/contact/phone
    !
    Unread Notification
    ‣ notifications/0/item/unreadCount

    View Slide

  24. SOURCE - WORKFLOW
    ‣ The program reads the given source file.
    ‣ Line by line, the program fills in the fields with the values with
    the help of mapping.
    ‣ Different transformations are carried out individually in each
    field.

    View Slide

  25. TARGET
    Target is required for every process, and only one
    instance of it could exist.
    Target is responsible for the following:
    ‣ Write or update the data into the selected
    target.
    ‣ Define the selected type’s settings. 

    (e.g.: delimiter, quote for CSV, etc.)

    View Slide

  26. TARGET - WORKFLOW
    ‣ After the record had reached the target object, the final line is
    saved to the given file type.

    View Slide

  27. DEMO
    (N1 FOLDER)

    View Slide

  28. COMPONENTS
    After the whole line is processed, the values are in the fields
    and the transforms are done on the field level, there is a
    possibility to manipulate the entire, cleaned values based
    on their correlations.

    View Slide

  29. MANIPULATIONS
    Manipulate the entire record. Manipulation steps can
    follow each other in any order, regardless of the type.
    ‣ Modifier
    ‣ Filter
    ‣ Expander
    ‣ Aggregator

    View Slide

  30. MODIFIER
    Modifiers are those objects that are given a whole line (record) and always return with a
    whole line. However, during their processes they make changes to values with the usage
    of the related values of different fields.

    View Slide

  31. FILTER
    Their function is primarily filtering. It is used when we would like to evaluate or get rid of
    incomplete or faulty records as a result of an earlier transformation.

    View Slide

  32. EXPAND
    It is used for expansion if we want to add additional values after the current source or
    breaks up list type elements to separate lines.

    View Slide

  33. AGGREGATOR
    It is used to create groups and calculate information from them. Aggregators act many
    times as Filters or Modifiers as well, since in several cases they delete lines or columns,
    modify and collect given values.

    View Slide

  34. DEMO
    (N2 & N3 FOLDER)

    View Slide

  35. MIGRATIONS
    During the running of the mETL script,
    there is a possibility to define a
    migration file and to generate a new
    migration file.
    !
    The metl-differences script is able to
    compare migration files and write out
    the keys of those elements that are to
    be deleted / updated / added /
    unchanged during the migration.
    !

    View Slide

  36. DEMO
    (N4 FOLDER)

    View Slide

  37. Bence Faludi (@bfaludi)
    Senior Database Manager
    [email protected]
    Bence’s inbox is open 0-24
    if you have any questions.
    Thanks
    for your attention!
    we are hiring!
    [email protected]

    View Slide