$30 off During Our Annual Pro Sale. View Details »

DevOpsPorto Meetup 38: Using cutting-edge open-source technologies to build one of the biggest industrial Data Lake of the World by Allan Sene

DevOpsPorto Meetup 38: Using cutting-edge open-source technologies to build one of the biggest industrial Data Lake of the World by Allan Sene

DevOpsPorto

August 06, 2020
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. www.datasprints.com
    Using cutting-edge
    open-source technologies
    to build a World-class
    Industrial Data Lake

    View Slide

  2. Hello!
    My name is Allan Sene
    Co-Founder & CTO @Data Sprints
    Co-Founder & Podcaster & Instructor @Data Hackers
    +10 years in Data & Software, +4 years as Data Engineer
    www.datasprints.com

    View Slide

  3. Agenda
    www.datasprints.com
    The Challenge
    What is a Data Lake?
    Awesome cutting-edge data tools
    Putting everything together do build a IDL
    Results and Next Steps
    Q&A

    View Slide

  4. THE CHALLENGE

    View Slide

  5. ● Multinational Steel Industry, with plants worldwide
    ● They need to give to global managers the capability of track the production line
    ● Data columns have BLOBs, Arrays and complex data types
    ● Migrating from on-prem to cloud
    ● Data sets with 70 Gb (compressed) and 8 million lines, more than 1600 columns, very
    complex queries (+200 joins), 15 minutes delay
    ● Maximum Query Response time of 30s
    ● Very limited cloud budget - USD 15.000,00/year
    DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    The Challenge

    View Slide

  6. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    Stakeholder's name,
    can you guess it?

    View Slide

  7. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    Milagres (Miracles)

    View Slide

  8. What is a Data Lake?

    View Slide

  9. “Is a Data Repository that holds a huge
    quantity of data on raw state, structured or
    not. The schema is only defined when is
    necessary for consumption"
    (Anne Buff - Best Practices Leader at SAS)
    DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com

    View Slide

  10. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    So, Data Lake is...

    View Slide

  11. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    Not really...

    View Slide

  12. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    ● Security & Audit
    ● Catalog & Access
    ● Data Pipelines & Orchestration
    Basically, a Platform that ensure safe and efficient data consumption
    A Data Lake must have

    View Slide

  13. Awesome cutting-edge
    data tools

    View Slide


  14. ● Data Lake Engine
    ● PMCs of Apache Arrow
    ● User-Friendly
    ● Open-source
    ● Does have an Enterprise
    Version
    Dremio

    View Slide


  15. ● SQL & API
    ● Wiki
    ● Connections:
    ○ S3, HDFS, AzureFS
    ○ Mongo, Elastic,
    ○ Postgres, MySQL
    ● Supports JSON, CSV, Parquet, Avro…
    ● Reflections
    Dremio

    View Slide

  16. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    Dremio's UI

    View Slide


  17. ● Dbt = "Data Build Tool"
    ● Built by Fisthtown Analytics
    ● Data Pipeline Orquestration
    ● "Airflow for Data Analysts"
    ● Open-source
    ● SaaS Version
    dbt

    View Slide


  18. ● ETL over a MPP
    ● Code Versioning
    ● Data Lineage & Docs
    ● Data Validation
    ● Seed Data
    dbt

    View Slide

  19. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    dbt's UI

    View Slide

  20. Putting everything
    together do build a DL

    View Slide

  21. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    Bringing all together
    ● Pipelines & Orchestration => dbt
    ● Catalog & Access => Dremio
    ● Processing => Spark
    ● Storage => Amazon S3

    View Slide

  22. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    Bringing all together
    JDBC
    Micro-batch
    Data
    Repartitioned
    Data
    Repartition
    &
    Schema
    Evolution
    Optimization
    Coalesced
    Data
    Data Port
    Data Explorer

    View Slide


  23. View Slide

  24. Results and Next Steps

    View Slide

  25. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    Results, until now
    ● Platform built on 6 months by a team of 4 engineers full-time

    View Slide

  26. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes
    www.datasprints.com
    Next steps
    ● Integrating with very common Analytics tools (Power BI, Tableau…),
    deployed worldwide
    ● Data Science Models consuming data through the Data Bus
    ● Data Quality Monitoring

    View Slide

  27. www.datahackers.com.br

    View Slide

  28. Questions?
    www.datasprints.com

    View Slide

  29. www.datasprints.com
    [email protected]
    Thanks!
    https:/
    /www.linkedin.com/in/allansene/

    View Slide