DevOpsPorto Meetup 38: Using cutting-edge open-source technologies to build one of the biggest industrial Data Lake of the World by Allan Sene

DevOpsPorto Meetup 38: Using cutting-edge open-source technologies to build one of the biggest industrial Data Lake of the World by Allan Sene

A2c14a1c4e16aa337c7d36abe7d1cf8f?s=128

DevOpsPorto

August 06, 2020
Tweet

Transcript

  1. www.datasprints.com Using cutting-edge open-source technologies to build a World-class Industrial

    Data Lake
  2. Hello! My name is Allan Sene Co-Founder & CTO @Data

    Sprints Co-Founder & Podcaster & Instructor @Data Hackers +10 years in Data & Software, +4 years as Data Engineer www.datasprints.com
  3. Agenda www.datasprints.com The Challenge What is a Data Lake? Awesome

    cutting-edge data tools Putting everything together do build a IDL Results and Next Steps Q&A
  4. THE CHALLENGE

  5. • Multinational Steel Industry, with plants worldwide • They need

    to give to global managers the capability of track the production line • Data columns have BLOBs, Arrays and complex data types • Migrating from on-prem to cloud • Data sets with 70 Gb (compressed) and 8 million lines, more than 1600 columns, very complex queries (+200 joins), 15 minutes delay • Maximum Query Response time of 30s • Very limited cloud budget - USD 15.000,00/year DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes www.datasprints.com The Challenge
  6. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Stakeholder's name, can you guess it?
  7. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Milagres (Miracles)
  8. What is a Data Lake?

  9. “Is a Data Repository that holds a huge quantity of

    data on raw state, structured or not. The schema is only defined when is necessary for consumption" (Anne Buff - Best Practices Leader at SAS) DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes www.datasprints.com
  10. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com So, Data Lake is...
  11. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Not really...
  12. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com • Security & Audit • Catalog & Access • Data Pipelines & Orchestration Basically, a Platform that ensure safe and efficient data consumption A Data Lake must have
  13. Awesome cutting-edge data tools

  14. “ • Data Lake Engine • PMCs of Apache Arrow

    • User-Friendly • Open-source • Does have an Enterprise Version Dremio
  15. “ • SQL & API • Wiki • Connections: ◦

    S3, HDFS, AzureFS ◦ Mongo, Elastic, ◦ Postgres, MySQL • Supports JSON, CSV, Parquet, Avro… • Reflections Dremio
  16. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Dremio's UI
  17. “ • Dbt = "Data Build Tool" • Built by

    Fisthtown Analytics • Data Pipeline Orquestration • "Airflow for Data Analysts" • Open-source • SaaS Version dbt
  18. “ • ETL over a MPP • Code Versioning •

    Data Lineage & Docs • Data Validation • Seed Data dbt
  19. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com dbt's UI
  20. Putting everything together do build a DL

  21. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Bringing all together • Pipelines & Orchestration => dbt • Catalog & Access => Dremio • Processing => Spark • Storage => Amazon S3
  22. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Bringing all together JDBC Micro-batch Data Repartitioned Data Repartition & Schema Evolution Optimization Coalesced Data Data Port Data Explorer
  23. Results and Next Steps

  24. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Results, until now • Platform built on 6 months by a team of 4 engineers full-time • <add plots here>
  25. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Next steps • Integrating with very common Analytics tools (Power BI, Tableau…), deployed worldwide • Data Science Models consuming data through the Data Bus • Data Quality Monitoring
  26. www.datahackers.com.br

  27. Questions? www.datasprints.com

  28. www.datasprints.com allan@datasprints.com Thanks! https:/ /www.linkedin.com/in/allansene/