$30 off During Our Annual Pro Sale. View Details »

DevOpsPorto Meetup 38: Using cutting-edge open-source technologies to build one of the biggest industrial Data Lake of the World by Allan Sene

DevOpsPorto Meetup 38: Using cutting-edge open-source technologies to build one of the biggest industrial Data Lake of the World by Allan Sene

DevOpsPorto

August 06, 2020
Tweet

More Decks by DevOpsPorto

Other Decks in Technology

Transcript

  1. www.datasprints.com Using cutting-edge open-source technologies to build a World-class Industrial

    Data Lake
  2. Hello! My name is Allan Sene Co-Founder & CTO @Data

    Sprints Co-Founder & Podcaster & Instructor @Data Hackers +10 years in Data & Software, +4 years as Data Engineer www.datasprints.com
  3. Agenda www.datasprints.com The Challenge What is a Data Lake? Awesome

    cutting-edge data tools Putting everything together do build a IDL Results and Next Steps Q&A
  4. THE CHALLENGE

  5. • Multinational Steel Industry, with plants worldwide • They need

    to give to global managers the capability of track the production line • Data columns have BLOBs, Arrays and complex data types • Migrating from on-prem to cloud • Data sets with 70 Gb (compressed) and 8 million lines, more than 1600 columns, very complex queries (+200 joins), 15 minutes delay • Maximum Query Response time of 30s • Very limited cloud budget - USD 15.000,00/year DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes www.datasprints.com The Challenge
  6. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Stakeholder's name, can you guess it?
  7. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Milagres (Miracles)
  8. What is a Data Lake?

  9. “Is a Data Repository that holds a huge quantity of

    data on raw state, structured or not. The schema is only defined when is necessary for consumption" (Anne Buff - Best Practices Leader at SAS) DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data Lakes www.datasprints.com
  10. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com So, Data Lake is...
  11. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Not really...
  12. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com • Security & Audit • Catalog & Access • Data Pipelines & Orchestration Basically, a Platform that ensure safe and efficient data consumption A Data Lake must have
  13. Awesome cutting-edge data tools

  14. “ • Data Lake Engine • PMCs of Apache Arrow

    • User-Friendly • Open-source • Does have an Enterprise Version Dremio
  15. “ • SQL & API • Wiki • Connections: ◦

    S3, HDFS, AzureFS ◦ Mongo, Elastic, ◦ Postgres, MySQL • Supports JSON, CSV, Parquet, Avro… • Reflections Dremio
  16. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Dremio's UI
  17. “ • Dbt = "Data Build Tool" • Built by

    Fisthtown Analytics • Data Pipeline Orquestration • "Airflow for Data Analysts" • Open-source • SaaS Version dbt
  18. “ • ETL over a MPP • Code Versioning •

    Data Lineage & Docs • Data Validation • Seed Data dbt
  19. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com dbt's UI
  20. Putting everything together do build a DL

  21. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Bringing all together • Pipelines & Orchestration => dbt • Catalog & Access => Dremio • Processing => Spark • Storage => Amazon S3
  22. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Bringing all together JDBC Micro-batch Data Repartitioned Data Repartition & Schema Evolution Optimization Coalesced Data Data Port Data Explorer
  23. Results and Next Steps

  24. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Results, until now • Platform built on 6 months by a team of 4 engineers full-time • <add plots here>
  25. DSPT Webinar + DevOps Porto: Cutting-edge open-source to build Data

    Lakes www.datasprints.com Next steps • Integrating with very common Analytics tools (Power BI, Tableau…), deployed worldwide • Data Science Models consuming data through the Data Bus • Data Quality Monitoring
  26. www.datahackers.com.br

  27. Questions? www.datasprints.com

  28. www.datasprints.com allan@datasprints.com Thanks! https:/ /www.linkedin.com/in/allansene/