Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.

https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again

Big Data Spain 2017
16th - 17th Kinépolis Madrid

Cb6e6da05b5b943d2691ceefa3381cad?s=128

Big Data Spain

December 01, 2017
Tweet

Transcript

  1. None
  2. Make the elephant fly, once again Luangsay Sourygna sluangsay@bol.com

  3. • Bol.com • Data Quality • Hive_compared_bq • DataPrep •

    PII data • Should we lift & shift? Agenda
  4. Bol.com • Number 1 webshop in Netherlands & Belgium •

    Yes, in Spain you’ve also heard about us: Winner of the Entrepreneurial Award This year, the winner is Bol.com from The Netherlands (Barcelona, 2014)
  5. It all began in 2008…

  6. S0 … How big is bol.com? Really?

  7. > 16 million products for sale > 50 million in

    catalog Hadoop in production since 2010 > 6 million active customers > 40 million visits per month > 5000 million pageviews/ye ar
  8. Hadoop at bol.com • On Premise Production cluster = 35

    nodes • 30+ IT teams • Several Business Teams
  9. But lot of challenges… • Lack of flexibility: • Version

    HDP • Christmas’ peak • Security issues • Who likes Kerberos? • No PII • We’re overloaded: • Sysops • YARN
  10. Let’s go to the Cloud

  11. Moving data seems easy

  12. Then: migrating ETL…

  13. Difficult in Big Data…

  14. Same challenge in 2015 • Hortonworks’ project (audit): • SQL

    server -> Hive • Huge ETL process • To test: 500MB – 60GB databases • 1st tests: • #rows • Manually reading few lines • Developed 3 validation tools • Ex: https:// community.hortonworks.com/articles/1283/hive-script-to-validate-tabl es-compare-one-with-an.html
  15. hive_compared_bq • Improved validation tool: • No Sqoop • Totally

    scalable (moving only aggregated checksums) • Now: & (also …) • https://github.com/bolcom/hive_compared_bq
  16. Improving test coverage • Not only for ETL migration •

    Also for “integration” test • Testing tools: • Unit tests: quick, automated • Hive_compared_bq: huge tests, capturing outliers
  17. Data Quality with DataPrep

  18. Yet another BigData processing tool… • input: • “Intelligent, visual

    & serverless data preparation” • Business People focused • “Excell” like • See data + statistics • Just push “Run” button… • Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98
  19. None
  20. Why Data Quality? • For IT: better understanding of data

    • See & “feel” data • Help before developing • Stats: outliers, skew • For Business: • Before: 1st analysis  better Jiras • After: validate with stats
  21. Beware of the dog

  22. Security in our Hadoop • Kerberos issues: • Integration Java

    libraries • REST interfaces not secured • Not encrypted • No strong audit • No strict HDFS, HBase, Hive permissions
  23. PII in BigQuery! • Always encrypted: disk + network •

    Serverless: patching on time • No Kerberos • Central access control • Hide PII columns with views • Advanced logging
  24. Example of logs

  25. Migrating: which strategy?

  26. Initial idea: many Dataproc • “a Hadoop cluster in 90

    seconds” • 1 cluster for each team (per job?) • Capacity issue • Configuration issue • Easy migration: no rewrite of MR, Pig, Hive, Flink…
  27. HBase  BigTable • HBase replication hooked to BigTable

  28. Finally, no lift & shift • Use of HBase is

    not optimal • Better using native Google Tools: • Cheaper • Faster
  29. Hive vs BigQuery • “Now with BigQuery, I start to

    think that a query is slow when it takes more than 30 seconds”
  30. Alternatives to Pig • BigQuery • Beam

  31. Flexibility of Beam • Easy for future migration: • In

    our Hadoop • In Google Cloud • Or maybe Beam in Cloud = + • Overhead of DataFlow = 2 mn, similar to Dataproc • Flink = better metrics + debugging
  32. Netherlands… • Want to discover what it feels like living

    4 meters under the sea level? • Yes… We’re hiring!
  33. Thanks till next bol.com Sourygna Luangsay sluangsay@bol.com