Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make the elephant fly, once again by Sourygna L...

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.

https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again

Big Data Spain 2017
16th - 17th Kinépolis Madrid

Big Data Spain

December 01, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. • Bol.com • Data Quality • Hive_compared_bq • DataPrep •

    PII data • Should we lift & shift? Agenda
  2. Bol.com • Number 1 webshop in Netherlands & Belgium •

    Yes, in Spain you’ve also heard about us: Winner of the Entrepreneurial Award This year, the winner is Bol.com from The Netherlands (Barcelona, 2014)
  3. > 16 million products for sale > 50 million in

    catalog Hadoop in production since 2010 > 6 million active customers > 40 million visits per month > 5000 million pageviews/ye ar
  4. Hadoop at bol.com • On Premise Production cluster = 35

    nodes • 30+ IT teams • Several Business Teams
  5. But lot of challenges… • Lack of flexibility: • Version

    HDP • Christmas’ peak • Security issues • Who likes Kerberos? • No PII • We’re overloaded: • Sysops • YARN
  6. Same challenge in 2015 • Hortonworks’ project (audit): • SQL

    server -> Hive • Huge ETL process • To test: 500MB – 60GB databases • 1st tests: • #rows • Manually reading few lines • Developed 3 validation tools • Ex: https:// community.hortonworks.com/articles/1283/hive-script-to-validate-tabl es-compare-one-with-an.html
  7. hive_compared_bq • Improved validation tool: • No Sqoop • Totally

    scalable (moving only aggregated checksums) • Now: & (also …) • https://github.com/bolcom/hive_compared_bq
  8. Improving test coverage • Not only for ETL migration •

    Also for “integration” test • Testing tools: • Unit tests: quick, automated • Hive_compared_bq: huge tests, capturing outliers
  9. Yet another BigData processing tool… • input: • “Intelligent, visual

    & serverless data preparation” • Business People focused • “Excell” like • See data + statistics • Just push “Run” button… • Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98
  10. Why Data Quality? • For IT: better understanding of data

    • See & “feel” data • Help before developing • Stats: outliers, skew • For Business: • Before: 1st analysis  better Jiras • After: validate with stats
  11. Security in our Hadoop • Kerberos issues: • Integration Java

    libraries • REST interfaces not secured • Not encrypted • No strong audit • No strict HDFS, HBase, Hive permissions
  12. PII in BigQuery! • Always encrypted: disk + network •

    Serverless: patching on time • No Kerberos • Central access control • Hide PII columns with views • Advanced logging
  13. Initial idea: many Dataproc • “a Hadoop cluster in 90

    seconds” • 1 cluster for each team (per job?) • Capacity issue • Configuration issue • Easy migration: no rewrite of MR, Pig, Hive, Flink…
  14. Finally, no lift & shift • Use of HBase is

    not optimal • Better using native Google Tools: • Cheaper • Faster
  15. Hive vs BigQuery • “Now with BigQuery, I start to

    think that a query is slow when it takes more than 30 seconds”
  16. Flexibility of Beam • Easy for future migration: • In

    our Hadoop • In Google Cloud • Or maybe Beam in Cloud = + • Overhead of DataFlow = 2 mn, similar to Dataproc • Flink = better metrics + debugging
  17. Netherlands… • Want to discover what it feels like living

    4 meters under the sea level? • Yes… We’re hiring!