Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.

https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again

Big Data Spain 2017
16th - 17th Kinépolis Madrid

Big Data Spain

December 01, 2017
Tweet

More Decks by Big Data Spain

Other Decks in Technology

Transcript

  1. • Bol.com • Data Quality • Hive_compared_bq • DataPrep •

    PII data • Should we lift & shift? Agenda
  2. Bol.com • Number 1 webshop in Netherlands & Belgium •

    Yes, in Spain you’ve also heard about us: Winner of the Entrepreneurial Award This year, the winner is Bol.com from The Netherlands (Barcelona, 2014)
  3. > 16 million products for sale > 50 million in

    catalog Hadoop in production since 2010 > 6 million active customers > 40 million visits per month > 5000 million pageviews/ye ar
  4. Hadoop at bol.com • On Premise Production cluster = 35

    nodes • 30+ IT teams • Several Business Teams
  5. But lot of challenges… • Lack of flexibility: • Version

    HDP • Christmas’ peak • Security issues • Who likes Kerberos? • No PII • We’re overloaded: • Sysops • YARN
  6. Same challenge in 2015 • Hortonworks’ project (audit): • SQL

    server -> Hive • Huge ETL process • To test: 500MB – 60GB databases • 1st tests: • #rows • Manually reading few lines • Developed 3 validation tools • Ex: https:// community.hortonworks.com/articles/1283/hive-script-to-validate-tabl es-compare-one-with-an.html
  7. hive_compared_bq • Improved validation tool: • No Sqoop • Totally

    scalable (moving only aggregated checksums) • Now: & (also …) • https://github.com/bolcom/hive_compared_bq
  8. Improving test coverage • Not only for ETL migration •

    Also for “integration” test • Testing tools: • Unit tests: quick, automated • Hive_compared_bq: huge tests, capturing outliers
  9. Yet another BigData processing tool… • input: • “Intelligent, visual

    & serverless data preparation” • Business People focused • “Excell” like • See data + statistics • Just push “Run” button… • Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98
  10. Why Data Quality? • For IT: better understanding of data

    • See & “feel” data • Help before developing • Stats: outliers, skew • For Business: • Before: 1st analysis  better Jiras • After: validate with stats
  11. Security in our Hadoop • Kerberos issues: • Integration Java

    libraries • REST interfaces not secured • Not encrypted • No strong audit • No strict HDFS, HBase, Hive permissions
  12. PII in BigQuery! • Always encrypted: disk + network •

    Serverless: patching on time • No Kerberos • Central access control • Hide PII columns with views • Advanced logging
  13. Initial idea: many Dataproc • “a Hadoop cluster in 90

    seconds” • 1 cluster for each team (per job?) • Capacity issue • Configuration issue • Easy migration: no rewrite of MR, Pig, Hive, Flink…
  14. Finally, no lift & shift • Use of HBase is

    not optimal • Better using native Google Tools: • Cheaper • Faster
  15. Hive vs BigQuery • “Now with BigQuery, I start to

    think that a query is slow when it takes more than 30 seconds”
  16. Flexibility of Beam • Easy for future migration: • In

    our Hadoop • In Google Cloud • Or maybe Beam in Cloud = + • Overhead of DataFlow = 2 mn, similar to Dataproc • Flink = better metrics + debugging
  17. Netherlands… • Want to discover what it feels like living

    4 meters under the sea level? • Yes… We’re hiring!