Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Make the elephant fly, once again Luangsay Sourygna [email protected]

• Bol.com • Data Quality • Hive_compared_bq • DataPrep •
PII data • Should we lift & shift? Agenda

Bol.com • Number 1 webshop in Netherlands & Belgium •
Yes, in Spain you’ve also heard about us: Winner of the Entrepreneurial Award This year, the winner is Bol.com from The Netherlands (Barcelona, 2014)

It all began in 2008…

S0 … How big is bol.com? Really?

> 16 million products for sale > 50 million in
catalog Hadoop in production since 2010 > 6 million active customers > 40 million visits per month > 5000 million pageviews/ye ar

Hadoop at bol.com • On Premise Production cluster = 35
nodes • 30+ IT teams • Several Business Teams

But lot of challenges… • Lack of flexibility: • Version
HDP • Christmas’ peak • Security issues • Who likes Kerberos? • No PII • We’re overloaded: • Sysops • YARN

Let’s go to the Cloud

Moving data seems easy

Then: migrating ETL…

Difficult in Big Data…

Same challenge in 2015 • Hortonworks’ project (audit): • SQL
server -> Hive • Huge ETL process • To test: 500MB – 60GB databases • 1st tests: • #rows • Manually reading few lines • Developed 3 validation tools • Ex: https:// community.hortonworks.com/articles/1283/hive-script-to-validate-tabl es-compare-one-with-an.html

hive_compared_bq • Improved validation tool: • No Sqoop • Totally
scalable (moving only aggregated checksums) • Now: & (also …) • https://github.com/bolcom/hive_compared_bq

Improving test coverage • Not only for ETL migration •
Also for “integration” test • Testing tools: • Unit tests: quick, automated • Hive_compared_bq: huge tests, capturing outliers

Data Quality with DataPrep

Yet another BigData processing tool… • input: • “Intelligent, visual
& serverless data preparation” • Business People focused • “Excell” like • See data + statistics • Just push “Run” button… • Presentation: https://www.youtube.com/watch?v=Q5GuTIgmt98

Why Data Quality? • For IT: better understanding of data
• See & “feel” data • Help before developing • Stats: outliers, skew • For Business: • Before: 1st analysis  better Jiras • After: validate with stats

Beware of the dog

Security in our Hadoop • Kerberos issues: • Integration Java
libraries • REST interfaces not secured • Not encrypted • No strong audit • No strict HDFS, HBase, Hive permissions

PII in BigQuery! • Always encrypted: disk + network •
Serverless: patching on time • No Kerberos • Central access control • Hide PII columns with views • Advanced logging

Example of logs

Migrating: which strategy?

Initial idea: many Dataproc • “a Hadoop cluster in 90
seconds” • 1 cluster for each team (per job?) • Capacity issue • Configuration issue • Easy migration: no rewrite of MR, Pig, Hive, Flink…

HBase  BigTable • HBase replication hooked to BigTable

Finally, no lift & shift • Use of HBase is
not optimal • Better using native Google Tools: • Cheaper • Faster

Hive vs BigQuery • “Now with BigQuery, I start to
think that a query is slow when it takes more than 30 seconds”

Alternatives to Pig • BigQuery • Beam

Flexibility of Beam • Easy for future migration: • In
our Hadoop • In Google Cloud • Or maybe Beam in Cloud = + • Overhead of DataFlow = 2 mn, similar to Dataproc • Flink = better metrics + debugging

Netherlands… • Want to discover what it feels like living
4 meters under the sea level? • Yes… We’re hiring!

Thanks till next bol.com Sourygna Luangsay [email protected]

Make the elephant fly, once again by Sourygna L...

Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017

Big Data Spain

More Decks by Big Data Spain

Other Decks in Technology

Featured

Transcript

Make the elephant fly, once again Luangsay Sourygna [email protected]

• Bol.com • Data Quality • Hive_compared_bq • DataPrep •

Bol.com • Number 1 webshop in Netherlands & Belgium •

It all began in 2008…

S0 … How big is bol.com? Really?

> 16 million products for sale > 50 million in

Hadoop at bol.com • On Premise Production cluster = 35

But lot of challenges… • Lack of flexibility: • Version

Let’s go to the Cloud

Moving data seems easy

Then: migrating ETL…

Difficult in Big Data…

Same challenge in 2015 • Hortonworks’ project (audit): • SQL

hive_compared_bq • Improved validation tool: • No Sqoop • Totally

Improving test coverage • Not only for ETL migration •

Data Quality with DataPrep

Yet another BigData processing tool… • input: • “Intelligent, visual

Why Data Quality? • For IT: better understanding of data

Beware of the dog

Security in our Hadoop • Kerberos issues: • Integration Java

PII in BigQuery! • Always encrypted: disk + network •

Example of logs

Migrating: which strategy?

Initial idea: many Dataproc • “a Hadoop cluster in 90

HBase  BigTable • HBase replication hooked to BigTable

Finally, no lift & shift • Use of HBase is

Hive vs BigQuery • “Now with BigQuery, I start to

Alternatives to Pig • BigQuery • Beam

Flexibility of Beam • Easy for future migration: • In

Netherlands… • Want to discover what it feels like living

Thanks till next bol.com Sourygna Luangsay [email protected]