Devoxx 2017 - Criteo - Hadoop Cluster Under Pressure

Hadoop cluster under pressure Rémy Saissy, Lead R&D - Lake
Operating one of the largest clusters in Europe

Criteo & Hadoop ecosystem 1

3 | Copyright © 2017 Criteo Our Mission TARGET THE
RIGHT USER AT THE RIGHT TIME WITH THE RIGHT MESSAGE

4 | Copyright © 2017 Criteo Top Level Applications Platforms
Infrastructure SRE Advertiser Publisher WebScale Prediction Dynamic Creative Recommendation Engine • Catalog • User Events • Campaigns • Reporting • RTB • Direct • Campaigns • Reporting Systems Platforms Systems Engine

2 Hadoop Infrastructure

The Team Anna Anthony Meriam Nicolas Rémy Stuart Yohan

8 | Copyright © 2017 Criteo Paris, FR Amsterdam, NL
Tokyo, JP Hadoop @ Criteo

9 | Copyright © 2017 Criteo Production Clusters 1,058 datanodes
12,600 cores 84TB memory 37PB Raw Storage Cloudera CDH4 1,353 datanodes 32,472 cores 338TB memory 108PB Raw Storage Cloudera CDH5

10 | Copyright © 2017 Criteo 1,058 datanodes 12,600 cores
84TB memory 37PB Raw Storage Cloudera CDH4 1,353 datanodes 32,472 cores 338TB memory 108PB Raw Storage Cloudera CDH5 Production Clusters

11 | Copyright © 2017 Criteo Preproduction Clusters Criteo has
3 preproduction Hadoop clusters Amsterdam preprod: 54 datanodes Paris preprod : 100 datanodes Lake preprod: 53 datanodes

3 Story of a Major Incident

13 | Copyright © 2017 Criteo Outage timeline • Friday
evening, 1st on-call of a new team member • Incident started Friday night • 5 people took turns over 36h • 20 people involved total • Source of the outage identified Sunday at 1am

14 | Copyright © 2017 Criteo Outage timeline • Friday
evening, 1st on-call of a new team member • Incident started Friday night • 5 people took turns over 36h • 20 people involved total • Source of the outage identified Sunday at 1am

15 | Copyright © 2017 Criteo • Stop and isolate
Namenodes • Mitigate heap consumption at startup • Adjust heap, start NN one by one • Enable RPCs, services ones first, then client ones • Investigate and delete superfluous data How did we fix it?

16 | Copyright © 2017 Criteo What happened? 218M blocks
242M blocks Outage 24 hours 222M blocks End of Outage

17 | Copyright © 2017 Criteo Journalnode errors Got too
many exceptions to achieve quorum size 2/3. 3 exceptions thrown: 8485: at org.apache.hadoop.hdfs.serverAsked for firstTxId 71030619500 which is in the middle of file /var/cache/hadoop/jn/root/current/edits_00000000710280706 81- 0000000071034147476.namenode.FileJournalManager.getRem oteEditLogs(FileJournalManager.java:195) at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogMani fest(Journal.java:638) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer. getEditLogManifest(JournalNodeRpcServer.java:178)Add Comment

18 | Copyright © 2017 Criteo Namenode is slowwwwww 2016-02-20
22:10:51,743 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 47515ms#012GC pool 'PS MarkSweep' had collection(s): count=1 time=47729ms 2016-02-20 22:11:53,331 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 47585ms#012GC pool 'PS MarkSweep' had collection(s): count=1 time=47845ms 2016-02-20 22:12:53,750 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 47916ms#012GC pool 'PS MarkSweep' had collection(s): count=1 time=48286ms 2016-02-20 22:13:57,209 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 49450ms#012GC pool 'PS MarkSweep' had collection(s): count=1 time=49701ms 2016-02-20 22:14:57,535 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 49324ms#012GC pool ...

19 | Copyright © 2017 Criteo • Post-Mortem – Communication
– Closer Quota Monitoring • Adjust hardware • Improve stability • Improve performances Next steps?

4 After the outage

21 | Copyright © 2017 Criteo • Namenode has a
major design issue • 2 very different roles: DB (metadata) and Servicing (RPC) • Scalability issue comes from In-Memory DB • Namenode later proved to have a lot of single threaded code (in our version of CDH) • We shared the Namenode and ResourceManager on the same machine Adjust Hardware - Problem

22 | Copyright © 2017 Criteo • Vertical Scaling •
Deal with large FSImage: • RAM: 192GB à 512GB • JVM: 150GB à 332GB • Faster write time of both edits and fsimage/checkpoint: SSD Storage • Mitigate single threaded code impact: 2.0Ghz à 3.4Ghz • Specialize Namenode machines Adjust Hardware - Mitigation

23 | Copyright © 2017 Criteo • A job creates
files • Along with a file creation, a lease (lock) is created • In case of failure, the lease is not released but cleaned up later by the Namenode • What happens when you have a very large amount of unreleased leases? àThe Namenode clean up all leases at once and becomes unresponsive àA failover happens, but the newly active Namenode also becomes unresponsive… HDFS-10220 gave us more control to fine tune the duty cycle with which the Namenode recovers old leases • dfs.namenode.lease-recheck-interval-ms • dfs.namenode.max-lock-hold-to-release-lease-ms Improve Stability - HDFS-10220

24 | Copyright © 2017 Criteo • Quota computation is
too long • Mono threaded code – run after replaying edits • With ~300 million files – takes very long • What happens when you have ~300 million files? àIt takes a very long time àIt triggers a failover as the Namenode becomes unresponsive HDFS-8865/HDFS-9003 uses the Fork-Join framework to improve the performances of that operation Improve Stability - HDFS-8865/HDFS-9003

25 | Copyright © 2017 Criteo • We had issues
with our datanodes (bad controllers) • Need to change RAID controllers of 700 machines • Decommission/Recommission cycle rack / rack (20 machines), several racks per week, during 4 months • Double the size of the cluster at the same time: 700 à 1353 datanodes • Generates a lot of datanode incremental block reports • One IBR per block movement • We estimated between 2.5 - 11 Million movement per datanode • Plus the normal cluster life, plus the usual HDFS rebalance • What happened? • Namenode became irresponsive at some point causing major outages HDFS-9198/HDFS-6841 - Coalesce IBR processing in the NN and limit IBR treatment to 4ms Complex and risky patch as it changed some protobuf interfaces but hopefully optional fields Improve Stability - HDFS-9198/HDFS-6841

26 | Copyright © 2017 Criteo • M/R job with
10GB heap size datanodes to give preprod the ability to deal with as many blocks as prod • Create 180 millions of files/block – 1 block/file • Multiple get/set of metadata files informations • Delete the millions of files/block created Improve performances – Load testing

27 | Copyright © 2017 Criteo • Parallel GC: 18min
startup – 10s GC with 20s std dev – 1GC/6min • CMS GC: 10min startup – ~2s GC with ~3s std dev – 1GC/3min • G1 GC: ~11min startup – ~630ms GC with 2s std dev - 1GC/30s + 1 long mixed GC (30s-1m) once per day Improve performances – JVM tuning

5 Next Challenges?

29 | Copyright © 2017 Criteo • ~700 machines with
defective raid card – Replace all of them • Scale HDFS from ~700 datanodes to ~3000 datanodes • Scale the backup Hadoop cluster to at least ½ the capacity of the main one • Address the Namenode scalability issue • Scaling the team ;-) Scaling

30 | Copyright © 2017 Criteo • Improve our knowlege
of actual cluster usage – Application Performance Management and metrology • Support our users in their transition to Spark and their experimentation with other frameworks • Increase the use of Mesos rather than bare metal machines for cluster access, probes,… • Upgrade the CDH4 cluster to the same level as the main CDH5 one • Address the Active/Passive Hadoop cluster issue Better cluster usage

Thank you!

Devoxx 2017 - Criteo - Hadoop Cluster Under Pre...

Devoxx 2017 - Criteo - Hadoop Cluster Under Pressure

Rémy SAISSY

More Decks by Rémy SAISSY

Other Decks in Technology

Featured

Transcript

Hadoop cluster under pressure Rémy Saissy, Lead R&D - Lake

Criteo & Hadoop ecosystem 1

3 | Copyright © 2017 Criteo Our Mission TARGET THE

4 | Copyright © 2017 Criteo Top Level Applications Platforms

5 | Copyright © 2017 Criteo Our Hadoop ecosystem

2 Hadoop Infrastructure

The Team Anna Anthony Meriam Nicolas Rémy Stuart Yohan

8 | Copyright © 2017 Criteo Paris, FR Amsterdam, NL

9 | Copyright © 2017 Criteo Production Clusters 1,058 datanodes

10 | Copyright © 2017 Criteo 1,058 datanodes 12,600 cores

11 | Copyright © 2017 Criteo Preproduction Clusters Criteo has

3 Story of a Major Incident

13 | Copyright © 2017 Criteo Outage timeline • Friday

14 | Copyright © 2017 Criteo Outage timeline • Friday

15 | Copyright © 2017 Criteo • Stop and isolate

16 | Copyright © 2017 Criteo What happened? 218M blocks

17 | Copyright © 2017 Criteo Journalnode errors Got too

18 | Copyright © 2017 Criteo Namenode is slowwwwww 2016-02-20

19 | Copyright © 2017 Criteo • Post-Mortem – Communication

4 After the outage

21 | Copyright © 2017 Criteo • Namenode has a

22 | Copyright © 2017 Criteo • Vertical Scaling •

23 | Copyright © 2017 Criteo • A job creates

24 | Copyright © 2017 Criteo • Quota computation is

25 | Copyright © 2017 Criteo • We had issues

26 | Copyright © 2017 Criteo • M/R job with

27 | Copyright © 2017 Criteo • Parallel GC: 18min

5 Next Challenges?

29 | Copyright © 2017 Criteo • ~700 machines with

30 | Copyright © 2017 Criteo • Improve our knowlege

Thank you!