Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Devoxx 2017 - Criteo - Hadoop Cluster Under Pressure

Devoxx 2017 - Criteo - Hadoop Cluster Under Pressure

Devoxx 2017 FR
Criteo
Hadoop Cluster Under Pressure
Operating one of the largest clusters in Europe

Rémy SAISSY

April 06, 2017
Tweet

More Decks by Rémy SAISSY

Other Decks in Technology

Transcript

  1. Hadoop cluster under pressure Rémy Saissy, Lead R&D - Lake

    Operating one of the largest clusters in Europe
  2. 3 | Copyright © 2017 Criteo Our Mission TARGET THE

    RIGHT USER AT THE RIGHT TIME WITH THE RIGHT MESSAGE
  3. 4 | Copyright © 2017 Criteo Top Level Applications Platforms

    Infrastructure SRE Advertiser Publisher WebScale Prediction Dynamic Creative Recommendation Engine • Catalog • User Events • Campaigns • Reporting • RTB • Direct • Campaigns • Reporting Systems Platforms Systems Engine
  4. 9 | Copyright © 2017 Criteo Production Clusters 1,058 datanodes

    12,600 cores 84TB memory 37PB Raw Storage Cloudera CDH4 1,353 datanodes 32,472 cores 338TB memory 108PB Raw Storage Cloudera CDH5
  5. 10 | Copyright © 2017 Criteo 1,058 datanodes 12,600 cores

    84TB memory 37PB Raw Storage Cloudera CDH4 1,353 datanodes 32,472 cores 338TB memory 108PB Raw Storage Cloudera CDH5 Production Clusters
  6. 11 | Copyright © 2017 Criteo Preproduction Clusters Criteo has

    3 preproduction Hadoop clusters Amsterdam preprod: 54 datanodes Paris preprod : 100 datanodes Lake preprod: 53 datanodes
  7. 13 | Copyright © 2017 Criteo Outage timeline • Friday

    evening, 1st on-call of a new team member • Incident started Friday night • 5 people took turns over 36h • 20 people involved total • Source of the outage identified Sunday at 1am
  8. 14 | Copyright © 2017 Criteo Outage timeline • Friday

    evening, 1st on-call of a new team member • Incident started Friday night • 5 people took turns over 36h • 20 people involved total • Source of the outage identified Sunday at 1am
  9. 15 | Copyright © 2017 Criteo • Stop and isolate

    Namenodes • Mitigate heap consumption at startup • Adjust heap, start NN one by one • Enable RPCs, services ones first, then client ones • Investigate and delete superfluous data How did we fix it?
  10. 16 | Copyright © 2017 Criteo What happened? 218M blocks

    242M blocks Outage 24 hours 222M blocks End of Outage
  11. 17 | Copyright © 2017 Criteo Journalnode errors Got too

    many exceptions to achieve quorum size 2/3. 3 exceptions thrown: 8485: at org.apache.hadoop.hdfs.serverAsked for firstTxId 71030619500 which is in the middle of file /var/cache/hadoop/jn/root/current/edits_00000000710280706 81- 0000000071034147476.namenode.FileJournalManager.getRem oteEditLogs(FileJournalManager.java:195) at org.apache.hadoop.hdfs.qjournal.server.Journal.getEditLogMani fest(Journal.java:638) at org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer. getEditLogManifest(JournalNodeRpcServer.java:178)Add Comment
  12. 18 | Copyright © 2017 Criteo Namenode is slowwwwww 2016-02-20

    22:10:51,743 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 47515ms#012GC pool 'PS MarkSweep' had collection(s): count=1 time=47729ms 2016-02-20 22:11:53,331 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 47585ms#012GC pool 'PS MarkSweep' had collection(s): count=1 time=47845ms 2016-02-20 22:12:53,750 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 47916ms#012GC pool 'PS MarkSweep' had collection(s): count=1 time=48286ms 2016-02-20 22:13:57,209 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 49450ms#012GC pool 'PS MarkSweep' had collection(s): count=1 time=49701ms 2016-02-20 22:14:57,535 WARN org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 49324ms#012GC pool ...
  13. 19 | Copyright © 2017 Criteo • Post-Mortem – Communication

    – Closer Quota Monitoring • Adjust hardware • Improve stability • Improve performances Next steps?
  14. 21 | Copyright © 2017 Criteo • Namenode has a

    major design issue • 2 very different roles: DB (metadata) and Servicing (RPC) • Scalability issue comes from In-Memory DB • Namenode later proved to have a lot of single threaded code (in our version of CDH) • We shared the Namenode and ResourceManager on the same machine Adjust Hardware - Problem
  15. 22 | Copyright © 2017 Criteo • Vertical Scaling •

    Deal with large FSImage: • RAM: 192GB à 512GB • JVM: 150GB à 332GB • Faster write time of both edits and fsimage/checkpoint: SSD Storage • Mitigate single threaded code impact: 2.0Ghz à 3.4Ghz • Specialize Namenode machines Adjust Hardware - Mitigation
  16. 23 | Copyright © 2017 Criteo • A job creates

    files • Along with a file creation, a lease (lock) is created • In case of failure, the lease is not released but cleaned up later by the Namenode • What happens when you have a very large amount of unreleased leases? àThe Namenode clean up all leases at once and becomes unresponsive àA failover happens, but the newly active Namenode also becomes unresponsive… HDFS-10220 gave us more control to fine tune the duty cycle with which the Namenode recovers old leases • dfs.namenode.lease-recheck-interval-ms • dfs.namenode.max-lock-hold-to-release-lease-ms Improve Stability - HDFS-10220
  17. 24 | Copyright © 2017 Criteo • Quota computation is

    too long • Mono threaded code – run after replaying edits • With ~300 million files – takes very long • What happens when you have ~300 million files? àIt takes a very long time àIt triggers a failover as the Namenode becomes unresponsive HDFS-8865/HDFS-9003 uses the Fork-Join framework to improve the performances of that operation Improve Stability - HDFS-8865/HDFS-9003
  18. 25 | Copyright © 2017 Criteo • We had issues

    with our datanodes (bad controllers) • Need to change RAID controllers of 700 machines • Decommission/Recommission cycle rack / rack (20 machines), several racks per week, during 4 months • Double the size of the cluster at the same time: 700 à 1353 datanodes • Generates a lot of datanode incremental block reports • One IBR per block movement • We estimated between 2.5 - 11 Million movement per datanode • Plus the normal cluster life, plus the usual HDFS rebalance • What happened? • Namenode became irresponsive at some point causing major outages HDFS-9198/HDFS-6841 - Coalesce IBR processing in the NN and limit IBR treatment to 4ms Complex and risky patch as it changed some protobuf interfaces but hopefully optional fields Improve Stability - HDFS-9198/HDFS-6841
  19. 26 | Copyright © 2017 Criteo • M/R job with

    10GB heap size datanodes to give preprod the ability to deal with as many blocks as prod • Create 180 millions of files/block – 1 block/file • Multiple get/set of metadata files informations • Delete the millions of files/block created Improve performances – Load testing
  20. 27 | Copyright © 2017 Criteo • Parallel GC: 18min

    startup – 10s GC with 20s std dev – 1GC/6min • CMS GC: 10min startup – ~2s GC with ~3s std dev – 1GC/3min • G1 GC: ~11min startup – ~630ms GC with 2s std dev - 1GC/30s + 1 long mixed GC (30s-1m) once per day Improve performances – JVM tuning
  21. 29 | Copyright © 2017 Criteo • ~700 machines with

    defective raid card – Replace all of them • Scale HDFS from ~700 datanodes to ~3000 datanodes • Scale the backup Hadoop cluster to at least ½ the capacity of the main one • Address the Namenode scalability issue • Scaling the team ;-) Scaling
  22. 30 | Copyright © 2017 Criteo • Improve our knowlege

    of actual cluster usage – Application Performance Management and metrology • Support our users in their transition to Spark and their experimentation with other frameworks • Increase the use of Mesos rather than bare metal machines for cluster access, probes,… • Upgrade the CDH4 cluster to the same level as the main CDH5 one • Address the Active/Passive Hadoop cluster issue Better cluster usage