Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Distributed Deep Learning with Chainer and Hadoop

UENISHI Kota
December 04, 2019

Distributed Deep Learning with Chainer and Hadoop

UENISHI Kota

December 04, 2019
Tweet

More Decks by UENISHI Kota

Other Decks in Technology

Transcript

  1. 2019/12/4 Hadoop
    SCR #27
    Distributed Deep Learning with Chainer and
    Hadoop
    (How Fast Can Data be Delivered to DNNs?)
    Kota UENISHI
    Preferred Networks, Inc.
    1

    View full-size slide

  2. Preferred Networks, Inc.
    Since 2014
    Mission
    - Make the Real World Computable
    - Rapid Realization of Cutting-Edge Technologies
    - Robot for Everyone
    2

    View full-size slide

  3. Our Storage System History
    2017 MN-1
    NFS: 3-5 storage systems running, ~200TB
    2018 MN-1
    NFS: 2 storage servers, ~400TB
    HDFS: 5 DataNodes, ~500TB
    2019 MN-2
    NFS: 2 HDD-based, 2 NVMe-based storage servers, ~600TB
    HDFS: 20 DataNodes, ~3PB
    3
    Disclaimer: numbers are very rough

    View full-size slide

  4. Why Hadoop?
    ❖ EcoSystem
    ➢ Troubleshooting by Googling
    ➢ Various operational tools
    ❖ Balance in newer architecture vs historical reliability
    ❖ Operational Ease for PFN
    ➢ User-land system
    ➢ Only part-time volunteers; no fulltime team
    4

    View full-size slide

  5. Deep Learning
    5
    Forward Backward Optimize
    推論
    [fish, dog, …]
    正解
    [cat, dog, …]
    loss
    [1.0, 0.0, …]
    update
    param
    grad
    Forward

    View full-size slide

  6. SGD with Minibatch
    6
    Shuffle Minibatch
    MNIST
    ● 60k images
    ● Typical batch size: 10~100
    CIFAR10
    ● 60k images
    ● Typical batch size: 64
    ImageNet-1k
    ● 1580k images
    ● Typical batch size: 32
    Open Images Dataset (OID)
    ● 8M images
    ● Typical batch size: ???
    Repeat

    View full-size slide

  7. Distributed Deep Learning
    7
    All-Reduce
    w/NCCL
    Forward
    Forward
    Forward
    Backward
    Backward
    Backward
    Optimize
    Optimize
    Optimize
    Forward Backward Optimize

    View full-size slide

  8. Our Clusters
    8
    MN-1a MN-1b MN-2
    GPU NVIDIA P100
    (PCIe,
    12GB/16GB)
    NVIDIA V100
    (PCIe, 32GB)
    NVIDIA V100
    (SXM2,
    32GB/16GB)
    GPUs 1024 512 1024
    CPUs 2048 2304 5760
    Interconnect InfiniBand FDR 
    x2 (128Gbps)
    InfiniBand EDR
    x2 (200Gbps)
    (RoCEv2)
    Network 1GbE 10GbE 100GbE x4
    Local Storage ~800GB ~3.2TB ~3.2TB
    Catalog perf. 19.1PFLOPS 57.3PFLOPS 128PFLOPS
    Preferred Networks builds MN-2, a
    state-of-the-art supercomputer
    powered with NVIDIA GPUs.

    View full-size slide

  9. MN-1
    9
    © NTT Communications
    © NTT Communications

    View full-size slide

  10. Why We Build on-prem Clusters?
    A. Faster cycle of training & testing is
    crucial to the speed of our R&D.
    10

    View full-size slide

  11. How Fast is ImageNet Delivered to Accelorators
    11
    Affiliation Duration Total per GPU/TPU Computer
    Facebook 1 hour [1] 39.5k images/sec 154.3 images/sec (Internal Cluster)
    PFN 15 minutes [2] 158k images/sec 154.3 images/sec MN-1a
    SONY 224 seconds [3] 635k images/sec 291.7 images/sec ABCI
    Google 88 seconds [4] 1.63M images/sec 1578.0 images/sec Internal TPU Pod
    Fujitsu 74.7 seconds [5] 1.90M images/sec 929.5 images/sec ABCI
    ImageNet:
    ● (Originally) A WordNet-like annotated set of images
    ● (Narrow context) A classification subset and task for 2012 and 2015 competition
    ● (Narrower context) De facto DL performance benchmark

    View full-size slide

  12. How Fast Can You Deliver?
    12
    Dataset in
    Disks
    GPU Node
    Memory

    View full-size slide

  13. 1. Naively Read from NFS
    13
    Pros
    ❖ Easy
    ❖ No porting
    ❖ Automatically cached
    Cons
    ❖ Not that fast
    ➢ >10^2ms on read
    ❖ NFS can easily be overloaded
    ❖ Not sustainable
    ➢ Capacity limited
    NFS

    View full-size slide

  14. 2. Copy to Local in Advance and Read Locally
    14
    Pros
    ❖ Fastest and Easy
    ➢ 10^1~10^-1 ms on read
    depending on local media
    ❖ No porting
    ❖ Automatically cached
    ❖ I/O load distributed
    Cons
    ❖ Needs data distribution before
    computation
    ❖ Local disk space is limited
    ❖ Not sustainable

    View full-size slide

  15. 3. Super-scalable and Fast Storage
    15
    Pros
    ❖ Easy and fast
    ❖ No porting
    ❖ Automatically cached
    Cons
    ❖ No such storage
    ❖ Not a poor-man’s solution
    Super-fast storage

    View full-size slide

  16. 4. Local HDFS
    16
    Inspired by MapReduce I/O style
    Pros
    ❖ Fast by direct local read
    Cons
    ❖ It’s very difficult.

    View full-size slide

  17. Why Local HDFS is not Practical?
    17
    17
    ❖ Amount of scattered data in each
    node is not equal in HDFS
    ❖ NCCL, de facto & key library for
    fast allreduce assumes equal size
    among GPUs
    ❖ Deploying kubelet and HDFS
    DataNode at the same host is not
    practical, too
    ❖ Difficult to allocate pod/GPU at
    where the data is; requires tight
    coupling of scheduler and data
    distribution.

    View full-size slide

  18. 5. Remote HDFS
    18
    Pros
    ❖ (Maybe) sustainable
    ❖ Almost infinite capacity
    ❖ Enough throughput
    ❖ Poor-man’s solution
    Cons
    ❖ Too many small files
    ❖ High latency
    ➢ 10^-1~10^3 seconds
    ❖ Needs application porting
    ❖ High latency
    ❖ ...and more!

    View full-size slide

  19. Mitigating Application Porting: ChainerIO
    ❖ API abstraction library to provide same
    code to local filesystem & HDFS
    ❖ Standing on the shoulder of PyArrow’s
    HDFS client
    19
    $ sudo apt-get install \
    openjdk-8-jdk libhdfs0
    $ pip install --user chainerio

    View full-size slide

  20. Mitigating High Latency 0/3: Prefetching
    • Chainer’s MultithreadIterator and MultiprocessIterator enables prefetching
    – Iterators know SGD's random choice of a batch in advance
    – A thread/process pool that prefetches a batch in a multiplexed way
    20
    main Pool Storage
    GPU
    With Prefetching
    main Storage
    GPU
    Without Prefetching

    View full-size slide

  21. Mitigating High Latency 1/3: ZIP
    • Put all images into one or a few large ZIP file
    – bonus: small-file problem resolved
    – cons: hard to make a ZIP
    • Without ZIP
    – Number of open ∝ number of images
    • With ZIP
    – Just a few open call
    21
    Client NN DN krb5
    ➢ To read first byte of a file, 2 roundtrip in Hadoop
    protocol required,
    ➢ … which involve 2 TLS handshake in secure
    cluster
    ➢ … which involve 12 IP packet roundtrips in total
    ➢ +2x2 roundtrip of Kerberos ticket check
    ➢ But could be or couldn’t be the culprit

    View full-size slide

  22. Mitigating High Latency 2/3: Preload/Cache
    Chainer’s scatter_dataset
    assigns data into each GPU process.
    And ChainerIO’s FileCache
    enables caching the assigned data
    on local disk.
    Pros
    ❖ No cache cap by RAM size
    Cons
    ❖ First epoch still slow
    ➢ Preload
    ❖ Application porting still
    required
    22
    cache cache cache cache

    View full-size slide

  23. Typical Workload in MN-2
    23
    Job A Job B Job C
    Dataset ImageNet OID (detection) OID (full set)
    #of GPUs (nodes) 32~256 32 ~ 512 32~256
    Training Time 1day ~ 30min days ~ hours days ~ hours
    #of epochs 90 10 50~100
    preload ~hours ~hours ~hours

    View full-size slide

  24. Mitigating High Latency 3/3: More Ideas Coming
    Problems
    • Preloading overhead
    • Application portability on non-POSIX
    implementation
    • Python-JVM data copying
    • multiprocessing fork vs JVM
    • Huge dataset that doesn’t fit local
    storage
    Ideas
    • Original File Format Faster than ZIP
    • Ozone & Flash-based cluster
    • Distributed Caching
    • And more
    24

    View full-size slide

  25. Summary
    ❖ Speed is crucial in machine learning R&D
    ❖ DL workload is random read with low latency requirement
    ❖ Sustainability and Performance (esp. latency) are in trade off
    ❖ HDFS is in better balance on that trade off
    ❖ Workaround and mitigations to improve the trade off balance
    ➢ Abstraction Library, Caching and ZIP
    25

    View full-size slide

  26. Questions?
    26

    View full-size slide