$30 off During Our Annual Pro Sale. View Details »

低レイヤーな人のためのディープラーニング

Fadis
July 20, 2019

 低レイヤーな人のためのディープラーニング

フレームワークに頼らずVulkanで畳み込みニューラルネットワークを実装する方法を解説します
これは2019年7月20日に行われた 第15回 カーネル/VM探検隊 での発表資料です
サンプルコード: https://github.com/Fadis/kernelvm_20190720_samples

Fadis

July 20, 2019
Tweet

More Decks by Fadis

Other Decks in Programming

Transcript

  1. ௿ϨΠϠʔͳਓͷҝͷ
    σΟʔϓϥʔχϯά
    NAOMASA MATSUBAYASHI
    https://github.com/Fadis/kernelvm_20190720_samples
    ͜ͷൃදʹొ৔͢Δαϯϓϧίʔυ

    View Slide

  2. ௿ϨΠϠʔͳਓͷҝͷ
    σΟʔϓϥʔχϯά

    View Slide

  3. x0
    x1
    x2
    x3
    x4
    y2
    ܗࣜχϡʔϩϯ
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42
    . . .

    View Slide

  4. yj
    = ϕ
    (∑
    i
    wij
    xi)
    x0
    x1
    x2
    x3
    x4
    y0
    y1
    y2
    y3
    y4
    . . .
    . . .
    . . .

    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42

    View Slide


  5. x00
    x01
    x02
    x03
    x04
    . . .
    x10
    x11
    x12
    x13
    x14
    . . .

    x20
    x21
    x22
    x23
    x24
    . . .

    x30
    x31
    x32
    x33
    x34
    . . .

    y0
    y1
    y2
    y3
    y4
    . . .
    w0
    w1
    w2
    w3
    ૚ΛແݶʹॏͶͨ෺͸ॏΈ࣍ୈͰ೚ҙͷؔ਺ΛදݱͰ͖ΔͰ͖Δ
    ༗ݶͰ΋༷ʑͳؔ਺ΛۙࣅͰ͖Δ
    χϡʔϥϧωοτϫʔΫ

    View Slide

  6. Կͱͳ͘૬ؔ͸͋Γͦ͏ͳΜ͚ͩͲ
    Ͳ͏͍͏ؔ܎͔Α͘Θ͔Βͳ͍ͭͷσʔλ
    ?
    2
    7
    4

    View Slide

  7. ॏΈΛ୳Δ໰୊ʹͳΔ
    2
    7
    4
    y0
    y1
    y2
    ͜Εͱ
    ͜ΕΛఆ਺ͱͯ͠
    ͱਖ਼͍͠ग़ྗ͕
    ग़དྷΔ͚ͩۙ͘ͳΔΑ͏ʹ
    y
    Λௐઅ͢Δ
    w

    View Slide

  8. ֶश
    w
    x0
    x1
    x2
    y0
    y1
    y2
    t0
    t1
    t2
    Λमਖ਼
    w
    ೖྗͱରԠ͢Δग़ྗ͕
    େྔʹ༗Ε͹
    ͜ͷૢ࡞Λ܁Γฦ͢͜ͱͰ
    χϡʔϥϧωοτϫʔΫ͕
    ͱͷؔ܎Λදؔ͢਺ͷ
    ۙࣅʹͳΔ
    x t
    x t
    σʔλ͔Β
    ؔ਺͕ಘΒΕΔ
    w Λमਖ਼
    w
    x0
    x1
    x2
    y0
    y1
    y2
    t0
    t1
    t2
    w Λमਖ਼
    w
    x0
    x1
    x2
    y0
    y1
    y2
    t0
    t1
    t2

    View Slide

  9. x0
    x1
    x2
    y0
    y1
    y2
    σΟʔϓϥʔχϯά
    χϡʔϥϧωοτϫʔΫͷ௕͍΍ͭ
    ௕͍ͱΑΓෳࡶͳؔ਺ΛۙࣅͰ͖Δ͕
    ֶश͕೉͘͠ͳΔҝੲ͸࣮༻తͰ͸ͳ͍ͱߟ͑ΒΕ͍ͯͨ
    ͜͜೥ఔͷݚڀͰ௕͍ωοτϫʔΫͷֶश͕ՄೳʹͳΓ
    େϒʔϜ

    View Slide

  10. CPU
    GPU
    FPGA
    ֶशʹ͸େྔͷܭࢉΛཁ͢Δҝ
    ͠͹͠͹$16֎ͷΞΫηϥϨʔλ͕༻͍ΒΕΔ

    View Slide

  11. CPU
    GPU
    FPGA
    ͲΜͳϋʔυ΢ΣΞΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ
    ϑϨʔϜϫʔΫ͕ొ৔
    TensorFlow
    Chainer
    Caffe
    PyTorch
    . . .
    Theano

    View Slide

  12. CPU
    GPU
    FPGA
    ͲͷϑϨʔϜϫʔΫΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ
    ϑϨʔϜϫʔΫ͕ొ৔
    TensorFlow
    Chainer
    Caffe
    PyTorch
    . . .
    Keras
    Theano
    ϨΠϠʔ͕ߴ͍

    View Slide

  13. CPU
    GPU
    FPGA
    TensorFlow
    Chainer
    Caffe
    PyTorch
    . . .
    Theano
    $ du -h tensorflow-1.12.3/
    ...
    145M tensorflow-1.12.3/
    $ du -h pytorch-1.1.0/
    ...
    44M pytorch-1.1.0/
    $ du -h chainer-6.1.0/
    ...
    22M chainer-6.1.0/
    Ͱ͔͍
    Keras
    $ du -h keras-2.2.4/
    ...
    3.0M keras-2.2.4/

    View Slide

  14. ௿ϨΠϠʔͳਓͷҝͷ
    σΟʔϓϥʔχϯά

    View Slide

  15. GPU
    ର৅ϋʔυ΢ΣΞΛ(16ʹߜΓ
    ϑϨʔϜϫʔΫΛ࢖Θͣʹ
    σΟʔϓϥʔχϯάΛࢼΈΔ

    View Slide

  16. ݱ୅ͷ(16͕උ͑ΔػೳΛ
    Ͱ͖Δ͚ͩͦͷ··ୟͨ͘Ίͷ
    ৽͍͠%άϥϑΟοΫ"1*
    ୈ13ճ ΧʔωϧʗVM୳ݕୂ
    ௿ϨΠϠʔάϥϑΟοΫAPI VulkanΛ࢝ΊΑ͏ ΑΓ
    Λ࢖͏
    GPUΛಈ͔͢ͷʹ

    View Slide

  17. Vulkanಉ༷ੜͷGPUΛ৮ΕΔCUDAʹ͸
    χϡʔϥϧωοτϫʔΫͰ༻͍ΔܭࢉΛϥΠϒϥϦԽͨ͠
    cuDNN͕͋Δ͕
    https://developer.nvidia.com/cudnn

    View Slide

  18. Λ࢖͏
    GPUΛಈ͔͢ͷʹ
    ඞཁͳܭࢉ͸શ࣮ͯ૷͢Δ

    View Slide

  19. Ӆ
    Ε




    ࠷΋جຊతͳશ݁߹૚ͭͷωοτϫʔΫ
    x0
    x1
    x2
    x3
    x4
    . . .
    y0
    y1
    y2
    y3
    y4
    . . .

    View Slide

  20. if( config.validation ) layers.emplace_back( "VK_LAYER_LUNARG_standard_validation" );
    const auto app_info = vk::ApplicationInfo( config.prog_name.c_str(), (தུ), VK_API_VERSION_1_1 );
    instance_ptr_t instance(
    new vk::Instance(
    vk::createInstance(
    vk::InstanceCreateInfo()
    .setPApplicationInfo( &app_info )
    .setEnabledExtensionCount( ext.size() ).setPpEnabledExtensionNames( ext.data() )
    .setEnabledLayerCount( layers.size() ).setPpEnabledLayerNames( layers.data() )
    )
    )
    );
    auto devices = instance->enumeratePhysicalDevices();
    if( devices.empty() ) throw device_is_not_available();
    devices.erase( std::remove_if( devices.begin(), devices.end(), [&]( const auto &d ) -> bool {
    auto avail_dext = d.enumerateDeviceExtensionProperties();
    for( const char *w: dext )
    if( std::find_if( avail_dext.begin(), avail_dext.end(), [&]( const auto &v ) {
    return !strcmp( v.extensionName, w );
    } ) == avail_dext.end() ) return true;
    const auto avail_dlayers = d.enumerateDeviceLayerProperties();
    for( const char *w: dlayers )
    if( std::find_if( avail_dlayers.begin(), avail_dlayers.end(), [&]( const auto &v ) {
    return !strcmp( v.layerName, w );
    } ) == avail_dlayers.end() ) return true;
    return false;
    } ), devices.end() );
    if( devices.empty() ) throw required_extensions_or_layers_are_not_available();
    7VMLBOͷΠϯελϯεΛ࡞Δ
    ར༻Մೳͳ(16ͷத͔Β
    ࢖͏΍ͭΛબͿ

    View Slide

  21. const auto queue_props = physical_device.getQueueFamilyProperties();
    uint32_t queue_index =std::distance(
    queue_props.begin(), std::find_if( queue_props.begin(), queue_props.end(), []( const auto &v ) {
    return bool( v.queueFlags & vk::QueueFlagBits::eCompute ) &&
    bool( v.queueFlags & vk::QueueFlagBits::eTransfer );
    } )
    );
    if( queue_index == queue_props.size() ) throw required_queue_is_not_available();
    const float priority = 0.0f;
    std::vector< vk::DeviceQueueCreateInfo > queues{};
    const auto queue_create_info = vk::DeviceQueueCreateInfo()
    .setQueueFamilyIndex( queue_index ).setQueueCount( 1 ).setPQueuePriorities( &priority );
    const auto features = physical_device.getFeatures();
    auto device = physical_device.createDevice( vk::DeviceCreateInfo()
    .setQueueCreateInfoCount( 1 ).setPQueueCreateInfos( &queue_create_info )
    .setEnabledExtensionCount( dext.size() ).setPpEnabledExtensionNames( dext.data() )
    .setEnabledLayerCount( dlayers.size() ).setPpEnabledLayerNames( dlayers.data() )
    .setPEnabledFeatures( &features ) );
    std::shared_ptr< vk::Device > d( new vk::Device( std::move( device ) ), []( const auto &p ) {
    if( p ) {
    p->destroy();
    delete p;
    }
    } );
    auto queue = device.getQueue( queue_index, 0 );
    auto command_pool = device.createCommandPool( vk::CommandPoolCreateInfo()
    .setQueueFamilyIndex( queue_index ).setFlags( vk::CommandPoolCreateFlagBits::eResetCommandBuffer ) );
    std::shared_ptr< vk::Queue > q( new vk::Queue( std::move( queue ) ), [d]( const auto& ) {} );
    std::shared_ptr< vk::CommandPool > p( new vk::CommandPool( std::move( command_pool ) ),
    [d]( const vk::CommandPool *p ) {
    if( p ) {
    d->destroyCommandPool( *p );
    ࿦ཧσόΠεɺΩϡʔɺίϚϯυϓʔϧΛ࡞Δ

    View Slide

  22. w =
    w00
    w01
    ⋯ w0M
    w10
    w11
    ⋯ w1M
    ⋮ ⋮ ⋮
    wN0
    wN1
    ⋯ wNM
    ೖྗ͕/ཁૉɺग़ྗ͕.ཁૉͷϕΫτϧͷ࣌
    ॏΈΛҎԼͷΑ͏ͳߦྻͱΈͳ͢ͱ
    y = ϕ (wx)

    ϕ
    ͭͷ૚ͷܭࢉ͸
    ϕΫτϧͱߦྻͷੵΛٻΊΔ
    ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ
    ϕ
    w
    x y

    View Slide

  23. Λमਖ਼
    w
    x0
    x1
    x2
    y0
    y1
    y2
    t0
    t1
    t2
    w Λमਖ਼
    w
    x3
    x4
    x5
    y3
    y4
    y5
    t3
    t4
    t5
    w
    ϛχόον
    x6
    y6 t6
    ෳ਺ͷೖྗϕΫτϧΛ
    ଋͶͯߦྻʹ͢Δ
    ଋ͝ͱʹޡࠩΛٻΊ͔ͯΒ
    ॏΈΛमਖ਼͢Δ
    ݸผʹ Λमਖ਼͢ΔΑΓ
    ֶश͕҆ఆ͢Δ
    w

    View Slide


  24. ϕ
    ߦྻͱߦྻͷੵΛٻΊΔ
    ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ
    ϕ
    w
    x0
    x1

    xb
    ͸ͦΕͧΕೖྗϕΫτϧ
    xn
    ͸ͦΕͧΕग़ྗϕΫτϧ
    yn
    y0
    y1

    yb
    ͸όοναΠζ
    b

    View Slide


  25. ϕ ੵ
    ϕ
    wh
    wo
    x0
    x1

    xb


    ؔ

    t0
    t1

    tb
    L

    View Slide


  26. ϕ ੵ
    ϕ





    ͷ





    ͷ

    Ӆ
    Ε

    ͷ

    Ӆ
    Ε

    ͷ




    ؔ

    ޡ


    ͠
    ͍





    ͷ

    Έ
    Ӆ
    Ε

    ͷ

    Έ

    View Slide

  27. hidden_weight.reset( new liblnn::buffer< glm::vec4 >(
    allocator, buf_type,
    vk::BufferCreateInfo().setSize( input_width * hidden_width * sizeof( glm::vec4 ) ).setUsage( copyable )
    ) );
    output_weight.reset( new liblnn::buffer< glm::vec4 >(
    allocator, buf_type,
    vk::BufferCreateInfo().setSize( hidden_width * output_width * sizeof( glm::vec4 ) ).setUsage( copyable )
    ) );
    hidden_affine_output.reset( new liblnn::buffer< float >(
    allocator, buf_type,
    vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable )
    ) );
    hidden_relu_output.reset( new liblnn::buffer< float >(
    allocator, buf_type,
    vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable )
    ) );
    output_affine_output.reset( new liblnn::buffer< float >(
    allocator, buf_type,
    vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable )
    ) );
    output_relu_output.reset( new liblnn::buffer< float >(
    allocator, buf_type,
    vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable )
    ) );
    softmax_grad.reset( new liblnn::buffer< float >(
    allocator, buf_type,
    vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable )
    ) );
    (16ͷϝϞϦΛ֬อ

    View Slide







  28. Έ
    ೖྗ ߦྻ
    ͱॏΈ ߦྻ
    ͷੵΛ
    (16Ͱܭࢉ͢Δ

    View Slide

  29. GPU
    ͕1ͭͷσʔλΛܭࢉ
    ͕SIMDͷ1Ϣχοτ
    Vulkan༻ޠͰSubgroup
    NVIDIA༻ޠͰWarp
    Vulkan༻ޠͰThread
    NVIDIA༻ޠͰ΋Thread

    View Slide

  30. GPU
    ϓϩηοα਺ͰੑೳΛՔ͙ΞʔΩςΫνϟͳͷͰ
    Ͱ͖Δ͚ͩ୔ࢁͷεϨουͰܭࢉΛ͠ͳ͚Ε͹ͳΒͳ͍

    View Slide

  31. VRAM
    શͯͷεϨου͸73".Λڞ༗͍ͯ͠Δ
    ෳ਺ͷεϨου͕ಉ͡ΞυϨε͔Β஋ΛಡΉͷ͸໰୊ͳ͍͕
    ෳ਺ͷεϨου͕ಉ͡ΞυϨεʹ஋Λॻ͍ͨ৔߹
    ͲͷεϨου͕ॻ͍ͨ஋͕࢒Δ͔͸ෆఆ
    Vulkan༻ޠͰMemory
    NVIDIA༻ޠͰGlobalMemory

    View Slide

  32. x00
    x01
    x02
    x10
    x11
    x12
    x20
    x21
    x22
    w00
    w01
    w02
    w10
    w11
    w12
    w20
    w21
    w22
    =

    i
    x0i
    wi0

    i
    x0i
    wi1

    i
    x0i
    wi2

    i
    x1i
    wi0

    i
    x1i
    wi1

    i
    x1i
    wi2

    i
    x2i
    wi0

    i
    x2i
    wi1

    i
    x2i
    wi2
    ໌Β͔ʹग़ྗߦྻͷཁૉ਺·Ͱ͸ฒྻͰܭࢉͰ͖Δ

    i
    x0i
    wi0
    = x00
    w00
    + x01
    w10
    + x02
    w20

    ߋʹ ͷத਎ΛฒྻͰܭࢉ͢ΔͱεϨου਺ΛՔ͛Δ͕
    ֤εϨουͷ஋ͷ ΛͲ͏΍ͬͯऔΔ͔͕໰୊ʹͳΔ


    View Slide

  33. 43".
    GPU͸ෳ਺ͷεϨουͰಉظՄೳͳ
    SRAMΛ͍࣋ͬͯΔ
    43".
    ʹ

    ͘

    ظ
    43".
    ͔
    Β

    Ή
    A
    B
    ಉҰWorkGroup಺ͷεϨουʹ
    ஋Λड͚౉͢͜ͱ͕Ͱ͖Δ
    A
    B
    WorkGroupͷશεϨου͕
    ಉظʹୡ͢Δ·Ͱఀࢭ
    ͜ͷSRAMͷࣄΛ
    Vulkan༻ޠͰSharedMemory
    NVIDIA༻ޠͰ΋SharedMemory
    SRAMΛڞ༗Ͱ͖ΔεϨουͷଋΛ
    Vulkan༻ޠͰWorkGroup
    NVIDIA༻ޠͰBlock

    View Slide

  34. ݹయతͳ(16ʹ͓͚Δ

    ճͷՃࢉͱಉظͰ ͕ٻ·Δ
    log2
    (n) ∑

    x0

    x1

    x2

    x3

    ظ
    S0
    := x0
    S1
    := x2

    ظ
    S0
    := x0
    + S0
    S1
    := x2
    + S1
    S0
    := S0
    + S1
    x0
    + x1
    + x2
    + x3
    43".

    View Slide

  35. 43".
    AB
    CD

    x0

    x1

    x2

    x3


    Ճ

    x0
    + x1
    + x2
    + x3
    x0
    + x1
    + x2
    + x3
    x0
    + x1
    + x2
    + x3
    x0
    + x1
    + x2
    + x3
    A
    B
    C
    D
    ৽͠ΊͷGPU͸ಉҰSubgroup಺Ͱ
    ਫฏԋࢉ͕Ͱ͖Δ
    Vulkan༻ޠͰSubgroup operations
    NVIDIA༻ޠͰWarp Shuffle

    View Slide

  36. 43".
    SubgroupͷαΠζ͕32ͷ৔߹
    ! ճͷՃࢉͱಉظͰ! ͕ٻ·Δ
    log32
    (n) ∑
    … …


    Ճ



    Ճ


    ظ


    Ճ


    x0

    x1

    x32

    x33

    x34

    x64
    64

    i=0
    xi
    φ΢͍(16ʹ͓͚Δ

    View Slide

  37. GPU
    Subgroup: ϓϩάϥϜΧ΢ϯλΛڞ༗͍ͯ͠Δ
    Workgroup: SRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ
    Dispatch: VRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ
    GeForceGTX1070ͷ৔߹
    32εϨου
    ෺ཧ4Subgroups
    ࿦ཧ48Subgroups
    ෺ཧ60Workgroups
    ࿦ཧ2^64Workgroups

    View Slide

  38. shared float local_sum[ local_memory_size ];
    float large_sum( in float value ) {
    float sg_sum = subgroupAdd( value );
    local_sum[ gl_SubgroupID ] = sg_sum;
    barrier();
    uint len = gl_NumSubgroups;
    while( len > 1 ) {
    uint index = gl_SubgroupInvocationID + gl_SubgroupID * gl_SubgroupSize;
    float sum = subgroupAdd( index < len ? local_sum[ index ] : 0.0 );
    local_sum[ gl_SubgroupID ] = sum;
    barrier();
    len /= gl_SubgroupSize;
    }
    barrier();
    return local_sum[ 0 ];
    }
    GLSLͰ!∑
    ਫฏՃࢉͯ͠
    ݁ՌΛSharedMemoryʹॻ͍ͯ
    ಉظͯ͠
    SharedMemoryͷ஋ΛਫฏՃࢉͯ͠
    ݁ՌΛSharedMemoryʹॻ͍ͯ
    ಉظͯ͠
    SharedMemoryͷཁૉ͕1ݸʹͳͬͨΒ
    ͦͷ஋Λreturn

    View Slide

  39. void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint output_index = gl_GlobalInvocationID.y;
    const uint data_index = gl_GlobalInvocationID.z;
    const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x;
    const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y;
    output_data[ output_index + data_index * output_width ] = 0.0;
    for( uint offset = 0; offset < width; offset += input_width ) {
    float value = ( offset + input_index ) < width ?
    input_data[ offset + input_index + data_index * width ] *
    weight[ output_index + ( offset + input_index ) * output_width ].x :
    0.0;
    output_data[ output_index + data_index * output_width ] += large_sum( value );
    }
    }
    GLSLͰߦྻͷੵ
    ! ΛऔΔඞཁ͕͋ΔεϨου͕ಉҰWorkGroupʹͳΔΑ͏ʹͯ͠
    ೖྗߦྻͷ஋ͱॏΈߦྻͷ஋ͷੵΛग़ྗߦྻʹॻ͘

    View Slide


  40. ϕ




    ׆ੑԽؔ਺
    ૚͕ߦྻੵ͚ͩͩͬͨ৔߹
    ૚Λ͍ͭ͘ॏͶͯ΋ઢܕੑ͕ҡ࣋͞ΕΔ
    ݴ͍׵͑Δͱઢܗͳؔ਺͔ۙ͠ࣅͰ͖ͳ͘ͳΔ
    ߦྻੵͱߦྻੵͷؒʹ
    ઢܕੑΛյ͢ඇઢܗͷؔ਺ΛڬΉࣄͰ
    ඇઢܗͳؔ਺ͷۙࣅΛՄೳʹ͢Δ

    View Slide

  41. Hyperbolic
    Tangent

    View Slide

  42. Rectified
    Linear Unit
    (ReLU)

    View Slide

  43. yij
    = tanh (xij)
    ׆ੑԽؔ਺͸ೖग़ྗߦྻͷཁૉ͝ͱʹಠཱʹܭࢉͰ͖Δҝ
    ग़ྗߦྻͷཁૉ਺ ೖྗߦྻͷཁૉ਺
    ·Ͱ͸ฒྻͰܭࢉͰ͖Δ
    yij
    =
    {
    xij
    xij
    > = 0
    0 xij
    < 0
    Hyperbolic Tangentͷ৔߹
    Rectified Linear Unitͷ৔߹

    View Slide

  44. void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint data_index = gl_GlobalInvocationID.z;
    const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x;
    for( uint offset = 0; offset < width; offset += input_width ) {
    if( ( offset + input_index ) < width )
    output_data[ offset + input_index + data_index * width ] =
    tanh( input_data[ offset + input_index + data_index * width ] );
    }
    }
    void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint data_index = gl_GlobalInvocationID.z;
    const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x;
    for( uint offset = 0; offset < width; offset += input_width ) {
    if( ( offset + input_index ) < width )
    output_data[ offset + input_index + data_index * width ] = max(
    0,
    input_data[ offset + input_index + data_index * width ]
    );
    }
    }
    GLSLͰHyperbolic Tangent
    GLSLͰRectified Linear Unit

    View Slide


  45. Β
    Ε
    ͨ




    ؔ

    ޡ


    ͠
    ͍


    ଛࣦؔ਺
    χϡʔϥϧωοτϫʔΫͷग़ྗͱ
    ཉ͔ͬͨ͠ग़ྗ͕ࣅ͍ͯΔ΄Ͳ
    ग़ྗ͕খ͘͞ͳΔؔ਺
    ࠷ޙʹ͜ΕΛ෇͚ΔࣄͰ
    ద੾ͳॏΈͷ୳ࡧ͸
    ࠷దԽ໰୊ʹͳΔ

    View Slide

  46. ग़ྗͷܗࣜ
    t =
    0
    0
    1
    0
    0
    y =
    0.8
    0.000007
    0.9
    0.036
    0.00005
    χϡʔϥϧωοτϫʔΫͷग़ྗ ग़͖ͯͯ΄͍͠ग़ྗ
    0൪͔2൪ͷͲͪΒ͔ͳؾ͕͢Δ
    ਖ਼ղ͸2൪Ͱ͢

    View Slide

  47. TPGUNBY
    yi
    =
    exi

    j
    exj
    softmax (y) =
    0.288213
    0.129503
    0.318524
    0.134249
    0.129509
    y =
    0.8
    0.000007
    0.9
    0.036
    0.00005

    View Slide

  48. ΫϩεΤϯτϩϐʔޡࠩ
    ! ͔ͭ! ͷ࣌
    t = 1 y = 1
    !l = 0
    ! ͔ͭ! ͷ࣌
    t = 1 y = 0.001
    !l = 4.605
    ! ͷ࣌
    t = 0
    !l = 0
    softmaxͷ݁Ռ! ͕1ʹۙͮ͘ҝʹ͸
    iҎ֎ͷ! ͷ஋͸0ʹۙ͘ͳ͚Ε͹ͳΒͳ͍
    yi
    y
    ·ͱΊΔͱ! ͱ! ͕ࣅ͍ͯΔఔ! ͕খ͘͞ͳΔ
    y t L
    L = − ∑
    i
    ti
    log (yi)
    yi
    =
    exi

    j
    exj

    View Slide

  49. void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint data_index = gl_GlobalInvocationID.z;
    float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0;
    float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 );
    float t = teacher_data[ input_index + data_index * width ];
    float y_ = max( y, 1.0e-10 );
    float value2 = input_index < width ? t * log( y_ ) : 0.0;
    float l = -large_sum( float( value2 ) );
    if( input_index == 0 )
    output_data[ data_index ] = l;
    }
    GLSLͰSoftmax with Cross Entropy Loss
    ! ͕0ʹ͖ۙͮա͗Δͱ
    infΛੜΈग़͢
    y
    ! ͕0ʹͳΔͱnan΍infΛੜΈग़͢

    j
    exj
    yi
    =
    exi

    j
    exj
    L = − ∑
    i
    ti
    log (yi)

    View Slide


  50. ϕ ੵ
    ϕ
    wh
    wo
    x0
    x1

    xb


    ؔ

    t0
    t1

    tb
    ! ͱ! Λఆ਺ͱݟ၏ͯ͠! ͕࠷খͱͳΔ
    ! ͱ! Λ୳͢
    x t L
    wh
    wo
    L

    View Slide

  51. ! ͕มԽͨ࣌͠! ͕ͲͷΑ͏ʹมԽ͢Δ͔
    w L
    ∂L
    ∂w
    ॏΈ͕ొ৔͢Δ૚͔Β
    ΫϩεΤϯτϩϐʔޡࠩ·Ͱͷؔ਺Λ
    ! ʹ͍ͭͯภඍ෼
    w

    View Slide

  52. wo
    ∂L
    ∂wo
    =
    ∂L
    ∂d
    ∂d
    ∂c
    ∂c
    ∂w0
    ! ͕3ͭ࿈ͳͬͨ߹੒ؔ਺ͱݟ၏ͤΔ
    ߹੒ؔ਺ͷඍ෼ͷ࿈࠯཯
    df
    dx
    =
    df
    dg
    dg
    dx
    ΑΓ

    ϕ


    ؔ

    L
    t
    c d
    ∂L
    ∂d
    ∂d
    ∂c
    ∂c
    ∂w0

    View Slide


  53. ϕ ੵ
    ϕ


    ؔ

    wh
    L
    t
    c d
    b
    a
    ∂L
    ∂wh
    =
    ∂L
    ∂d
    ∂d
    ∂c
    ∂c
    ∂b
    ∂b
    ∂a
    ∂a
    ∂wh

    View Slide


  54. ϕ ੵ
    ϕ


    ؔ

    L
    wh t
    c d
    b
    a
    ֤૚ͷೖग़ྗͷඍ෼͕ٻ·Δ৔߹
    ! ΛޙΖͷ૚͔ΒॱʹٻΊΒΕΔ
    ∂L
    ∂w
    ޡࠩٯ఻೻๏
    ∂L
    ∂d
    ∂L
    ∂d
    ∂d
    ∂c
    ∂L
    ∂d
    ∂d
    ∂c
    ∂c
    ∂b
    ∂L
    ∂d
    ∂d
    ∂c
    ∂c
    ∂b
    ∂b
    ∂a
    ∂L
    ∂d
    ∂d
    ∂c
    ∂c
    ∂b
    ∂b
    ∂a
    ∂a
    ∂wh

    View Slide


  55. ϕ ੵ
    ϕ




    ͷ




    ͷ

    Ε

    ͷ

    Ε

    ͷ




    ؔ


    ͠
    ͍





    ͷ

    Έ
    Ӆ
    Ε

    ͷ

    Έ
    ޡ



    ؔ

    ͷ
    ޯ

    ׆

    Խ
    ؔ

    ͷ
    ޯ

    ߦ


    ͷ
    ޯ

    ׆

    Խ
    ؔ

    ͷ
    ޯ

    ߦ


    ͷ
    ޯ

    View Slide

  56. ޯ



    ؔ

    ޡ


    ͠
    ͍


    L = − ∑
    i
    ti
    log (yi)
    yi
    =
    exi

    j
    exj
    ∂L
    ∂yi
    = −
    ti
    yi
    ∂yi
    ∂xk
    =
    {
    yi (1 − yi) i = k
    −yi
    yk
    i ≠ k
    ଛࣦؔ਺ͷٯ఻೻

    View Slide

  57. ޯ



    ؔ

    ޡ


    ͠
    ͍



    i
    ∂L
    ∂xi
    =
    ∂L
    ∂yi
    ∂yi
    ∂xi
    − ∑
    k≠i
    ∂L
    ∂yk
    ∂yk
    ∂xi
    = −ti (1 − yi) + ∑
    k≠i
    tk
    yi
    ∂L
    ∂yi
    = −
    ti
    yi
    ∂yi
    ∂xk
    =
    {
    yi (1 − yi) i = k
    −yi
    yk
    i ≠ k

    View Slide

  58. ޯ



    ؔ

    ޡ


    ͠
    ͍


    ཉ͍͠ग़ྗͷ૯࿨=1
    ∂L
    ∂xi
    =
    ∂L
    ∂yi
    ∂yi
    ∂xi
    − ∑
    k≠i
    ∂L
    ∂yk
    ∂yk
    ∂xi
    = −ti (1 − yi) + ∑
    k≠i
    tk
    yi
    = −ti
    + yi ∑
    k
    tk
    = yi
    − ti

    View Slide

  59. void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint data_index = gl_GlobalInvocationID.z;
    float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0;
    float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 );
    float t = teacher_data[ input_index + data_index * width ];
    float y_ = max( y, 1.0e-10 );
    float value2 = input_index < width ? t * log( y_ ) : 0.0;
    float l = -large_sum( float( value2 ) );
    if( input_index == 0 )
    output_data[ data_index ] = l;
    if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t );
    }

    ͠
    ͍


    softmaxͷGLSLͷ࠷ޙͰ
    ޯ഑Λग़ྗ
    = −ti (1 − yi) + ∑
    k≠i
    tk
    yi
    = −ti
    + yi ∑
    k
    tk
    = yi
    − ti

    View Slide

  60. si
    =
    xi
    2
    +
    1
    2
    yi
    =
    esi

    j
    esj
    L = −∑
    i
    ti
    log (yi)

    ϕ


    ؔ

    t0
    t1

    tb
    L
    ͷ
    ஋͕ग़·͢
    −1 ≤ x ≤ 1
    Ͱ
    ͍ͩ͘͞
    0 ≤ x
    ࠷ऴ૚ͷ׆ੑԽؔ਺ʹtanhΛ࢖͏ͱ
    ଛࣦؔ਺͕ظ଴͢Δ஋Ҭͱ߹Θͳ͍ͷͰἧ͑Δ

    View Slide

  61. t0
    t1

    tb
    ஋͕ग़·͢
    ∂L
    ∂si
    = yi
    − ti
    ∂si
    ∂xi
    =
    1
    2
    ∂L
    ∂xi
    =
    ∂L
    ∂si
    ∂si
    ∂xi
    =
    yi
    − ti
    2
    void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint data_index = gl_GlobalInvocationID.z;
    float value1 = input_index < width ?
    exp( input_data[ input_index + data_index * width ] * 0.5 + 0.5 ) :
    0.0;
    float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 );
    float t = teacher_data[ input_index + data_index * width ];
    float y_ = max( y, 1.0e-10 );
    float value2 = input_index < width ? t * log( y_ ) : 0.0;
    float l = -large_sum( float( value2 ) );
    if( input_index == 0 )
    output_data[ data_index ] = l;
    if( input_index < width )
    input_grad[ input_index + data_index * width ] = float( y - t ) * 0.5;
    }
    Ͱ
    ͍ͩ͘͞
    0 ≤ x

    View Slide

  62. ޯ


    ϕ



    ͷ
    ޯ

    Hyperbolic
    Tangentͷ
    ٯ఻೻
    yi
    = tanh (xi)
    ∂yi
    ∂xi
    = 1 − tanh2 (xi)
    ∂L
    ∂xi
    =
    ∂L
    ∂yi
    ∂yi
    ∂xi
    =
    ∂L
    ∂yi
    (1 − tanh2 (xi))
    ग़ྗଆͷޯ഑
    ∂L
    ∂yi
    ∂L
    ∂xi

    View Slide

  63. ޯ

    void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint data_index = gl_GlobalInvocationID.z;
    const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x;
    for( uint offset = 0; offset < width; offset += width ) {
    if( ( offset + input_index ) < width )
    input_grad[ offset + input_index + data_index * width ] =
    ( 1 - pow( tanh( input_data[ offset + input_index + data_index * width ] ), 2 ) ) *
    output_grad[ offset + input_index + data_index * width ];
    }
    }
    Hyperbolic
    Tangentͷ
    ٯ఻೻
    i
    ∂xi
    = 1 − tanh2 (xi)
    ∂L
    ∂xi
    =
    ∂L
    ∂yi
    ∂yi
    ∂xi
    =
    ∂L
    ∂yi
    (1 − tanh2 (xi))

    View Slide

  64. ޯ


    ϕ



    ͷ
    ޯ

    Rectified
    Linear Unitͷ
    ٯ఻೻
    yi
    = {
    xi
    xi
    ≥ 0
    0 xi
    < 0
    ∂yi
    ∂xi
    = {
    1 xi
    ≥ 0
    0 xi
    < 0
    ∂L
    ∂xi
    =
    ∂L
    ∂yi
    ∂yi
    ∂xi
    =
    {
    ∂L
    ∂yi
    xi
    ≥ 0
    0 xi
    < 0
    ∂L
    ∂yi
    ∂L
    ∂xi

    View Slide

  65. Rectified
    Linear Unitͷ
    ٯ఻೻
    void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint data_index = gl_GlobalInvocationID.z;
    const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x;
    for( uint offset = 0; offset < width; offset += input_width ) {
    if( ( offset + input_index ) < width )
    input_grad[ offset + input_index + data_index * width ] =
    input_data[ offset + input_index + data_index * width ] >= 0 ?
    output_grad[ offset + input_index + data_index * width ] :
    0.0;
    }
    }
    ∂L
    ∂xi
    =
    ∂L
    ∂yi
    ∂yi
    ∂xi
    =
    {
    ∂L
    ∂yi
    xi
    ≥ 0
    0 xi
    < 0

    View Slide

  66. Ͳ͏ݟͯ΋ෆ࿈ଓ͕ͩ
    ඍ෼͕ఆٛͰ͖Δͷ͔

    View Slide

  67. yi
    = log (1 + exp (xi))
    ReLUͷ࿦จ[1]
    Ͱ͸
    Λۙࣅͯ͠
    yi
    = {
    xi
    xi
    ≥ 0
    0 xi
    < 0
    ͱ͍ͯ͠Δ
    [1] Vinod Nair and Geoffrey E. Hinton. 2010. Rectified
    linear units improve restricted boltzmann machines.
    In Proceedings of the 27th International Conference on
    International Conference on Machine Learning (ICML'10),
    Johannes Fürnkranz and Thorsten Joachims (Eds.).
    Omnipress, USA, 807-814.

    View Slide

  68. ∂yi
    ∂xi
    =
    exp (xi)
    1 + exp (xi)
    ैͬͯ
    Λۙࣅͯ͠
    ∂yi
    ∂xi
    = {
    1 xi
    ≥ 0
    0 xi
    < 0

    View Slide

  69. ͷ
    ޯ

    x




    ͷ
    ޯ

    ߦྻੵͷ
    ٯ఻೻
    ∂L
    ∂yi
    ∂L
    ∂xi
    ͷޯ഑
    w
    w
    ∂L
    ∂wi
    yj
    = ∑
    i
    wij
    xi
    ∂yj
    ∂wij
    = xi
    ∂L
    ∂wij
    =
    ∂L
    ∂yi
    xi
    2ͭͷ෺ΛٻΊΔඞཁ͕͋ΔͷͰ
    ੺࿮ͷ෦෼͔Βߟ͑Δ
    ॏΈ͕ؔΘͬͨग़ྗͷޯ഑ʹ
    ॏΈ͕ؔΘͬͨೖྗΛֻ͚Δ

    View Slide

  70. ͷ
    ޯ

    x




    ͷ
    ޯ

    ߦྻੵͷ
    ٯ఻೻
    ∂L
    ∂yi
    ∂L
    ∂xi
    ͷޯ഑
    w
    w
    ∂L
    ∂wi
    ࣍ʹ྘࿮ͷ෦෼Λߟ͑Δ
    yj
    = ∑
    i
    wij
    xi
    ∂yj
    ∂xi
    = ∑
    i
    wij
    ∂L
    ∂xi
    = ∑
    j
    ∂L
    ∂yj
    wij
    ೖྗ͕ӨڹΛ༩͑ͨग़ྗͷޯ഑ʹ
    ͦͷೖग़ྗʹ͍ͭͯͷॏΈΛֻ͚ͨ෺ͷ૯࿨

    View Slide

  71. void main() {
    const uint input_index = gl_GlobalInvocationID.x;
    const uint output_index = gl_GlobalInvocationID.y;
    const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x;
    const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y;
    float grad_w_sum = 0.0;
    for( uint data_index = 0; data_index != batch_size; data_index++ ) {
    input_grad[ input_index + data_index * input_width ] = 0.0;
    for( uint offset = 0; offset < height; offset += output_width ) {
    float grad_x = ( offset + output_index ) < height ?
    weight[ offset + output_index + input_index * height ].x *
    output_grad[ offset + output_index + data_index * height ] :
    0.0;
    input_grad[ input_index + data_index * input_width ] += large_sum( grad_x );
    }
    }
    for( uint offset = 0; offset < height; offset += output_width ) {
    float grad_w_sum = 0.0;
    for( uint data_index = 0; data_index != batch_size; data_index++ ) {
    float grad_w = ( offset + output_index ) < height ?
    input_data[ input_index + data_index * input_width ] *
    output_grad[ offset + output_index + data_index * height ] :
    0.0;
    grad_w_sum += grad_w;
    }
    if( ( offset + output_index ) < height )
    adam( weight[ offset + output_index + input_index * height ], grad_w_sum );
    }
    }
    ߦྻੵͷ
    ٯ఻೻
    ! ͷ਺ͰεϨουΛىಈ͠
    ! Λڞ༗͢ΔεϨου͕
    ਫฏՃࢉͰ͖ΔΑ͏ʹ
    WorkGroupΛׂΓ౰ͯΔ
    ∂L
    ∂wij
    ∂L
    ∂xi
    ∂L
    ∂wij
    ∂L
    ∂xi

    View Slide

  72. ֬཰తޯ഑߱Լ๏
    ͍·͜͜
    ͜͜ʹ
    ḷΓண͖͍ͨ
    wt+1
    = wt
    − μ
    ∂L
    ∂wt
    গ͠
    جຊతʹ͸! ʹԊͬͯ
    ΑΓ! ͕খ͘͞ͳΔํ΁
    গ͠! Λߋ৽͢Δ
    ∂L
    ∂wt
    L
    w
    L
    w
    ! ͷߋ৽ํ޲
    w

    View Slide

  73. ֬཰తޯ഑߱Լ๏
    ͍·͜͜
    ͜͜ʹ
    ḷΓண͖͍ͨ
    L
    w
    ! ͷߋ৽ํ޲
    w
    ͪΐͬͱͨ͠ग़ͬுΓʹ
    ͍᪴ͯ໭Δ
    wt+1
    = wt
    − μ
    ∂L
    ∂wt

    View Slide

  74. SGD(֬཰తޯ഑߱Լ๏)
    MomentumSGD NAG
    AdaGrad
    RMSprop
    Adam
    AdaDelta
    AdaMax
    SMORMS3
    RMSpropGraves
    Eve
    Nadam
    Santa-E
    Santa-SSS
    AdaSecant
    GD by GD
    ࠷దԽΞϧΰϦζϜͷ
    ਐԽ

    View Slide

  75. Adam
    g =
    ∂L
    ∂wt
    mt
    = β1
    mt−1
    + (1 − β1) g
    vt
    = β2
    vt−1
    + (1 − β2) g2
    ̂
    mt
    =
    mt
    1 − βt
    1
    ̂
    vt
    =
    vt
    1 − βt
    2
    wt+1
    = wt

    α ̂
    mt
    ̂
    vt
    + ϵ
    void adam( inout vec4 weight, in float grad ) {
    weight.w += 1;
    float gt = grad;
    weight.y = beta1 * weight.y + ( 1 - beta1 ) * gt;
    weight.z = beta2 * weight.z + ( 1 - beta2 ) * gt * gt;
    float mhat = weight.y / ( 1 - pow( beta1, weight.w ) );
    float vhat = weight.z / ( 1 - pow( beta2, weight.w ) );
    weight.x -= alpha * mhat / ( sqrt( vhat ) + eps );
    }

    ͕ਪ঑͞Ε͍ͯΔͷͰ
    ͜ͷ஋Λͦͷ··࢖͏
    α = 0.001
    β1
    = 0.9
    β2
    = 0.999
    • Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for
    stochastic optimization. 2014. arXiv:1412.6980v9
    https://arxiv.org/abs/1412.6980

    View Slide

  76. ॳظ஋
    ! ͕ಉ͡χϡʔϩϯ͸ಉ͍ৼΔ෣͍Λ͢Δ
    w
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42

    ಉ͡ग़ྗΛ͢ΔχϡʔϩϯͷॏΈʹ͍ͭͯͷޯ഑͸Ұக͢Δ
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42
    × w02

    ϕ
    × w12
    × w22
    × w32
    × w42

    ∂L
    ∂w

    ∂L
    ∂w
    ಉ͚ͩ͡ॏΈ͕ߋ৽͞ΕΔҝͣͬͱಉ͡ৼΔ෣͍Λ͢Δ
    χϡʔϥϧωοτϫʔΫͷॏΈͷॳظ஋͸
    ۉҰʹͳ͍ͬͯͯ͸͍͚ͳ͍

    View Slide

  77. Xavierͷॳظ஋
    w =
    1
    n
    randn()
    Understanding the difficulty of training deep feedforward neural networks
    Xavier Glorot and Yoshua Bengio
    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics PMLR. p. 249--256. 2010
    http://proceedings.mlr.press/v9/glorot10a.html
    ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ
    n w
    ਖ਼نཚ਺

    View Slide

  78. GPUͰཚ਺Λ࡞Δ

    Gt
    0.0801

    Gt+1
    0.2926

    Gt+2
    0.4342

    Gt+3
    0.6978
    ଟ͘ͷٖࣅཚ਺ΞϧΰϦζϜ͸
    1ݸͷཚ਺Λుͨ͘ͼʹߋ৽͞ΕΔঢ়ଶΛ࣋ͭ
    ͜Εͩͱલͷཚ਺͕ੜ੒͞ΕΔ·Ͱ࣍ͷཚ਺͕ੜ੒Ͱ͖ͳ͍ҝ
    εέʔϧ͠ͳ͍

    View Slide

  79. GPUͰཚ਺Λ࡞Δ
    10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ
    ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ
    s = [
    12.9898
    78.233]
    t = 43758.5453
    fract (x) = x − ⌊x⌋
    f (x) = fract (t sin (x ⋅ s))

    f
    0.7340
    [0.1 0.8]

    f
    0.1768
    ཚ਺ੜ੒ثͷঢ়ଶΛ1ཁૉຖʹҾ͖ܧ͕ͳ͍ҝεέʔϧ͢Δ
    [0.3 0.2]

    View Slide

  80. GPUͰཚ਺Λ࡞Δ
    10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ
    ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ
    ཚ਺ੜ੒ثͷग़ྗͷ෼෍
    Box-Muller๏Ͱ
    ਖ਼ن෼෍ʹͨ͠΋ͷ
    Ξϗໟ͕ੜ͑ͯΔ͚Ͳ
    ࢖͑ͳ͍ϨϕϧͰ͸ͳͦ͞͏

    View Slide

  81. float prand( vec2 i ) {
    return fract(sin(dot( i.xy ,vec2(12.9898,78.233))) * 43758.5453);
    }
    const float PI = 3.1415926535897932384626433832795;
    float boxmuller( vec2 i, float mu, float sigma ) {
    float x = 1 - prand( i );
    float y = prand( vec2( i.y, x ) );
    float n = prand( vec2( x, y * PI ) );
    float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n );
    return mu + sigma * v;
    }
    float xavier_init_value( vec2 i, uint n ) {
    float value = boxmuller( i, 0.0, 1.0 / sqrt( n ) );
    return value;
    }
    void main() {
    const uint x = gl_GlobalInvocationID.x;
    const uint y = gl_GlobalInvocationID.y;
    const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x;
    const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y;
    const uint index = x + y * width;
    weight[ index ] = vec4(
    xavier_init_value( vec2( float( x )/width, float( y )/height ), input_size ),
    0, 0, 0
    );
    }
    Xavierͷॳظ஋
    Ұ༷ཚ਺Λ࡞ͬͯ
    box-muller๏Ͱਖ਼نཚ਺ʹͯ͠
    xavierͷॳظ஋ΛٻΊΔؔ਺Λ
    εϨου਺=ॏΈͷཁૉ਺
    Ͱ࣮ߦ

    View Slide

  82. Heͷॳظ஋
    w =
    2
    n
    randn()
    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
    Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun 2015
    https://arxiv.org/abs/1502.01852
    ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ
    n w
    ਖ਼نཚ਺
    ReLUΛ࢖͏৔߹ʹXavierͷॳظ஋ΑΓ
    ॳظͷޡࠩͷ఻೻ʹ༏ΕΔͱ͞ΕΔ

    View Slide

  83. float prand( vec2 i ) {
    return fract(sin(dot( i.xy ,vec2(12.9898,78.233))) * 43758.5453);
    }
    const float PI = 3.1415926535897932384626433832795;
    float boxmuller( vec2 i, float mu, float sigma ) {
    float x = 1 - prand( i );
    float y = prand( vec2( i.y, x ) );
    float n = prand( vec2( x, y * PI ) );
    float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n );
    return mu + sigma * v;
    }
    float he_init_value( vec2 i, uint n ) {
    float value = boxmuller( i, 0.0, sqrt( 2 ) / sqrt( n ) );
    return value;
    }
    void main() {
    const uint x = gl_GlobalInvocationID.x;
    const uint y = gl_GlobalInvocationID.y;
    const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x;
    const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y;
    const uint index = x + y * width;
    weight[ index ] = vec4(
    he_init_value( vec2( float( x )/width, float( y )/height ), input_size ),
    0, 0, 0
    );
    }
    ͕͜͜ҧ͏
    Heͷॳظ஋

    View Slide

  84. hidden_affine1.reset( new layer( create_affine_forward_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, batch_images[ 0 ], hidden_affine_output, hidden_weight,
    batch_size
    ) ) );
    hidden_affine2.reset( new layer( create_affine_forward_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, batch_images[ 1 ], hidden_affine_output, hidden_weight,
    batch_size
    ) ) );
    hidden_activation.reset( new layer( create_relu_forward_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, hidden_affine_output, hidden_activation_output
    ) ) );
    output_affine.reset( new layer( create_affine_forward_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output,
    output_weight, batch_size
    ) ) );
    output_activation.reset( new layer( create_tanh_forward_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output
    ) ) );
    error1.reset( new layer( create_softmax_combined_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad,
    batch_labels[ 0 ]
    ) ) );
    error2.reset( new layer( create_softmax_combined_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad,
    batch_labels[ 1 ]
    ) ) );
    output_activation_backward.reset( new layer( create_tanh_backward_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output,
    output_activation_grad, softmax_grad
    ) ) );
    output_affine_backward.reset( new layer( create_affine_backward_pipeline(
    device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output,
    GLSLΛίϯύΠϧ͠ComputePipeline࡞ͬͯ
    όοϑΝΛׂΓ౰ͯΔ

    View Slide

  85. void network::exec() {
    ++swap_index;
    swap_index %= 2;
    queue->submit(
    vk::SubmitInfo()
    .setCommandBufferCount( 1 )
    .setPCommandBuffers( command_buffers->data() + swap_index ),
    vk::Fence()
    );
    fill( false, false );
    queue->waitIdle();
    if( debug ) {
    std::cout << "==============" << std::endl;
    check();
    print( *error_out, batch_size );
    print( *output_activation_output, batch_size );
    print_image( *batch_images[ swap_index ], train_input->get_image_width(), batch_size );
    print_label( *batch_labels[ swap_index ], batch_size );
    print_eval( *output_activation_output, batch_size );
    }
    }
    ࣮ߦ
    ೖྗͱཉ͍͠ग़ྗͷϖΞͷόοϑΝΛ2ηοτ༻ҙ͠
    ֶशͱ࣍ͷσʔλͷసૹΛಉ࣌ʹߦ͏
    ίϚϯυόοϑΝͷ಺༰ΛGPUʹ౤͛Δ

    View Slide

  86. MNIST
    http://yann.lecun.com/exdb/mnist/
    ϥϕϧ෇͖खॻ͖਺ࣈը૾7ສຕ
    SVMͰ΋9ׂҎ্ͷਫ਼౓Ͱ
    ෼ྨͰ͖Δ؆୯ͳ෼ྨλεΫ
    ਖ਼࣮͘͠૷͞Ε͍ͯΕ͹
    ෼ྨͰ͖ͳ͍͸͕ͣͳ͍

    View Slide

  87. όοναΠζ64 ӅΕ૚ͷ෯128
    ධՁσʔλ෼ྨਫ਼౓
    98%લޙ
    χϡʔϥϧωοτϫʔΫͱͯ͠
    ػೳ͍ͯ͠Δ

    View Slide

  88. Fashion-MNIST
    ϥϕϧ෇͖ҥྨը૾7ສຕ
    Tγϟπɺίʔτɺۺ౳
    10छྨͷҥྨؚ͕·ΕΔ
    ҟͳΔҥྨͰ΋
    ܗ͸ࣅ͍ͯͨΓ͢ΔͨΊ
    MNISTΑΓ͸
    ೉͍͠ͱ͞ΕΔ
    https://github.com/zalandoresearch/fashion-mnist

    View Slide

  89. όοναΠζ64 ӅΕ૚ͷ෯128
    ͔֬ʹ΍΍མͪΔ
    ධՁσʔλ෼ྨਫ਼౓
    87%લޙ

    View Slide


  90. ϕ ੵ
    ϕ


    ؔ

    L

    ϕ
    ਫ਼౓͕ग़ͳ͍࣌͸૚Λ૿΍͢
    ͔͠͠୯७ͳߦྻੵͷ૚Λ૿΍͢ͱ
    ! ͕ͲΜͲΜ૿͑Δ
    w

    View Slide

  91. ৞ΈࠐΈ
    ϑΟϧλ!w
    ೖྗ!x
    ग़ྗ!y
    ֶश͢Δը૾ॲཧϑΟϧλ

    View Slide

  92. ৞ΈࠐΈ
    ϑΟϧλ!w
    ೖྗ!x
    ग़ྗ!y
    ֶश͢Δը૾ॲཧϑΟϧλ

    View Slide

  93. yij
    =
    M

    k=0
    N

    l=0
    wkl
    x
    (i+k)(j + l)
    ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹
    M × N



    Έ

    Έ


    ϑ

    ϧ
    λ

    View Slide

  94. yij
    =
    M

    k=0
    N

    l=0
    wkl
    x
    (i+k)(j + l)
    ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹
    M × N

    Έ

    Έ



    ͷ
    ޯ

    ͷ
    ޯ

    x
    ∂L
    ∂yij
    ∂L
    ∂xij
    ͷޯ഑
    w
    w
    ∂L
    ∂wij
    ৞ΈࠐΈͷٯ఻೻
    ∂L
    ∂wkl
    = ∑
    i

    j
    ∂L
    ∂yij
    x
    (i+k)(j + l)
    ͋ΔॏΈ͕ؔΘͬͨશͯͷೖग़ྗͷϖΞʹ͍ͭͯ
    ग़ྗଆͷޯ഑ͱೖྗͷੵͷ૯࿨ΛͱΔ

    View Slide

  95. yij
    =
    M

    k=0
    N

    l=0
    wkl
    x
    (i+k)(j + l)
    ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹
    M × N

    Έ

    Έ



    ͷ
    ޯ

    ͷ
    ޯ

    x
    ͷޯ഑
    w
    w
    ৞ΈࠐΈͷٯ఻೻
    ∂L
    ∂wkl
    = ∑
    i

    j
    ∂L
    ∂yij
    x
    (i+k)(j + l)
    ∂L
    ∂yij
    ∂L
    ∂xij
    ∂L
    ∂wij
    ∂L
    ∂xij
    =
    M

    k
    N

    l
    wkl
    ∂L
    ∂y
    (i−k)(j − l)
    180౓ճసͨ͠ग़ྗଆͷޯ഑Λ
    ৞ΈࠐΉ

    View Slide

  96. void main() {
    const uint filter_index = gl_GlobalInvocationID.x;
    const uint filter_x = filter_index % filter_width;
    const uint filter_y = filter_index / filter_width % filter_height;
    const uint channel = filter_index / filter_width / filter_height % channels;
    const uint filter_size = filter_width * filter_height * channels;
    const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2;
    const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2;
    bool filter_oob = filter_index >= filter_size;
    float sum = 0.0;
    for( int data_index = 0; data_index != batch_size; ++data_index ) {
    for( int output_x = 0; output_x != output_width; ++output_x ) {
    for( int output_y = 0; output_y != output_height; ++output_y ) {
    const int output_index = int( output_x ) + int( output_y ) * int( output_width ) +
    int( channel ) * int( output_width * output_height ) +
    data_index * int( output_width * output_height * channels );
    const int input_x = output_x * int(filter_xstride) - int(xmargin) + int(filter_x);
    const int input_y = output_y * int(filter_ystride) - int(ymargin) + int(filter_y);
    const bool input_oob = filter_oob || input_x < 0 || input_x >= input_width ||
    input_y < 0 || input_y >= input_height;
    const int input_index = int( input_x ) + int( input_y ) * int( input_width ) +
    int( channel ) * int( input_width * input_height ) +
    data_index * int( input_width * input_height * channels );
    const float grad = filter_oob ? 0.0 : output_grad[ output_index ];
    const float x = input_oob ? 0.0 : input_data[ input_index ];
    sum += grad * x;
    }
    }
    }
    if( !filter_oob ) adam( weight[ filter_index ], sum );
    }
    ΛٻΊΔ(-4-
    ∂L
    ∂wkl

    View Slide

  97. void main() {
    const uint input_x = gl_GlobalInvocationID.x % output_width;
    const uint input_y = gl_GlobalInvocationID.x / output_width % output_height;
    const uint channel = gl_GlobalInvocationID.x / output_width / output_height;
    const uint data_index = gl_GlobalInvocationID.z;
    const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2;
    const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2;
    const uint input_size = input_width * input_height * channels;
    const uint relative_input_index = input_x + input_y * input_width + channel * input_width * input_height;
    const uint input_index = relative_input_index + data_index * input_width * input_height * channels;
    if( relative_input_index < input_size ) input_grad[ input_index ] = 0.0;
    for( int x = 0; x != filter_width; ++x ) {
    for( int y = 0; y != filter_height; ++y ) {
    const int output_x = int(input_x) * int(filter_xstride) - int(xmargin) - x;
    const int output_y = int(input_y) * int(filter_ystride) - int(ymargin) - y;
    const bool oob = output_x < 0 || output_x >= output_width || output_y < 0 || output_y >= output_height;
    const int relative_output_index = output_x + output_y * int(output_width) +
    int(channel) * int(output_width * output_height);
    const int output_index = relative_output_index +
    int(data_index) * int(output_width * output_height * channels );
    const uint filter_index = x + y * int(filter_width) +
    channel * int(filter_width * filter_height );
    if( relative_input_index < input_size ) {
    if( !oob ) {
    const float grad = output_grad[ output_index ] * weight[ filter_index ].x;
    input_grad[ input_index ] += grad;
    }
    }
    }
    }
    }
    ΛٻΊΔ(-4-
    ∂L
    ∂xij

    View Slide

  98. 3
    6
    1
    2
    2
    0
    5
    9
    6
    MaxPooling
    ೖྗ!x
    ग़ྗ!y
    ൣғ಺Ͱ஋͕࠷େͩͬͨ
    1ཁૉ͚ͩΛग़ྗʹ࢒͢

    View Slide

  99. 3
    6
    1
    2
    2
    0
    5
    9
    6
    MaxPooling
    ೖྗ!x
    ग़ྗ!y
    ൣғ಺Ͱ஋͕࠷େͩͬͨ
    1ཁૉ͚ͩΛग़ྗʹ࢒͢
    9

    View Slide



  100. .BY1PPMJOH


    MaxPooling
    void main() {
    const uint relative_output_index = gl_GlobalInvocationID.x;
    const uint output_x = relative_output_index % output_width;
    const uint output_y = relative_output_index / output_width % output_height;
    const uint channel = relative_output_index / output_width / output_height;
    const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width;
    const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height;
    const uint data_index = gl_GlobalInvocationID.z;
    const uint output_size = output_width * output_height * channels;
    const uint input_size = input_width * input_height * channels;
    const uint output_index = relative_output_index + data_index * output_size;
    if( relative_output_index < output_size ) output_data[ output_index ] = 0.0;
    for( uint x = 0; x != filter_width; ++x ) {
    for( uint y = 0; y != filter_height; ++y ) {
    const uint input_x = x + output_x * filter_xstride;
    const uint input_y = y + output_y * filter_ystride;
    const uint input_index = input_x + input_y * input_width + channel * input_width * input_height +
    data_index * input_width * input_height * channels;
    if( relative_output_index < output_size )
    output_data[ output_index ] = max( output_data[ output_index ], input_data[ input_index ] );
    }
    }
    }
    ϑΟϧλαΠζ ͷ৔߹
    M × N
    yij
    = max (x
    (Mi+k)(Nj + l)) k ∈ [0,M], l ∈ [0,N]

    View Slide

  101. yij
    = max (x
    (Mi+k)(Nj + l)) k ∈ [0,M], l ∈ [0,N]
    ϑΟϧλαΠζ ͷ৔߹
    M × N



    ͷ
    ޯ

    .BY1PPMJOH



    ͷ
    ޯ

    ∂L
    ∂x
    (Mi+k)(Nj + l)
    =
    ∂L
    ∂yij
    x
    (i+k)(j + l)
    = yij
    0 x
    (i+k)(j + l)
    ≠ yij
    ࠷େ஋ΛΑ͖ͯͨ͜͠ೖྗʹରԠ͢Δޯ഑Λ
    ग़ྗଆͷޯ഑ͷ஋ʹ͢Δ
    ∂L
    ∂yij
    ∂L
    ∂xij
    MaxPooling
    ͷٯ఻೻

    View Slide

  102. void main() {
    const uint relative_output_index = gl_GlobalInvocationID.x;
    const uint output_x = relative_output_index % output_width;
    const uint output_y = relative_output_index / output_width % output_height;
    const uint channel = relative_output_index / output_width / output_height;
    const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width;
    const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height;
    const uint data_index = gl_GlobalInvocationID.z;
    const uint output_size = output_width * output_height * channels;
    const uint input_size = input_width * input_height * channels;
    const uint output_index = relative_output_index + data_index * output_size;
    const uint initial_input_x = output_x * filter_xstride;
    const uint initial_input_y = output_y * filter_ystride;
    for( uint x = 0; x != filter_width; ++x ) {
    for( uint y = 0; y != filter_width; ++y ) {
    const uint input_x = x + output_x * filter_xstride;
    const uint input_y = y + output_y * filter_ystride;
    const uint input_index = input_x + input_y * input_width + channel * input_width * input_height +
    data_index * input_width * input_height * channels;
    if( relative_output_index < output_size )
    input_grad[ input_index ] =
    ( input_data[ input_index ] == output_data[ output_index ] ) ? output_grad[ output_index ] : 0.0;
    }
    }
    }
    ͷ
    ޯ

    PPMJOH
    ͷ
    ޯ

    MaxPooling
    ͷٯ఻೻
    ∂L
    ∂x
    (Mi+k)(Nj + l)
    =
    ∂L
    ∂yij
    x
    (i+k)(j + l)
    = yij
    0 x
    (i+k)(j + l)
    ≠ yij

    View Slide


  103. ϕ ੵ
    ϕ


    ؔ

    L
    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    NEW!

    View Slide

  104. ධՁσʔλ෼ྨਫ਼౓
    90%લޙ
    গ͠޲্
    όοναΠζ64 ӅΕ૚ͷ෯128
    ৞ΈࠐΈ૚ͷνϟωϧ਺14:14

    View Slide

  105. L

    ϕ ੵ
    ϕ


    ؔ

    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    NEW!

    View Slide

  106. ͜Ε͸ͻͲ͍
    όοναΠζ64 ӅΕ૚ͷ෯128
    ৞ΈࠐΈ૚ͷνϟωϧ਺32:32:64:64

    View Slide

  107. L

    ϕ ੵ
    ϕ


    ؔ

    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    ! ͸࠷ऴ૚Λআ͍ͯ
    ReLU
    ϕ
    ͚ͩ͜͜Hyperbolic Tangent

    View Slide

  108. L

    ϕ ੵ
    ϕ


    ؔ

    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    ReLU͸ਖ਼ํ޲ʹ
    ͍͘ΒͰ΋େ͖ͳ஋ΛͱΔ
    ֶश͕;Β͍͍ͭͯΔͱ
    tanhʹڊେͳ஋͕͞͞Δ
    Ͱ
    ͍ͩ͘͞
    −1 ≤ x ≤ 1
    14326.7

    View Slide

  109. L

    ϕ ੵ
    ϕ


    ؔ

    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    .BY1PPMJOH

    ϕ

    Έ

    Έ

    ϕ

    Έ

    Έ
    14326.7͸μϝ͗ͯ͢
    গ͘͠Β͍஋͕มΘͬͯ΋μϝͳͷͰ
    ޯ഑͸ Ͱ͢
    0
    Ͳ͏ͨ͠Β͍͍ͷ͔
    ͳΜ΋Θ͔ΒΜ
    ޯ഑͕ແ͘ͳͬͯ
    ֶशͰ͖ͳ͘ͳΔ
    ∂L
    ∂xi
    =
    ∂L
    ∂yi
    (1 − tanh2 (xi))
    14326.7

    View Slide

  110. ৞ΈࠐΈ૚ͷֶश཰Λ! ʹͨ͠Βਫ਼౓͕վળͨ͠
    1
    10
    ֶͨͩ͠श͕஗͍
    όοναΠζ64 ӅΕ૚ͷ෯128
    ৞ΈࠐΈ૚ͷνϟωϧ਺32:32:64:64

    View Slide

  111. ૚ͷग़ྗͷ෼෍ΛҰఆʹอͭख๏
    Batch Normalization
    Layer Normalization
    Group Normalization
    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
    Sergey Ioffe and Christian Szegedy 2015
    https://arxiv.org/abs/1502.03167
    https://arxiv.org/abs/1607.06450
    Layer Normalization
    Lei Ba, Jimmy and Kiros, Jamie Ryan and Hinton, Geoffrey E.
    p. arXiv:1607.06450. 2016
    https://arxiv.org/abs/1803.08494
    Group Normalization
    Yuxin Wu and Kaiming He 2018

    View Slide

  112. TensorCore
    ͓·͚


    ߦ




    AB + C

    A
    B
    C
    ! ཁૉͷߦྻ! ͱ! ཁૉͷߦྻ! Λֻ͚ͯ
    ! ཁૉͷߦྻ! Λ଍ͨ݁͠ՌΛ32εϨου࢖ͬͯ1໋ྩͰಘΔ
    16 × 16 A 8 × 16 B
    16 × 8 C
    7,@/7@DPPQFSBUJWF@NBUSJY֦ுʹରԠͨ͠
    /7*%*"ͷ(16ͳΒ
    7VMLBO͔ΒͰ΋ར༻Ͱ͖Δ

    View Slide

  113. ͳΜͰTensorCoreΛ࢖Θͳ͔ͬͨͷ?
    ஸ౓Ոʹམ͍ͪͯͨGeForceGTX1070ʹ͸
    TensorCore͕ແ͔ͬͨ

    View Slide

  114. ·ͱΊ
    σΟʔϓϥʔχϯά͸ΞϧΰϦζϜͳͷͰ
    ࣮૷͢Ε͹Ͳ͜Ͱ΋ಈ͘

    View Slide