低レイヤーな人のためのディープラーニング

635e53b96114c922fa5486b418895960?s=47 Fadis
July 20, 2019

 低レイヤーな人のためのディープラーニング

フレームワークに頼らずVulkanで畳み込みニューラルネットワークを実装する方法を解説します
これは2019年7月20日に行われた 第15回 カーネル/VM探検隊 での発表資料です
サンプルコード: https://github.com/Fadis/kernelvm_20190720_samples

635e53b96114c922fa5486b418895960?s=128

Fadis

July 20, 2019
Tweet

Transcript

  1. 3.

    x0 x1 x2 x3 x4 y2 ܗࣜχϡʔϩϯ × w02 ∑

    ϕ × w12 × w22 × w32 × w42 . . .
  2. 4.

    yj = ϕ (∑ i wij xi) x0 x1 x2

    x3 x4 y0 y1 y2 y3 y4 . . . . . . . . . ૚ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42
  3. 5.

    ૚ x00 x01 x02 x03 x04 . . . x10

    x11 x12 x13 x14 . . . ૚ x20 x21 x22 x23 x24 . . . ૚ x30 x31 x32 x33 x34 . . . ૚ y0 y1 y2 y3 y4 . . . w0 w1 w2 w3 ૚ΛແݶʹॏͶͨ෺͸ॏΈ࣍ୈͰ೚ҙͷؔ਺ΛදݱͰ͖ΔͰ͖Δ ༗ݶͰ΋༷ʑͳؔ਺ΛۙࣅͰ͖Δ χϡʔϥϧωοτϫʔΫ
  4. 8.

    ֶश w x0 x1 x2 y0 y1 y2 t0 t1

    t2  Λमਖ਼ w ೖྗͱରԠ͢Δग़ྗ͕ େྔʹ༗Ε͹ ͜ͷૢ࡞Λ܁Γฦ͢͜ͱͰ χϡʔϥϧωοτϫʔΫ͕ ͱͷؔ܎Λදؔ͢਺ͷ ۙࣅʹͳΔ x t x t σʔλ͔Β ؔ਺͕ಘΒΕΔ w  Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2 w  Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2
  5. 13.

    CPU GPU FPGA TensorFlow Chainer Caffe PyTorch . . .

    Theano $ du -h tensorflow-1.12.3/ ... 145M tensorflow-1.12.3/ $ du -h pytorch-1.1.0/ ... 44M pytorch-1.1.0/ $ du -h chainer-6.1.0/ ... 22M chainer-6.1.0/ Ͱ͔͍ Keras $ du -h keras-2.2.4/ ... 3.0M keras-2.2.4/
  6. 20.

    if( config.validation ) layers.emplace_back( "VK_LAYER_LUNARG_standard_validation" ); const auto app_info =

    vk::ApplicationInfo( config.prog_name.c_str(), (தུ), VK_API_VERSION_1_1 ); instance_ptr_t instance( new vk::Instance( vk::createInstance( vk::InstanceCreateInfo() .setPApplicationInfo( &app_info ) .setEnabledExtensionCount( ext.size() ).setPpEnabledExtensionNames( ext.data() ) .setEnabledLayerCount( layers.size() ).setPpEnabledLayerNames( layers.data() ) ) ) ); auto devices = instance->enumeratePhysicalDevices(); if( devices.empty() ) throw device_is_not_available(); devices.erase( std::remove_if( devices.begin(), devices.end(), [&]( const auto &d ) -> bool { auto avail_dext = d.enumerateDeviceExtensionProperties(); for( const char *w: dext ) if( std::find_if( avail_dext.begin(), avail_dext.end(), [&]( const auto &v ) { return !strcmp( v.extensionName, w ); } ) == avail_dext.end() ) return true; const auto avail_dlayers = d.enumerateDeviceLayerProperties(); for( const char *w: dlayers ) if( std::find_if( avail_dlayers.begin(), avail_dlayers.end(), [&]( const auto &v ) { return !strcmp( v.layerName, w ); } ) == avail_dlayers.end() ) return true; return false; } ), devices.end() ); if( devices.empty() ) throw required_extensions_or_layers_are_not_available(); 7VMLBOͷΠϯελϯεΛ࡞Δ ར༻Մೳͳ(16ͷத͔Β ࢖͏΍ͭΛબͿ
  7. 21.

    const auto queue_props = physical_device.getQueueFamilyProperties(); uint32_t queue_index =std::distance( queue_props.begin(), std::find_if(

    queue_props.begin(), queue_props.end(), []( const auto &v ) { return bool( v.queueFlags & vk::QueueFlagBits::eCompute ) && bool( v.queueFlags & vk::QueueFlagBits::eTransfer ); } ) ); if( queue_index == queue_props.size() ) throw required_queue_is_not_available(); const float priority = 0.0f; std::vector< vk::DeviceQueueCreateInfo > queues{}; const auto queue_create_info = vk::DeviceQueueCreateInfo() .setQueueFamilyIndex( queue_index ).setQueueCount( 1 ).setPQueuePriorities( &priority ); const auto features = physical_device.getFeatures(); auto device = physical_device.createDevice( vk::DeviceCreateInfo() .setQueueCreateInfoCount( 1 ).setPQueueCreateInfos( &queue_create_info ) .setEnabledExtensionCount( dext.size() ).setPpEnabledExtensionNames( dext.data() ) .setEnabledLayerCount( dlayers.size() ).setPpEnabledLayerNames( dlayers.data() ) .setPEnabledFeatures( &features ) ); std::shared_ptr< vk::Device > d( new vk::Device( std::move( device ) ), []( const auto &p ) { if( p ) { p->destroy(); delete p; } } ); auto queue = device.getQueue( queue_index, 0 ); auto command_pool = device.createCommandPool( vk::CommandPoolCreateInfo() .setQueueFamilyIndex( queue_index ).setFlags( vk::CommandPoolCreateFlagBits::eResetCommandBuffer ) ); std::shared_ptr< vk::Queue > q( new vk::Queue( std::move( queue ) ), [d]( const auto& ) {} ); std::shared_ptr< vk::CommandPool > p( new vk::CommandPool( std::move( command_pool ) ), [d]( const vk::CommandPool *p ) { if( p ) { d->destroyCommandPool( *p ); ࿦ཧσόΠεɺΩϡʔɺίϚϯυϓʔϧΛ࡞Δ
  8. 22.

    w = w00 w01 ⋯ w0M w10 w11 ⋯ w1M

    ⋮ ⋮ ⋮ wN0 wN1 ⋯ wNM ೖྗ͕/ཁૉɺग़ྗ͕.ཁૉͷϕΫτϧͷ࣌ ॏΈΛҎԼͷΑ͏ͳߦྻͱΈͳ͢ͱ y = ϕ (wx) ੵ  ϕ ͭͷ૚ͷܭࢉ͸  ϕΫτϧͱߦྻͷੵΛٻΊΔ  ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w x y
  9. 23.

     Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0

    t1 t2 w  Λमਖ਼ w x3 x4 x5 y3 y4 y5 t3 t4 t5 w ϛχόον x6 y6 t6 ෳ਺ͷೖྗϕΫτϧΛ ଋͶͯߦྻʹ͢Δ ଋ͝ͱʹޡࠩΛٻΊ͔ͯΒ ॏΈΛमਖ਼͢Δ ݸผʹ Λमਖ਼͢ΔΑΓ ֶश͕҆ఆ͢Δ w
  10. 24.

    ੵ  ϕ  ߦྻͱߦྻͷੵΛٻΊΔ  ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w

    x0 x1 ⋮ xb  ͸ͦΕͧΕೖྗϕΫτϧ xn  ͸ͦΕͧΕग़ྗϕΫτϧ yn y0 y1 ⋮ yb  ͸όοναΠζ b
  11. 25.

    ੵ  ϕ ੵ  ϕ wh wo x0 x1

    ⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L
  12. 26.

    ੵ  ϕ ੵ  ϕ ೖ ྗ ग़ ྗ

    ૚ ͷ ग़ ྗ ग़ ྗ ૚ ͷ ੵ Ӆ Ε ૚ ͷ ੵ Ӆ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ
  13. 27.

    hidden_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( input_width *

    hidden_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); output_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * output_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); hidden_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); hidden_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); softmax_grad.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); (16ͷϝϞϦΛ֬อ
  14. 28.

    ੵ ೖ ྗ ग़ ྗ ॏ Έ ೖྗ ߦྻ ͱॏΈ

    ߦྻ ͷੵΛ (16Ͱܭࢉ͢Δ
  15. 32.

    x00 x01 x02 x10 x11 x12 x20 x21 x22 w00

    w01 w02 w10 w11 w12 w20 w21 w22 = ∑ i x0i wi0 ∑ i x0i wi1 ∑ i x0i wi2 ∑ i x1i wi0 ∑ i x1i wi1 ∑ i x1i wi2 ∑ i x2i wi0 ∑ i x2i wi1 ∑ i x2i wi2 ໌Β͔ʹग़ྗߦྻͷཁૉ਺·Ͱ͸ฒྻͰܭࢉͰ͖Δ ∑ i x0i wi0 = x00 w00 + x01 w10 + x02 w20  ߋʹ ͷத਎ΛฒྻͰܭࢉ͢ΔͱεϨου਺ΛՔ͛Δ͕ ֤εϨουͷ஋ͷ ΛͲ͏΍ͬͯऔΔ͔͕໰୊ʹͳΔ ∑ ∑
  16. 33.

    43". GPU͸ෳ਺ͷεϨουͰಉظՄೳͳ SRAMΛ͍࣋ͬͯΔ 43". ʹ ॻ ͘ ಉ ظ 43".

    ͔ Β ಡ Ή A B ಉҰWorkGroup಺ͷεϨουʹ ஋Λड͚౉͢͜ͱ͕Ͱ͖Δ A B WorkGroupͷશεϨου͕ ಉظʹୡ͢Δ·Ͱఀࢭ ͜ͷSRAMͷࣄΛ Vulkan༻ޠͰSharedMemory NVIDIA༻ޠͰ΋SharedMemory SRAMΛڞ༗Ͱ͖ΔεϨουͷଋΛ Vulkan༻ޠͰWorkGroup NVIDIA༻ޠͰBlock
  17. 34.

    ݹయతͳ(16ʹ͓͚Δ ∑  ճͷՃࢉͱಉظͰ ͕ٻ·Δ log2 (n) ∑  x0

     x1  x2  x3 ಉ ظ S0 := x0 S1 := x2 ಉ ظ S0 := x0 + S0 S1 := x2 + S1 S0 := S0 + S1 x0 + x1 + x2 + x3 43".
  18. 35.

    43". AB CD  x0  x1  x2 

    x3 ਫ ฏ Ճ ࢉ x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 A B C D ৽͠ΊͷGPU͸ಉҰSubgroup಺Ͱ ਫฏԋࢉ͕Ͱ͖Δ Vulkan༻ޠͰSubgroup operations NVIDIA༻ޠͰWarp Shuffle
  19. 36.

    43". SubgroupͷαΠζ͕32ͷ৔߹ ! ճͷՃࢉͱಉظͰ! ͕ٻ·Δ log32 (n) ∑ … …

    ਫ ฏ Ճ ࢉ ਫ ฏ Ճ ࢉ ಉ ظ ਫ ฏ Ճ ࢉ  x0  x1  x32  x33  x34  x64 64 ∑ i=0 xi φ΢͍(16ʹ͓͚Δ ∑
  20. 38.

    shared float local_sum[ local_memory_size ]; float large_sum( in float value

    ) { float sg_sum = subgroupAdd( value ); local_sum[ gl_SubgroupID ] = sg_sum; barrier(); uint len = gl_NumSubgroups; while( len > 1 ) { uint index = gl_SubgroupInvocationID + gl_SubgroupID * gl_SubgroupSize; float sum = subgroupAdd( index < len ? local_sum[ index ] : 0.0 ); local_sum[ gl_SubgroupID ] = sum; barrier(); len /= gl_SubgroupSize; } barrier(); return local_sum[ 0 ]; } GLSLͰ!∑ ਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷ஋ΛਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷཁૉ͕1ݸʹͳͬͨΒ ͦͷ஋Λreturn
  21. 39.

    void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    output_index = gl_GlobalInvocationID.y; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; output_data[ output_index + data_index * output_width ] = 0.0; for( uint offset = 0; offset < width; offset += input_width ) { float value = ( offset + input_index ) < width ? input_data[ offset + input_index + data_index * width ] * weight[ output_index + ( offset + input_index ) * output_width ].x : 0.0; output_data[ output_index + data_index * output_width ] += large_sum( value ); } } GLSLͰߦྻͷੵ ! ΛऔΔඞཁ͕͋ΔεϨου͕ಉҰWorkGroupʹͳΔΑ͏ʹͯ͠ ೖྗߦྻͷ஋ͱॏΈߦྻͷ஋ͷੵΛग़ྗߦྻʹॻ͘ ∑
  22. 40.

     ϕ ೖ ྗ ग़ ྗ ׆ੑԽؔ਺ ૚͕ߦྻੵ͚ͩͩͬͨ৔߹ ૚Λ͍ͭ͘ॏͶͯ΋ઢܕੑ͕ҡ࣋͞ΕΔ ݴ͍׵͑Δͱઢܗͳؔ਺͔ۙ͠ࣅͰ͖ͳ͘ͳΔ

    ߦྻੵͱߦྻੵͷؒʹ ઢܕੑΛյ͢ඇઢܗͷؔ਺ΛڬΉࣄͰ ඇઢܗͳؔ਺ͷۙࣅΛՄೳʹ͢Δ
  23. 44.

    void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = tanh( input_data[ offset + input_index + data_index * width ] ); } } void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = max( 0, input_data[ offset + input_index + data_index * width ] ); } } GLSLͰHyperbolic Tangent GLSLͰRectified Linear Unit
  24. 45.

    ಘ Β Ε ͨ ग़ ྗ ଛ ࣦ ؔ ਺

    ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ଛࣦؔ਺ χϡʔϥϧωοτϫʔΫͷग़ྗͱ ཉ͔ͬͨ͠ग़ྗ͕ࣅ͍ͯΔ΄Ͳ ग़ྗ͕খ͘͞ͳΔؔ਺ ࠷ޙʹ͜ΕΛ෇͚ΔࣄͰ ద੾ͳॏΈͷ୳ࡧ͸ ࠷దԽ໰୊ʹͳΔ
  25. 46.

    ग़ྗͷܗࣜ t = 0 0 1 0 0 y =

    0.8 0.000007 0.9 0.036 0.00005 χϡʔϥϧωοτϫʔΫͷग़ྗ ग़͖ͯͯ΄͍͠ग़ྗ 0൪͔2൪ͷͲͪΒ͔ͳؾ͕͢Δ ਖ਼ղ͸2൪Ͱ͢
  26. 47.

    TPGUNBY yi = exi ∑ j exj softmax (y) =

    0.288213 0.129503 0.318524 0.134249 0.129509 y = 0.8 0.000007 0.9 0.036 0.00005
  27. 48.

    ΫϩεΤϯτϩϐʔޡࠩ ! ͔ͭ! ͷ࣌ t = 1 y = 1

    !l = 0 ! ͔ͭ! ͷ࣌ t = 1 y = 0.001 !l = 4.605 ! ͷ࣌ t = 0 !l = 0 softmaxͷ݁Ռ! ͕1ʹۙͮ͘ҝʹ͸ iҎ֎ͷ! ͷ஋͸0ʹۙ͘ͳ͚Ε͹ͳΒͳ͍ yi y ·ͱΊΔͱ! ͱ! ͕ࣅ͍ͯΔఔ! ͕খ͘͞ͳΔ y t L L = − ∑ i ti log (yi) yi = exi ∑ j exj
  28. 49.

    void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; } GLSLͰSoftmax with Cross Entropy Loss ! ͕0ʹ͖ۙͮա͗Δͱ infΛੜΈग़͢ y ! ͕0ʹͳΔͱnan΍infΛੜΈग़͢ ∑ j exj yi = exi ∑ j exj L = − ∑ i ti log (yi)
  29. 50.

    ੵ  ϕ ੵ  ϕ wh wo x0 x1

    ⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb ! ͱ! Λఆ਺ͱݟ၏ͯ͠! ͕࠷খͱͳΔ ! ͱ! Λ୳͢ x t L wh wo L
  30. 52.

    wo ∂L ∂wo = ∂L ∂d ∂d ∂c ∂c ∂w0

    ! ͕3ͭ࿈ͳͬͨ߹੒ؔ਺ͱݟ၏ͤΔ ߹੒ؔ਺ͷඍ෼ͷ࿈࠯཯ df dx = df dg dg dx ΑΓ ੵ  ϕ ଛ ࣦ ؔ ਺ L t c d ∂L ∂d ∂d ∂c ∂c ∂w0
  31. 53.

    ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

    wh L t c d b a ∂L ∂wh = ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh
  32. 54.

    ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

    L wh t c d b a ֤૚ͷೖग़ྗͷඍ෼͕ٻ·Δ৔߹ ! ΛޙΖͷ૚͔ΒॱʹٻΊΒΕΔ ∂L ∂w ޡࠩٯ఻೻๏ ∂L ∂d ∂L ∂d ∂d ∂c ∂L ∂d ∂d ∂c ∂c ∂b ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh
  33. 55.

    ੵ  ϕ ੵ  ϕ ೖ ྗ ྗ ૚

    ͷ ग़ ྗ ྗ ૚ ͷ ੵ Ε ૚ ͷ ੵ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ ޡ ࠩ ଛ ࣦ ؔ ਺ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑
  34. 56.

    ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

    ͍ ग़ ྗ L = − ∑ i ti log (yi) yi = exi ∑ j exj ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k ଛࣦؔ਺ͷٯ఻೻
  35. 57.

    ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

    ͍ ग़ ྗ ∑ i ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k
  36. 58.

    ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

    ͍ ग़ ྗ ཉ͍͠ग़ྗͷ૯࿨=1 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti
  37. 59.

    void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ); } ཉ ͠ ͍ ग़ ྗ softmaxͷGLSLͷ࠷ޙͰ ޯ഑Λग़ྗ = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti
  38. 60.

    si = xi 2 + 1 2 yi = esi

    ∑ j esj L = −∑ i ti log (yi)  ϕ ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L  ͷ ஋͕ग़·͢ −1 ≤ x ≤ 1  Ͱ ͍ͩ͘͞ 0 ≤ x ࠷ऴ૚ͷ׆ੑԽؔ਺ʹtanhΛ࢖͏ͱ ଛࣦؔ਺͕ظ଴͢Δ஋Ҭͱ߹Θͳ͍ͷͰἧ͑Δ
  39. 61.

    t0 t1 ⋮ tb ஋͕ग़·͢ ∂L ∂si = yi −

    ti ∂si ∂xi = 1 2 ∂L ∂xi = ∂L ∂si ∂si ∂xi = yi − ti 2 void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] * 0.5 + 0.5 ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ) * 0.5; }  Ͱ ͍ͩ͘͞ 0 ≤ x
  40. 62.

    ޯ ഑  ϕ ग़ ྗ ଆ ͷ ޯ ഑

    Hyperbolic Tangentͷ ٯ఻೻ yi = tanh (xi) ∂yi ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi)) ग़ྗଆͷޯ഑ ∂L ∂yi ∂L ∂xi
  41. 63.

    ޯ ഑ void main() { const uint input_index = gl_GlobalInvocationID.x;

    const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = ( 1 - pow( tanh( input_data[ offset + input_index + data_index * width ] ), 2 ) ) * output_grad[ offset + input_index + data_index * width ]; } } Hyperbolic Tangentͷ ٯ఻೻ i ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi))
  42. 64.

    ޯ ഑  ϕ ग़ ྗ ଆ ͷ ޯ ഑

    Rectified Linear Unitͷ ٯ఻೻ yi = { xi xi ≥ 0 0 xi < 0 ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0 ∂L ∂yi ∂L ∂xi
  43. 65.

    Rectified Linear Unitͷ ٯ఻೻ void main() { const uint input_index

    = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = input_data[ offset + input_index + data_index * width ] >= 0 ? output_grad[ offset + input_index + data_index * width ] : 0.0; } } ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0
  44. 67.

    yi = log (1 + exp (xi)) ReLUͷ࿦จ[1] Ͱ͸ Λۙࣅͯ͠

    yi = { xi xi ≥ 0 0 xi < 0 ͱ͍ͯ͠Δ [1] Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML'10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, USA, 807-814.
  45. 68.

    ∂yi ∂xi = exp (xi) 1 + exp (xi) ैͬͯ

    Λۙࣅͯ͠ ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0
  46. 69.

    ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ

    ഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi  ͷޯ഑ w w ∂L ∂wi yj = ∑ i wij xi ∂yj ∂wij = xi ∂L ∂wij = ∂L ∂yi xi 2ͭͷ෺ΛٻΊΔඞཁ͕͋ΔͷͰ ੺࿮ͷ෦෼͔Βߟ͑Δ ॏΈ͕ؔΘͬͨग़ྗͷޯ഑ʹ ॏΈ͕ؔΘͬͨೖྗΛֻ͚Δ
  47. 70.

    ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ

    ഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi  ͷޯ഑ w w ∂L ∂wi ࣍ʹ྘࿮ͷ෦෼Λߟ͑Δ yj = ∑ i wij xi ∂yj ∂xi = ∑ i wij ∂L ∂xi = ∑ j ∂L ∂yj wij ೖྗ͕ӨڹΛ༩͑ͨग़ྗͷޯ഑ʹ ͦͷೖग़ྗʹ͍ͭͯͷॏΈΛֻ͚ͨ෺ͷ૯࿨
  48. 71.

    void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    output_index = gl_GlobalInvocationID.y; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; float grad_w_sum = 0.0; for( uint data_index = 0; data_index != batch_size; data_index++ ) { input_grad[ input_index + data_index * input_width ] = 0.0; for( uint offset = 0; offset < height; offset += output_width ) { float grad_x = ( offset + output_index ) < height ? weight[ offset + output_index + input_index * height ].x * output_grad[ offset + output_index + data_index * height ] : 0.0; input_grad[ input_index + data_index * input_width ] += large_sum( grad_x ); } } for( uint offset = 0; offset < height; offset += output_width ) { float grad_w_sum = 0.0; for( uint data_index = 0; data_index != batch_size; data_index++ ) { float grad_w = ( offset + output_index ) < height ? input_data[ input_index + data_index * input_width ] * output_grad[ offset + output_index + data_index * height ] : 0.0; grad_w_sum += grad_w; } if( ( offset + output_index ) < height ) adam( weight[ offset + output_index + input_index * height ], grad_w_sum ); } } ߦྻੵͷ ٯ఻೻ ! ͷ਺ͰεϨουΛىಈ͠ ! Λڞ༗͢ΔεϨου͕ ਫฏՃࢉͰ͖ΔΑ͏ʹ WorkGroupΛׂΓ౰ͯΔ ∂L ∂wij ∂L ∂xi ∂L ∂wij ∂L ∂xi
  49. 72.

    ֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ wt+1 = wt − μ ∂L

    ∂wt গ͠ جຊతʹ͸! ʹԊͬͯ ΑΓ! ͕খ͘͞ͳΔํ΁ গ͠! Λߋ৽͢Δ ∂L ∂wt L w L w ! ͷߋ৽ํ޲ w
  50. 74.

    SGD(֬཰తޯ഑߱Լ๏) MomentumSGD NAG AdaGrad RMSprop Adam AdaDelta AdaMax SMORMS3 RMSpropGraves

    Eve Nadam Santa-E Santa-SSS AdaSecant GD by GD ࠷దԽΞϧΰϦζϜͷ ਐԽ
  51. 75.

    Adam g = ∂L ∂wt mt = β1 mt−1 +

    (1 − β1) g vt = β2 vt−1 + (1 − β2) g2 ̂ mt = mt 1 − βt 1 ̂ vt = vt 1 − βt 2 wt+1 = wt − α ̂ mt ̂ vt + ϵ void adam( inout vec4 weight, in float grad ) { weight.w += 1; float gt = grad; weight.y = beta1 * weight.y + ( 1 - beta1 ) * gt; weight.z = beta2 * weight.z + ( 1 - beta2 ) * gt * gt; float mhat = weight.y / ( 1 - pow( beta1, weight.w ) ); float vhat = weight.z / ( 1 - pow( beta2, weight.w ) ); weight.x -= alpha * mhat / ( sqrt( vhat ) + eps ); }   ͕ਪ঑͞Ε͍ͯΔͷͰ ͜ͷ஋Λͦͷ··࢖͏ α = 0.001 β1 = 0.9 β2 = 0.999 • Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for stochastic optimization. 2014. arXiv:1412.6980v9 https://arxiv.org/abs/1412.6980
  52. 76.

    ॳظ஋ ! ͕ಉ͡χϡʔϩϯ͸ಉ͍ৼΔ෣͍Λ͢Δ w × w02 ∑ ϕ × w12

    × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42   ಉ͡ग़ྗΛ͢ΔχϡʔϩϯͷॏΈʹ͍ͭͯͷޯ഑͸Ұக͢Δ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42  ∂L ∂w  ∂L ∂w ಉ͚ͩ͡ॏΈ͕ߋ৽͞ΕΔҝͣͬͱಉ͡ৼΔ෣͍Λ͢Δ χϡʔϥϧωοτϫʔΫͷॏΈͷॳظ஋͸ ۉҰʹͳ͍ͬͯͯ͸͍͚ͳ͍
  53. 77.

    Xavierͷॳظ஋ w = 1 n randn() Understanding the difficulty of

    training deep feedforward neural networks Xavier Glorot and Yoshua Bengio Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics PMLR. p. 249--256. 2010 http://proceedings.mlr.press/v9/glorot10a.html ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺
  54. 78.

    GPUͰཚ਺Λ࡞Δ  Gt 0.0801  Gt+1 0.2926  Gt+2 0.4342

     Gt+3 0.6978 ଟ͘ͷٖࣅཚ਺ΞϧΰϦζϜ͸ 1ݸͷཚ਺Λుͨ͘ͼʹߋ৽͞ΕΔঢ়ଶΛ࣋ͭ ͜Εͩͱલͷཚ਺͕ੜ੒͞ΕΔ·Ͱ࣍ͷཚ਺͕ੜ੒Ͱ͖ͳ͍ҝ εέʔϧ͠ͳ͍
  55. 79.

    GPUͰཚ਺Λ࡞Δ 10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ s = [ 12.9898 78.233] t =

    43758.5453 fract (x) = x − ⌊x⌋ f (x) = fract (t sin (x ⋅ s))  f 0.7340 [0.1 0.8]  f 0.1768 ཚ਺ੜ੒ثͷঢ়ଶΛ1ཁૉຖʹҾ͖ܧ͕ͳ͍ҝεέʔϧ͢Δ [0.3 0.2]
  56. 81.

    float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233)))

    * 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float xavier_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, 1.0 / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( xavier_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } Xavierͷॳظ஋ Ұ༷ཚ਺Λ࡞ͬͯ box-muller๏Ͱਖ਼نཚ਺ʹͯ͠ xavierͷॳظ஋ΛٻΊΔؔ਺Λ εϨου਺=ॏΈͷཁૉ਺ Ͱ࣮ߦ
  57. 82.

    Heͷॳظ஋ w = 2 n randn() Delving Deep into Rectifiers:

    Surpassing Human-Level Performance on ImageNet Classification Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun 2015 https://arxiv.org/abs/1502.01852 ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺ ReLUΛ࢖͏৔߹ʹXavierͷॳظ஋ΑΓ ॳظͷޡࠩͷ఻೻ʹ༏ΕΔͱ͞ΕΔ
  58. 83.

    float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233)))

    * 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float he_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, sqrt( 2 ) / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( he_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } ͕͜͜ҧ͏ Heͷॳظ஋
  59. 84.

    hidden_affine1.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[

    0 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_affine2.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[ 1 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_activation.reset( new layer( create_relu_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_affine_output, hidden_activation_output ) ) ); output_affine.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, output_weight, batch_size ) ) ); output_activation.reset( new layer( create_tanh_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output ) ) ); error1.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 0 ] ) ) ); error2.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 1 ] ) ) ); output_activation_backward.reset( new layer( create_tanh_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output, output_activation_grad, softmax_grad ) ) ); output_affine_backward.reset( new layer( create_affine_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, GLSLΛίϯύΠϧ͠ComputePipeline࡞ͬͯ όοϑΝΛׂΓ౰ͯΔ
  60. 85.

    void network::exec() { ++swap_index; swap_index %= 2; queue->submit( vk::SubmitInfo() .setCommandBufferCount(

    1 ) .setPCommandBuffers( command_buffers->data() + swap_index ), vk::Fence() ); fill( false, false ); queue->waitIdle(); if( debug ) { std::cout << "==============" << std::endl; check(); print( *error_out, batch_size ); print( *output_activation_output, batch_size ); print_image( *batch_images[ swap_index ], train_input->get_image_width(), batch_size ); print_label( *batch_labels[ swap_index ], batch_size ); print_eval( *output_activation_output, batch_size ); } } ࣮ߦ ೖྗͱཉ͍͠ग़ྗͷϖΞͷόοϑΝΛ2ηοτ༻ҙ͠ ֶशͱ࣍ͷσʔλͷసૹΛಉ࣌ʹߦ͏ ίϚϯυόοϑΝͷ಺༰ΛGPUʹ౤͛Δ
  61. 90.

    ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

    L ੵ  ϕ ਫ਼౓͕ग़ͳ͍࣌͸૚Λ૿΍͢ ͔͠͠୯७ͳߦྻੵͷ૚Λ૿΍͢ͱ ! ͕ͲΜͲΜ૿͑Δ w
  62. 93.

    yij = M ∑ k=0 N ∑ l=0 wkl x

    (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ೖ ྗ ৞ Έ ࠐ Έ ग़ ྗ ϑ ỹ ϧ λ
  63. 94.

    yij = M ∑ k=0 N ∑ l=0 wkl x

    (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x ∂L ∂yij ∂L ∂xij  ͷޯ഑ w w ∂L ∂wij ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ͋ΔॏΈ͕ؔΘͬͨશͯͷೖग़ྗͷϖΞʹ͍ͭͯ ग़ྗଆͷޯ഑ͱೖྗͷੵͷ૯࿨ΛͱΔ
  64. 95.

    yij = M ∑ k=0 N ∑ l=0 wkl x

    (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x  ͷޯ഑ w w ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ∂L ∂yij ∂L ∂xij ∂L ∂wij ∂L ∂xij = M ∑ k N ∑ l wkl ∂L ∂y (i−k)(j − l) 180౓ճసͨ͠ग़ྗଆͷޯ഑Λ ৞ΈࠐΉ
  65. 96.

    void main() { const uint filter_index = gl_GlobalInvocationID.x; const uint

    filter_x = filter_index % filter_width; const uint filter_y = filter_index / filter_width % filter_height; const uint channel = filter_index / filter_width / filter_height % channels; const uint filter_size = filter_width * filter_height * channels; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; bool filter_oob = filter_index >= filter_size; float sum = 0.0; for( int data_index = 0; data_index != batch_size; ++data_index ) { for( int output_x = 0; output_x != output_width; ++output_x ) { for( int output_y = 0; output_y != output_height; ++output_y ) { const int output_index = int( output_x ) + int( output_y ) * int( output_width ) + int( channel ) * int( output_width * output_height ) + data_index * int( output_width * output_height * channels ); const int input_x = output_x * int(filter_xstride) - int(xmargin) + int(filter_x); const int input_y = output_y * int(filter_ystride) - int(ymargin) + int(filter_y); const bool input_oob = filter_oob || input_x < 0 || input_x >= input_width || input_y < 0 || input_y >= input_height; const int input_index = int( input_x ) + int( input_y ) * int( input_width ) + int( channel ) * int( input_width * input_height ) + data_index * int( input_width * input_height * channels ); const float grad = filter_oob ? 0.0 : output_grad[ output_index ]; const float x = input_oob ? 0.0 : input_data[ input_index ]; sum += grad * x; } } } if( !filter_oob ) adam( weight[ filter_index ], sum ); }  ΛٻΊΔ(-4- ∂L ∂wkl
  66. 97.

    void main() { const uint input_x = gl_GlobalInvocationID.x % output_width;

    const uint input_y = gl_GlobalInvocationID.x / output_width % output_height; const uint channel = gl_GlobalInvocationID.x / output_width / output_height; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; const uint input_size = input_width * input_height * channels; const uint relative_input_index = input_x + input_y * input_width + channel * input_width * input_height; const uint input_index = relative_input_index + data_index * input_width * input_height * channels; if( relative_input_index < input_size ) input_grad[ input_index ] = 0.0; for( int x = 0; x != filter_width; ++x ) { for( int y = 0; y != filter_height; ++y ) { const int output_x = int(input_x) * int(filter_xstride) - int(xmargin) - x; const int output_y = int(input_y) * int(filter_ystride) - int(ymargin) - y; const bool oob = output_x < 0 || output_x >= output_width || output_y < 0 || output_y >= output_height; const int relative_output_index = output_x + output_y * int(output_width) + int(channel) * int(output_width * output_height); const int output_index = relative_output_index + int(data_index) * int(output_width * output_height * channels ); const uint filter_index = x + y * int(filter_width) + channel * int(filter_width * filter_height ); if( relative_input_index < input_size ) { if( !oob ) { const float grad = output_grad[ output_index ] * weight[ filter_index ].x; input_grad[ input_index ] += grad; } } } } }  ΛٻΊΔ(-4- ∂L ∂xij
  67. 98.

    3 6 1 2 2 0 5 9 6 MaxPooling

    ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢
  68. 99.

    3 6 1 2 2 0 5 9 6 MaxPooling

    ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢ 9
  69. 100.

    ೖ ྗ .BY1PPMJOH ग़ ྗ MaxPooling void main() { const

    uint relative_output_index = gl_GlobalInvocationID.x; const uint output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; if( relative_output_index < output_size ) output_data[ output_index ] = 0.0; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_height; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) output_data[ output_index ] = max( output_data[ output_index ], input_data[ input_index ] ); } } } ϑΟϧλαΠζ ͷ৔߹ M × N yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M], l ∈ [0,N]
  70. 101.

    yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M],

    l ∈ [0,N] ϑΟϧλαΠζ ͷ৔߹ M × N ೖ ྗ ଆ ͷ ޯ ഑ .BY1PPMJOH ग़ ྗ ଆ ͷ ޯ ഑ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij ࠷େ஋ΛΑ͖ͯͨ͜͠ೖྗʹରԠ͢Δޯ഑Λ ग़ྗଆͷޯ഑ͷ஋ʹ͢Δ ∂L ∂yij ∂L ∂xij MaxPooling ͷٯ఻೻
  71. 102.

    void main() { const uint relative_output_index = gl_GlobalInvocationID.x; const uint

    output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; const uint initial_input_x = output_x * filter_xstride; const uint initial_input_y = output_y * filter_ystride; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_width; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) input_grad[ input_index ] = ( input_data[ input_index ] == output_data[ output_index ] ) ? output_grad[ output_index ] : 0.0; } } } ͷ ޯ ഑ PPMJOH ͷ ޯ ഑ MaxPooling ͷٯ఻೻ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij
  72. 103.

    ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

    L .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ NEW!
  73. 105.

    L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

    ਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ NEW!
  74. 107.

    L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

    ਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ ! ͸࠷ऴ૚Λআ͍ͯ ReLU ϕ ͚ͩ͜͜Hyperbolic Tangent
  75. 108.

    L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

    ਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ ReLU͸ਖ਼ํ޲ʹ ͍͘ΒͰ΋େ͖ͳ஋ΛͱΔ ֶश͕;Β͍͍ͭͯΔͱ tanhʹڊେͳ஋͕͞͞Δ Ͱ ͍ͩ͘͞ −1 ≤ x ≤ 1 14326.7
  76. 109.

    L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

    ਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ 14326.7͸μϝ͗ͯ͢ গ͘͠Β͍஋͕มΘͬͯ΋μϝͳͷͰ ޯ഑͸ Ͱ͢ 0 Ͳ͏ͨ͠Β͍͍ͷ͔ ͳΜ΋Θ͔ΒΜ ޯ഑͕ແ͘ͳͬͯ ֶशͰ͖ͳ͘ͳΔ ∂L ∂xi = ∂L ∂yi (1 − tanh2 (xi)) 14326.7
  77. 111.

    ૚ͷग़ྗͷ෼෍ΛҰఆʹอͭख๏ Batch Normalization Layer Normalization Group Normalization Batch Normalization: Accelerating

    Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe and Christian Szegedy 2015 https://arxiv.org/abs/1502.03167 https://arxiv.org/abs/1607.06450 Layer Normalization Lei Ba, Jimmy and Kiros, Jamie Ryan and Hinton, Geoffrey E. p. arXiv:1607.06450. 2016 https://arxiv.org/abs/1803.08494 Group Normalization Yuxin Wu and Kaiming He 2018
  78. 112.

    TensorCore ͓·͚ ਫ ฏ ߦ ྻ ੵ ࿨ … AB

    + C  A  B  C ! ཁૉͷߦྻ! ͱ! ཁૉͷߦྻ! Λֻ͚ͯ ! ཁૉͷߦྻ! Λ଍ͨ݁͠ՌΛ32εϨου࢖ͬͯ1໋ྩͰಘΔ 16 × 16 A 8 × 16 B 16 × 8 C 7,@/7@DPPQFSBUJWF@NBUSJY֦ுʹରԠͨ͠ /7*%*"ͷ(16ͳΒ 7VMLBO͔ΒͰ΋ར༻Ͱ͖Δ