低レイヤーな人のためのディープラーニング

635e53b96114c922fa5486b418895960?s=47 Fadis
July 20, 2019

 低レイヤーな人のためのディープラーニング

フレームワークに頼らずVulkanで畳み込みニューラルネットワークを実装する方法を解説します
これは2019年7月20日に行われた 第15回 カーネル/VM探検隊 での発表資料です
サンプルコード: https://github.com/Fadis/kernelvm_20190720_samples

635e53b96114c922fa5486b418895960?s=128

Fadis

July 20, 2019
Tweet

Transcript

  1. ௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά NAOMASA MATSUBAYASHI https://github.com/Fadis/kernelvm_20190720_samples ͜ͷൃදʹొ৔͢Δαϯϓϧίʔυ

  2. ௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά

  3. x0 x1 x2 x3 x4 y2 ܗࣜχϡʔϩϯ × w02 ∑

    ϕ × w12 × w22 × w32 × w42 . . .
  4. yj = ϕ (∑ i wij xi) x0 x1 x2

    x3 x4 y0 y1 y2 y3 y4 . . . . . . . . . ૚ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42
  5. ૚ x00 x01 x02 x03 x04 . . . x10

    x11 x12 x13 x14 . . . ૚ x20 x21 x22 x23 x24 . . . ૚ x30 x31 x32 x33 x34 . . . ૚ y0 y1 y2 y3 y4 . . . w0 w1 w2 w3 ૚ΛແݶʹॏͶͨ෺͸ॏΈ࣍ୈͰ೚ҙͷؔ਺ΛදݱͰ͖ΔͰ͖Δ ༗ݶͰ΋༷ʑͳؔ਺ΛۙࣅͰ͖Δ χϡʔϥϧωοτϫʔΫ
  6. Կͱͳ͘૬ؔ͸͋Γͦ͏ͳΜ͚ͩͲ Ͳ͏͍͏ؔ܎͔Α͘Θ͔Βͳ͍ͭͷσʔλ ? 2 7 4

  7. ॏΈΛ୳Δ໰୊ʹͳΔ 2 7 4 y0 y1 y2 ͜Εͱ ͜ΕΛఆ਺ͱͯ͠ ͱਖ਼͍͠ग़ྗ͕

    ग़དྷΔ͚ͩۙ͘ͳΔΑ͏ʹ y  Λௐઅ͢Δ w
  8. ֶश w x0 x1 x2 y0 y1 y2 t0 t1

    t2  Λमਖ਼ w ೖྗͱରԠ͢Δग़ྗ͕ େྔʹ༗Ε͹ ͜ͷૢ࡞Λ܁Γฦ͢͜ͱͰ χϡʔϥϧωοτϫʔΫ͕ ͱͷؔ܎Λදؔ͢਺ͷ ۙࣅʹͳΔ x t x t σʔλ͔Β ؔ਺͕ಘΒΕΔ w  Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2 w  Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2
  9. x0 x1 x2 y0 y1 y2 σΟʔϓϥʔχϯά χϡʔϥϧωοτϫʔΫͷ௕͍΍ͭ ௕͍ͱΑΓෳࡶͳؔ਺ΛۙࣅͰ͖Δ͕ ֶश͕೉͘͠ͳΔҝੲ͸࣮༻తͰ͸ͳ͍ͱߟ͑ΒΕ͍ͯͨ

    ͜͜೥ఔͷݚڀͰ௕͍ωοτϫʔΫͷֶश͕ՄೳʹͳΓ େϒʔϜ
  10. CPU GPU FPGA ֶशʹ͸େྔͷܭࢉΛཁ͢Δҝ ͠͹͠͹$16֎ͷΞΫηϥϨʔλ͕༻͍ΒΕΔ

  11. CPU GPU FPGA ͲΜͳϋʔυ΢ΣΞΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ ϑϨʔϜϫʔΫ͕ొ৔ TensorFlow Chainer Caffe PyTorch .

    . . Theano
  12. CPU GPU FPGA ͲͷϑϨʔϜϫʔΫΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ ϑϨʔϜϫʔΫ͕ొ৔ TensorFlow Chainer Caffe PyTorch .

    . . Keras Theano ϨΠϠʔ͕ߴ͍
  13. CPU GPU FPGA TensorFlow Chainer Caffe PyTorch . . .

    Theano $ du -h tensorflow-1.12.3/ ... 145M tensorflow-1.12.3/ $ du -h pytorch-1.1.0/ ... 44M pytorch-1.1.0/ $ du -h chainer-6.1.0/ ... 22M chainer-6.1.0/ Ͱ͔͍ Keras $ du -h keras-2.2.4/ ... 3.0M keras-2.2.4/
  14. ௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά

  15. GPU ର৅ϋʔυ΢ΣΞΛ(16ʹߜΓ ϑϨʔϜϫʔΫΛ࢖Θͣʹ σΟʔϓϥʔχϯάΛࢼΈΔ

  16. ݱ୅ͷ(16͕උ͑ΔػೳΛ Ͱ͖Δ͚ͩͦͷ··ୟͨ͘Ίͷ ৽͍͠%άϥϑΟοΫ"1* ୈ13ճ ΧʔωϧʗVM୳ݕୂ ௿ϨΠϠʔάϥϑΟοΫAPI VulkanΛ࢝ΊΑ͏ ΑΓ Λ࢖͏ GPUΛಈ͔͢ͷʹ

  17. Vulkanಉ༷ੜͷGPUΛ৮ΕΔCUDAʹ͸ χϡʔϥϧωοτϫʔΫͰ༻͍ΔܭࢉΛϥΠϒϥϦԽͨ͠ cuDNN͕͋Δ͕ https://developer.nvidia.com/cudnn

  18. Λ࢖͏ GPUΛಈ͔͢ͷʹ ඞཁͳܭࢉ͸શ࣮ͯ૷͢Δ

  19. Ӆ Ε ૚ ग़ ྗ ૚ ࠷΋جຊతͳશ݁߹૚ͭͷωοτϫʔΫ x0 x1 x2

    x3 x4 . . . y0 y1 y2 y3 y4 . . .
  20. if( config.validation ) layers.emplace_back( "VK_LAYER_LUNARG_standard_validation" ); const auto app_info =

    vk::ApplicationInfo( config.prog_name.c_str(), (தུ), VK_API_VERSION_1_1 ); instance_ptr_t instance( new vk::Instance( vk::createInstance( vk::InstanceCreateInfo() .setPApplicationInfo( &app_info ) .setEnabledExtensionCount( ext.size() ).setPpEnabledExtensionNames( ext.data() ) .setEnabledLayerCount( layers.size() ).setPpEnabledLayerNames( layers.data() ) ) ) ); auto devices = instance->enumeratePhysicalDevices(); if( devices.empty() ) throw device_is_not_available(); devices.erase( std::remove_if( devices.begin(), devices.end(), [&]( const auto &d ) -> bool { auto avail_dext = d.enumerateDeviceExtensionProperties(); for( const char *w: dext ) if( std::find_if( avail_dext.begin(), avail_dext.end(), [&]( const auto &v ) { return !strcmp( v.extensionName, w ); } ) == avail_dext.end() ) return true; const auto avail_dlayers = d.enumerateDeviceLayerProperties(); for( const char *w: dlayers ) if( std::find_if( avail_dlayers.begin(), avail_dlayers.end(), [&]( const auto &v ) { return !strcmp( v.layerName, w ); } ) == avail_dlayers.end() ) return true; return false; } ), devices.end() ); if( devices.empty() ) throw required_extensions_or_layers_are_not_available(); 7VMLBOͷΠϯελϯεΛ࡞Δ ར༻Մೳͳ(16ͷத͔Β ࢖͏΍ͭΛબͿ
  21. const auto queue_props = physical_device.getQueueFamilyProperties(); uint32_t queue_index =std::distance( queue_props.begin(), std::find_if(

    queue_props.begin(), queue_props.end(), []( const auto &v ) { return bool( v.queueFlags & vk::QueueFlagBits::eCompute ) && bool( v.queueFlags & vk::QueueFlagBits::eTransfer ); } ) ); if( queue_index == queue_props.size() ) throw required_queue_is_not_available(); const float priority = 0.0f; std::vector< vk::DeviceQueueCreateInfo > queues{}; const auto queue_create_info = vk::DeviceQueueCreateInfo() .setQueueFamilyIndex( queue_index ).setQueueCount( 1 ).setPQueuePriorities( &priority ); const auto features = physical_device.getFeatures(); auto device = physical_device.createDevice( vk::DeviceCreateInfo() .setQueueCreateInfoCount( 1 ).setPQueueCreateInfos( &queue_create_info ) .setEnabledExtensionCount( dext.size() ).setPpEnabledExtensionNames( dext.data() ) .setEnabledLayerCount( dlayers.size() ).setPpEnabledLayerNames( dlayers.data() ) .setPEnabledFeatures( &features ) ); std::shared_ptr< vk::Device > d( new vk::Device( std::move( device ) ), []( const auto &p ) { if( p ) { p->destroy(); delete p; } } ); auto queue = device.getQueue( queue_index, 0 ); auto command_pool = device.createCommandPool( vk::CommandPoolCreateInfo() .setQueueFamilyIndex( queue_index ).setFlags( vk::CommandPoolCreateFlagBits::eResetCommandBuffer ) ); std::shared_ptr< vk::Queue > q( new vk::Queue( std::move( queue ) ), [d]( const auto& ) {} ); std::shared_ptr< vk::CommandPool > p( new vk::CommandPool( std::move( command_pool ) ), [d]( const vk::CommandPool *p ) { if( p ) { d->destroyCommandPool( *p ); ࿦ཧσόΠεɺΩϡʔɺίϚϯυϓʔϧΛ࡞Δ
  22. w = w00 w01 ⋯ w0M w10 w11 ⋯ w1M

    ⋮ ⋮ ⋮ wN0 wN1 ⋯ wNM ೖྗ͕/ཁૉɺग़ྗ͕.ཁૉͷϕΫτϧͷ࣌ ॏΈΛҎԼͷΑ͏ͳߦྻͱΈͳ͢ͱ y = ϕ (wx) ੵ  ϕ ͭͷ૚ͷܭࢉ͸  ϕΫτϧͱߦྻͷੵΛٻΊΔ  ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w x y
  23.  Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0

    t1 t2 w  Λमਖ਼ w x3 x4 x5 y3 y4 y5 t3 t4 t5 w ϛχόον x6 y6 t6 ෳ਺ͷೖྗϕΫτϧΛ ଋͶͯߦྻʹ͢Δ ଋ͝ͱʹޡࠩΛٻΊ͔ͯΒ ॏΈΛमਖ਼͢Δ ݸผʹ Λमਖ਼͢ΔΑΓ ֶश͕҆ఆ͢Δ w
  24. ੵ  ϕ  ߦྻͱߦྻͷੵΛٻΊΔ  ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w

    x0 x1 ⋮ xb  ͸ͦΕͧΕೖྗϕΫτϧ xn  ͸ͦΕͧΕग़ྗϕΫτϧ yn y0 y1 ⋮ yb  ͸όοναΠζ b
  25. ੵ  ϕ ੵ  ϕ wh wo x0 x1

    ⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L
  26. ੵ  ϕ ੵ  ϕ ೖ ྗ ग़ ྗ

    ૚ ͷ ग़ ྗ ग़ ྗ ૚ ͷ ੵ Ӆ Ε ૚ ͷ ੵ Ӆ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ
  27. hidden_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( input_width *

    hidden_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); output_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * output_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); hidden_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); hidden_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); softmax_grad.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); (16ͷϝϞϦΛ֬อ
  28. ੵ ೖ ྗ ग़ ྗ ॏ Έ ೖྗ ߦྻ ͱॏΈ

    ߦྻ ͷੵΛ (16Ͱܭࢉ͢Δ
  29. GPU ͕1ͭͷσʔλΛܭࢉ ͕SIMDͷ1Ϣχοτ Vulkan༻ޠͰSubgroup NVIDIA༻ޠͰWarp Vulkan༻ޠͰThread NVIDIA༻ޠͰ΋Thread

  30. GPU ϓϩηοα਺ͰੑೳΛՔ͙ΞʔΩςΫνϟͳͷͰ Ͱ͖Δ͚ͩ୔ࢁͷεϨουͰܭࢉΛ͠ͳ͚Ε͹ͳΒͳ͍

  31. VRAM શͯͷεϨου͸73".Λڞ༗͍ͯ͠Δ ෳ਺ͷεϨου͕ಉ͡ΞυϨε͔Β஋ΛಡΉͷ͸໰୊ͳ͍͕ ෳ਺ͷεϨου͕ಉ͡ΞυϨεʹ஋Λॻ͍ͨ৔߹ ͲͷεϨου͕ॻ͍ͨ஋͕࢒Δ͔͸ෆఆ Vulkan༻ޠͰMemory NVIDIA༻ޠͰGlobalMemory

  32. x00 x01 x02 x10 x11 x12 x20 x21 x22 w00

    w01 w02 w10 w11 w12 w20 w21 w22 = ∑ i x0i wi0 ∑ i x0i wi1 ∑ i x0i wi2 ∑ i x1i wi0 ∑ i x1i wi1 ∑ i x1i wi2 ∑ i x2i wi0 ∑ i x2i wi1 ∑ i x2i wi2 ໌Β͔ʹग़ྗߦྻͷཁૉ਺·Ͱ͸ฒྻͰܭࢉͰ͖Δ ∑ i x0i wi0 = x00 w00 + x01 w10 + x02 w20  ߋʹ ͷத਎ΛฒྻͰܭࢉ͢ΔͱεϨου਺ΛՔ͛Δ͕ ֤εϨουͷ஋ͷ ΛͲ͏΍ͬͯऔΔ͔͕໰୊ʹͳΔ ∑ ∑
  33. 43". GPU͸ෳ਺ͷεϨουͰಉظՄೳͳ SRAMΛ͍࣋ͬͯΔ 43". ʹ ॻ ͘ ಉ ظ 43".

    ͔ Β ಡ Ή A B ಉҰWorkGroup಺ͷεϨουʹ ஋Λड͚౉͢͜ͱ͕Ͱ͖Δ A B WorkGroupͷશεϨου͕ ಉظʹୡ͢Δ·Ͱఀࢭ ͜ͷSRAMͷࣄΛ Vulkan༻ޠͰSharedMemory NVIDIA༻ޠͰ΋SharedMemory SRAMΛڞ༗Ͱ͖ΔεϨουͷଋΛ Vulkan༻ޠͰWorkGroup NVIDIA༻ޠͰBlock
  34. ݹయతͳ(16ʹ͓͚Δ ∑  ճͷՃࢉͱಉظͰ ͕ٻ·Δ log2 (n) ∑  x0

     x1  x2  x3 ಉ ظ S0 := x0 S1 := x2 ಉ ظ S0 := x0 + S0 S1 := x2 + S1 S0 := S0 + S1 x0 + x1 + x2 + x3 43".
  35. 43". AB CD  x0  x1  x2 

    x3 ਫ ฏ Ճ ࢉ x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 A B C D ৽͠ΊͷGPU͸ಉҰSubgroup಺Ͱ ਫฏԋࢉ͕Ͱ͖Δ Vulkan༻ޠͰSubgroup operations NVIDIA༻ޠͰWarp Shuffle
  36. 43". SubgroupͷαΠζ͕32ͷ৔߹ ! ճͷՃࢉͱಉظͰ! ͕ٻ·Δ log32 (n) ∑ … …

    ਫ ฏ Ճ ࢉ ਫ ฏ Ճ ࢉ ಉ ظ ਫ ฏ Ճ ࢉ  x0  x1  x32  x33  x34  x64 64 ∑ i=0 xi φ΢͍(16ʹ͓͚Δ ∑
  37. GPU Subgroup: ϓϩάϥϜΧ΢ϯλΛڞ༗͍ͯ͠Δ Workgroup: SRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ Dispatch: VRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ GeForceGTX1070ͷ৔߹ 32εϨου ෺ཧ4Subgroups

    ࿦ཧ48Subgroups ෺ཧ60Workgroups ࿦ཧ2^64Workgroups
  38. shared float local_sum[ local_memory_size ]; float large_sum( in float value

    ) { float sg_sum = subgroupAdd( value ); local_sum[ gl_SubgroupID ] = sg_sum; barrier(); uint len = gl_NumSubgroups; while( len > 1 ) { uint index = gl_SubgroupInvocationID + gl_SubgroupID * gl_SubgroupSize; float sum = subgroupAdd( index < len ? local_sum[ index ] : 0.0 ); local_sum[ gl_SubgroupID ] = sum; barrier(); len /= gl_SubgroupSize; } barrier(); return local_sum[ 0 ]; } GLSLͰ!∑ ਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷ஋ΛਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷཁૉ͕1ݸʹͳͬͨΒ ͦͷ஋Λreturn
  39. void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    output_index = gl_GlobalInvocationID.y; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; output_data[ output_index + data_index * output_width ] = 0.0; for( uint offset = 0; offset < width; offset += input_width ) { float value = ( offset + input_index ) < width ? input_data[ offset + input_index + data_index * width ] * weight[ output_index + ( offset + input_index ) * output_width ].x : 0.0; output_data[ output_index + data_index * output_width ] += large_sum( value ); } } GLSLͰߦྻͷੵ ! ΛऔΔඞཁ͕͋ΔεϨου͕ಉҰWorkGroupʹͳΔΑ͏ʹͯ͠ ೖྗߦྻͷ஋ͱॏΈߦྻͷ஋ͷੵΛग़ྗߦྻʹॻ͘ ∑
  40.  ϕ ೖ ྗ ग़ ྗ ׆ੑԽؔ਺ ૚͕ߦྻੵ͚ͩͩͬͨ৔߹ ૚Λ͍ͭ͘ॏͶͯ΋ઢܕੑ͕ҡ࣋͞ΕΔ ݴ͍׵͑Δͱઢܗͳؔ਺͔ۙ͠ࣅͰ͖ͳ͘ͳΔ

    ߦྻੵͱߦྻੵͷؒʹ ઢܕੑΛյ͢ඇઢܗͷؔ਺ΛڬΉࣄͰ ඇઢܗͳؔ਺ͷۙࣅΛՄೳʹ͢Δ
  41. Hyperbolic Tangent

  42. Rectified Linear Unit (ReLU)

  43. yij = tanh (xij) ׆ੑԽؔ਺͸ೖग़ྗߦྻͷཁૉ͝ͱʹಠཱʹܭࢉͰ͖Δҝ ग़ྗߦྻͷཁૉ਺ ೖྗߦྻͷཁૉ਺ ·Ͱ͸ฒྻͰܭࢉͰ͖Δ yij =

    { xij xij > = 0 0 xij < 0 Hyperbolic Tangentͷ৔߹ Rectified Linear Unitͷ৔߹
  44. void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = tanh( input_data[ offset + input_index + data_index * width ] ); } } void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = max( 0, input_data[ offset + input_index + data_index * width ] ); } } GLSLͰHyperbolic Tangent GLSLͰRectified Linear Unit
  45. ಘ Β Ε ͨ ग़ ྗ ଛ ࣦ ؔ ਺

    ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ଛࣦؔ਺ χϡʔϥϧωοτϫʔΫͷग़ྗͱ ཉ͔ͬͨ͠ग़ྗ͕ࣅ͍ͯΔ΄Ͳ ग़ྗ͕খ͘͞ͳΔؔ਺ ࠷ޙʹ͜ΕΛ෇͚ΔࣄͰ ద੾ͳॏΈͷ୳ࡧ͸ ࠷దԽ໰୊ʹͳΔ
  46. ग़ྗͷܗࣜ t = 0 0 1 0 0 y =

    0.8 0.000007 0.9 0.036 0.00005 χϡʔϥϧωοτϫʔΫͷग़ྗ ग़͖ͯͯ΄͍͠ग़ྗ 0൪͔2൪ͷͲͪΒ͔ͳؾ͕͢Δ ਖ਼ղ͸2൪Ͱ͢
  47. TPGUNBY yi = exi ∑ j exj softmax (y) =

    0.288213 0.129503 0.318524 0.134249 0.129509 y = 0.8 0.000007 0.9 0.036 0.00005
  48. ΫϩεΤϯτϩϐʔޡࠩ ! ͔ͭ! ͷ࣌ t = 1 y = 1

    !l = 0 ! ͔ͭ! ͷ࣌ t = 1 y = 0.001 !l = 4.605 ! ͷ࣌ t = 0 !l = 0 softmaxͷ݁Ռ! ͕1ʹۙͮ͘ҝʹ͸ iҎ֎ͷ! ͷ஋͸0ʹۙ͘ͳ͚Ε͹ͳΒͳ͍ yi y ·ͱΊΔͱ! ͱ! ͕ࣅ͍ͯΔఔ! ͕খ͘͞ͳΔ y t L L = − ∑ i ti log (yi) yi = exi ∑ j exj
  49. void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; } GLSLͰSoftmax with Cross Entropy Loss ! ͕0ʹ͖ۙͮա͗Δͱ infΛੜΈग़͢ y ! ͕0ʹͳΔͱnan΍infΛੜΈग़͢ ∑ j exj yi = exi ∑ j exj L = − ∑ i ti log (yi)
  50. ੵ  ϕ ੵ  ϕ wh wo x0 x1

    ⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb ! ͱ! Λఆ਺ͱݟ၏ͯ͠! ͕࠷খͱͳΔ ! ͱ! Λ୳͢ x t L wh wo L
  51. ! ͕มԽͨ࣌͠! ͕ͲͷΑ͏ʹมԽ͢Δ͔ w L ∂L ∂w ॏΈ͕ొ৔͢Δ૚͔Β ΫϩεΤϯτϩϐʔޡࠩ·Ͱͷؔ਺Λ !

    ʹ͍ͭͯภඍ෼ w
  52. wo ∂L ∂wo = ∂L ∂d ∂d ∂c ∂c ∂w0

    ! ͕3ͭ࿈ͳͬͨ߹੒ؔ਺ͱݟ၏ͤΔ ߹੒ؔ਺ͷඍ෼ͷ࿈࠯཯ df dx = df dg dg dx ΑΓ ੵ  ϕ ଛ ࣦ ؔ ਺ L t c d ∂L ∂d ∂d ∂c ∂c ∂w0
  53. ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

    wh L t c d b a ∂L ∂wh = ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh
  54. ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

    L wh t c d b a ֤૚ͷೖग़ྗͷඍ෼͕ٻ·Δ৔߹ ! ΛޙΖͷ૚͔ΒॱʹٻΊΒΕΔ ∂L ∂w ޡࠩٯ఻೻๏ ∂L ∂d ∂L ∂d ∂d ∂c ∂L ∂d ∂d ∂c ∂c ∂b ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh
  55. ੵ  ϕ ੵ  ϕ ೖ ྗ ྗ ૚

    ͷ ग़ ྗ ྗ ૚ ͷ ੵ Ε ૚ ͷ ੵ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ ޡ ࠩ ଛ ࣦ ؔ ਺ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑
  56. ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

    ͍ ग़ ྗ L = − ∑ i ti log (yi) yi = exi ∑ j exj ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k ଛࣦؔ਺ͷٯ఻೻
  57. ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

    ͍ ग़ ྗ ∑ i ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k
  58. ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

    ͍ ग़ ྗ ཉ͍͠ग़ྗͷ૯࿨=1 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti
  59. void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ); } ཉ ͠ ͍ ग़ ྗ softmaxͷGLSLͷ࠷ޙͰ ޯ഑Λग़ྗ = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti
  60. si = xi 2 + 1 2 yi = esi

    ∑ j esj L = −∑ i ti log (yi)  ϕ ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L  ͷ ஋͕ग़·͢ −1 ≤ x ≤ 1  Ͱ ͍ͩ͘͞ 0 ≤ x ࠷ऴ૚ͷ׆ੑԽؔ਺ʹtanhΛ࢖͏ͱ ଛࣦؔ਺͕ظ଴͢Δ஋Ҭͱ߹Θͳ͍ͷͰἧ͑Δ
  61. t0 t1 ⋮ tb ஋͕ग़·͢ ∂L ∂si = yi −

    ti ∂si ∂xi = 1 2 ∂L ∂xi = ∂L ∂si ∂si ∂xi = yi − ti 2 void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] * 0.5 + 0.5 ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ) * 0.5; }  Ͱ ͍ͩ͘͞ 0 ≤ x
  62. ޯ ഑  ϕ ग़ ྗ ଆ ͷ ޯ ഑

    Hyperbolic Tangentͷ ٯ఻೻ yi = tanh (xi) ∂yi ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi)) ग़ྗଆͷޯ഑ ∂L ∂yi ∂L ∂xi
  63. ޯ ഑ void main() { const uint input_index = gl_GlobalInvocationID.x;

    const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = ( 1 - pow( tanh( input_data[ offset + input_index + data_index * width ] ), 2 ) ) * output_grad[ offset + input_index + data_index * width ]; } } Hyperbolic Tangentͷ ٯ఻೻ i ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi))
  64. ޯ ഑  ϕ ग़ ྗ ଆ ͷ ޯ ഑

    Rectified Linear Unitͷ ٯ఻೻ yi = { xi xi ≥ 0 0 xi < 0 ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0 ∂L ∂yi ∂L ∂xi
  65. Rectified Linear Unitͷ ٯ఻೻ void main() { const uint input_index

    = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = input_data[ offset + input_index + data_index * width ] >= 0 ? output_grad[ offset + input_index + data_index * width ] : 0.0; } } ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0
  66. Ͳ͏ݟͯ΋ෆ࿈ଓ͕ͩ ඍ෼͕ఆٛͰ͖Δͷ͔

  67. yi = log (1 + exp (xi)) ReLUͷ࿦จ[1] Ͱ͸ Λۙࣅͯ͠

    yi = { xi xi ≥ 0 0 xi < 0 ͱ͍ͯ͠Δ [1] Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML'10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, USA, 807-814.
  68. ∂yi ∂xi = exp (xi) 1 + exp (xi) ैͬͯ

    Λۙࣅͯ͠ ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0
  69. ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ

    ഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi  ͷޯ഑ w w ∂L ∂wi yj = ∑ i wij xi ∂yj ∂wij = xi ∂L ∂wij = ∂L ∂yi xi 2ͭͷ෺ΛٻΊΔඞཁ͕͋ΔͷͰ ੺࿮ͷ෦෼͔Βߟ͑Δ ॏΈ͕ؔΘͬͨग़ྗͷޯ഑ʹ ॏΈ͕ؔΘͬͨೖྗΛֻ͚Δ
  70. ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ

    ഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi  ͷޯ഑ w w ∂L ∂wi ࣍ʹ྘࿮ͷ෦෼Λߟ͑Δ yj = ∑ i wij xi ∂yj ∂xi = ∑ i wij ∂L ∂xi = ∑ j ∂L ∂yj wij ೖྗ͕ӨڹΛ༩͑ͨग़ྗͷޯ഑ʹ ͦͷೖग़ྗʹ͍ͭͯͷॏΈΛֻ͚ͨ෺ͷ૯࿨
  71. void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

    output_index = gl_GlobalInvocationID.y; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; float grad_w_sum = 0.0; for( uint data_index = 0; data_index != batch_size; data_index++ ) { input_grad[ input_index + data_index * input_width ] = 0.0; for( uint offset = 0; offset < height; offset += output_width ) { float grad_x = ( offset + output_index ) < height ? weight[ offset + output_index + input_index * height ].x * output_grad[ offset + output_index + data_index * height ] : 0.0; input_grad[ input_index + data_index * input_width ] += large_sum( grad_x ); } } for( uint offset = 0; offset < height; offset += output_width ) { float grad_w_sum = 0.0; for( uint data_index = 0; data_index != batch_size; data_index++ ) { float grad_w = ( offset + output_index ) < height ? input_data[ input_index + data_index * input_width ] * output_grad[ offset + output_index + data_index * height ] : 0.0; grad_w_sum += grad_w; } if( ( offset + output_index ) < height ) adam( weight[ offset + output_index + input_index * height ], grad_w_sum ); } } ߦྻੵͷ ٯ఻೻ ! ͷ਺ͰεϨουΛىಈ͠ ! Λڞ༗͢ΔεϨου͕ ਫฏՃࢉͰ͖ΔΑ͏ʹ WorkGroupΛׂΓ౰ͯΔ ∂L ∂wij ∂L ∂xi ∂L ∂wij ∂L ∂xi
  72. ֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ wt+1 = wt − μ ∂L

    ∂wt গ͠ جຊతʹ͸! ʹԊͬͯ ΑΓ! ͕খ͘͞ͳΔํ΁ গ͠! Λߋ৽͢Δ ∂L ∂wt L w L w ! ͷߋ৽ํ޲ w
  73. ֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ L w ! ͷߋ৽ํ޲ w ͪΐͬͱͨ͠ग़ͬுΓʹ

    ͍᪴ͯ໭Δ wt+1 = wt − μ ∂L ∂wt
  74. SGD(֬཰తޯ഑߱Լ๏) MomentumSGD NAG AdaGrad RMSprop Adam AdaDelta AdaMax SMORMS3 RMSpropGraves

    Eve Nadam Santa-E Santa-SSS AdaSecant GD by GD ࠷దԽΞϧΰϦζϜͷ ਐԽ
  75. Adam g = ∂L ∂wt mt = β1 mt−1 +

    (1 − β1) g vt = β2 vt−1 + (1 − β2) g2 ̂ mt = mt 1 − βt 1 ̂ vt = vt 1 − βt 2 wt+1 = wt − α ̂ mt ̂ vt + ϵ void adam( inout vec4 weight, in float grad ) { weight.w += 1; float gt = grad; weight.y = beta1 * weight.y + ( 1 - beta1 ) * gt; weight.z = beta2 * weight.z + ( 1 - beta2 ) * gt * gt; float mhat = weight.y / ( 1 - pow( beta1, weight.w ) ); float vhat = weight.z / ( 1 - pow( beta2, weight.w ) ); weight.x -= alpha * mhat / ( sqrt( vhat ) + eps ); }   ͕ਪ঑͞Ε͍ͯΔͷͰ ͜ͷ஋Λͦͷ··࢖͏ α = 0.001 β1 = 0.9 β2 = 0.999 • Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for stochastic optimization. 2014. arXiv:1412.6980v9 https://arxiv.org/abs/1412.6980
  76. ॳظ஋ ! ͕ಉ͡χϡʔϩϯ͸ಉ͍ৼΔ෣͍Λ͢Δ w × w02 ∑ ϕ × w12

    × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42   ಉ͡ग़ྗΛ͢ΔχϡʔϩϯͷॏΈʹ͍ͭͯͷޯ഑͸Ұக͢Δ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42  ∂L ∂w  ∂L ∂w ಉ͚ͩ͡ॏΈ͕ߋ৽͞ΕΔҝͣͬͱಉ͡ৼΔ෣͍Λ͢Δ χϡʔϥϧωοτϫʔΫͷॏΈͷॳظ஋͸ ۉҰʹͳ͍ͬͯͯ͸͍͚ͳ͍
  77. Xavierͷॳظ஋ w = 1 n randn() Understanding the difficulty of

    training deep feedforward neural networks Xavier Glorot and Yoshua Bengio Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics PMLR. p. 249--256. 2010 http://proceedings.mlr.press/v9/glorot10a.html ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺
  78. GPUͰཚ਺Λ࡞Δ  Gt 0.0801  Gt+1 0.2926  Gt+2 0.4342

     Gt+3 0.6978 ଟ͘ͷٖࣅཚ਺ΞϧΰϦζϜ͸ 1ݸͷཚ਺Λుͨ͘ͼʹߋ৽͞ΕΔঢ়ଶΛ࣋ͭ ͜Εͩͱલͷཚ਺͕ੜ੒͞ΕΔ·Ͱ࣍ͷཚ਺͕ੜ੒Ͱ͖ͳ͍ҝ εέʔϧ͠ͳ͍
  79. GPUͰཚ਺Λ࡞Δ 10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ s = [ 12.9898 78.233] t =

    43758.5453 fract (x) = x − ⌊x⌋ f (x) = fract (t sin (x ⋅ s))  f 0.7340 [0.1 0.8]  f 0.1768 ཚ਺ੜ੒ثͷঢ়ଶΛ1ཁૉຖʹҾ͖ܧ͕ͳ͍ҝεέʔϧ͢Δ [0.3 0.2]
  80. GPUͰཚ਺Λ࡞Δ 10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ ཚ਺ੜ੒ثͷग़ྗͷ෼෍ Box-Muller๏Ͱ ਖ਼ن෼෍ʹͨ͠΋ͷ Ξϗໟ͕ੜ͑ͯΔ͚Ͳ ࢖͑ͳ͍ϨϕϧͰ͸ͳͦ͞͏

  81. float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233)))

    * 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float xavier_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, 1.0 / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( xavier_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } Xavierͷॳظ஋ Ұ༷ཚ਺Λ࡞ͬͯ box-muller๏Ͱਖ਼نཚ਺ʹͯ͠ xavierͷॳظ஋ΛٻΊΔؔ਺Λ εϨου਺=ॏΈͷཁૉ਺ Ͱ࣮ߦ
  82. Heͷॳظ஋ w = 2 n randn() Delving Deep into Rectifiers:

    Surpassing Human-Level Performance on ImageNet Classification Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun 2015 https://arxiv.org/abs/1502.01852 ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺ ReLUΛ࢖͏৔߹ʹXavierͷॳظ஋ΑΓ ॳظͷޡࠩͷ఻೻ʹ༏ΕΔͱ͞ΕΔ
  83. float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233)))

    * 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float he_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, sqrt( 2 ) / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( he_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } ͕͜͜ҧ͏ Heͷॳظ஋
  84. hidden_affine1.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[

    0 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_affine2.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[ 1 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_activation.reset( new layer( create_relu_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_affine_output, hidden_activation_output ) ) ); output_affine.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, output_weight, batch_size ) ) ); output_activation.reset( new layer( create_tanh_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output ) ) ); error1.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 0 ] ) ) ); error2.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 1 ] ) ) ); output_activation_backward.reset( new layer( create_tanh_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output, output_activation_grad, softmax_grad ) ) ); output_affine_backward.reset( new layer( create_affine_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, GLSLΛίϯύΠϧ͠ComputePipeline࡞ͬͯ όοϑΝΛׂΓ౰ͯΔ
  85. void network::exec() { ++swap_index; swap_index %= 2; queue->submit( vk::SubmitInfo() .setCommandBufferCount(

    1 ) .setPCommandBuffers( command_buffers->data() + swap_index ), vk::Fence() ); fill( false, false ); queue->waitIdle(); if( debug ) { std::cout << "==============" << std::endl; check(); print( *error_out, batch_size ); print( *output_activation_output, batch_size ); print_image( *batch_images[ swap_index ], train_input->get_image_width(), batch_size ); print_label( *batch_labels[ swap_index ], batch_size ); print_eval( *output_activation_output, batch_size ); } } ࣮ߦ ೖྗͱཉ͍͠ग़ྗͷϖΞͷόοϑΝΛ2ηοτ༻ҙ͠ ֶशͱ࣍ͷσʔλͷసૹΛಉ࣌ʹߦ͏ ίϚϯυόοϑΝͷ಺༰ΛGPUʹ౤͛Δ
  86. MNIST http://yann.lecun.com/exdb/mnist/ ϥϕϧ෇͖खॻ͖਺ࣈը૾7ສຕ SVMͰ΋9ׂҎ্ͷਫ਼౓Ͱ ෼ྨͰ͖Δ؆୯ͳ෼ྨλεΫ ਖ਼࣮͘͠૷͞Ε͍ͯΕ͹ ෼ྨͰ͖ͳ͍͸͕ͣͳ͍

  87. όοναΠζ64 ӅΕ૚ͷ෯128 ධՁσʔλ෼ྨਫ਼౓ 98%લޙ χϡʔϥϧωοτϫʔΫͱͯ͠ ػೳ͍ͯ͠Δ

  88. Fashion-MNIST ϥϕϧ෇͖ҥྨը૾7ສຕ Tγϟπɺίʔτɺۺ౳ 10छྨͷҥྨؚ͕·ΕΔ ҟͳΔҥྨͰ΋ ܗ͸ࣅ͍ͯͨΓ͢ΔͨΊ MNISTΑΓ͸ ೉͍͠ͱ͞ΕΔ https://github.com/zalandoresearch/fashion-mnist

  89. όοναΠζ64 ӅΕ૚ͷ෯128 ͔֬ʹ΍΍མͪΔ ධՁσʔλ෼ྨਫ਼౓ 87%લޙ

  90. ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

    L ੵ  ϕ ਫ਼౓͕ग़ͳ͍࣌͸૚Λ૿΍͢ ͔͠͠୯७ͳߦྻੵͷ૚Λ૿΍͢ͱ ! ͕ͲΜͲΜ૿͑Δ w
  91. ৞ΈࠐΈ ϑΟϧλ!w ೖྗ!x ग़ྗ!y ֶश͢Δը૾ॲཧϑΟϧλ

  92. ৞ΈࠐΈ ϑΟϧλ!w ೖྗ!x ग़ྗ!y ֶश͢Δը૾ॲཧϑΟϧλ

  93. yij = M ∑ k=0 N ∑ l=0 wkl x

    (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ೖ ྗ ৞ Έ ࠐ Έ ग़ ྗ ϑ ỹ ϧ λ
  94. yij = M ∑ k=0 N ∑ l=0 wkl x

    (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x ∂L ∂yij ∂L ∂xij  ͷޯ഑ w w ∂L ∂wij ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ͋ΔॏΈ͕ؔΘͬͨશͯͷೖग़ྗͷϖΞʹ͍ͭͯ ग़ྗଆͷޯ഑ͱೖྗͷੵͷ૯࿨ΛͱΔ
  95. yij = M ∑ k=0 N ∑ l=0 wkl x

    (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x  ͷޯ഑ w w ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ∂L ∂yij ∂L ∂xij ∂L ∂wij ∂L ∂xij = M ∑ k N ∑ l wkl ∂L ∂y (i−k)(j − l) 180౓ճసͨ͠ग़ྗଆͷޯ഑Λ ৞ΈࠐΉ
  96. void main() { const uint filter_index = gl_GlobalInvocationID.x; const uint

    filter_x = filter_index % filter_width; const uint filter_y = filter_index / filter_width % filter_height; const uint channel = filter_index / filter_width / filter_height % channels; const uint filter_size = filter_width * filter_height * channels; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; bool filter_oob = filter_index >= filter_size; float sum = 0.0; for( int data_index = 0; data_index != batch_size; ++data_index ) { for( int output_x = 0; output_x != output_width; ++output_x ) { for( int output_y = 0; output_y != output_height; ++output_y ) { const int output_index = int( output_x ) + int( output_y ) * int( output_width ) + int( channel ) * int( output_width * output_height ) + data_index * int( output_width * output_height * channels ); const int input_x = output_x * int(filter_xstride) - int(xmargin) + int(filter_x); const int input_y = output_y * int(filter_ystride) - int(ymargin) + int(filter_y); const bool input_oob = filter_oob || input_x < 0 || input_x >= input_width || input_y < 0 || input_y >= input_height; const int input_index = int( input_x ) + int( input_y ) * int( input_width ) + int( channel ) * int( input_width * input_height ) + data_index * int( input_width * input_height * channels ); const float grad = filter_oob ? 0.0 : output_grad[ output_index ]; const float x = input_oob ? 0.0 : input_data[ input_index ]; sum += grad * x; } } } if( !filter_oob ) adam( weight[ filter_index ], sum ); }  ΛٻΊΔ(-4- ∂L ∂wkl
  97. void main() { const uint input_x = gl_GlobalInvocationID.x % output_width;

    const uint input_y = gl_GlobalInvocationID.x / output_width % output_height; const uint channel = gl_GlobalInvocationID.x / output_width / output_height; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; const uint input_size = input_width * input_height * channels; const uint relative_input_index = input_x + input_y * input_width + channel * input_width * input_height; const uint input_index = relative_input_index + data_index * input_width * input_height * channels; if( relative_input_index < input_size ) input_grad[ input_index ] = 0.0; for( int x = 0; x != filter_width; ++x ) { for( int y = 0; y != filter_height; ++y ) { const int output_x = int(input_x) * int(filter_xstride) - int(xmargin) - x; const int output_y = int(input_y) * int(filter_ystride) - int(ymargin) - y; const bool oob = output_x < 0 || output_x >= output_width || output_y < 0 || output_y >= output_height; const int relative_output_index = output_x + output_y * int(output_width) + int(channel) * int(output_width * output_height); const int output_index = relative_output_index + int(data_index) * int(output_width * output_height * channels ); const uint filter_index = x + y * int(filter_width) + channel * int(filter_width * filter_height ); if( relative_input_index < input_size ) { if( !oob ) { const float grad = output_grad[ output_index ] * weight[ filter_index ].x; input_grad[ input_index ] += grad; } } } } }  ΛٻΊΔ(-4- ∂L ∂xij
  98. 3 6 1 2 2 0 5 9 6 MaxPooling

    ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢
  99. 3 6 1 2 2 0 5 9 6 MaxPooling

    ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢ 9
  100. ೖ ྗ .BY1PPMJOH ग़ ྗ MaxPooling void main() { const

    uint relative_output_index = gl_GlobalInvocationID.x; const uint output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; if( relative_output_index < output_size ) output_data[ output_index ] = 0.0; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_height; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) output_data[ output_index ] = max( output_data[ output_index ], input_data[ input_index ] ); } } } ϑΟϧλαΠζ ͷ৔߹ M × N yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M], l ∈ [0,N]
  101. yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M],

    l ∈ [0,N] ϑΟϧλαΠζ ͷ৔߹ M × N ೖ ྗ ଆ ͷ ޯ ഑ .BY1PPMJOH ग़ ྗ ଆ ͷ ޯ ഑ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij ࠷େ஋ΛΑ͖ͯͨ͜͠ೖྗʹରԠ͢Δޯ഑Λ ग़ྗଆͷޯ഑ͷ஋ʹ͢Δ ∂L ∂yij ∂L ∂xij MaxPooling ͷٯ఻೻
  102. void main() { const uint relative_output_index = gl_GlobalInvocationID.x; const uint

    output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; const uint initial_input_x = output_x * filter_xstride; const uint initial_input_y = output_y * filter_ystride; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_width; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) input_grad[ input_index ] = ( input_data[ input_index ] == output_data[ output_index ] ) ? output_grad[ output_index ] : 0.0; } } } ͷ ޯ ഑ PPMJOH ͷ ޯ ഑ MaxPooling ͷٯ఻೻ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij
  103. ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

    L .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ NEW!
  104. ධՁσʔλ෼ྨਫ਼౓ 90%લޙ গ͠޲্ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺14:14

  105. L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

    ਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ NEW!
  106. ͜Ε͸ͻͲ͍ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺32:32:64:64

  107. L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

    ਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ ! ͸࠷ऴ૚Λআ͍ͯ ReLU ϕ ͚ͩ͜͜Hyperbolic Tangent
  108. L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

    ਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ ReLU͸ਖ਼ํ޲ʹ ͍͘ΒͰ΋େ͖ͳ஋ΛͱΔ ֶश͕;Β͍͍ͭͯΔͱ tanhʹڊେͳ஋͕͞͞Δ Ͱ ͍ͩ͘͞ −1 ≤ x ≤ 1 14326.7
  109. L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

    ਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ 14326.7͸μϝ͗ͯ͢ গ͘͠Β͍஋͕มΘͬͯ΋μϝͳͷͰ ޯ഑͸ Ͱ͢ 0 Ͳ͏ͨ͠Β͍͍ͷ͔ ͳΜ΋Θ͔ΒΜ ޯ഑͕ແ͘ͳͬͯ ֶशͰ͖ͳ͘ͳΔ ∂L ∂xi = ∂L ∂yi (1 − tanh2 (xi)) 14326.7
  110. ৞ΈࠐΈ૚ͷֶश཰Λ! ʹͨ͠Βਫ਼౓͕վળͨ͠ 1 10 ֶͨͩ͠श͕஗͍ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺32:32:64:64

  111. ૚ͷग़ྗͷ෼෍ΛҰఆʹอͭख๏ Batch Normalization Layer Normalization Group Normalization Batch Normalization: Accelerating

    Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe and Christian Szegedy 2015 https://arxiv.org/abs/1502.03167 https://arxiv.org/abs/1607.06450 Layer Normalization Lei Ba, Jimmy and Kiros, Jamie Ryan and Hinton, Geoffrey E. p. arXiv:1607.06450. 2016 https://arxiv.org/abs/1803.08494 Group Normalization Yuxin Wu and Kaiming He 2018
  112. TensorCore ͓·͚ ਫ ฏ ߦ ྻ ੵ ࿨ … AB

    + C  A  B  C ! ཁૉͷߦྻ! ͱ! ཁૉͷߦྻ! Λֻ͚ͯ ! ཁૉͷߦྻ! Λ଍ͨ݁͠ՌΛ32εϨου࢖ͬͯ1໋ྩͰಘΔ 16 × 16 A 8 × 16 B 16 × 8 C 7,@/7@DPPQFSBUJWF@NBUSJY֦ுʹରԠͨ͠ /7*%*"ͷ(16ͳΒ 7VMLBO͔ΒͰ΋ར༻Ͱ͖Δ
  113. ͳΜͰTensorCoreΛ࢖Θͳ͔ͬͨͷ? ஸ౓Ոʹམ͍ͪͯͨGeForceGTX1070ʹ͸ TensorCore͕ແ͔ͬͨ

  114. ·ͱΊ σΟʔϓϥʔχϯά͸ΞϧΰϦζϜͳͷͰ ࣮૷͢Ε͹Ͳ͜Ͱ΋ಈ͘