# 低レイヤーな人のためのディープラーニング

フレームワークに頼らずVulkanで畳み込みニューラルネットワークを実装する方法を解説します
これは2019年7月20日に行われた 第15回 カーネル／VM探検隊 での発表資料です

July 20, 2019

## Transcript

3. ### x0 x1 x2 x3 x4 y2 ܗࣜχϡʔϩϯ × w02 ∑

ϕ × w12 × w22 × w32 × w42 . . .
4. ### yj = ϕ (∑ i wij xi) x0 x1 x2

x3 x4 y0 y1 y2 y3 y4 . . . . . . . . . ૚ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42
5. ### ૚ x00 x01 x02 x03 x04 . . . x10

x11 x12 x13 x14 . . . ૚ x20 x21 x22 x23 x24 . . . ૚ x30 x31 x32 x33 x34 . . . ૚ y0 y1 y2 y3 y4 . . . w0 w1 w2 w3 ૚ΛແݶʹॏͶͨ෺͸ॏΈ࣍ୈͰ೚ҙͷؔ਺ΛදݱͰ͖ΔͰ͖Δ ༗ݶͰ΋༷ʑͳؔ਺ΛۙࣅͰ͖Δ χϡʔϥϧωοτϫʔΫ

7. ### ॏΈΛ୳Δ໰୊ʹͳΔ 2 7 4 y0 y1 y2 ͜Εͱ ͜ΕΛఆ਺ͱͯ͠ ͱਖ਼͍͠ग़ྗ͕

ग़དྷΔ͚ͩۙ͘ͳΔΑ͏ʹ y  Λௐઅ͢Δ w
8. ### ֶश w x0 x1 x2 y0 y1 y2 t0 t1

t2  Λमਖ਼ w ೖྗͱରԠ͢Δग़ྗ͕ େྔʹ༗Ε͹ ͜ͷૢ࡞Λ܁Γฦ͢͜ͱͰ χϡʔϥϧωοτϫʔΫ͕ ͱͷؔ܎Λදؔ͢਺ͷ ۙࣅʹͳΔ x t x t σʔλ͔Β ؔ਺͕ಘΒΕΔ w  Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2 w  Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2
9. ### x0 x1 x2 y0 y1 y2 σΟʔϓϥʔχϯά χϡʔϥϧωοτϫʔΫͷ௕͍΍ͭ ௕͍ͱΑΓෳࡶͳؔ਺ΛۙࣅͰ͖Δ͕ ֶश͕೉͘͠ͳΔҝੲ͸࣮༻తͰ͸ͳ͍ͱߟ͑ΒΕ͍ͯͨ

͜͜೥ఔͷݚڀͰ௕͍ωοτϫʔΫͷֶश͕ՄೳʹͳΓ େϒʔϜ

. . Theano
12. ### CPU GPU FPGA ͲͷϑϨʔϜϫʔΫΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ ϑϨʔϜϫʔΫ͕ొ৔ TensorFlow Chainer Caffe PyTorch .

. . Keras Theano ϨΠϠʔ͕ߴ͍
13. ### CPU GPU FPGA TensorFlow Chainer Caffe PyTorch . . .

Theano \$ du -h tensorflow-1.12.3/ ... 145M tensorflow-1.12.3/ \$ du -h pytorch-1.1.0/ ... 44M pytorch-1.1.0/ \$ du -h chainer-6.1.0/ ... 22M chainer-6.1.0/ Ͱ͔͍ Keras \$ du -h keras-2.2.4/ ... 3.0M keras-2.2.4/

19. ### Ӆ Ε ૚ ग़ ྗ ૚ ࠷΋جຊతͳશ݁߹૚ͭͷωοτϫʔΫ x0 x1 x2

x3 x4 . . . y0 y1 y2 y3 y4 . . .
20. ### if( config.validation ) layers.emplace_back( "VK_LAYER_LUNARG_standard_validation" ); const auto app_info =

vk::ApplicationInfo( config.prog_name.c_str(), (தུ), VK_API_VERSION_1_1 ); instance_ptr_t instance( new vk::Instance( vk::createInstance( vk::InstanceCreateInfo() .setPApplicationInfo( &app_info ) .setEnabledExtensionCount( ext.size() ).setPpEnabledExtensionNames( ext.data() ) .setEnabledLayerCount( layers.size() ).setPpEnabledLayerNames( layers.data() ) ) ) ); auto devices = instance->enumeratePhysicalDevices(); if( devices.empty() ) throw device_is_not_available(); devices.erase( std::remove_if( devices.begin(), devices.end(), [&]( const auto &d ) -> bool { auto avail_dext = d.enumerateDeviceExtensionProperties(); for( const char *w: dext ) if( std::find_if( avail_dext.begin(), avail_dext.end(), [&]( const auto &v ) { return !strcmp( v.extensionName, w ); } ) == avail_dext.end() ) return true; const auto avail_dlayers = d.enumerateDeviceLayerProperties(); for( const char *w: dlayers ) if( std::find_if( avail_dlayers.begin(), avail_dlayers.end(), [&]( const auto &v ) { return !strcmp( v.layerName, w ); } ) == avail_dlayers.end() ) return true; return false; } ), devices.end() ); if( devices.empty() ) throw required_extensions_or_layers_are_not_available(); 7VMLBOͷΠϯελϯεΛ࡞Δ ར༻Մೳͳ(16ͷத͔Β ࢖͏΍ͭΛબͿ
21. ### const auto queue_props = physical_device.getQueueFamilyProperties(); uint32_t queue_index =std::distance( queue_props.begin(), std::find_if(

queue_props.begin(), queue_props.end(), []( const auto &v ) { return bool( v.queueFlags & vk::QueueFlagBits::eCompute ) && bool( v.queueFlags & vk::QueueFlagBits::eTransfer ); } ) ); if( queue_index == queue_props.size() ) throw required_queue_is_not_available(); const float priority = 0.0f; std::vector< vk::DeviceQueueCreateInfo > queues{}; const auto queue_create_info = vk::DeviceQueueCreateInfo() .setQueueFamilyIndex( queue_index ).setQueueCount( 1 ).setPQueuePriorities( &priority ); const auto features = physical_device.getFeatures(); auto device = physical_device.createDevice( vk::DeviceCreateInfo() .setQueueCreateInfoCount( 1 ).setPQueueCreateInfos( &queue_create_info ) .setEnabledExtensionCount( dext.size() ).setPpEnabledExtensionNames( dext.data() ) .setEnabledLayerCount( dlayers.size() ).setPpEnabledLayerNames( dlayers.data() ) .setPEnabledFeatures( &features ) ); std::shared_ptr< vk::Device > d( new vk::Device( std::move( device ) ), []( const auto &p ) { if( p ) { p->destroy(); delete p; } } ); auto queue = device.getQueue( queue_index, 0 ); auto command_pool = device.createCommandPool( vk::CommandPoolCreateInfo() .setQueueFamilyIndex( queue_index ).setFlags( vk::CommandPoolCreateFlagBits::eResetCommandBuffer ) ); std::shared_ptr< vk::Queue > q( new vk::Queue( std::move( queue ) ), [d]( const auto& ) {} ); std::shared_ptr< vk::CommandPool > p( new vk::CommandPool( std::move( command_pool ) ), [d]( const vk::CommandPool *p ) { if( p ) { d->destroyCommandPool( *p ); ࿦ཧσόΠεɺΩϡʔɺίϚϯυϓʔϧΛ࡞Δ
22. ### w = w00 w01 ⋯ w0M w10 w11 ⋯ w1M

⋮ ⋮ ⋮ wN0 wN1 ⋯ wNM ೖྗ͕/ཁૉɺग़ྗ͕.ཁૉͷϕΫτϧͷ࣌ ॏΈΛҎԼͷΑ͏ͳߦྻͱΈͳ͢ͱ y = ϕ (wx) ੵ  ϕ ͭͷ૚ͷܭࢉ͸  ϕΫτϧͱߦྻͷੵΛٻΊΔ  ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w x y
23. ###  Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0

t1 t2 w  Λमਖ਼ w x3 x4 x5 y3 y4 y5 t3 t4 t5 w ϛχόον x6 y6 t6 ෳ਺ͷೖྗϕΫτϧΛ ଋͶͯߦྻʹ͢Δ ଋ͝ͱʹޡࠩΛٻΊ͔ͯΒ ॏΈΛमਖ਼͢Δ ݸผʹ Λमਖ਼͢ΔΑΓ ֶश͕҆ఆ͢Δ w
24. ### ੵ  ϕ  ߦྻͱߦྻͷੵΛٻΊΔ  ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w

x0 x1 ⋮ xb  ͸ͦΕͧΕೖྗϕΫτϧ xn  ͸ͦΕͧΕग़ྗϕΫτϧ yn y0 y1 ⋮ yb  ͸όοναΠζ b
25. ### ੵ  ϕ ੵ  ϕ wh wo x0 x1

⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L
26. ### ੵ  ϕ ੵ  ϕ ೖ ྗ ग़ ྗ

૚ ͷ ग़ ྗ ग़ ྗ ૚ ͷ ੵ Ӆ Ε ૚ ͷ ੵ Ӆ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ
27. ### hidden_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( input_width *

hidden_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); output_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * output_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); hidden_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); hidden_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); softmax_grad.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); (16ͷϝϞϦΛ֬อ
28. ### ੵ ೖ ྗ ग़ ྗ ॏ Έ ೖྗ ߦྻ ͱॏΈ

ߦྻ ͷੵΛ (16Ͱܭࢉ͢Δ

32. ### x00 x01 x02 x10 x11 x12 x20 x21 x22 w00

w01 w02 w10 w11 w12 w20 w21 w22 = ∑ i x0i wi0 ∑ i x0i wi1 ∑ i x0i wi2 ∑ i x1i wi0 ∑ i x1i wi1 ∑ i x1i wi2 ∑ i x2i wi0 ∑ i x2i wi1 ∑ i x2i wi2 ໌Β͔ʹग़ྗߦྻͷཁૉ਺·Ͱ͸ฒྻͰܭࢉͰ͖Δ ∑ i x0i wi0 = x00 w00 + x01 w10 + x02 w20  ߋʹ ͷத਎ΛฒྻͰܭࢉ͢ΔͱεϨου਺ΛՔ͛Δ͕ ֤εϨουͷ஋ͷ ΛͲ͏΍ͬͯऔΔ͔͕໰୊ʹͳΔ ∑ ∑
33. ### 43". GPU͸ෳ਺ͷεϨουͰಉظՄೳͳ SRAMΛ͍࣋ͬͯΔ 43". ʹ ॻ ͘ ಉ ظ 43".

͔ Β ಡ Ή A B ಉҰWorkGroup಺ͷεϨουʹ ஋Λड͚౉͢͜ͱ͕Ͱ͖Δ A B WorkGroupͷશεϨου͕ ಉظʹୡ͢Δ·Ͱఀࢭ ͜ͷSRAMͷࣄΛ Vulkan༻ޠͰSharedMemory NVIDIA༻ޠͰ΋SharedMemory SRAMΛڞ༗Ͱ͖ΔεϨουͷଋΛ Vulkan༻ޠͰWorkGroup NVIDIA༻ޠͰBlock
34. ### ݹయతͳ(16ʹ͓͚Δ ∑  ճͷՃࢉͱಉظͰ ͕ٻ·Δ log2 (n) ∑  x0

 x1  x2  x3 ಉ ظ S0 := x0 S1 := x2 ಉ ظ S0 := x0 + S0 S1 := x2 + S1 S0 := S0 + S1 x0 + x1 + x2 + x3 43".
35. ### 43". AB CD  x0  x1  x2 

x3 ਫ ฏ Ճ ࢉ x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 A B C D ৽͠ΊͷGPU͸ಉҰSubgroup಺Ͱ ਫฏԋࢉ͕Ͱ͖Δ Vulkan༻ޠͰSubgroup operations NVIDIA༻ޠͰWarp Shuffle
36. ### 43". SubgroupͷαΠζ͕32ͷ৔߹ ! ճͷՃࢉͱಉظͰ! ͕ٻ·Δ log32 (n) ∑ … …

ਫ ฏ Ճ ࢉ ਫ ฏ Ճ ࢉ ಉ ظ ਫ ฏ Ճ ࢉ  x0  x1  x32  x33  x34  x64 64 ∑ i=0 xi φ΢͍(16ʹ͓͚Δ ∑
37. ### GPU Subgroup: ϓϩάϥϜΧ΢ϯλΛڞ༗͍ͯ͠Δ Workgroup: SRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ Dispatch: VRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ GeForceGTX1070ͷ৔߹ 32εϨου ෺ཧ4Subgroups

࿦ཧ48Subgroups ෺ཧ60Workgroups ࿦ཧ2^64Workgroups
38. ### shared float local_sum[ local_memory_size ]; float large_sum( in float value

) { float sg_sum = subgroupAdd( value ); local_sum[ gl_SubgroupID ] = sg_sum; barrier(); uint len = gl_NumSubgroups; while( len > 1 ) { uint index = gl_SubgroupInvocationID + gl_SubgroupID * gl_SubgroupSize; float sum = subgroupAdd( index < len ? local_sum[ index ] : 0.0 ); local_sum[ gl_SubgroupID ] = sum; barrier(); len /= gl_SubgroupSize; } barrier(); return local_sum[ 0 ]; } GLSLͰ!∑ ਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷ஋ΛਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷཁૉ͕1ݸʹͳͬͨΒ ͦͷ஋Λreturn
39. ### void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

output_index = gl_GlobalInvocationID.y; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; output_data[ output_index + data_index * output_width ] = 0.0; for( uint offset = 0; offset < width; offset += input_width ) { float value = ( offset + input_index ) < width ? input_data[ offset + input_index + data_index * width ] * weight[ output_index + ( offset + input_index ) * output_width ].x : 0.0; output_data[ output_index + data_index * output_width ] += large_sum( value ); } } GLSLͰߦྻͷੵ ! ΛऔΔඞཁ͕͋ΔεϨου͕ಉҰWorkGroupʹͳΔΑ͏ʹͯ͠ ೖྗߦྻͷ஋ͱॏΈߦྻͷ஋ͷੵΛग़ྗߦྻʹॻ͘ ∑
40. ###  ϕ ೖ ྗ ग़ ྗ ׆ੑԽؔ਺ ૚͕ߦྻੵ͚ͩͩͬͨ৔߹ ૚Λ͍ͭ͘ॏͶͯ΋ઢܕੑ͕ҡ࣋͞ΕΔ ݴ͍׵͑Δͱઢܗͳؔ਺͔ۙ͠ࣅͰ͖ͳ͘ͳΔ

ߦྻੵͱߦྻੵͷؒʹ ઢܕੑΛյ͢ඇઢܗͷؔ਺ΛڬΉࣄͰ ඇઢܗͳؔ਺ͷۙࣅΛՄೳʹ͢Δ

43. ### yij = tanh (xij) ׆ੑԽؔ਺͸ೖग़ྗߦྻͷཁૉ͝ͱʹಠཱʹܭࢉͰ͖Δҝ ग़ྗߦྻͷཁૉ਺ ೖྗߦྻͷཁૉ਺ ·Ͱ͸ฒྻͰܭࢉͰ͖Δ yij =

{ xij xij > = 0 0 xij < 0 Hyperbolic Tangentͷ৔߹ Rectified Linear Unitͷ৔߹
44. ### void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = tanh( input_data[ offset + input_index + data_index * width ] ); } } void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = max( 0, input_data[ offset + input_index + data_index * width ] ); } } GLSLͰHyperbolic Tangent GLSLͰRectified Linear Unit
45. ### ಘ Β Ε ͨ ग़ ྗ ଛ ࣦ ؔ ਺

ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ଛࣦؔ਺ χϡʔϥϧωοτϫʔΫͷग़ྗͱ ཉ͔ͬͨ͠ग़ྗ͕ࣅ͍ͯΔ΄Ͳ ग़ྗ͕খ͘͞ͳΔؔ਺ ࠷ޙʹ͜ΕΛ෇͚ΔࣄͰ ద੾ͳॏΈͷ୳ࡧ͸ ࠷దԽ໰୊ʹͳΔ
46. ### ग़ྗͷܗࣜ t = 0 0 1 0 0 y =

0.8 0.000007 0.9 0.036 0.00005 χϡʔϥϧωοτϫʔΫͷग़ྗ ग़͖ͯͯ΄͍͠ग़ྗ 0൪͔2൪ͷͲͪΒ͔ͳؾ͕͢Δ ਖ਼ղ͸2൪Ͱ͢
47. ### TPGUNBY yi = exi ∑ j exj softmax (y) =

0.288213 0.129503 0.318524 0.134249 0.129509 y = 0.8 0.000007 0.9 0.036 0.00005
48. ### ΫϩεΤϯτϩϐʔޡࠩ ! ͔ͭ! ͷ࣌ t = 1 y = 1

!l = 0 ! ͔ͭ! ͷ࣌ t = 1 y = 0.001 !l = 4.605 ! ͷ࣌ t = 0 !l = 0 softmaxͷ݁Ռ! ͕1ʹۙͮ͘ҝʹ͸ iҎ֎ͷ! ͷ஋͸0ʹۙ͘ͳ͚Ε͹ͳΒͳ͍ yi y ·ͱΊΔͱ! ͱ! ͕ࣅ͍ͯΔఔ! ͕খ͘͞ͳΔ y t L L = − ∑ i ti log (yi) yi = exi ∑ j exj
49. ### void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; } GLSLͰSoftmax with Cross Entropy Loss ! ͕0ʹ͖ۙͮա͗Δͱ infΛੜΈग़͢ y ! ͕0ʹͳΔͱnan΍infΛੜΈग़͢ ∑ j exj yi = exi ∑ j exj L = − ∑ i ti log (yi)
50. ### ੵ  ϕ ੵ  ϕ wh wo x0 x1

⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb ! ͱ! Λఆ਺ͱݟ၏ͯ͠! ͕࠷খͱͳΔ ! ͱ! Λ୳͢ x t L wh wo L

ʹ͍ͭͯภඍ෼ w
52. ### wo ∂L ∂wo = ∂L ∂d ∂d ∂c ∂c ∂w0

! ͕3ͭ࿈ͳͬͨ߹੒ؔ਺ͱݟ၏ͤΔ ߹੒ؔ਺ͷඍ෼ͷ࿈࠯཯ df dx = df dg dg dx ΑΓ ੵ  ϕ ଛ ࣦ ؔ ਺ L t c d ∂L ∂d ∂d ∂c ∂c ∂w0
53. ### ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

wh L t c d b a ∂L ∂wh = ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh
54. ### ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

L wh t c d b a ֤૚ͷೖग़ྗͷඍ෼͕ٻ·Δ৔߹ ! ΛޙΖͷ૚͔ΒॱʹٻΊΒΕΔ ∂L ∂w ޡࠩٯ఻೻๏ ∂L ∂d ∂L ∂d ∂d ∂c ∂L ∂d ∂d ∂c ∂c ∂b ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh
55. ### ੵ  ϕ ੵ  ϕ ೖ ྗ ྗ ૚

ͷ ग़ ྗ ྗ ૚ ͷ ੵ Ε ૚ ͷ ੵ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ ޡ ࠩ ଛ ࣦ ؔ ਺ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑
56. ### ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

͍ ग़ ྗ L = − ∑ i ti log (yi) yi = exi ∑ j exj ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k ଛࣦؔ਺ͷٯ఻೻
57. ### ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

͍ ग़ ྗ ∑ i ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k
58. ### ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠

͍ ग़ ྗ ཉ͍͠ग़ྗͷ૯࿨=1 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti
59. ### void main() { const uint input_index = gl_GlobalInvocationID.x; const uint

data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ); } ཉ ͠ ͍ ग़ ྗ softmaxͷGLSLͷ࠷ޙͰ ޯ഑Λग़ྗ = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti
60. ### si = xi 2 + 1 2 yi = esi

∑ j esj L = −∑ i ti log (yi)  ϕ ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L  ͷ ஋͕ग़·͢ −1 ≤ x ≤ 1  Ͱ ͍ͩ͘͞ 0 ≤ x ࠷ऴ૚ͷ׆ੑԽؔ਺ʹtanhΛ࢖͏ͱ ଛࣦؔ਺͕ظ଴͢Δ஋Ҭͱ߹Θͳ͍ͷͰἧ͑Δ
61. ### t0 t1 ⋮ tb ஋͕ग़·͢ ∂L ∂si = yi −

ti ∂si ∂xi = 1 2 ∂L ∂xi = ∂L ∂si ∂si ∂xi = yi − ti 2 void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] * 0.5 + 0.5 ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ) * 0.5; }  Ͱ ͍ͩ͘͞ 0 ≤ x
62. ### ޯ ഑  ϕ ग़ ྗ ଆ ͷ ޯ ഑

Hyperbolic Tangentͷ ٯ఻೻ yi = tanh (xi) ∂yi ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi)) ग़ྗଆͷޯ഑ ∂L ∂yi ∂L ∂xi
63. ### ޯ ഑ void main() { const uint input_index = gl_GlobalInvocationID.x;

const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = ( 1 - pow( tanh( input_data[ offset + input_index + data_index * width ] ), 2 ) ) * output_grad[ offset + input_index + data_index * width ]; } } Hyperbolic Tangentͷ ٯ఻೻ i ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi))
64. ### ޯ ഑  ϕ ग़ ྗ ଆ ͷ ޯ ഑

Rectified Linear Unitͷ ٯ఻೻ yi = { xi xi ≥ 0 0 xi < 0 ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0 ∂L ∂yi ∂L ∂xi
65. ### Rectified Linear Unitͷ ٯ఻೻ void main() { const uint input_index

= gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = input_data[ offset + input_index + data_index * width ] >= 0 ? output_grad[ offset + input_index + data_index * width ] : 0.0; } } ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0

67. ### yi = log (1 + exp (xi)) ReLUͷ࿦จ Ͱ͸ Λۙࣅͯ͠

yi = { xi xi ≥ 0 0 xi < 0 ͱ͍ͯ͠Δ  Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML'10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, USA, 807-814.
68. ### ∂yi ∂xi = exp (xi) 1 + exp (xi) ैͬͯ

Λۙࣅͯ͠ ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0
69. ### ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ

഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi  ͷޯ഑ w w ∂L ∂wi yj = ∑ i wij xi ∂yj ∂wij = xi ∂L ∂wij = ∂L ∂yi xi 2ͭͷ෺ΛٻΊΔඞཁ͕͋ΔͷͰ ੺࿮ͷ෦෼͔Βߟ͑Δ ॏΈ͕ؔΘͬͨग़ྗͷޯ഑ʹ ॏΈ͕ؔΘͬͨೖྗΛֻ͚Δ
70. ### ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ

഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi  ͷޯ഑ w w ∂L ∂wi ࣍ʹ྘࿮ͷ෦෼Λߟ͑Δ yj = ∑ i wij xi ∂yj ∂xi = ∑ i wij ∂L ∂xi = ∑ j ∂L ∂yj wij ೖྗ͕ӨڹΛ༩͑ͨग़ྗͷޯ഑ʹ ͦͷೖग़ྗʹ͍ͭͯͷॏΈΛֻ͚ͨ෺ͷ૯࿨

72. ### ֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ wt+1 = wt − μ ∂L

∂wt গ͠ جຊతʹ͸! ʹԊͬͯ ΑΓ! ͕খ͘͞ͳΔํ΁ গ͠! Λߋ৽͢Δ ∂L ∂wt L w L w ! ͷߋ৽ํ޲ w
73. ### ֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ L w ! ͷߋ৽ํ޲ w ͪΐͬͱͨ͠ग़ͬுΓʹ

͍᪴ͯ໭Δ wt+1 = wt − μ ∂L ∂wt

75. ### Adam g = ∂L ∂wt mt = β1 mt−1 +

(1 − β1) g vt = β2 vt−1 + (1 − β2) g2 ̂ mt = mt 1 − βt 1 ̂ vt = vt 1 − βt 2 wt+1 = wt − α ̂ mt ̂ vt + ϵ void adam( inout vec4 weight, in float grad ) { weight.w += 1; float gt = grad; weight.y = beta1 * weight.y + ( 1 - beta1 ) * gt; weight.z = beta2 * weight.z + ( 1 - beta2 ) * gt * gt; float mhat = weight.y / ( 1 - pow( beta1, weight.w ) ); float vhat = weight.z / ( 1 - pow( beta2, weight.w ) ); weight.x -= alpha * mhat / ( sqrt( vhat ) + eps ); }   ͕ਪ঑͞Ε͍ͯΔͷͰ ͜ͷ஋Λͦͷ··࢖͏ α = 0.001 β1 = 0.9 β2 = 0.999 • Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for stochastic optimization. 2014. arXiv:1412.6980v9 https://arxiv.org/abs/1412.6980
76. ### ॳظ஋ ! ͕ಉ͡χϡʔϩϯ͸ಉ͍ৼΔ෣͍Λ͢Δ w × w02 ∑ ϕ × w12

× w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42   ಉ͡ग़ྗΛ͢ΔχϡʔϩϯͷॏΈʹ͍ͭͯͷޯ഑͸Ұக͢Δ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42  ∂L ∂w  ∂L ∂w ಉ͚ͩ͡ॏΈ͕ߋ৽͞ΕΔҝͣͬͱಉ͡ৼΔ෣͍Λ͢Δ χϡʔϥϧωοτϫʔΫͷॏΈͷॳظ஋͸ ۉҰʹͳ͍ͬͯͯ͸͍͚ͳ͍
77. ### Xavierͷॳظ஋ w = 1 n randn() Understanding the difficulty of

training deep feedforward neural networks Xavier Glorot and Yoshua Bengio Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics PMLR. p. 249--256. 2010 http://proceedings.mlr.press/v9/glorot10a.html ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺
78. ### GPUͰཚ਺Λ࡞Δ  Gt 0.0801  Gt+1 0.2926  Gt+2 0.4342

 Gt+3 0.6978 ଟ͘ͷٖࣅཚ਺ΞϧΰϦζϜ͸ 1ݸͷཚ਺Λుͨ͘ͼʹߋ৽͞ΕΔঢ়ଶΛ࣋ͭ ͜Εͩͱલͷཚ਺͕ੜ੒͞ΕΔ·Ͱ࣍ͷཚ਺͕ੜ੒Ͱ͖ͳ͍ҝ εέʔϧ͠ͳ͍
79. ### GPUͰཚ਺Λ࡞Δ 10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ s = [ 12.9898 78.233] t =

43758.5453 fract (x) = x − ⌊x⌋ f (x) = fract (t sin (x ⋅ s))  f 0.7340 [0.1 0.8]  f 0.1768 ཚ਺ੜ੒ثͷঢ়ଶΛ1ཁૉຖʹҾ͖ܧ͕ͳ͍ҝεέʔϧ͢Δ [0.3 0.2]

81. ### float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233)))

* 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float xavier_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, 1.0 / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( xavier_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } Xavierͷॳظ஋ Ұ༷ཚ਺Λ࡞ͬͯ box-muller๏Ͱਖ਼نཚ਺ʹͯ͠ xavierͷॳظ஋ΛٻΊΔؔ਺Λ εϨου਺=ॏΈͷཁૉ਺ Ͱ࣮ߦ
82. ### Heͷॳظ஋ w = 2 n randn() Delving Deep into Rectifiers:

Surpassing Human-Level Performance on ImageNet Classification Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun 2015 https://arxiv.org/abs/1502.01852 ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺ ReLUΛ࢖͏৔߹ʹXavierͷॳظ஋ΑΓ ॳظͷޡࠩͷ఻೻ʹ༏ΕΔͱ͞ΕΔ
83. ### float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233)))

* 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float he_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, sqrt( 2 ) / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( he_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } ͕͜͜ҧ͏ Heͷॳظ஋
84. ### hidden_affine1.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[

0 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_affine2.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[ 1 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_activation.reset( new layer( create_relu_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_affine_output, hidden_activation_output ) ) ); output_affine.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, output_weight, batch_size ) ) ); output_activation.reset( new layer( create_tanh_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output ) ) ); error1.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 0 ] ) ) ); error2.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 1 ] ) ) ); output_activation_backward.reset( new layer( create_tanh_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output, output_activation_grad, softmax_grad ) ) ); output_affine_backward.reset( new layer( create_affine_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, GLSLΛίϯύΠϧ͠ComputePipeline࡞ͬͯ όοϑΝΛׂΓ౰ͯΔ
85. ### void network::exec() { ++swap_index; swap_index %= 2; queue->submit( vk::SubmitInfo() .setCommandBufferCount(

1 ) .setPCommandBuffers( command_buffers->data() + swap_index ), vk::Fence() ); fill( false, false ); queue->waitIdle(); if( debug ) { std::cout << "==============" << std::endl; check(); print( *error_out, batch_size ); print( *output_activation_output, batch_size ); print_image( *batch_images[ swap_index ], train_input->get_image_width(), batch_size ); print_label( *batch_labels[ swap_index ], batch_size ); print_eval( *output_activation_output, batch_size ); } } ࣮ߦ ೖྗͱཉ͍͠ग़ྗͷϖΞͷόοϑΝΛ2ηοτ༻ҙ͠ ֶशͱ࣍ͷσʔλͷసૹΛಉ࣌ʹߦ͏ ίϚϯυόοϑΝͷ಺༰ΛGPUʹ౤͛Δ

90. ### ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

L ੵ  ϕ ਫ਼౓͕ग़ͳ͍࣌͸૚Λ૿΍͢ ͔͠͠୯७ͳߦྻੵͷ૚Λ૿΍͢ͱ ! ͕ͲΜͲΜ૿͑Δ w

93. ### yij = M ∑ k=0 N ∑ l=0 wkl x

(i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ೖ ྗ ৞ Έ ࠐ Έ ग़ ྗ ϑ ỹ ϧ λ
94. ### yij = M ∑ k=0 N ∑ l=0 wkl x

(i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x ∂L ∂yij ∂L ∂xij  ͷޯ഑ w w ∂L ∂wij ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ͋ΔॏΈ͕ؔΘͬͨશͯͷೖग़ྗͷϖΞʹ͍ͭͯ ग़ྗଆͷޯ഑ͱೖྗͷੵͷ૯࿨ΛͱΔ
95. ### yij = M ∑ k=0 N ∑ l=0 wkl x

(i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x  ͷޯ഑ w w ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ∂L ∂yij ∂L ∂xij ∂L ∂wij ∂L ∂xij = M ∑ k N ∑ l wkl ∂L ∂y (i−k)(j − l) 180౓ճసͨ͠ग़ྗଆͷޯ഑Λ ৞ΈࠐΉ
96. ### void main() { const uint filter_index = gl_GlobalInvocationID.x; const uint

filter_x = filter_index % filter_width; const uint filter_y = filter_index / filter_width % filter_height; const uint channel = filter_index / filter_width / filter_height % channels; const uint filter_size = filter_width * filter_height * channels; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; bool filter_oob = filter_index >= filter_size; float sum = 0.0; for( int data_index = 0; data_index != batch_size; ++data_index ) { for( int output_x = 0; output_x != output_width; ++output_x ) { for( int output_y = 0; output_y != output_height; ++output_y ) { const int output_index = int( output_x ) + int( output_y ) * int( output_width ) + int( channel ) * int( output_width * output_height ) + data_index * int( output_width * output_height * channels ); const int input_x = output_x * int(filter_xstride) - int(xmargin) + int(filter_x); const int input_y = output_y * int(filter_ystride) - int(ymargin) + int(filter_y); const bool input_oob = filter_oob || input_x < 0 || input_x >= input_width || input_y < 0 || input_y >= input_height; const int input_index = int( input_x ) + int( input_y ) * int( input_width ) + int( channel ) * int( input_width * input_height ) + data_index * int( input_width * input_height * channels ); const float grad = filter_oob ? 0.0 : output_grad[ output_index ]; const float x = input_oob ? 0.0 : input_data[ input_index ]; sum += grad * x; } } } if( !filter_oob ) adam( weight[ filter_index ], sum ); }  ΛٻΊΔ(-4- ∂L ∂wkl
97. ### void main() { const uint input_x = gl_GlobalInvocationID.x % output_width;

const uint input_y = gl_GlobalInvocationID.x / output_width % output_height; const uint channel = gl_GlobalInvocationID.x / output_width / output_height; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; const uint input_size = input_width * input_height * channels; const uint relative_input_index = input_x + input_y * input_width + channel * input_width * input_height; const uint input_index = relative_input_index + data_index * input_width * input_height * channels; if( relative_input_index < input_size ) input_grad[ input_index ] = 0.0; for( int x = 0; x != filter_width; ++x ) { for( int y = 0; y != filter_height; ++y ) { const int output_x = int(input_x) * int(filter_xstride) - int(xmargin) - x; const int output_y = int(input_y) * int(filter_ystride) - int(ymargin) - y; const bool oob = output_x < 0 || output_x >= output_width || output_y < 0 || output_y >= output_height; const int relative_output_index = output_x + output_y * int(output_width) + int(channel) * int(output_width * output_height); const int output_index = relative_output_index + int(data_index) * int(output_width * output_height * channels ); const uint filter_index = x + y * int(filter_width) + channel * int(filter_width * filter_height ); if( relative_input_index < input_size ) { if( !oob ) { const float grad = output_grad[ output_index ] * weight[ filter_index ].x; input_grad[ input_index ] += grad; } } } } }  ΛٻΊΔ(-4- ∂L ∂xij
98. ### 3 6 1 2 2 0 5 9 6 MaxPooling

ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢
99. ### 3 6 1 2 2 0 5 9 6 MaxPooling

ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢ 9
100. ### ೖ ྗ .BY1PPMJOH ग़ ྗ MaxPooling void main() { const

uint relative_output_index = gl_GlobalInvocationID.x; const uint output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; if( relative_output_index < output_size ) output_data[ output_index ] = 0.0; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_height; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) output_data[ output_index ] = max( output_data[ output_index ], input_data[ input_index ] ); } } } ϑΟϧλαΠζ ͷ৔߹ M × N yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M], l ∈ [0,N]
101. ### yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M],

l ∈ [0,N] ϑΟϧλαΠζ ͷ৔߹ M × N ೖ ྗ ଆ ͷ ޯ ഑ .BY1PPMJOH ग़ ྗ ଆ ͷ ޯ ഑ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij ࠷େ஋ΛΑ͖ͯͨ͜͠ೖྗʹରԠ͢Δޯ഑Λ ग़ྗଆͷޯ഑ͷ஋ʹ͢Δ ∂L ∂yij ∂L ∂xij MaxPooling ͷٯ఻೻
102. ### void main() { const uint relative_output_index = gl_GlobalInvocationID.x; const uint

output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; const uint initial_input_x = output_x * filter_xstride; const uint initial_input_y = output_y * filter_ystride; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_width; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) input_grad[ input_index ] = ( input_data[ input_index ] == output_data[ output_index ] ) ? output_grad[ output_index ] : 0.0; } } } ͷ ޯ ഑ PPMJOH ͷ ޯ ഑ MaxPooling ͷٯ఻೻ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij
103. ### ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ ਺

L .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ NEW!

105. ### L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ NEW!

107. ### L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ ! ͸࠷ऴ૚Λআ͍ͯ ReLU ϕ ͚ͩ͜͜Hyperbolic Tangent
108. ### L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ ReLU͸ਖ਼ํ޲ʹ ͍͘ΒͰ΋େ͖ͳ஋ΛͱΔ ֶश͕;Β͍͍ͭͯΔͱ tanhʹڊେͳ஋͕͞͞Δ Ͱ ͍ͩ͘͞ −1 ≤ x ≤ 1 14326.7
109. ### L ੵ  ϕ ੵ  ϕ ଛ ࣦ ؔ

਺ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ .BY1PPMJOH  ϕ ৞ Έ ࠐ Έ  ϕ ৞ Έ ࠐ Έ 14326.7͸μϝ͗ͯ͢ গ͘͠Β͍஋͕มΘͬͯ΋μϝͳͷͰ ޯ഑͸ Ͱ͢ 0 Ͳ͏ͨ͠Β͍͍ͷ͔ ͳΜ΋Θ͔ΒΜ ޯ഑͕ແ͘ͳͬͯ ֶशͰ͖ͳ͘ͳΔ ∂L ∂xi = ∂L ∂yi (1 − tanh2 (xi)) 14326.7

111. ### ૚ͷग़ྗͷ෼෍ΛҰఆʹอͭख๏ Batch Normalization Layer Normalization Group Normalization Batch Normalization: Accelerating

Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe and Christian Szegedy 2015 https://arxiv.org/abs/1502.03167 https://arxiv.org/abs/1607.06450 Layer Normalization Lei Ba, Jimmy and Kiros, Jamie Ryan and Hinton, Geoffrey E. p. arXiv:1607.06450. 2016 https://arxiv.org/abs/1803.08494 Group Normalization Yuxin Wu and Kaiming He 2018
112. ### TensorCore ͓·͚ ਫ ฏ ߦ ྻ ੵ ࿨ … AB

+ C  A  B  C ! ཁૉͷߦྻ! ͱ! ཁૉͷߦྻ! Λֻ͚ͯ ! ཁૉͷߦྻ! Λ଍ͨ݁͠ՌΛ32εϨου࢖ͬͯ1໋ྩͰಘΔ 16 × 16 A 8 × 16 B 16 × 8 C 7,@/7@DPPQFSBUJWF@NBUSJY֦ுʹରԠͨ͠ /7*%*"ͷ(16ͳΒ 7VMLBO͔ΒͰ΋ར༻Ͱ͖Δ