低レイヤーな人のためのディープラーニング

௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά NAOMASA MATSUBAYASHI https://github.com/Fadis/kernelvm_20190720_samples ͜ͷൃදʹొ৔͢Δαϯϓϧίʔυ

௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά

x0 x1 x2 x3 x4 y2 ܗࣜχϡʔϩϯ × w02 ∑
ϕ × w12 × w22 × w32 × w42 . . .

yj = ϕ (∑ i wij xi) x0 x1 x2
x3 x4 y0 y1 y2 y3 y4 . . . . . . . . . ૚ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42

૚ x00 x01 x02 x03 x04 . . . x10
x11 x12 x13 x14 . . . ૚ x20 x21 x22 x23 x24 . . . ૚ x30 x31 x32 x33 x34 . . . ૚ y0 y1 y2 y3 y4 . . . w0 w1 w2 w3 ૚ΛແݶʹॏͶͨ෺͸ॏΈ࣍ୈͰ೚ҙͷؔ਺ΛදݱͰ͖ΔͰ͖Δ ༗ݶͰ΋༷ʑͳؔ਺ΛۙࣅͰ͖Δ χϡʔϥϧωοτϫʔΫ

Կͱͳ͘૬ؔ͸͋Γͦ͏ͳΜ͚ͩͲ Ͳ͏͍͏ؔ܎͔Α͘Θ͔Βͳ͍ͭͷσʔλ ? 2 7 4

ॏΈΛ୳Δ໰୊ʹͳΔ 2 7 4 y0 y1 y2 ͜Εͱ ͜ΕΛఆ਺ͱͯ͠ ͱਖ਼͍͠ग़ྗ͕
ग़དྷΔ͚ͩۙ͘ͳΔΑ͏ʹ y Λௐઅ͢Δ w

ֶश w x0 x1 x2 y0 y1 y2 t0 t1
t2 Λमਖ਼ w ೖྗͱରԠ͢Δग़ྗ͕ େྔʹ༗Ε͹ ͜ͷૢ࡞Λ܁Γฦ͢͜ͱͰ χϡʔϥϧωοτϫʔΫ͕ ͱͷؔ܎Λදؔ͢਺ͷ ۙࣅʹͳΔ x t x t σʔλ͔Β ؔ਺͕ಘΒΕΔ w Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2 w Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2

x0 x1 x2 y0 y1 y2 σΟʔϓϥʔχϯά χϡʔϥϧωοτϫʔΫͷ௕͍΍ͭ ௕͍ͱΑΓෳࡶͳؔ਺ΛۙࣅͰ͖Δ͕ ֶश͕೉͘͠ͳΔҝੲ͸࣮༻తͰ͸ͳ͍ͱߟ͑ΒΕ͍ͯͨ
͜͜೥ఔͷݚڀͰ௕͍ωοτϫʔΫͷֶश͕ՄೳʹͳΓ େϒʔϜ

CPU GPU FPGA ֶशʹ͸େྔͷܭࢉΛཁ͢Δҝ ͠͹͠͹$16֎ͷΞΫηϥϨʔλ͕༻͍ΒΕΔ

CPU GPU FPGA ͲΜͳϋʔυ΢ΣΞΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ ϑϨʔϜϫʔΫ͕ొ৔ TensorFlow Chainer Caffe PyTorch .
. . Theano

CPU GPU FPGA ͲͷϑϨʔϜϫʔΫΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ ϑϨʔϜϫʔΫ͕ొ৔ TensorFlow Chainer Caffe PyTorch .
. . Keras Theano ϨΠϠʔ͕ߴ͍

CPU GPU FPGA TensorFlow Chainer Caffe PyTorch . . .
Theano $ du -h tensorflow-1.12.3/ ... 145M tensorflow-1.12.3/ $ du -h pytorch-1.1.0/ ... 44M pytorch-1.1.0/ $ du -h chainer-6.1.0/ ... 22M chainer-6.1.0/ Ͱ͔͍ Keras $ du -h keras-2.2.4/ ... 3.0M keras-2.2.4/

௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά

GPU ର৅ϋʔυ΢ΣΞΛ(16ʹߜΓ ϑϨʔϜϫʔΫΛ࢖Θͣʹ σΟʔϓϥʔχϯάΛࢼΈΔ

ݱ୅ͷ(16͕උ͑ΔػೳΛ Ͱ͖Δ͚ͩͦͷ··ୟͨ͘Ίͷ ৽͍͠%άϥϑΟοΫ"1* ୈ13ճ ΧʔωϧʗVM୳ݕୂ ௿ϨΠϠʔάϥϑΟοΫAPI VulkanΛ࢝ΊΑ͏ ΑΓ Λ࢖͏ GPUΛಈ͔͢ͷʹ

Vulkanಉ༷ੜͷGPUΛ৮ΕΔCUDAʹ͸ χϡʔϥϧωοτϫʔΫͰ༻͍ΔܭࢉΛϥΠϒϥϦԽͨ͠ cuDNN͕͋Δ͕ https://developer.nvidia.com/cudnn

Λ࢖͏ GPUΛಈ͔͢ͷʹ ඞཁͳܭࢉ͸શ࣮ͯ૷͢Δ

Ӆ Ε ૚ ग़ ྗ ૚ ࠷΋جຊతͳશ݁߹૚ͭͷωοτϫʔΫ x0 x1 x2
x3 x4 . . . y0 y1 y2 y3 y4 . . .

if( config.validation ) layers.emplace_back( "VK_LAYER_LUNARG_standard_validation" ); const auto app_info =
vk::ApplicationInfo( config.prog_name.c_str(), (தུ), VK_API_VERSION_1_1 ); instance_ptr_t instance( new vk::Instance( vk::createInstance( vk::InstanceCreateInfo() .setPApplicationInfo( &app_info ) .setEnabledExtensionCount( ext.size() ).setPpEnabledExtensionNames( ext.data() ) .setEnabledLayerCount( layers.size() ).setPpEnabledLayerNames( layers.data() ) ) ) ); auto devices = instance->enumeratePhysicalDevices(); if( devices.empty() ) throw device_is_not_available(); devices.erase( std::remove_if( devices.begin(), devices.end(), [&]( const auto &d ) -> bool { auto avail_dext = d.enumerateDeviceExtensionProperties(); for( const char *w: dext ) if( std::find_if( avail_dext.begin(), avail_dext.end(), [&]( const auto &v ) { return !strcmp( v.extensionName, w ); } ) == avail_dext.end() ) return true; const auto avail_dlayers = d.enumerateDeviceLayerProperties(); for( const char *w: dlayers ) if( std::find_if( avail_dlayers.begin(), avail_dlayers.end(), [&]( const auto &v ) { return !strcmp( v.layerName, w ); } ) == avail_dlayers.end() ) return true; return false; } ), devices.end() ); if( devices.empty() ) throw required_extensions_or_layers_are_not_available(); 7VMLBOͷΠϯελϯεΛ࡞Δ ར༻Մೳͳ(16ͷத͔Β ࢖͏΍ͭΛબͿ

const auto queue_props = physical_device.getQueueFamilyProperties(); uint32_t queue_index =std::distance( queue_props.begin(), std::find_if(
queue_props.begin(), queue_props.end(), []( const auto &v ) { return bool( v.queueFlags & vk::QueueFlagBits::eCompute ) && bool( v.queueFlags & vk::QueueFlagBits::eTransfer ); } ) ); if( queue_index == queue_props.size() ) throw required_queue_is_not_available(); const float priority = 0.0f; std::vector< vk::DeviceQueueCreateInfo > queues{}; const auto queue_create_info = vk::DeviceQueueCreateInfo() .setQueueFamilyIndex( queue_index ).setQueueCount( 1 ).setPQueuePriorities( &priority ); const auto features = physical_device.getFeatures(); auto device = physical_device.createDevice( vk::DeviceCreateInfo() .setQueueCreateInfoCount( 1 ).setPQueueCreateInfos( &queue_create_info ) .setEnabledExtensionCount( dext.size() ).setPpEnabledExtensionNames( dext.data() ) .setEnabledLayerCount( dlayers.size() ).setPpEnabledLayerNames( dlayers.data() ) .setPEnabledFeatures( &features ) ); std::shared_ptr< vk::Device > d( new vk::Device( std::move( device ) ), []( const auto &p ) { if( p ) { p->destroy(); delete p; } } ); auto queue = device.getQueue( queue_index, 0 ); auto command_pool = device.createCommandPool( vk::CommandPoolCreateInfo() .setQueueFamilyIndex( queue_index ).setFlags( vk::CommandPoolCreateFlagBits::eResetCommandBuffer ) ); std::shared_ptr< vk::Queue > q( new vk::Queue( std::move( queue ) ), [d]( const auto& ) {} ); std::shared_ptr< vk::CommandPool > p( new vk::CommandPool( std::move( command_pool ) ), [d]( const vk::CommandPool *p ) { if( p ) { d->destroyCommandPool( *p ); ࿦ཧσόΠεɺΩϡʔɺίϚϯυϓʔϧΛ࡞Δ

w = w00 w01 ⋯ w0M w10 w11 ⋯ w1M
⋮ ⋮ ⋮ wN0 wN1 ⋯ wNM ೖྗ͕/ཁૉɺग़ྗ͕.ཁૉͷϕΫτϧͷ࣌ ॏΈΛҎԼͷΑ͏ͳߦྻͱΈͳ͢ͱ y = ϕ (wx) ੵ ϕ ͭͷ૚ͷܭࢉ͸ ϕΫτϧͱߦྻͷੵΛٻΊΔ ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w x y

Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0
t1 t2 w Λमਖ਼ w x3 x4 x5 y3 y4 y5 t3 t4 t5 w ϛχόον x6 y6 t6 ෳ਺ͷೖྗϕΫτϧΛ ଋͶͯߦྻʹ͢Δ ଋ͝ͱʹޡࠩΛٻΊ͔ͯΒ ॏΈΛमਖ਼͢Δ ݸผʹ Λमਖ਼͢ΔΑΓ ֶश͕҆ఆ͢Δ w

ੵ ϕ ߦྻͱߦྻͷੵΛٻΊΔ ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w
x0 x1 ⋮ xb ͸ͦΕͧΕೖྗϕΫτϧ xn ͸ͦΕͧΕग़ྗϕΫτϧ yn y0 y1 ⋮ yb ͸όοναΠζ b

ੵ ϕ ੵ ϕ wh wo x0 x1
⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L

ੵ ϕ ੵ ϕ ೖ ྗ ग़ ྗ
૚ ͷ ग़ ྗ ग़ ྗ ૚ ͷ ੵ Ӆ Ε ૚ ͷ ੵ Ӆ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ

hidden_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( input_width *
hidden_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); output_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * output_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); hidden_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); hidden_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); softmax_grad.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); (16ͷϝϞϦΛ֬อ

ੵ ೖ ྗ ग़ ྗ ॏ Έ ೖྗ ߦྻ ͱॏΈ
ߦྻ ͷੵΛ (16Ͱܭࢉ͢Δ

GPU ͕1ͭͷσʔλΛܭࢉ ͕SIMDͷ1Ϣχοτ Vulkan༻ޠͰSubgroup NVIDIA༻ޠͰWarp Vulkan༻ޠͰThread NVIDIA༻ޠͰ΋Thread

GPU ϓϩηοα਺ͰੑೳΛՔ͙ΞʔΩςΫνϟͳͷͰ Ͱ͖Δ͚ͩ୔ࢁͷεϨουͰܭࢉΛ͠ͳ͚Ε͹ͳΒͳ͍

VRAM શͯͷεϨου͸73".Λڞ༗͍ͯ͠Δ ෳ਺ͷεϨου͕ಉ͡ΞυϨε͔Β஋ΛಡΉͷ͸໰୊ͳ͍͕ ෳ਺ͷεϨου͕ಉ͡ΞυϨεʹ஋Λॻ͍ͨ৔߹ ͲͷεϨου͕ॻ͍ͨ஋͕࢒Δ͔͸ෆఆ Vulkan༻ޠͰMemory NVIDIA༻ޠͰGlobalMemory

x00 x01 x02 x10 x11 x12 x20 x21 x22 w00
w01 w02 w10 w11 w12 w20 w21 w22 = ∑ i x0i wi0 ∑ i x0i wi1 ∑ i x0i wi2 ∑ i x1i wi0 ∑ i x1i wi1 ∑ i x1i wi2 ∑ i x2i wi0 ∑ i x2i wi1 ∑ i x2i wi2 ໌Β͔ʹग़ྗߦྻͷཁૉ਺·Ͱ͸ฒྻͰܭࢉͰ͖Δ ∑ i x0i wi0 = x00 w00 + x01 w10 + x02 w20 ߋʹ ͷத਎ΛฒྻͰܭࢉ͢ΔͱεϨου਺ΛՔ͛Δ͕ ֤εϨουͷ஋ͷ ΛͲ͏΍ͬͯऔΔ͔͕໰୊ʹͳΔ ∑ ∑

43". GPU͸ෳ਺ͷεϨουͰಉظՄೳͳ SRAMΛ͍࣋ͬͯΔ 43". ʹ ॻ ͘ ಉ ظ 43".
͔ Β ಡ Ή A B ಉҰWorkGroup಺ͷεϨουʹ ஋Λड͚౉͢͜ͱ͕Ͱ͖Δ A B WorkGroupͷશεϨου͕ ಉظʹୡ͢Δ·Ͱఀࢭ ͜ͷSRAMͷࣄΛ Vulkan༻ޠͰSharedMemory NVIDIA༻ޠͰ΋SharedMemory SRAMΛڞ༗Ͱ͖ΔεϨουͷଋΛ Vulkan༻ޠͰWorkGroup NVIDIA༻ޠͰBlock

ݹయతͳ(16ʹ͓͚Δ ∑ ճͷՃࢉͱಉظͰ ͕ٻ·Δ log2 (n) ∑ x0
x1 x2 x3 ಉ ظ S0 := x0 S1 := x2 ಉ ظ S0 := x0 + S0 S1 := x2 + S1 S0 := S0 + S1 x0 + x1 + x2 + x3 43".

43". AB CD x0 x1 x2
x3 ਫ ฏ Ճ ࢉ x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 A B C D ৽͠ΊͷGPU͸ಉҰSubgroup಺Ͱ ਫฏԋࢉ͕Ͱ͖Δ Vulkan༻ޠͰSubgroup operations NVIDIA༻ޠͰWarp Shuffle

43". SubgroupͷαΠζ͕32ͷ৔߹ ! ճͷՃࢉͱಉظͰ! ͕ٻ·Δ log32 (n) ∑ … …
ਫ ฏ Ճ ࢉ ਫ ฏ Ճ ࢉ ಉ ظ ਫ ฏ Ճ ࢉ x0 x1 x32 x33 x34 x64 64 ∑ i=0 xi φ΢͍(16ʹ͓͚Δ ∑

GPU Subgroup: ϓϩάϥϜΧ΢ϯλΛڞ༗͍ͯ͠Δ Workgroup: SRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ Dispatch: VRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ GeForceGTX1070ͷ৔߹ 32εϨου ෺ཧ4Subgroups
࿦ཧ48Subgroups ෺ཧ60Workgroups ࿦ཧ2^64Workgroups

shared float local_sum[ local_memory_size ]; float large_sum( in float value
) { float sg_sum = subgroupAdd( value ); local_sum[ gl_SubgroupID ] = sg_sum; barrier(); uint len = gl_NumSubgroups; while( len > 1 ) { uint index = gl_SubgroupInvocationID + gl_SubgroupID * gl_SubgroupSize; float sum = subgroupAdd( index < len ? local_sum[ index ] : 0.0 ); local_sum[ gl_SubgroupID ] = sum; barrier(); len /= gl_SubgroupSize; } barrier(); return local_sum[ 0 ]; } GLSLͰ!∑ ਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷ஋ΛਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷཁૉ͕1ݸʹͳͬͨΒ ͦͷ஋Λreturn

void main() { const uint input_index = gl_GlobalInvocationID.x; const uint
output_index = gl_GlobalInvocationID.y; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; output_data[ output_index + data_index * output_width ] = 0.0; for( uint offset = 0; offset < width; offset += input_width ) { float value = ( offset + input_index ) < width ? input_data[ offset + input_index + data_index * width ] * weight[ output_index + ( offset + input_index ) * output_width ].x : 0.0; output_data[ output_index + data_index * output_width ] += large_sum( value ); } } GLSLͰߦྻͷੵ ! ΛऔΔඞཁ͕͋ΔεϨου͕ಉҰWorkGroupʹͳΔΑ͏ʹͯ͠ ೖྗߦྻͷ஋ͱॏΈߦྻͷ஋ͷੵΛग़ྗߦྻʹॻ͘ ∑

ϕ ೖ ྗ ग़ ྗ ׆ੑԽؔ਺ ૚͕ߦྻੵ͚ͩͩͬͨ৔߹ ૚Λ͍ͭ͘ॏͶͯ΋ઢܕੑ͕ҡ࣋͞ΕΔ ݴ͍׵͑Δͱઢܗͳؔ਺͔ۙ͠ࣅͰ͖ͳ͘ͳΔ
ߦྻੵͱߦྻੵͷؒʹ ઢܕੑΛյ͢ඇઢܗͷؔ਺ΛڬΉࣄͰ ඇઢܗͳؔ਺ͷۙࣅΛՄೳʹ͢Δ

Hyperbolic Tangent

Rectified Linear Unit (ReLU)

yij = tanh (xij) ׆ੑԽؔ਺͸ೖग़ྗߦྻͷཁૉ͝ͱʹಠཱʹܭࢉͰ͖Δҝ ग़ྗߦྻͷཁૉ਺ ೖྗߦྻͷཁૉ਺ ·Ͱ͸ฒྻͰܭࢉͰ͖Δ yij =
{ xij xij > = 0 0 xij < 0 Hyperbolic Tangentͷ৔߹ Rectified Linear Unitͷ৔߹

data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = tanh( input_data[ offset + input_index + data_index * width ] ); } } void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = max( 0, input_data[ offset + input_index + data_index * width ] ); } } GLSLͰHyperbolic Tangent GLSLͰRectified Linear Unit

ಘ Β Ε ͨ ग़ ྗ ଛ ࣦ ؔ ਺
ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ଛࣦؔ਺ χϡʔϥϧωοτϫʔΫͷग़ྗͱ ཉ͔ͬͨ͠ग़ྗ͕ࣅ͍ͯΔ΄Ͳ ग़ྗ͕খ͘͞ͳΔؔ਺ ࠷ޙʹ͜ΕΛ෇͚ΔࣄͰ ద੾ͳॏΈͷ୳ࡧ͸ ࠷దԽ໰୊ʹͳΔ

ग़ྗͷܗࣜ t = 0 0 1 0 0 y =
0.8 0.000007 0.9 0.036 0.00005 χϡʔϥϧωοτϫʔΫͷग़ྗ ग़͖ͯͯ΄͍͠ग़ྗ 0൪͔2൪ͷͲͪΒ͔ͳؾ͕͢Δ ਖ਼ղ͸2൪Ͱ͢

TPGUNBY yi = exi ∑ j exj softmax (y) =
0.288213 0.129503 0.318524 0.134249 0.129509 y = 0.8 0.000007 0.9 0.036 0.00005

ΫϩεΤϯτϩϐʔޡࠩ ! ͔ͭ! ͷ࣌ t = 1 y = 1
!l = 0 ! ͔ͭ! ͷ࣌ t = 1 y = 0.001 !l = 4.605 ! ͷ࣌ t = 0 !l = 0 softmaxͷ݁Ռ! ͕1ʹۙͮ͘ҝʹ͸ iҎ֎ͷ! ͷ஋͸0ʹۙ͘ͳ͚Ε͹ͳΒͳ͍ yi y ·ͱΊΔͱ! ͱ! ͕ࣅ͍ͯΔఔ! ͕খ͘͞ͳΔ y t L L = − ∑ i ti log (yi) yi = exi ∑ j exj

data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; } GLSLͰSoftmax with Cross Entropy Loss ! ͕0ʹ͖ۙͮա͗Δͱ infΛੜΈग़͢ y ! ͕0ʹͳΔͱnan΍infΛੜΈग़͢ ∑ j exj yi = exi ∑ j exj L = − ∑ i ti log (yi)

ੵ ϕ ੵ ϕ wh wo x0 x1
⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb ! ͱ! Λఆ਺ͱݟ၏ͯ͠! ͕࠷খͱͳΔ ! ͱ! Λ୳͢ x t L wh wo L

! ͕มԽͨ࣌͠! ͕ͲͷΑ͏ʹมԽ͢Δ͔ w L ∂L ∂w ॏΈ͕ొ৔͢Δ૚͔Β ΫϩεΤϯτϩϐʔޡࠩ·Ͱͷؔ਺Λ !
ʹ͍ͭͯภඍ෼ w

wo ∂L ∂wo = ∂L ∂d ∂d ∂c ∂c ∂w0
! ͕3ͭ࿈ͳͬͨ߹੒ؔ਺ͱݟ၏ͤΔ ߹੒ؔ਺ͷඍ෼ͷ࿈࠯཯ df dx = df dg dg dx ΑΓ ੵ ϕ ଛ ࣦ ؔ ਺ L t c d ∂L ∂d ∂d ∂c ∂c ∂w0

ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺
wh L t c d b a ∂L ∂wh = ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh

L wh t c d b a ֤૚ͷೖग़ྗͷඍ෼͕ٻ·Δ৔߹ ! ΛޙΖͷ૚͔ΒॱʹٻΊΒΕΔ ∂L ∂w ޡࠩٯ఻೻๏ ∂L ∂d ∂L ∂d ∂d ∂c ∂L ∂d ∂d ∂c ∂c ∂b ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh

ੵ ϕ ੵ ϕ ೖ ྗ ྗ ૚
ͷ ग़ ྗ ྗ ૚ ͷ ੵ Ε ૚ ͷ ੵ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ ޡ ࠩ ଛ ࣦ ؔ ਺ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑

ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠
͍ ग़ ྗ L = − ∑ i ti log (yi) yi = exi ∑ j exj ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k ଛࣦؔ਺ͷٯ఻೻

ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠
͍ ग़ ྗ ∑ i ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k

ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠
͍ ग़ ྗ ཉ͍͠ग़ྗͷ૯࿨=1 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti

data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ); } ཉ ͠ ͍ ग़ ྗ softmaxͷGLSLͷ࠷ޙͰ ޯ഑Λग़ྗ = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti

si = xi 2 + 1 2 yi = esi
∑ j esj L = −∑ i ti log (yi) ϕ ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L ͷ ஋͕ग़·͢ −1 ≤ x ≤ 1 Ͱ ͍ͩ͘͞ 0 ≤ x ࠷ऴ૚ͷ׆ੑԽؔ਺ʹtanhΛ࢖͏ͱ ଛࣦؔ਺͕ظ଴͢Δ஋Ҭͱ߹Θͳ͍ͷͰἧ͑Δ

t0 t1 ⋮ tb ஋͕ग़·͢ ∂L ∂si = yi −
ti ∂si ∂xi = 1 2 ∂L ∂xi = ∂L ∂si ∂si ∂xi = yi − ti 2 void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] * 0.5 + 0.5 ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ) * 0.5; } Ͱ ͍ͩ͘͞ 0 ≤ x

ޯ ഑ ϕ ग़ ྗ ଆ ͷ ޯ ഑
Hyperbolic Tangentͷ ٯ఻೻ yi = tanh (xi) ∂yi ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi)) ग़ྗଆͷޯ഑ ∂L ∂yi ∂L ∂xi

ޯ ഑ void main() { const uint input_index = gl_GlobalInvocationID.x;
const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = ( 1 - pow( tanh( input_data[ offset + input_index + data_index * width ] ), 2 ) ) * output_grad[ offset + input_index + data_index * width ]; } } Hyperbolic Tangentͷ ٯ఻೻ i ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi))

ޯ ഑ ϕ ग़ ྗ ଆ ͷ ޯ ഑
Rectified Linear Unitͷ ٯ఻೻ yi = { xi xi ≥ 0 0 xi < 0 ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0 ∂L ∂yi ∂L ∂xi

Rectified Linear Unitͷ ٯ఻೻ void main() { const uint input_index
= gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = input_data[ offset + input_index + data_index * width ] >= 0 ? output_grad[ offset + input_index + data_index * width ] : 0.0; } } ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0

Ͳ͏ݟͯ΋ෆ࿈ଓ͕ͩ ඍ෼͕ఆٛͰ͖Δͷ͔

yi = log (1 + exp (xi)) ReLUͷ࿦จ[1] Ͱ͸ Λۙࣅͯ͠
yi = { xi xi ≥ 0 0 xi < 0 ͱ͍ͯ͠Δ [1] Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML'10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, USA, 807-814.

∂yi ∂xi = exp (xi) 1 + exp (xi) ैͬͯ
Λۙࣅͯ͠ ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0

ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ
഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi ͷޯ഑ w w ∂L ∂wi yj = ∑ i wij xi ∂yj ∂wij = xi ∂L ∂wij = ∂L ∂yi xi 2ͭͷ෺ΛٻΊΔඞཁ͕͋ΔͷͰ ੺࿮ͷ෦෼͔Βߟ͑Δ ॏΈ͕ؔΘͬͨग़ྗͷޯ഑ʹ ॏΈ͕ؔΘͬͨೖྗΛֻ͚Δ

ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ
഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi ͷޯ഑ w w ∂L ∂wi ࣍ʹ྘࿮ͷ෦෼Λߟ͑Δ yj = ∑ i wij xi ∂yj ∂xi = ∑ i wij ∂L ∂xi = ∑ j ∂L ∂yj wij ೖྗ͕ӨڹΛ༩͑ͨग़ྗͷޯ഑ʹ ͦͷೖग़ྗʹ͍ͭͯͷॏΈΛֻ͚ͨ෺ͷ૯࿨

output_index = gl_GlobalInvocationID.y; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; float grad_w_sum = 0.0; for( uint data_index = 0; data_index != batch_size; data_index++ ) { input_grad[ input_index + data_index * input_width ] = 0.0; for( uint offset = 0; offset < height; offset += output_width ) { float grad_x = ( offset + output_index ) < height ? weight[ offset + output_index + input_index * height ].x * output_grad[ offset + output_index + data_index * height ] : 0.0; input_grad[ input_index + data_index * input_width ] += large_sum( grad_x ); } } for( uint offset = 0; offset < height; offset += output_width ) { float grad_w_sum = 0.0; for( uint data_index = 0; data_index != batch_size; data_index++ ) { float grad_w = ( offset + output_index ) < height ? input_data[ input_index + data_index * input_width ] * output_grad[ offset + output_index + data_index * height ] : 0.0; grad_w_sum += grad_w; } if( ( offset + output_index ) < height ) adam( weight[ offset + output_index + input_index * height ], grad_w_sum ); } } ߦྻੵͷ ٯ఻೻ ! ͷ਺ͰεϨουΛىಈ͠ ! Λڞ༗͢ΔεϨου͕ ਫฏՃࢉͰ͖ΔΑ͏ʹ WorkGroupΛׂΓ౰ͯΔ ∂L ∂wij ∂L ∂xi ∂L ∂wij ∂L ∂xi

֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ wt+1 = wt − μ ∂L
∂wt গ͠ جຊతʹ͸! ʹԊͬͯ ΑΓ! ͕খ͘͞ͳΔํ΁ গ͠! Λߋ৽͢Δ ∂L ∂wt L w L w ! ͷߋ৽ํ޲ w

֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ L w ! ͷߋ৽ํ޲ w ͪΐͬͱͨ͠ग़ͬுΓʹ
͍᪴ͯ໭Δ wt+1 = wt − μ ∂L ∂wt

SGD(֬཰తޯ഑߱Լ๏) MomentumSGD NAG AdaGrad RMSprop Adam AdaDelta AdaMax SMORMS3 RMSpropGraves
Eve Nadam Santa-E Santa-SSS AdaSecant GD by GD ࠷దԽΞϧΰϦζϜͷ ਐԽ

Adam g = ∂L ∂wt mt = β1 mt−1 +
(1 − β1) g vt = β2 vt−1 + (1 − β2) g2 ̂ mt = mt 1 − βt 1 ̂ vt = vt 1 − βt 2 wt+1 = wt − α ̂ mt ̂ vt + ϵ void adam( inout vec4 weight, in float grad ) { weight.w += 1; float gt = grad; weight.y = beta1 * weight.y + ( 1 - beta1 ) * gt; weight.z = beta2 * weight.z + ( 1 - beta2 ) * gt * gt; float mhat = weight.y / ( 1 - pow( beta1, weight.w ) ); float vhat = weight.z / ( 1 - pow( beta2, weight.w ) ); weight.x -= alpha * mhat / ( sqrt( vhat ) + eps ); } ͕ਪ঑͞Ε͍ͯΔͷͰ ͜ͷ஋Λͦͷ··࢖͏ α = 0.001 β1 = 0.9 β2 = 0.999 • Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for stochastic optimization. 2014. arXiv:1412.6980v9 https://arxiv.org/abs/1412.6980

ॳظ஋ ! ͕ಉ͡χϡʔϩϯ͸ಉ͍ৼΔ෣͍Λ͢Δ w × w02 ∑ ϕ × w12
× w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 ಉ͡ग़ྗΛ͢ΔχϡʔϩϯͷॏΈʹ͍ͭͯͷޯ഑͸Ұக͢Δ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 ∂L ∂w ∂L ∂w ಉ͚ͩ͡ॏΈ͕ߋ৽͞ΕΔҝͣͬͱಉ͡ৼΔ෣͍Λ͢Δ χϡʔϥϧωοτϫʔΫͷॏΈͷॳظ஋͸ ۉҰʹͳ͍ͬͯͯ͸͍͚ͳ͍

Xavierͷॳظ஋ w = 1 n randn() Understanding the difficulty of
training deep feedforward neural networks Xavier Glorot and Yoshua Bengio Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics PMLR. p. 249--256. 2010 http://proceedings.mlr.press/v9/glorot10a.html ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺

GPUͰཚ਺Λ࡞Δ Gt 0.0801 Gt+1 0.2926 Gt+2 0.4342
Gt+3 0.6978 ଟ͘ͷٖࣅཚ਺ΞϧΰϦζϜ͸ 1ݸͷཚ਺Λుͨ͘ͼʹߋ৽͞ΕΔঢ়ଶΛ࣋ͭ ͜Εͩͱલͷཚ਺͕ੜ੒͞ΕΔ·Ͱ࣍ͷཚ਺͕ੜ੒Ͱ͖ͳ͍ҝ εέʔϧ͠ͳ͍

GPUͰཚ਺Λ࡞Δ 10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ s = [ 12.9898 78.233] t =
43758.5453 fract (x) = x − ⌊x⌋ f (x) = fract (t sin (x ⋅ s)) f 0.7340 [0.1 0.8] f 0.1768 ཚ਺ੜ੒ثͷঢ়ଶΛ1ཁૉຖʹҾ͖ܧ͕ͳ͍ҝεέʔϧ͢Δ [0.3 0.2]

GPUͰཚ਺Λ࡞Δ 10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ ཚ਺ੜ੒ثͷग़ྗͷ෼෍ Box-Muller๏Ͱ ਖ਼ن෼෍ʹͨ͠΋ͷ Ξϗໟ͕ੜ͑ͯΔ͚Ͳ ࢖͑ͳ͍ϨϕϧͰ͸ͳͦ͞͏

float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233)))
* 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float xavier_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, 1.0 / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( xavier_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } Xavierͷॳظ஋ Ұ༷ཚ਺Λ࡞ͬͯ box-muller๏Ͱਖ਼نཚ਺ʹͯ͠ xavierͷॳظ஋ΛٻΊΔؔ਺Λ εϨου਺=ॏΈͷཁૉ਺ Ͱ࣮ߦ

Heͷॳظ஋ w = 2 n randn() Delving Deep into Rectifiers:
Surpassing Human-Level Performance on ImageNet Classification Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun 2015 https://arxiv.org/abs/1502.01852 ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺ ReLUΛ࢖͏৔߹ʹXavierͷॳظ஋ΑΓ ॳظͷޡࠩͷ఻೻ʹ༏ΕΔͱ͞ΕΔ

float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233)))
* 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float he_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, sqrt( 2 ) / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( he_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } ͕͜͜ҧ͏ Heͷॳظ஋

hidden_affine1.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[
0 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_affine2.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[ 1 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_activation.reset( new layer( create_relu_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_affine_output, hidden_activation_output ) ) ); output_affine.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, output_weight, batch_size ) ) ); output_activation.reset( new layer( create_tanh_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output ) ) ); error1.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 0 ] ) ) ); error2.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 1 ] ) ) ); output_activation_backward.reset( new layer( create_tanh_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output, output_activation_grad, softmax_grad ) ) ); output_affine_backward.reset( new layer( create_affine_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, GLSLΛίϯύΠϧ͠ComputePipeline࡞ͬͯ όοϑΝΛׂΓ౰ͯΔ

void network::exec() { ++swap_index; swap_index %= 2; queue->submit( vk::SubmitInfo() .setCommandBufferCount(
1 ) .setPCommandBuffers( command_buffers->data() + swap_index ), vk::Fence() ); fill( false, false ); queue->waitIdle(); if( debug ) { std::cout << "==============" << std::endl; check(); print( *error_out, batch_size ); print( *output_activation_output, batch_size ); print_image( *batch_images[ swap_index ], train_input->get_image_width(), batch_size ); print_label( *batch_labels[ swap_index ], batch_size ); print_eval( *output_activation_output, batch_size ); } } ࣮ߦ ೖྗͱཉ͍͠ग़ྗͷϖΞͷόοϑΝΛ2ηοτ༻ҙ͠ ֶशͱ࣍ͷσʔλͷసૹΛಉ࣌ʹߦ͏ ίϚϯυόοϑΝͷ಺༰ΛGPUʹ౤͛Δ

MNIST http://yann.lecun.com/exdb/mnist/ ϥϕϧ෇͖खॻ͖਺ࣈը૾7ສຕ SVMͰ΋9ׂҎ্ͷਫ਼౓Ͱ ෼ྨͰ͖Δ؆୯ͳ෼ྨλεΫ ਖ਼࣮͘͠૷͞Ε͍ͯΕ͹ ෼ྨͰ͖ͳ͍͸͕ͣͳ͍

όοναΠζ64 ӅΕ૚ͷ෯128 ධՁσʔλ෼ྨਫ਼౓ 98%લޙ χϡʔϥϧωοτϫʔΫͱͯ͠ ػೳ͍ͯ͠Δ

Fashion-MNIST ϥϕϧ෇͖ҥྨը૾7ສຕ Tγϟπɺίʔτɺۺ౳ 10छྨͷҥྨؚ͕·ΕΔ ҟͳΔҥྨͰ΋ ܗ͸ࣅ͍ͯͨΓ͢ΔͨΊ MNISTΑΓ͸ ೉͍͠ͱ͞ΕΔ https://github.com/zalandoresearch/fashion-mnist

όοναΠζ64 ӅΕ૚ͷ෯128 ͔֬ʹ΍΍མͪΔ ධՁσʔλ෼ྨਫ਼౓ 87%લޙ

L ੵ ϕ ਫ਼౓͕ग़ͳ͍࣌͸૚Λ૿΍͢ ͔͠͠୯७ͳߦྻੵͷ૚Λ૿΍͢ͱ ! ͕ͲΜͲΜ૿͑Δ w

৞ΈࠐΈ ϑΟϧλ!w ೖྗ!x ग़ྗ!y ֶश͢Δը૾ॲཧϑΟϧλ

yij = M ∑ k=0 N ∑ l=0 wkl x
(i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ೖ ྗ ৞ Έ ࠐ Έ ग़ ྗ ϑ ỹ ϧ λ

(i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x ∂L ∂yij ∂L ∂xij ͷޯ഑ w w ∂L ∂wij ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ͋ΔॏΈ͕ؔΘͬͨશͯͷೖग़ྗͷϖΞʹ͍ͭͯ ग़ྗଆͷޯ഑ͱೖྗͷੵͷ૯࿨ΛͱΔ

(i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x ͷޯ഑ w w ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ∂L ∂yij ∂L ∂xij ∂L ∂wij ∂L ∂xij = M ∑ k N ∑ l wkl ∂L ∂y (i−k)(j − l) 180౓ճసͨ͠ग़ྗଆͷޯ഑Λ ৞ΈࠐΉ

void main() { const uint filter_index = gl_GlobalInvocationID.x; const uint
filter_x = filter_index % filter_width; const uint filter_y = filter_index / filter_width % filter_height; const uint channel = filter_index / filter_width / filter_height % channels; const uint filter_size = filter_width * filter_height * channels; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; bool filter_oob = filter_index >= filter_size; float sum = 0.0; for( int data_index = 0; data_index != batch_size; ++data_index ) { for( int output_x = 0; output_x != output_width; ++output_x ) { for( int output_y = 0; output_y != output_height; ++output_y ) { const int output_index = int( output_x ) + int( output_y ) * int( output_width ) + int( channel ) * int( output_width * output_height ) + data_index * int( output_width * output_height * channels ); const int input_x = output_x * int(filter_xstride) - int(xmargin) + int(filter_x); const int input_y = output_y * int(filter_ystride) - int(ymargin) + int(filter_y); const bool input_oob = filter_oob || input_x < 0 || input_x >= input_width || input_y < 0 || input_y >= input_height; const int input_index = int( input_x ) + int( input_y ) * int( input_width ) + int( channel ) * int( input_width * input_height ) + data_index * int( input_width * input_height * channels ); const float grad = filter_oob ? 0.0 : output_grad[ output_index ]; const float x = input_oob ? 0.0 : input_data[ input_index ]; sum += grad * x; } } } if( !filter_oob ) adam( weight[ filter_index ], sum ); } ΛٻΊΔ(-4- ∂L ∂wkl

void main() { const uint input_x = gl_GlobalInvocationID.x % output_width;
const uint input_y = gl_GlobalInvocationID.x / output_width % output_height; const uint channel = gl_GlobalInvocationID.x / output_width / output_height; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; const uint input_size = input_width * input_height * channels; const uint relative_input_index = input_x + input_y * input_width + channel * input_width * input_height; const uint input_index = relative_input_index + data_index * input_width * input_height * channels; if( relative_input_index < input_size ) input_grad[ input_index ] = 0.0; for( int x = 0; x != filter_width; ++x ) { for( int y = 0; y != filter_height; ++y ) { const int output_x = int(input_x) * int(filter_xstride) - int(xmargin) - x; const int output_y = int(input_y) * int(filter_ystride) - int(ymargin) - y; const bool oob = output_x < 0 || output_x >= output_width || output_y < 0 || output_y >= output_height; const int relative_output_index = output_x + output_y * int(output_width) + int(channel) * int(output_width * output_height); const int output_index = relative_output_index + int(data_index) * int(output_width * output_height * channels ); const uint filter_index = x + y * int(filter_width) + channel * int(filter_width * filter_height ); if( relative_input_index < input_size ) { if( !oob ) { const float grad = output_grad[ output_index ] * weight[ filter_index ].x; input_grad[ input_index ] += grad; } } } } } ΛٻΊΔ(-4- ∂L ∂xij

3 6 1 2 2 0 5 9 6 MaxPooling
ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢

3 6 1 2 2 0 5 9 6 MaxPooling
ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢ 9

ೖ ྗ .BY1PPMJOH ग़ ྗ MaxPooling void main() { const
uint relative_output_index = gl_GlobalInvocationID.x; const uint output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; if( relative_output_index < output_size ) output_data[ output_index ] = 0.0; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_height; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) output_data[ output_index ] = max( output_data[ output_index ], input_data[ input_index ] ); } } } ϑΟϧλαΠζ ͷ৔߹ M × N yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M], l ∈ [0,N]

yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M],
l ∈ [0,N] ϑΟϧλαΠζ ͷ৔߹ M × N ೖ ྗ ଆ ͷ ޯ ഑ .BY1PPMJOH ग़ ྗ ଆ ͷ ޯ ഑ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij ࠷େ஋ΛΑ͖ͯͨ͜͠ೖྗʹରԠ͢Δޯ഑Λ ग़ྗଆͷޯ഑ͷ஋ʹ͢Δ ∂L ∂yij ∂L ∂xij MaxPooling ͷٯ఻೻

void main() { const uint relative_output_index = gl_GlobalInvocationID.x; const uint
output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; const uint initial_input_x = output_x * filter_xstride; const uint initial_input_y = output_y * filter_ystride; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_width; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) input_grad[ input_index ] = ( input_data[ input_index ] == output_data[ output_index ] ) ? output_grad[ output_index ] : 0.0; } } } ͷ ޯ ഑ PPMJOH ͷ ޯ ഑ MaxPooling ͷٯ఻೻ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij

L .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ NEW!

ධՁσʔλ෼ྨਫ਼౓ 90%લޙ গ͠޲্ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺14:14

L ੵ ϕ ੵ ϕ ଛ ࣦ ؔ
਺ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ NEW!

͜Ε͸ͻͲ͍ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺32:32:64:64

਺ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ ! ͸࠷ऴ૚Λআ͍ͯ ReLU ϕ ͚ͩ͜͜Hyperbolic Tangent

਺ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ ReLU͸ਖ਼ํ޲ʹ ͍͘ΒͰ΋େ͖ͳ஋ΛͱΔ ֶश͕;Β͍͍ͭͯΔͱ tanhʹڊେͳ஋͕͞͞Δ Ͱ ͍ͩ͘͞ −1 ≤ x ≤ 1 14326.7

਺ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ 14326.7͸μϝ͗ͯ͢ গ͘͠Β͍஋͕มΘͬͯ΋μϝͳͷͰ ޯ഑͸ Ͱ͢ 0 Ͳ͏ͨ͠Β͍͍ͷ͔ ͳΜ΋Θ͔ΒΜ ޯ഑͕ແ͘ͳͬͯ ֶशͰ͖ͳ͘ͳΔ ∂L ∂xi = ∂L ∂yi (1 − tanh2 (xi)) 14326.7

৞ΈࠐΈ૚ͷֶश཰Λ! ʹͨ͠Βਫ਼౓͕վળͨ͠ 1 10 ֶͨͩ͠श͕஗͍ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺32:32:64:64

૚ͷग़ྗͷ෼෍ΛҰఆʹอͭख๏ Batch Normalization Layer Normalization Group Normalization Batch Normalization: Accelerating
Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe and Christian Szegedy 2015 https://arxiv.org/abs/1502.03167 https://arxiv.org/abs/1607.06450 Layer Normalization Lei Ba, Jimmy and Kiros, Jamie Ryan and Hinton, Geoffrey E. p. arXiv:1607.06450. 2016 https://arxiv.org/abs/1803.08494 Group Normalization Yuxin Wu and Kaiming He 2018

TensorCore ͓·͚ ਫ ฏ ߦ ྻ ੵ ࿨ … AB
+ C A B C ! ཁૉͷߦྻ! ͱ! ཁૉͷߦྻ! Λֻ͚ͯ ! ཁૉͷߦྻ! Λ଍ͨ݁͠ՌΛ32εϨου࢖ͬͯ1໋ྩͰಘΔ 16 × 16 A 8 × 16 B 16 × 8 C 7,@/7@DPPQFSBUJWF@NBUSJY֦ுʹରԠͨ͠ /7*%*"ͷ(16ͳΒ 7VMLBO͔ΒͰ΋ར༻Ͱ͖Δ

ͳΜͰTensorCoreΛ࢖Θͳ͔ͬͨͷ? ஸ౓Ոʹམ͍ͪͯͨGeForceGTX1070ʹ͸ TensorCore͕ແ͔ͬͨ

·ͱΊ σΟʔϓϥʔχϯά͸ΞϧΰϦζϜͳͷͰ ࣮૷͢Ε͹Ͳ͜Ͱ΋ಈ͘

低レイヤーな人のためのディープラーニング

低レイヤーな人のためのディープラーニング

More Decks by Fadis

Other Decks in Programming

Featured

Transcript