Slide 1

Slide 1 text

௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά NAOMASA MATSUBAYASHI https://github.com/Fadis/kernelvm_20190720_samples ͜ͷൃදʹొ৔͢Δαϯϓϧίʔυ

Slide 2

Slide 2 text

௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά

Slide 3

Slide 3 text

x0 x1 x2 x3 x4 y2 ܗࣜχϡʔϩϯ × w02 ∑ ϕ × w12 × w22 × w32 × w42 . . .

Slide 4

Slide 4 text

yj = ϕ (∑ i wij xi) x0 x1 x2 x3 x4 y0 y1 y2 y3 y4 . . . . . . . . . ૚ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42

Slide 5

Slide 5 text

૚ x00 x01 x02 x03 x04 . . . x10 x11 x12 x13 x14 . . . ૚ x20 x21 x22 x23 x24 . . . ૚ x30 x31 x32 x33 x34 . . . ૚ y0 y1 y2 y3 y4 . . . w0 w1 w2 w3 ૚ΛແݶʹॏͶͨ෺͸ॏΈ࣍ୈͰ೚ҙͷؔ਺ΛදݱͰ͖ΔͰ͖Δ ༗ݶͰ΋༷ʑͳؔ਺ΛۙࣅͰ͖Δ χϡʔϥϧωοτϫʔΫ

Slide 6

Slide 6 text

Կͱͳ͘૬ؔ͸͋Γͦ͏ͳΜ͚ͩͲ Ͳ͏͍͏ؔ܎͔Α͘Θ͔Βͳ͍ͭͷσʔλ ? 2 7 4

Slide 7

Slide 7 text

ॏΈΛ୳Δ໰୊ʹͳΔ 2 7 4 y0 y1 y2 ͜Εͱ ͜ΕΛఆ਺ͱͯ͠ ͱਖ਼͍͠ग़ྗ͕ ग़དྷΔ͚ͩۙ͘ͳΔΑ͏ʹ y Λௐઅ͢Δ w

Slide 8

Slide 8 text

ֶश w x0 x1 x2 y0 y1 y2 t0 t1 t2 Λमਖ਼ w ೖྗͱରԠ͢Δग़ྗ͕ େྔʹ༗Ε͹ ͜ͷૢ࡞Λ܁Γฦ͢͜ͱͰ χϡʔϥϧωοτϫʔΫ͕ ͱͷؔ܎Λදؔ͢਺ͷ ۙࣅʹͳΔ x t x t σʔλ͔Β ؔ਺͕ಘΒΕΔ w Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2 w Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2

Slide 9

Slide 9 text

x0 x1 x2 y0 y1 y2 σΟʔϓϥʔχϯά χϡʔϥϧωοτϫʔΫͷ௕͍΍ͭ ௕͍ͱΑΓෳࡶͳؔ਺ΛۙࣅͰ͖Δ͕ ֶश͕೉͘͠ͳΔҝੲ͸࣮༻తͰ͸ͳ͍ͱߟ͑ΒΕ͍ͯͨ ͜͜೥ఔͷݚڀͰ௕͍ωοτϫʔΫͷֶश͕ՄೳʹͳΓ େϒʔϜ

Slide 10

Slide 10 text

CPU GPU FPGA ֶशʹ͸େྔͷܭࢉΛཁ͢Δҝ ͠͹͠͹$16֎ͷΞΫηϥϨʔλ͕༻͍ΒΕΔ

Slide 11

Slide 11 text

CPU GPU FPGA ͲΜͳϋʔυ΢ΣΞΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ ϑϨʔϜϫʔΫ͕ొ৔ TensorFlow Chainer Caffe PyTorch . . . Theano

Slide 12

Slide 12 text

CPU GPU FPGA ͲͷϑϨʔϜϫʔΫΛ࢖ͬͯܭࢉ͢Δ͔Λந৅Խ͢Δ ϑϨʔϜϫʔΫ͕ొ৔ TensorFlow Chainer Caffe PyTorch . . . Keras Theano ϨΠϠʔ͕ߴ͍

Slide 13

Slide 13 text

CPU GPU FPGA TensorFlow Chainer Caffe PyTorch . . . Theano $ du -h tensorflow-1.12.3/ ... 145M tensorflow-1.12.3/ $ du -h pytorch-1.1.0/ ... 44M pytorch-1.1.0/ $ du -h chainer-6.1.0/ ... 22M chainer-6.1.0/ Ͱ͔͍ Keras $ du -h keras-2.2.4/ ... 3.0M keras-2.2.4/

Slide 14

Slide 14 text

௿ϨΠϠʔͳਓͷҝͷ σΟʔϓϥʔχϯά

Slide 15

Slide 15 text

GPU ର৅ϋʔυ΢ΣΞΛ(16ʹߜΓ ϑϨʔϜϫʔΫΛ࢖Θͣʹ σΟʔϓϥʔχϯάΛࢼΈΔ

Slide 16

Slide 16 text

ݱ୅ͷ(16͕උ͑ΔػೳΛ Ͱ͖Δ͚ͩͦͷ··ୟͨ͘Ίͷ ৽͍͠%άϥϑΟοΫ"1* ୈ13ճ ΧʔωϧʗVM୳ݕୂ ௿ϨΠϠʔάϥϑΟοΫAPI VulkanΛ࢝ΊΑ͏ ΑΓ Λ࢖͏ GPUΛಈ͔͢ͷʹ

Slide 17

Slide 17 text

Vulkanಉ༷ੜͷGPUΛ৮ΕΔCUDAʹ͸ χϡʔϥϧωοτϫʔΫͰ༻͍ΔܭࢉΛϥΠϒϥϦԽͨ͠ cuDNN͕͋Δ͕ https://developer.nvidia.com/cudnn

Slide 18

Slide 18 text

Λ࢖͏ GPUΛಈ͔͢ͷʹ ඞཁͳܭࢉ͸શ࣮ͯ૷͢Δ

Slide 19

Slide 19 text

Ӆ Ε ૚ ग़ ྗ ૚ ࠷΋جຊతͳશ݁߹૚ͭͷωοτϫʔΫ x0 x1 x2 x3 x4 . . . y0 y1 y2 y3 y4 . . .

Slide 20

Slide 20 text

if( config.validation ) layers.emplace_back( "VK_LAYER_LUNARG_standard_validation" ); const auto app_info = vk::ApplicationInfo( config.prog_name.c_str(), (தུ), VK_API_VERSION_1_1 ); instance_ptr_t instance( new vk::Instance( vk::createInstance( vk::InstanceCreateInfo() .setPApplicationInfo( &app_info ) .setEnabledExtensionCount( ext.size() ).setPpEnabledExtensionNames( ext.data() ) .setEnabledLayerCount( layers.size() ).setPpEnabledLayerNames( layers.data() ) ) ) ); auto devices = instance->enumeratePhysicalDevices(); if( devices.empty() ) throw device_is_not_available(); devices.erase( std::remove_if( devices.begin(), devices.end(), [&]( const auto &d ) -> bool { auto avail_dext = d.enumerateDeviceExtensionProperties(); for( const char *w: dext ) if( std::find_if( avail_dext.begin(), avail_dext.end(), [&]( const auto &v ) { return !strcmp( v.extensionName, w ); } ) == avail_dext.end() ) return true; const auto avail_dlayers = d.enumerateDeviceLayerProperties(); for( const char *w: dlayers ) if( std::find_if( avail_dlayers.begin(), avail_dlayers.end(), [&]( const auto &v ) { return !strcmp( v.layerName, w ); } ) == avail_dlayers.end() ) return true; return false; } ), devices.end() ); if( devices.empty() ) throw required_extensions_or_layers_are_not_available(); 7VMLBOͷΠϯελϯεΛ࡞Δ ར༻Մೳͳ(16ͷத͔Β ࢖͏΍ͭΛબͿ

Slide 21

Slide 21 text

const auto queue_props = physical_device.getQueueFamilyProperties(); uint32_t queue_index =std::distance( queue_props.begin(), std::find_if( queue_props.begin(), queue_props.end(), []( const auto &v ) { return bool( v.queueFlags & vk::QueueFlagBits::eCompute ) && bool( v.queueFlags & vk::QueueFlagBits::eTransfer ); } ) ); if( queue_index == queue_props.size() ) throw required_queue_is_not_available(); const float priority = 0.0f; std::vector< vk::DeviceQueueCreateInfo > queues{}; const auto queue_create_info = vk::DeviceQueueCreateInfo() .setQueueFamilyIndex( queue_index ).setQueueCount( 1 ).setPQueuePriorities( &priority ); const auto features = physical_device.getFeatures(); auto device = physical_device.createDevice( vk::DeviceCreateInfo() .setQueueCreateInfoCount( 1 ).setPQueueCreateInfos( &queue_create_info ) .setEnabledExtensionCount( dext.size() ).setPpEnabledExtensionNames( dext.data() ) .setEnabledLayerCount( dlayers.size() ).setPpEnabledLayerNames( dlayers.data() ) .setPEnabledFeatures( &features ) ); std::shared_ptr< vk::Device > d( new vk::Device( std::move( device ) ), []( const auto &p ) { if( p ) { p->destroy(); delete p; } } ); auto queue = device.getQueue( queue_index, 0 ); auto command_pool = device.createCommandPool( vk::CommandPoolCreateInfo() .setQueueFamilyIndex( queue_index ).setFlags( vk::CommandPoolCreateFlagBits::eResetCommandBuffer ) ); std::shared_ptr< vk::Queue > q( new vk::Queue( std::move( queue ) ), [d]( const auto& ) {} ); std::shared_ptr< vk::CommandPool > p( new vk::CommandPool( std::move( command_pool ) ), [d]( const vk::CommandPool *p ) { if( p ) { d->destroyCommandPool( *p ); ࿦ཧσόΠεɺΩϡʔɺίϚϯυϓʔϧΛ࡞Δ

Slide 22

Slide 22 text

w = w00 w01 ⋯ w0M w10 w11 ⋯ w1M ⋮ ⋮ ⋮ wN0 wN1 ⋯ wNM ೖྗ͕/ཁૉɺग़ྗ͕.ཁૉͷϕΫτϧͷ࣌ ॏΈΛҎԼͷΑ͏ͳߦྻͱΈͳ͢ͱ y = ϕ (wx) ੵ ϕ ͭͷ૚ͷܭࢉ͸ ϕΫτϧͱߦྻͷੵΛٻΊΔ ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w x y

Slide 23

Slide 23 text

Λमਖ਼ w x0 x1 x2 y0 y1 y2 t0 t1 t2 w Λमਖ਼ w x3 x4 x5 y3 y4 y5 t3 t4 t5 w ϛχόον x6 y6 t6 ෳ਺ͷೖྗϕΫτϧΛ ଋͶͯߦྻʹ͢Δ ଋ͝ͱʹޡࠩΛٻΊ͔ͯΒ ॏΈΛमਖ਼͢Δ ݸผʹ Λमਖ਼͢ΔΑΓ ֶश͕҆ఆ͢Δ w

Slide 24

Slide 24 text

ੵ ϕ ߦྻͱߦྻͷੵΛٻΊΔ ͷ݁Ռͷ֤ཁૉʹ Λద༻͢Δ ϕ w x0 x1 ⋮ xb ͸ͦΕͧΕೖྗϕΫτϧ xn ͸ͦΕͧΕग़ྗϕΫτϧ yn y0 y1 ⋮ yb ͸όοναΠζ b

Slide 25

Slide 25 text

ੵ ϕ ੵ ϕ wh wo x0 x1 ⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L

Slide 26

Slide 26 text

ੵ ϕ ੵ ϕ ೖ ྗ ग़ ྗ ૚ ͷ ग़ ྗ ग़ ྗ ૚ ͷ ੵ Ӆ Ε ૚ ͷ ੵ Ӆ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ

Slide 27

Slide 27 text

hidden_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( input_width * hidden_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); output_weight.reset( new liblnn::buffer< glm::vec4 >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * output_width * sizeof( glm::vec4 ) ).setUsage( copyable ) ) ); hidden_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); hidden_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( hidden_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_affine_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); output_relu_output.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); softmax_grad.reset( new liblnn::buffer< float >( allocator, buf_type, vk::BufferCreateInfo().setSize( output_width * batch_size * sizeof( float ) ).setUsage( non_copyable ) ) ); (16ͷϝϞϦΛ֬อ

Slide 28

Slide 28 text

ੵ ೖ ྗ ग़ ྗ ॏ Έ ೖྗ ߦྻ ͱॏΈ ߦྻ ͷੵΛ (16Ͱܭࢉ͢Δ

Slide 29

Slide 29 text

GPU ͕1ͭͷσʔλΛܭࢉ ͕SIMDͷ1Ϣχοτ Vulkan༻ޠͰSubgroup NVIDIA༻ޠͰWarp Vulkan༻ޠͰThread NVIDIA༻ޠͰ΋Thread

Slide 30

Slide 30 text

GPU ϓϩηοα਺ͰੑೳΛՔ͙ΞʔΩςΫνϟͳͷͰ Ͱ͖Δ͚ͩ୔ࢁͷεϨουͰܭࢉΛ͠ͳ͚Ε͹ͳΒͳ͍

Slide 31

Slide 31 text

VRAM શͯͷεϨου͸73".Λڞ༗͍ͯ͠Δ ෳ਺ͷεϨου͕ಉ͡ΞυϨε͔Β஋ΛಡΉͷ͸໰୊ͳ͍͕ ෳ਺ͷεϨου͕ಉ͡ΞυϨεʹ஋Λॻ͍ͨ৔߹ ͲͷεϨου͕ॻ͍ͨ஋͕࢒Δ͔͸ෆఆ Vulkan༻ޠͰMemory NVIDIA༻ޠͰGlobalMemory

Slide 32

Slide 32 text

x00 x01 x02 x10 x11 x12 x20 x21 x22 w00 w01 w02 w10 w11 w12 w20 w21 w22 = ∑ i x0i wi0 ∑ i x0i wi1 ∑ i x0i wi2 ∑ i x1i wi0 ∑ i x1i wi1 ∑ i x1i wi2 ∑ i x2i wi0 ∑ i x2i wi1 ∑ i x2i wi2 ໌Β͔ʹग़ྗߦྻͷཁૉ਺·Ͱ͸ฒྻͰܭࢉͰ͖Δ ∑ i x0i wi0 = x00 w00 + x01 w10 + x02 w20 ߋʹ ͷத਎ΛฒྻͰܭࢉ͢ΔͱεϨου਺ΛՔ͛Δ͕ ֤εϨουͷ஋ͷ ΛͲ͏΍ͬͯऔΔ͔͕໰୊ʹͳΔ ∑ ∑

Slide 33

Slide 33 text

43". GPU͸ෳ਺ͷεϨουͰಉظՄೳͳ SRAMΛ͍࣋ͬͯΔ 43". ʹ ॻ ͘ ಉ ظ 43". ͔ Β ಡ Ή A B ಉҰWorkGroup಺ͷεϨουʹ ஋Λड͚౉͢͜ͱ͕Ͱ͖Δ A B WorkGroupͷશεϨου͕ ಉظʹୡ͢Δ·Ͱఀࢭ ͜ͷSRAMͷࣄΛ Vulkan༻ޠͰSharedMemory NVIDIA༻ޠͰ΋SharedMemory SRAMΛڞ༗Ͱ͖ΔεϨουͷଋΛ Vulkan༻ޠͰWorkGroup NVIDIA༻ޠͰBlock

Slide 34

Slide 34 text

ݹయతͳ(16ʹ͓͚Δ ∑ ճͷՃࢉͱಉظͰ ͕ٻ·Δ log2 (n) ∑ x0 x1 x2 x3 ಉ ظ S0 := x0 S1 := x2 ಉ ظ S0 := x0 + S0 S1 := x2 + S1 S0 := S0 + S1 x0 + x1 + x2 + x3 43".

Slide 35

Slide 35 text

43". AB CD x0 x1 x2 x3 ਫ ฏ Ճ ࢉ x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 x0 + x1 + x2 + x3 A B C D ৽͠ΊͷGPU͸ಉҰSubgroup಺Ͱ ਫฏԋࢉ͕Ͱ͖Δ Vulkan༻ޠͰSubgroup operations NVIDIA༻ޠͰWarp Shuffle

Slide 36

Slide 36 text

43". SubgroupͷαΠζ͕32ͷ৔߹ ! ճͷՃࢉͱಉظͰ! ͕ٻ·Δ log32 (n) ∑ … … ਫ ฏ Ճ ࢉ ਫ ฏ Ճ ࢉ ಉ ظ ਫ ฏ Ճ ࢉ x0 x1 x32 x33 x34 x64 64 ∑ i=0 xi φ΢͍(16ʹ͓͚Δ ∑

Slide 37

Slide 37 text

GPU Subgroup: ϓϩάϥϜΧ΢ϯλΛڞ༗͍ͯ͠Δ Workgroup: SRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ Dispatch: VRAMΛڞ༗ͯ͠ಉ࣌ʹ࣮ߦͰ͖Δ GeForceGTX1070ͷ৔߹ 32εϨου ෺ཧ4Subgroups ࿦ཧ48Subgroups ෺ཧ60Workgroups ࿦ཧ2^64Workgroups

Slide 38

Slide 38 text

shared float local_sum[ local_memory_size ]; float large_sum( in float value ) { float sg_sum = subgroupAdd( value ); local_sum[ gl_SubgroupID ] = sg_sum; barrier(); uint len = gl_NumSubgroups; while( len > 1 ) { uint index = gl_SubgroupInvocationID + gl_SubgroupID * gl_SubgroupSize; float sum = subgroupAdd( index < len ? local_sum[ index ] : 0.0 ); local_sum[ gl_SubgroupID ] = sum; barrier(); len /= gl_SubgroupSize; } barrier(); return local_sum[ 0 ]; } GLSLͰ!∑ ਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷ஋ΛਫฏՃࢉͯ͠ ݁ՌΛSharedMemoryʹॻ͍ͯ ಉظͯ͠ SharedMemoryͷཁૉ͕1ݸʹͳͬͨΒ ͦͷ஋Λreturn

Slide 39

Slide 39 text

void main() { const uint input_index = gl_GlobalInvocationID.x; const uint output_index = gl_GlobalInvocationID.y; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; output_data[ output_index + data_index * output_width ] = 0.0; for( uint offset = 0; offset < width; offset += input_width ) { float value = ( offset + input_index ) < width ? input_data[ offset + input_index + data_index * width ] * weight[ output_index + ( offset + input_index ) * output_width ].x : 0.0; output_data[ output_index + data_index * output_width ] += large_sum( value ); } } GLSLͰߦྻͷੵ ! ΛऔΔඞཁ͕͋ΔεϨου͕ಉҰWorkGroupʹͳΔΑ͏ʹͯ͠ ೖྗߦྻͷ஋ͱॏΈߦྻͷ஋ͷੵΛग़ྗߦྻʹॻ͘ ∑

Slide 40

Slide 40 text

ϕ ೖ ྗ ग़ ྗ ׆ੑԽؔ਺ ૚͕ߦྻੵ͚ͩͩͬͨ৔߹ ૚Λ͍ͭ͘ॏͶͯ΋ઢܕੑ͕ҡ࣋͞ΕΔ ݴ͍׵͑Δͱઢܗͳؔ਺͔ۙ͠ࣅͰ͖ͳ͘ͳΔ ߦྻੵͱߦྻੵͷؒʹ ઢܕੑΛյ͢ඇઢܗͷؔ਺ΛڬΉࣄͰ ඇઢܗͳؔ਺ͷۙࣅΛՄೳʹ͢Δ

Slide 41

Slide 41 text

Hyperbolic Tangent

Slide 42

Slide 42 text

Rectified Linear Unit (ReLU)

Slide 43

Slide 43 text

yij = tanh (xij) ׆ੑԽؔ਺͸ೖग़ྗߦྻͷཁૉ͝ͱʹಠཱʹܭࢉͰ͖Δҝ ग़ྗߦྻͷཁૉ਺ ೖྗߦྻͷཁૉ਺ ·Ͱ͸ฒྻͰܭࢉͰ͖Δ yij = { xij xij > = 0 0 xij < 0 Hyperbolic Tangentͷ৔߹ Rectified Linear Unitͷ৔߹

Slide 44

Slide 44 text

void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = tanh( input_data[ offset + input_index + data_index * width ] ); } } void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) output_data[ offset + input_index + data_index * width ] = max( 0, input_data[ offset + input_index + data_index * width ] ); } } GLSLͰHyperbolic Tangent GLSLͰRectified Linear Unit

Slide 45

Slide 45 text

ಘ Β Ε ͨ ग़ ྗ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ଛࣦؔ਺ χϡʔϥϧωοτϫʔΫͷग़ྗͱ ཉ͔ͬͨ͠ग़ྗ͕ࣅ͍ͯΔ΄Ͳ ग़ྗ͕খ͘͞ͳΔؔ਺ ࠷ޙʹ͜ΕΛ෇͚ΔࣄͰ ద੾ͳॏΈͷ୳ࡧ͸ ࠷దԽ໰୊ʹͳΔ

Slide 46

Slide 46 text

ग़ྗͷܗࣜ t = 0 0 1 0 0 y = 0.8 0.000007 0.9 0.036 0.00005 χϡʔϥϧωοτϫʔΫͷग़ྗ ग़͖ͯͯ΄͍͠ग़ྗ 0൪͔2൪ͷͲͪΒ͔ͳؾ͕͢Δ ਖ਼ղ͸2൪Ͱ͢

Slide 47

Slide 47 text

TPGUNBY yi = exi ∑ j exj softmax (y) = 0.288213 0.129503 0.318524 0.134249 0.129509 y = 0.8 0.000007 0.9 0.036 0.00005

Slide 48

Slide 48 text

ΫϩεΤϯτϩϐʔޡࠩ ! ͔ͭ! ͷ࣌ t = 1 y = 1 !l = 0 ! ͔ͭ! ͷ࣌ t = 1 y = 0.001 !l = 4.605 ! ͷ࣌ t = 0 !l = 0 softmaxͷ݁Ռ! ͕1ʹۙͮ͘ҝʹ͸ iҎ֎ͷ! ͷ஋͸0ʹۙ͘ͳ͚Ε͹ͳΒͳ͍ yi y ·ͱΊΔͱ! ͱ! ͕ࣅ͍ͯΔఔ! ͕খ͘͞ͳΔ y t L L = − ∑ i ti log (yi) yi = exi ∑ j exj

Slide 49

Slide 49 text

void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; } GLSLͰSoftmax with Cross Entropy Loss ! ͕0ʹ͖ۙͮա͗Δͱ infΛੜΈग़͢ y ! ͕0ʹͳΔͱnan΍infΛੜΈग़͢ ∑ j exj yi = exi ∑ j exj L = − ∑ i ti log (yi)

Slide 50

Slide 50 text

ੵ ϕ ੵ ϕ wh wo x0 x1 ⋮ xb ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb ! ͱ! Λఆ਺ͱݟ၏ͯ͠! ͕࠷খͱͳΔ ! ͱ! Λ୳͢ x t L wh wo L

Slide 51

Slide 51 text

! ͕มԽͨ࣌͠! ͕ͲͷΑ͏ʹมԽ͢Δ͔ w L ∂L ∂w ॏΈ͕ొ৔͢Δ૚͔Β ΫϩεΤϯτϩϐʔޡࠩ·Ͱͷؔ਺Λ ! ʹ͍ͭͯภඍ෼ w

Slide 52

Slide 52 text

wo ∂L ∂wo = ∂L ∂d ∂d ∂c ∂c ∂w0 ! ͕3ͭ࿈ͳͬͨ߹੒ؔ਺ͱݟ၏ͤΔ ߹੒ؔ਺ͷඍ෼ͷ࿈࠯཯ df dx = df dg dg dx ΑΓ ੵ ϕ ଛ ࣦ ؔ ਺ L t c d ∂L ∂d ∂d ∂c ∂c ∂w0

Slide 53

Slide 53 text

ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺ wh L t c d b a ∂L ∂wh = ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh

Slide 54

Slide 54 text

ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺ L wh t c d b a ֤૚ͷೖग़ྗͷඍ෼͕ٻ·Δ৔߹ ! ΛޙΖͷ૚͔ΒॱʹٻΊΒΕΔ ∂L ∂w ޡࠩٯ఻೻๏ ∂L ∂d ∂L ∂d ∂d ∂c ∂L ∂d ∂d ∂c ∂c ∂b ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂L ∂d ∂d ∂c ∂c ∂b ∂b ∂a ∂a ∂wh

Slide 55

Slide 55 text

ੵ ϕ ੵ ϕ ೖ ྗ ྗ ૚ ͷ ग़ ྗ ྗ ૚ ͷ ੵ Ε ૚ ͷ ੵ Ε ૚ ͷ ग़ ྗ ଛ ࣦ ؔ ਺ ཉ ͠ ͍ ग़ ྗ ग़ ྗ ૚ ͷ ॏ Έ Ӆ Ε ૚ ͷ ॏ Έ ޡ ࠩ ଛ ࣦ ؔ ਺ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑ ׆ ੑ Խ ؔ ਺ ͷ ޯ ഑ ߦ ྻ ੵ ͷ ޯ ഑

Slide 56

Slide 56 text

ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ L = − ∑ i ti log (yi) yi = exi ∑ j exj ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k ଛࣦؔ਺ͷٯ఻೻

Slide 57

Slide 57 text

ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ∑ i ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi ∂L ∂yi = − ti yi ∂yi ∂xk = { yi (1 − yi) i = k −yi yk i ≠ k

Slide 58

Slide 58 text

ޯ ഑ ଛ ࣦ ؔ ਺ ޡ ࠩ ཉ ͠ ͍ ग़ ྗ ཉ͍͠ग़ྗͷ૯࿨=1 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi − ∑ k≠i ∂L ∂yk ∂yk ∂xi = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti

Slide 59

Slide 59 text

void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ); } ཉ ͠ ͍ ग़ ྗ softmaxͷGLSLͷ࠷ޙͰ ޯ഑Λग़ྗ = −ti (1 − yi) + ∑ k≠i tk yi = −ti + yi ∑ k tk = yi − ti

Slide 60

Slide 60 text

si = xi 2 + 1 2 yi = esi ∑ j esj L = −∑ i ti log (yi) ϕ ଛ ࣦ ؔ ਺ t0 t1 ⋮ tb L ͷ ஋͕ग़·͢ −1 ≤ x ≤ 1 Ͱ ͍ͩ͘͞ 0 ≤ x ࠷ऴ૚ͷ׆ੑԽؔ਺ʹtanhΛ࢖͏ͱ ଛࣦؔ਺͕ظ଴͢Δ஋Ҭͱ߹Θͳ͍ͷͰἧ͑Δ

Slide 61

Slide 61 text

t0 t1 ⋮ tb ஋͕ग़·͢ ∂L ∂si = yi − ti ∂si ∂xi = 1 2 ∂L ∂xi = ∂L ∂si ∂si ∂xi = yi − ti 2 void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; float value1 = input_index < width ? exp( input_data[ input_index + data_index * width ] * 0.5 + 0.5 ) : 0.0; float y = value1 / ( large_sum( float( value1 ) ) + 1.0e-10 ); float t = teacher_data[ input_index + data_index * width ]; float y_ = max( y, 1.0e-10 ); float value2 = input_index < width ? t * log( y_ ) : 0.0; float l = -large_sum( float( value2 ) ); if( input_index == 0 ) output_data[ data_index ] = l; if( input_index < width ) input_grad[ input_index + data_index * width ] = float( y - t ) * 0.5; } Ͱ ͍ͩ͘͞ 0 ≤ x

Slide 62

Slide 62 text

ޯ ഑ ϕ ग़ ྗ ଆ ͷ ޯ ഑ Hyperbolic Tangentͷ ٯ఻೻ yi = tanh (xi) ∂yi ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi)) ग़ྗଆͷޯ഑ ∂L ∂yi ∂L ∂xi

Slide 63

Slide 63 text

ޯ ഑ void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = ( 1 - pow( tanh( input_data[ offset + input_index + data_index * width ] ), 2 ) ) * output_grad[ offset + input_index + data_index * width ]; } } Hyperbolic Tangentͷ ٯ఻೻ i ∂xi = 1 − tanh2 (xi) ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = ∂L ∂yi (1 − tanh2 (xi))

Slide 64

Slide 64 text

ޯ ഑ ϕ ग़ ྗ ଆ ͷ ޯ ഑ Rectified Linear Unitͷ ٯ఻೻ yi = { xi xi ≥ 0 0 xi < 0 ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0 ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0 ∂L ∂yi ∂L ∂xi

Slide 65

Slide 65 text

Rectified Linear Unitͷ ٯ఻೻ void main() { const uint input_index = gl_GlobalInvocationID.x; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; for( uint offset = 0; offset < width; offset += input_width ) { if( ( offset + input_index ) < width ) input_grad[ offset + input_index + data_index * width ] = input_data[ offset + input_index + data_index * width ] >= 0 ? output_grad[ offset + input_index + data_index * width ] : 0.0; } } ∂L ∂xi = ∂L ∂yi ∂yi ∂xi = { ∂L ∂yi xi ≥ 0 0 xi < 0

Slide 66

Slide 66 text

Ͳ͏ݟͯ΋ෆ࿈ଓ͕ͩ ඍ෼͕ఆٛͰ͖Δͷ͔

Slide 67

Slide 67 text

yi = log (1 + exp (xi)) ReLUͷ࿦จ[1] Ͱ͸ Λۙࣅͯ͠ yi = { xi xi ≥ 0 0 xi < 0 ͱ͍ͯ͠Δ [1] Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML'10), Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, USA, 807-814.

Slide 68

Slide 68 text

∂yi ∂xi = exp (xi) 1 + exp (xi) ैͬͯ Λۙࣅͯ͠ ∂yi ∂xi = { 1 xi ≥ 0 0 xi < 0

Slide 69

Slide 69 text

ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ ഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi ͷޯ഑ w w ∂L ∂wi yj = ∑ i wij xi ∂yj ∂wij = xi ∂L ∂wij = ∂L ∂yi xi 2ͭͷ෺ΛٻΊΔඞཁ͕͋ΔͷͰ ੺࿮ͷ෦෼͔Βߟ͑Δ ॏΈ͕ؔΘͬͨग़ྗͷޯ഑ʹ ॏΈ͕ؔΘͬͨೖྗΛֻ͚Δ

Slide 70

Slide 70 text

ͷ ޯ ഑ x ੵ ग़ ྗ ଆ ͷ ޯ ഑ ߦྻੵͷ ٯ఻೻ ∂L ∂yi ∂L ∂xi ͷޯ഑ w w ∂L ∂wi ࣍ʹ྘࿮ͷ෦෼Λߟ͑Δ yj = ∑ i wij xi ∂yj ∂xi = ∑ i wij ∂L ∂xi = ∑ j ∂L ∂yj wij ೖྗ͕ӨڹΛ༩͑ͨग़ྗͷޯ഑ʹ ͦͷೖग़ྗʹ͍ͭͯͷॏΈΛֻ͚ͨ෺ͷ૯࿨

Slide 71

Slide 71 text

void main() { const uint input_index = gl_GlobalInvocationID.x; const uint output_index = gl_GlobalInvocationID.y; const uint input_width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint output_width = gl_WorkGroupSize.y * gl_NumWorkGroups.y; float grad_w_sum = 0.0; for( uint data_index = 0; data_index != batch_size; data_index++ ) { input_grad[ input_index + data_index * input_width ] = 0.0; for( uint offset = 0; offset < height; offset += output_width ) { float grad_x = ( offset + output_index ) < height ? weight[ offset + output_index + input_index * height ].x * output_grad[ offset + output_index + data_index * height ] : 0.0; input_grad[ input_index + data_index * input_width ] += large_sum( grad_x ); } } for( uint offset = 0; offset < height; offset += output_width ) { float grad_w_sum = 0.0; for( uint data_index = 0; data_index != batch_size; data_index++ ) { float grad_w = ( offset + output_index ) < height ? input_data[ input_index + data_index * input_width ] * output_grad[ offset + output_index + data_index * height ] : 0.0; grad_w_sum += grad_w; } if( ( offset + output_index ) < height ) adam( weight[ offset + output_index + input_index * height ], grad_w_sum ); } } ߦྻੵͷ ٯ఻೻ ! ͷ਺ͰεϨουΛىಈ͠ ! Λڞ༗͢ΔεϨου͕ ਫฏՃࢉͰ͖ΔΑ͏ʹ WorkGroupΛׂΓ౰ͯΔ ∂L ∂wij ∂L ∂xi ∂L ∂wij ∂L ∂xi

Slide 72

Slide 72 text

֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ wt+1 = wt − μ ∂L ∂wt গ͠ جຊతʹ͸! ʹԊͬͯ ΑΓ! ͕খ͘͞ͳΔํ΁ গ͠! Λߋ৽͢Δ ∂L ∂wt L w L w ! ͷߋ৽ํ޲ w

Slide 73

Slide 73 text

֬཰తޯ഑߱Լ๏ ͍·͜͜ ͜͜ʹ ḷΓண͖͍ͨ L w ! ͷߋ৽ํ޲ w ͪΐͬͱͨ͠ग़ͬுΓʹ ͍᪴ͯ໭Δ wt+1 = wt − μ ∂L ∂wt

Slide 74

Slide 74 text

SGD(֬཰తޯ഑߱Լ๏) MomentumSGD NAG AdaGrad RMSprop Adam AdaDelta AdaMax SMORMS3 RMSpropGraves Eve Nadam Santa-E Santa-SSS AdaSecant GD by GD ࠷దԽΞϧΰϦζϜͷ ਐԽ

Slide 75

Slide 75 text

Adam g = ∂L ∂wt mt = β1 mt−1 + (1 − β1) g vt = β2 vt−1 + (1 − β2) g2 ̂ mt = mt 1 − βt 1 ̂ vt = vt 1 − βt 2 wt+1 = wt − α ̂ mt ̂ vt + ϵ void adam( inout vec4 weight, in float grad ) { weight.w += 1; float gt = grad; weight.y = beta1 * weight.y + ( 1 - beta1 ) * gt; weight.z = beta2 * weight.z + ( 1 - beta2 ) * gt * gt; float mhat = weight.y / ( 1 - pow( beta1, weight.w ) ); float vhat = weight.z / ( 1 - pow( beta2, weight.w ) ); weight.x -= alpha * mhat / ( sqrt( vhat ) + eps ); } ͕ਪ঑͞Ε͍ͯΔͷͰ ͜ͷ஋Λͦͷ··࢖͏ α = 0.001 β1 = 0.9 β2 = 0.999 • Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for stochastic optimization. 2014. arXiv:1412.6980v9 https://arxiv.org/abs/1412.6980

Slide 76

Slide 76 text

ॳظ஋ ! ͕ಉ͡χϡʔϩϯ͸ಉ͍ৼΔ෣͍Λ͢Δ w × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 ಉ͡ग़ྗΛ͢ΔχϡʔϩϯͷॏΈʹ͍ͭͯͷޯ഑͸Ұக͢Δ × w02 ∑ ϕ × w12 × w22 × w32 × w42 × w02 ∑ ϕ × w12 × w22 × w32 × w42 ∂L ∂w ∂L ∂w ಉ͚ͩ͡ॏΈ͕ߋ৽͞ΕΔҝͣͬͱಉ͡ৼΔ෣͍Λ͢Δ χϡʔϥϧωοτϫʔΫͷॏΈͷॳظ஋͸ ۉҰʹͳ͍ͬͯͯ͸͍͚ͳ͍

Slide 77

Slide 77 text

Xavierͷॳظ஋ w = 1 n randn() Understanding the difficulty of training deep feedforward neural networks Xavier Glorot and Yoshua Bengio Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics PMLR. p. 249--256. 2010 http://proceedings.mlr.press/v9/glorot10a.html ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺

Slide 78

Slide 78 text

GPUͰཚ਺Λ࡞Δ Gt 0.0801 Gt+1 0.2926 Gt+2 0.4342 Gt+3 0.6978 ଟ͘ͷٖࣅཚ਺ΞϧΰϦζϜ͸ 1ݸͷཚ਺Λుͨ͘ͼʹߋ৽͞ΕΔঢ়ଶΛ࣋ͭ ͜Εͩͱલͷཚ਺͕ੜ੒͞ΕΔ·Ͱ࣍ͷཚ਺͕ੜ੒Ͱ͖ͳ͍ҝ εέʔϧ͠ͳ͍

Slide 79

Slide 79 text

GPUͰཚ਺Λ࡞Δ 10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ s = [ 12.9898 78.233] t = 43758.5453 fract (x) = x − ⌊x⌋ f (x) = fract (t sin (x ⋅ s)) f 0.7340 [0.1 0.8] f 0.1768 ཚ਺ੜ੒ثͷঢ়ଶΛ1ཁૉຖʹҾ͖ܧ͕ͳ͍ҝεέʔϧ͢Δ [0.3 0.2]

Slide 80

Slide 80 text

GPUͰཚ਺Λ࡞Δ 10೥ఔલ͔ΒϏσΦήʔϜ։ൃऀͷؒͰޠΓܧ͕Ε͍ͯΔ ṖͷҰ༷ཚ਺ੜ੒ΞϧΰϦζϜ ཚ਺ੜ੒ثͷग़ྗͷ෼෍ Box-Muller๏Ͱ ਖ਼ن෼෍ʹͨ͠΋ͷ Ξϗໟ͕ੜ͑ͯΔ͚Ͳ ࢖͑ͳ͍ϨϕϧͰ͸ͳͦ͞͏

Slide 81

Slide 81 text

float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233))) * 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float xavier_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, 1.0 / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( xavier_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } Xavierͷॳظ஋ Ұ༷ཚ਺Λ࡞ͬͯ box-muller๏Ͱਖ਼نཚ਺ʹͯ͠ xavierͷॳظ஋ΛٻΊΔؔ਺Λ εϨου਺=ॏΈͷཁૉ਺ Ͱ࣮ߦ

Slide 82

Slide 82 text

Heͷॳظ஋ w = 2 n randn() Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun 2015 https://arxiv.org/abs/1502.01852 ͋Δ૚ʹ! ཁૉͷೖྗ͕͋Δ࣌! ΛҎԼͷ஋ͰॳظԽ͢Δ n w ਖ਼نཚ਺ ReLUΛ࢖͏৔߹ʹXavierͷॳظ஋ΑΓ ॳظͷޡࠩͷ఻೻ʹ༏ΕΔͱ͞ΕΔ

Slide 83

Slide 83 text

float prand( vec2 i ) { return fract(sin(dot( i.xy ,vec2(12.9898,78.233))) * 43758.5453); } const float PI = 3.1415926535897932384626433832795; float boxmuller( vec2 i, float mu, float sigma ) { float x = 1 - prand( i ); float y = prand( vec2( i.y, x ) ); float n = prand( vec2( x, y * PI ) ); float v = sqrt( -2.0 * log( x ) ) * cos( 2 * PI * n ); return mu + sigma * v; } float he_init_value( vec2 i, uint n ) { float value = boxmuller( i, 0.0, sqrt( 2 ) / sqrt( n ) ); return value; } void main() { const uint x = gl_GlobalInvocationID.x; const uint y = gl_GlobalInvocationID.y; const uint width = gl_WorkGroupSize.x * gl_NumWorkGroups.x; const uint height = gl_WorkGroupSize.y * gl_NumWorkGroups.y; const uint index = x + y * width; weight[ index ] = vec4( he_init_value( vec2( float( x )/width, float( y )/height ), input_size ), 0, 0, 0 ); } ͕͜͜ҧ͏ Heͷॳظ஋

Slide 84

Slide 84 text

hidden_affine1.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[ 0 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_affine2.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, batch_images[ 1 ], hidden_affine_output, hidden_weight, batch_size ) ) ); hidden_activation.reset( new layer( create_relu_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_affine_output, hidden_activation_output ) ) ); output_affine.reset( new layer( create_affine_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, output_weight, batch_size ) ) ); output_activation.reset( new layer( create_tanh_forward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output ) ) ); error1.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 0 ] ) ) ); error2.reset( new layer( create_softmax_combined_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_activation_output, error_out, softmax_grad, batch_labels[ 1 ] ) ) ); output_activation_backward.reset( new layer( create_tanh_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, output_affine_output, output_activation_output, output_activation_grad, softmax_grad ) ) ); output_affine_backward.reset( new layer( create_affine_backward_pipeline( device, mods, descriptor_pool, pipeline_cache, props, hidden_activation_output, output_affine_output, GLSLΛίϯύΠϧ͠ComputePipeline࡞ͬͯ όοϑΝΛׂΓ౰ͯΔ

Slide 85

Slide 85 text

void network::exec() { ++swap_index; swap_index %= 2; queue->submit( vk::SubmitInfo() .setCommandBufferCount( 1 ) .setPCommandBuffers( command_buffers->data() + swap_index ), vk::Fence() ); fill( false, false ); queue->waitIdle(); if( debug ) { std::cout << "==============" << std::endl; check(); print( *error_out, batch_size ); print( *output_activation_output, batch_size ); print_image( *batch_images[ swap_index ], train_input->get_image_width(), batch_size ); print_label( *batch_labels[ swap_index ], batch_size ); print_eval( *output_activation_output, batch_size ); } } ࣮ߦ ೖྗͱཉ͍͠ग़ྗͷϖΞͷόοϑΝΛ2ηοτ༻ҙ͠ ֶशͱ࣍ͷσʔλͷసૹΛಉ࣌ʹߦ͏ ίϚϯυόοϑΝͷ಺༰ΛGPUʹ౤͛Δ

Slide 86

Slide 86 text

MNIST http://yann.lecun.com/exdb/mnist/ ϥϕϧ෇͖खॻ͖਺ࣈը૾7ສຕ SVMͰ΋9ׂҎ্ͷਫ਼౓Ͱ ෼ྨͰ͖Δ؆୯ͳ෼ྨλεΫ ਖ਼࣮͘͠૷͞Ε͍ͯΕ͹ ෼ྨͰ͖ͳ͍͸͕ͣͳ͍

Slide 87

Slide 87 text

όοναΠζ64 ӅΕ૚ͷ෯128 ධՁσʔλ෼ྨਫ਼౓ 98%લޙ χϡʔϥϧωοτϫʔΫͱͯ͠ ػೳ͍ͯ͠Δ

Slide 88

Slide 88 text

Fashion-MNIST ϥϕϧ෇͖ҥྨը૾7ສຕ Tγϟπɺίʔτɺۺ౳ 10छྨͷҥྨؚ͕·ΕΔ ҟͳΔҥྨͰ΋ ܗ͸ࣅ͍ͯͨΓ͢ΔͨΊ MNISTΑΓ͸ ೉͍͠ͱ͞ΕΔ https://github.com/zalandoresearch/fashion-mnist

Slide 89

Slide 89 text

όοναΠζ64 ӅΕ૚ͷ෯128 ͔֬ʹ΍΍མͪΔ ධՁσʔλ෼ྨਫ਼౓ 87%લޙ

Slide 90

Slide 90 text

ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺ L ੵ ϕ ਫ਼౓͕ग़ͳ͍࣌͸૚Λ૿΍͢ ͔͠͠୯७ͳߦྻੵͷ૚Λ૿΍͢ͱ ! ͕ͲΜͲΜ૿͑Δ w

Slide 91

Slide 91 text

৞ΈࠐΈ ϑΟϧλ!w ೖྗ!x ग़ྗ!y ֶश͢Δը૾ॲཧϑΟϧλ

Slide 92

Slide 92 text

৞ΈࠐΈ ϑΟϧλ!w ೖྗ!x ग़ྗ!y ֶश͢Δը૾ॲཧϑΟϧλ

Slide 93

Slide 93 text

yij = M ∑ k=0 N ∑ l=0 wkl x (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ೖ ྗ ৞ Έ ࠐ Έ ग़ ྗ ϑ ỹ ϧ λ

Slide 94

Slide 94 text

yij = M ∑ k=0 N ∑ l=0 wkl x (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x ∂L ∂yij ∂L ∂xij ͷޯ഑ w w ∂L ∂wij ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ͋ΔॏΈ͕ؔΘͬͨશͯͷೖग़ྗͷϖΞʹ͍ͭͯ ग़ྗଆͷޯ഑ͱೖྗͷੵͷ૯࿨ΛͱΔ

Slide 95

Slide 95 text

yij = M ∑ k=0 N ∑ l=0 wkl x (i+k)(j + l) ϑΟϧλαΠζ ɺִؒϚʔδϯͳ͠ͷ৔߹ M × N ৞ Έ ࠐ Έ ग़ ྗ ଆ ͷ ޯ ഑ ͷ ޯ ഑ x ͷޯ഑ w w ৞ΈࠐΈͷٯ఻೻ ∂L ∂wkl = ∑ i ∑ j ∂L ∂yij x (i+k)(j + l) ∂L ∂yij ∂L ∂xij ∂L ∂wij ∂L ∂xij = M ∑ k N ∑ l wkl ∂L ∂y (i−k)(j − l) 180౓ճసͨ͠ग़ྗଆͷޯ഑Λ ৞ΈࠐΉ

Slide 96

Slide 96 text

void main() { const uint filter_index = gl_GlobalInvocationID.x; const uint filter_x = filter_index % filter_width; const uint filter_y = filter_index / filter_width % filter_height; const uint channel = filter_index / filter_width / filter_height % channels; const uint filter_size = filter_width * filter_height * channels; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; bool filter_oob = filter_index >= filter_size; float sum = 0.0; for( int data_index = 0; data_index != batch_size; ++data_index ) { for( int output_x = 0; output_x != output_width; ++output_x ) { for( int output_y = 0; output_y != output_height; ++output_y ) { const int output_index = int( output_x ) + int( output_y ) * int( output_width ) + int( channel ) * int( output_width * output_height ) + data_index * int( output_width * output_height * channels ); const int input_x = output_x * int(filter_xstride) - int(xmargin) + int(filter_x); const int input_y = output_y * int(filter_ystride) - int(ymargin) + int(filter_y); const bool input_oob = filter_oob || input_x < 0 || input_x >= input_width || input_y < 0 || input_y >= input_height; const int input_index = int( input_x ) + int( input_y ) * int( input_width ) + int( channel ) * int( input_width * input_height ) + data_index * int( input_width * input_height * channels ); const float grad = filter_oob ? 0.0 : output_grad[ output_index ]; const float x = input_oob ? 0.0 : input_data[ input_index ]; sum += grad * x; } } } if( !filter_oob ) adam( weight[ filter_index ], sum ); } ΛٻΊΔ(-4- ∂L ∂wkl

Slide 97

Slide 97 text

void main() { const uint input_x = gl_GlobalInvocationID.x % output_width; const uint input_y = gl_GlobalInvocationID.x / output_width % output_height; const uint channel = gl_GlobalInvocationID.x / output_width / output_height; const uint data_index = gl_GlobalInvocationID.z; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width - xmargin * 2; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height - ymargin * 2; const uint input_size = input_width * input_height * channels; const uint relative_input_index = input_x + input_y * input_width + channel * input_width * input_height; const uint input_index = relative_input_index + data_index * input_width * input_height * channels; if( relative_input_index < input_size ) input_grad[ input_index ] = 0.0; for( int x = 0; x != filter_width; ++x ) { for( int y = 0; y != filter_height; ++y ) { const int output_x = int(input_x) * int(filter_xstride) - int(xmargin) - x; const int output_y = int(input_y) * int(filter_ystride) - int(ymargin) - y; const bool oob = output_x < 0 || output_x >= output_width || output_y < 0 || output_y >= output_height; const int relative_output_index = output_x + output_y * int(output_width) + int(channel) * int(output_width * output_height); const int output_index = relative_output_index + int(data_index) * int(output_width * output_height * channels ); const uint filter_index = x + y * int(filter_width) + channel * int(filter_width * filter_height ); if( relative_input_index < input_size ) { if( !oob ) { const float grad = output_grad[ output_index ] * weight[ filter_index ].x; input_grad[ input_index ] += grad; } } } } } ΛٻΊΔ(-4- ∂L ∂xij

Slide 98

Slide 98 text

3 6 1 2 2 0 5 9 6 MaxPooling ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢

Slide 99

Slide 99 text

3 6 1 2 2 0 5 9 6 MaxPooling ೖྗ!x ग़ྗ!y ൣғ಺Ͱ஋͕࠷େͩͬͨ 1ཁૉ͚ͩΛग़ྗʹ࢒͢ 9

Slide 100

Slide 100 text

ೖ ྗ .BY1PPMJOH ग़ ྗ MaxPooling void main() { const uint relative_output_index = gl_GlobalInvocationID.x; const uint output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; if( relative_output_index < output_size ) output_data[ output_index ] = 0.0; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_height; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) output_data[ output_index ] = max( output_data[ output_index ], input_data[ input_index ] ); } } } ϑΟϧλαΠζ ͷ৔߹ M × N yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M], l ∈ [0,N]

Slide 101

Slide 101 text

yij = max (x (Mi+k)(Nj + l)) k ∈ [0,M], l ∈ [0,N] ϑΟϧλαΠζ ͷ৔߹ M × N ೖ ྗ ଆ ͷ ޯ ഑ .BY1PPMJOH ग़ ྗ ଆ ͷ ޯ ഑ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij ࠷େ஋ΛΑ͖ͯͨ͜͠ೖྗʹରԠ͢Δޯ഑Λ ग़ྗଆͷޯ഑ͷ஋ʹ͢Δ ∂L ∂yij ∂L ∂xij MaxPooling ͷٯ఻೻

Slide 102

Slide 102 text

void main() { const uint relative_output_index = gl_GlobalInvocationID.x; const uint output_x = relative_output_index % output_width; const uint output_y = relative_output_index / output_width % output_height; const uint channel = relative_output_index / output_width / output_height; const uint input_width = ( output_width - 1 ) * filter_xstride + filter_width; const uint input_height = ( output_height - 1 ) * filter_ystride + filter_height; const uint data_index = gl_GlobalInvocationID.z; const uint output_size = output_width * output_height * channels; const uint input_size = input_width * input_height * channels; const uint output_index = relative_output_index + data_index * output_size; const uint initial_input_x = output_x * filter_xstride; const uint initial_input_y = output_y * filter_ystride; for( uint x = 0; x != filter_width; ++x ) { for( uint y = 0; y != filter_width; ++y ) { const uint input_x = x + output_x * filter_xstride; const uint input_y = y + output_y * filter_ystride; const uint input_index = input_x + input_y * input_width + channel * input_width * input_height + data_index * input_width * input_height * channels; if( relative_output_index < output_size ) input_grad[ input_index ] = ( input_data[ input_index ] == output_data[ output_index ] ) ? output_grad[ output_index ] : 0.0; } } } ͷ ޯ ഑ PPMJOH ͷ ޯ ഑ MaxPooling ͷٯ఻೻ ∂L ∂x (Mi+k)(Nj + l) = ∂L ∂yij x (i+k)(j + l) = yij 0 x (i+k)(j + l) ≠ yij

Slide 103

Slide 103 text

ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺ L .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ NEW!

Slide 104

Slide 104 text

ධՁσʔλ෼ྨਫ਼౓ 90%લޙ গ͠޲্ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺14:14

Slide 105

Slide 105 text

L ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ NEW!

Slide 106

Slide 106 text

͜Ε͸ͻͲ͍ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺32:32:64:64

Slide 107

Slide 107 text

L ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ ! ͸࠷ऴ૚Λআ͍ͯ ReLU ϕ ͚ͩ͜͜Hyperbolic Tangent

Slide 108

Slide 108 text

L ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ ReLU͸ਖ਼ํ޲ʹ ͍͘ΒͰ΋େ͖ͳ஋ΛͱΔ ֶश͕;Β͍͍ͭͯΔͱ tanhʹڊେͳ஋͕͞͞Δ Ͱ ͍ͩ͘͞ −1 ≤ x ≤ 1 14326.7

Slide 109

Slide 109 text

L ੵ ϕ ੵ ϕ ଛ ࣦ ؔ ਺ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ .BY1PPMJOH ϕ ৞ Έ ࠐ Έ ϕ ৞ Έ ࠐ Έ 14326.7͸μϝ͗ͯ͢ গ͘͠Β͍஋͕มΘͬͯ΋μϝͳͷͰ ޯ഑͸ Ͱ͢ 0 Ͳ͏ͨ͠Β͍͍ͷ͔ ͳΜ΋Θ͔ΒΜ ޯ഑͕ແ͘ͳͬͯ ֶशͰ͖ͳ͘ͳΔ ∂L ∂xi = ∂L ∂yi (1 − tanh2 (xi)) 14326.7

Slide 110

Slide 110 text

৞ΈࠐΈ૚ͷֶश཰Λ! ʹͨ͠Βਫ਼౓͕վળͨ͠ 1 10 ֶͨͩ͠श͕஗͍ όοναΠζ64 ӅΕ૚ͷ෯128 ৞ΈࠐΈ૚ͷνϟωϧ਺32:32:64:64

Slide 111

Slide 111 text

૚ͷग़ྗͷ෼෍ΛҰఆʹอͭख๏ Batch Normalization Layer Normalization Group Normalization Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Sergey Ioffe and Christian Szegedy 2015 https://arxiv.org/abs/1502.03167 https://arxiv.org/abs/1607.06450 Layer Normalization Lei Ba, Jimmy and Kiros, Jamie Ryan and Hinton, Geoffrey E. p. arXiv:1607.06450. 2016 https://arxiv.org/abs/1803.08494 Group Normalization Yuxin Wu and Kaiming He 2018

Slide 112

Slide 112 text

TensorCore ͓·͚ ਫ ฏ ߦ ྻ ੵ ࿨ … AB + C A B C ! ཁૉͷߦྻ! ͱ! ཁૉͷߦྻ! Λֻ͚ͯ ! ཁૉͷߦྻ! Λ଍ͨ݁͠ՌΛ32εϨου࢖ͬͯ1໋ྩͰಘΔ 16 × 16 A 8 × 16 B 16 × 8 C 7,@/7@DPPQFSBUJWF@NBUSJY֦ுʹରԠͨ͠ /7*%*"ͷ(16ͳΒ 7VMLBO͔ΒͰ΋ར༻Ͱ͖Δ

Slide 113

Slide 113 text

ͳΜͰTensorCoreΛ࢖Θͳ͔ͬͨͷ? ஸ౓Ոʹམ͍ͪͯͨGeForceGTX1070ʹ͸ TensorCore͕ແ͔ͬͨ

Slide 114

Slide 114 text

·ͱΊ σΟʔϓϥʔχϯά͸ΞϧΰϦζϜͳͷͰ ࣮૷͢Ε͹Ͳ͜Ͱ΋ಈ͘