Deep Learning in Go, or Shennanigans With Matrices

@chewxy on Twi-er Deep Learning in Go Or,
shennanigans By Xuanyi Chew, GolangSyd Oct2015

@chewxy on Twi-er Why Go? •  Knee jerk
reacDon. •  No good reason. •  You probably should use Theano.

@chewxy on Twi-er Deep Neural Network Picture thanks
to Michael Nielsen

@chewxy on Twi-er Modern Deep Nets (LSTM)

@chewxy on Twi-er A Neurone Picture thanks to
Michael Nielsen f(x, w) w1 w2 w3 output = f(Σwi xi )

@chewxy on Twi-er Many Neurones Picture thanks to
Michael Nielsen

@chewxy on Twi-er Representing Neural Networks w1
w2 w3 Input Output (number of neurones)

@chewxy on Twi-er Deep Neural Network Picture thanks
to Michael Nielsen 1 x n n x m m x o o x 4

@chewxy on Twi-er Implementing a NN in Go
type NeuralNetwork struct { DeepConfig hiddenLayers []*NeuroneLayer logitLayer *LogitLayer // or SoftmaxLayer }

type NeuroneLayer struct { *Neurone LayerConfig activationFunction ActivationFunction } type ActivationFunction func(float64) float64

type Neurone struct { // weights[i][j] -‐-‐ i is the output index, j is the input index Weights [][]float64 Biases []float64 }

func (nl *NeuroneLayer) Activate(x []float64) []float64 { retVal := make([]float64, nl.Outputs) for i, row := range nl.Weights { var output float64 for j, weight := range row { output += weight * x[j] } output += nl.Biases[i] retVal[i] = nl.activationFunction(output) } return retVal }

•  Other bits and bobs: •  Back propagate errors •  Learning method (SGD, AdaGrad, RMSprop, L-‐BFGS…) •  Errors (Cross Entropy, meansquare errors…) •  Unique layers (ConvoluDon, RBM, etc …) •  Mix and Match bits •  All the above is just simple manipulaDon of matrix code

@chewxy on Twi-er The Problem •  Not fast
enough •  Use too much memory

@chewxy on Twi-er Too Much Memory •  Some
matrices are sparse •  Too much space wasted storing 0.0f

@chewxy on Twi-er Real Life Logs 2015/10/21 09:59:14
Epoch 58. Correct/Total: 214577/229672 = 0.93 2015/10/21 09:59:14 MatSize: 6370341/151970075 2015/10/21 09:59:14 Mat load factor: 0.04 2015/10/21 09:59:14 Allocated : 1905.51 MB 2015/10/21 09:59:14 Total Allocated : 244659.96 MB 2015/10/21 09:59:14 Heap Allocted : 1905.51 MB 2015/10/21 09:59:14 Sys Total Allocated: 2335.47 MB 2015/10/21 09:59:14 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐

@chewxy on Twi-er Solution(s) •  Implement Sparse Matrix
•  map[coordinate]float64 •  type sparseMat64 struct { d []float64 r []uint32 c []uint32 } •  type sparseMat64 []*cell •  type sparseMat64 [][MAXCOL]float64 type coordinate struct { r, c int } type cell struct { col int, val float64 }

@chewxy on Twi-er Solution(s) •  Implement Sparse Matrix
•  map[coordinate]float64 •  type sparseMat64 struct { d []float64 r []uint32 c []uint32 } •  type sparseMat64 []*cell •  type sparseMat64 [][MAXCOL]float64 type coordinate struct { r, c int } type cell struct { col int, val float64 } Most flexible Most useful for Hadamard Fastest, good for small MAXCOL Fastest to build

@chewxy on Twi-er Dense Matrix •  "import github.com/gonum/matrix"
•  [][]float64 •  [][MAXCOL]float64 •  [MAXCOL][]float64 •  type mat64 struct { d []float64 stride int }

@chewxy on Twi-er Other Memory Stuff •  Transpose
at will or store both Column-‐Major and Row-‐Major matrix? •  Layer by layer training (save each layer to disk)

@chewxy on Twi-er Not Fast Enough •  Let's
look at a real life example

@chewxy on Twi-er Embarrassingly Parallel threads := runtime.GOMAXPROCS(-‐1)
chunkSize := len(mat) / threads costChan := make(chan *cost) for chunk := 0; chunk < threads; chunk++ { var start int = chunk * chunkSize var end int if start+chunkSize > len(mat) { end = len(mat) } else { end = start + chunkSize } go nn.costFn(mat[start:end], chunkSize, costChan) }

@chewxy on Twi-er Problem •  I only have
8 CPUs

@chewxy on Twi-er ProJiling… Showing top 10 nodes
out of 61 (cum >= 44.95s) flat flat% sum% cum cum% 254.90s 67.81% 67.81% 337.61s 89.81% github.com/chewxy/goddamnproject/dp. (*neuralnetwork).costFn 11.39s 3.03% 70.84% 11.39s 3.03% sync/atomic.CompareAndSwapUint32 8s 2.13% 72.97% 29.19s 7.77% sync.(*Mutex).Lock 7.51s 2.00% 74.96% 7.51s 2.00% runtime.procyield 7.48s 1.99% 76.95% 7.69s 2.05% github.com/chewxy/goddamnproject/dp. (*neuralnetwork).computeScores 6.92s 1.84% 78.80% 8.04s 2.14% runtime.mapaccess2_fast64 6.25s 1.66% 80.46% 13.54s 3.60% math.Pow 5.17s 1.38% 81.83% 5.17s 1.38% sync/atomic.AddUint32 4.08s 1.09% 82.92% 4.08s 1.09% runtime.heapBitsForObject 3.40s 0.9% 83.82% 44.95s 11.96% math/rand.(*lockedSource).Int63 (pprof) weblist costFn

@chewxy on Twi-er weblist cosFn 225 .
. // This loop took a helluva long time 226 4.93s 4.93s for _, a := range actives { 227 29.47s 29.47s layer2Grad[i][a] += delta * hiddenAct[a] 228 1.02mins 1.02mins hiddenActGrad[a] += delta * nn.weights2[i][a] 229 . . }

@chewxy on Twi-er AVX to the Rescue! • 
Intel® brings extra extra parallelism to SIMD instrucDons with AVX©®™

@chewxy on Twi-er AVX/SIMD in a nutshell 256
bit register float64 float64 256 bit register float64 float64 +

@chewxy on Twi-er AVX/SIMD in a nutshell 256
bit register float64 float64 256 bit register float64 float64 + Single Instruction Multiple Data VADDPD

@chewxy on Twi-er AVX in Go /*
#cgo CFLAGS: -‐mavx -‐std=c99 #include <stdio.h> #include <stdlib.h> #include <immintrin.h>//AVX: -‐mavx void avx_add64(const size_t n, double *x, double *y, double *z) { static const size_t elements = 4; // with double precision, the processor can only handle 4 at a time. Maybe switch to float32? const size_t end = n / elements; // convert to AVX specific types __m256d *vz = (__m256d *)z; __m256d *vx = (__m256d *)x; __m256d *vy = (__m256d *)y; for (size_t i=0; i < end; ++i ){ vz[i] = _mm256_add_pd(vx[i], vy[i]); } }

@chewxy on Twi-er AVX in Go // multiplies
a scalar by a vector void avx_scale64(const double *s, const size_t n, double *x, double *z) { static const size_t elements = 4; // with double precision, the processor can only handle 4 at a time. Maybe switch to float32? const size_t end = n / elements; const __m256d vy = _mm256_broadcast_sd(s); // broadcast a scalar into vy // load x and z into AVX dedicated registers __m256d *vz = (__m256d *)z; __m256d *vx = (__m256d *)x; for (size_t i=0; i < end; ++i) { vz[i] = _mm256_mul_pd(vx[i], vy); } } */

@chewxy on Twi-er AVX in Go func malloc64(length
int) []float64 { // pad it so that at most there will be 64 items extra -‐ worse comes to worst, there are a bunch of 0s to add or mul size := length * 64 ptr := C._mm_malloc((C.size_t)(size), 64) header := reflect.SliceHeader{ Data: uintptr(unsafe.Pointer(ptr)), Len: length, Cap: length, } goSlice := *(*[]float64)(unsafe.Pointer(&header)) return goSlice } // somewhere later defer C._mm_free(unsafe.Pointer(&vec[0]))

@chewxy on Twi-er The Bits To Be AVX'd
for i := range rows { // calculate delta omitted // This loop took a helluva long time for _, a := range actives { layer2Grad[i][a] += delta * hiddenAct[a] hiddenActGrad[a] += delta * nn.weights2[i][a] } }

@chewxy on Twi-er After AVX for i :=
range rows { // calculate delta omitted scale64(delta, hiddenAct) ith2LWeights := make([]float64, len(nn.weights2[i])) copy(nn.weights2[i][0:], ith2LWeights[0:]) scale64(delta, ith2LWeights) dropout(ith2LWeights, inactives) add64(layer2Grad[i], hiddenAct) add64(hiddenActGrad, ith2LWeights) }

@chewxy on Twi-er Uh Oh Showing top 10
nodes out of 92 (cum >= 13.03mins) flat flat% sum% cum cum% 4.93mins 11.45% 11.45% 4.93mins 11.45% runtime._ExternalCode 4.58mins 10.62% 22.07% 10.03mins 23.28% runtime.cgocall 3.26mins 7.56% 29.62% 3.26mins 7.56% runtime.heapBitsForObject 3.24mins 7.53% 37.15% 26.96mins 62.59% github.com/chewxy/ goddamnproject/dp.(*neuralnetwork).costFn 2.75mins 6.38% 43.53% 3.45mins 8.01% runtime.greyobject 2.20mins 5.11% 48.64% 2.20mins 5.11% runtime.memmove 1.74mins 4.03% 52.67% 8.25mins 19.15% runtime.scanobject 1.71mins 3.96% 56.63% 1.71mins 3.96% runtime.memclr 1.45mins 3.38% 60.00% 3.13mins 7.26% runtime.deferreturn 1.34mins 3.12% 63.12% 13.03mins 30.24% runtime.systemstack (pprof) weblist costFn

@chewxy on Twi-er Overhead for CGo •  Is
way too high 217 1.99s 4.92mins scale64(delta, hiddenAct) ... 221 3.74s 5.05mins scale64(delta, ith2LWeights) ... 228 3.84s 6.48mins add64(layer2Grad[i], hiddenAct) 229 1.52s 6.07mins add64(hiddenActGrad, ith2LWeights)

@chewxy on Twi-er Solution(s) •  Batch Cgo calls
•  Handwrite ASM, match APIs

@chewxy on Twi-er Solution: Don't Use Cgo
•  Batch Cgo calls •  Handwrite ASM! Too much rearchitecting

@chewxy on Twi-er Problem •  Go's Assembler doesn't
support AVX instrucDons. •  Stuck with SSE2 •  golang.org/src/cmd/internal/obj/x86/a.out.go lists all the instrucDons and registers supported

@chewxy on Twi-er // func scale64(s float64, a []float64)
TEXT ·∙scale64(SB), 7, $0 MOVSD s+0(FP), X0 MOVQ x_data+8(FP), SI // move the first element to the indexing register MOVQ x_len+24(FP), BP // move the length into BP // check that there is more than 4 values in the slice SUBQ $4, BP JL remainders // jump to remainder // manually broadcast s into X0's lower bits, which is used for SIMD MOVLHPS X0, X0 loop: // load the first two elements, and then multiply by s (which is in X0) MOVUPD (SI), X2 MULPD X0, X2 // load second two elements, then multiply by s MOVUPD 16(SI), X4 MULPD X0, X4 // move the values back to the original positions MOVUPD X2, (SI) MOVUPD X4, 16(SI) // now that we are done with the first 4 elements, update the pointers to the top of the next 4 elements ADDQ $32, SI SUBQ $4, BP // the len is now less 4 JGE loop remainders: ADDQ $4, BP // we subtracted earlier, remember??!!! JEQ done // if there are nothing left to process, we'll go to done remainingloop: // load element from x, and then multiply by s (which in X0) // TODO: check if there is a performance penalty on using SSE registers MOVSD (SI), X2 MULSD X0, X2 // save it back to the array MOVSD X2, (SI) // update the pointer ADDQ $8, SI DECQ BP JNE remainingloop done: RET

@chewxy on Twi-er Pprof Says… 217 1.64s
12.82s scale64(delta, hiddenAct) ... 221 7.88s 7.88s scale64(delta, ith2LWeights) ... 228 . . add64(layer2Grad[i], hiddenAct) 229 . . add64(hiddenActGrad, ith2LWeights)

@chewxy on Twi-er Interesting Note go test -‐bench
. testing: warning: no tests to run PASS BenchmarkGoScale64-‐8 1000000 1339 ns/op BenchmarkScale64-‐8 5000000 329 ns/op BenchmarkGoAdd64-‐8 1000000 1711 ns/op BenchmarkAdd64-‐8 1000000 1473 ns/op ok github.com/chewxy/testideas3 6.567s func GoAdd64(a, b []float64) { for i, v := range a{ a[i] = v + b[i] } } Not much difference

@chewxy on Twi-er Other More Realistic Solutions • 
Be-er algorithms •  PrecalculaDng acDvaDons and caching§ •  Binary weights (no mulDplicaDon, only addiDon)* *very very very very very new. I'm sDll playing with it

@chewxy on Twi-er GoNum •  Failed build (a
few months ago when I started)… haven't checked since.

@chewxy on Twi-er Conclusion •  Building a neural
network in Go is fairly simple. •  Building a performant neural network in Go? Only embark with quesDonable sanity •  Irony: deploying Theano is hard. But all my opDmizaDons make deploying this equally hard. •  A default built in, well opDmized matrix type would be really really really sweet

Deep Learning in Go, or Shennanigans With Matrices

Deep Learning in Go, or Shennanigans With Matrices

More Decks by Xuanyi

Other Decks in Programming

Featured

Transcript