Slide 1

Slide 1 text

How I made a pure-Ruby word2vec program more than 3x faster RubyConf Taiwan 2016 @remore

Slide 2

Slide 2 text

Who Am I Kei Sawada @remore A rubyist from Tokyo" An weekend contrabassist Engineering Manager at Recruit Holdings Co.,Ltd., VP of Engineering at NIJIBOX Co.,Ltd.

Slide 3

Slide 3 text

Me And Taiwan A Taiwanese coworker who is working at NIJIBOX Many #rubyfriends in Taiwan Eddie, Ryudo, Chao, Yu-Cheng, lulalala, Lin Yu Hsiang and many others Super glad to be here today!

Slide 4

Slide 4 text

Who is interested in Ruby’s Performance micro-benchmarking results YARV, ISeq and profiling tools⏱ Who may be interested in RPC(IPC) with Python and Julia from Ruby This Talk Is Mainly For The Rubyist

Slide 5

Slide 5 text

Table Of Contents A Reality 3x Challenge Why Slow

Slide 6

Slide 6 text

Chapter 1 A Reality A reality of Ruby’s performance for large-scale computation

Slide 7

Slide 7 text

"x=2.5; 1.upto(N){|i| x=x+i}; p x"

Slide 8

Slide 8 text

> echo "x=2.5; 1.upto(10){|i| x=x+i}; p x" | time ruby 57.5 0.13 real 0.06 user 0.05 sys 0sec 0.05sec 0.1sec 0.15sec 0.2sec 10 Ruby

Slide 9

Slide 9 text

> echo "x=2.5; 1.upto(100){|i| x=x+i}; p x" | time ruby 5052.5 0.11 real 0.06 user 0.04 sys 0.1sec 0.108sec 0.115sec 0.123sec 0.13sec 10 100 Ruby

Slide 10

Slide 10 text

> echo "x=2.5; 1.upto(1000){|i| x=x+i}; p x" | time ruby 500502.5 0.15 real 0.07 user 0.05 sys 0sec 0.038sec 0.075sec 0.113sec 0.15sec 10 100 1000 Ruby

Slide 11

Slide 11 text

> echo "x=2.5; 1.upto(10000){|i| x=x+i}; p x" | time ruby 50005002.5 0.11 real 0.06 user 0.04 sys 0sec 0.038sec 0.075sec 0.113sec 0.15sec 10 100 1000 10000 Ruby

Slide 12

Slide 12 text

> echo "x=2.5; 1.upto(1e5){|i| x=x+i}; p x" | time ruby 5000050002.5 0.14 real 0.08 user 0.05 sys 0sec 0.038sec 0.075sec 0.113sec 0.15sec 10 100 1000 10000 1e5 Ruby

Slide 13

Slide 13 text

Looks good enough for smaller number of loops!

Slide 14

Slide 14 text

> echo "x=2.5; 1.upto(1e6){|i| x=x+i}; p x" | time ruby 500000500002.5 0.25 real 0.20 user 0.04 sys 0sec 0.065sec 0.13sec 0.195sec 0.26sec 10 100 1000 10000 1e5 1e6 Ruby

Slide 15

Slide 15 text

> echo "x=2.5; 1.upto(1e7){|i| x=x+i}; p x" | time ruby 50000005000002.5 1.58 real 1.52 user 0.05 sys 0sec 0.4sec 0.8sec 1.2sec 1.6sec 10 100 1000 10000 1e5 1e6 1e7 Ruby

Slide 16

Slide 16 text

> echo "x=2.5; 1.upto(1e8){|i| x=x+i}; p x" | time ruby 5.000000050000003e+15 14.56 real 14.37 user 0.09 sys 0sec 4sec 8sec 12sec 16sec 10 100 1000 10000 1e5 1e6 1e7 1e8 Ruby

Slide 17

Slide 17 text

> echo "x=2.5; 1.upto(1e9){|i| x=x+i}; p x" | time ruby 5.00000000067109e+17 157.27 real 150.16 user 1.30 sys 0sec 40sec 80sec 120sec 160sec 10 100 1000 10000 1e5 1e6 1e7 1e8 1e9 Ruby

Slide 18

Slide 18 text

Damn, apparently Ruby is Slow ⌛ for huge number of loops

Slide 19

Slide 19 text

> PY=$(cat << EOS "n=2.5 for i in range(1,int(\$N)+1): n=i+n; print(n)" EOS ) > N=1e3 && eval echo "$PY" | time python 500502.5 0.10 real 0.01 user 0.01 sys How About Python?

Slide 20

Slide 20 text

> N=1e5 && eval echo "$PY" | time python 5000050002.5 0.13 real 0.03 user 0.01 sys 0sec 0.038sec 0.075sec 0.113sec 0.15sec 10 100 1000 10000 1e5 Ruby Python

Slide 21

Slide 21 text

Both Ruby And Python are good enough for smaller number of loops!

Slide 22

Slide 22 text

> N=1e6 && eval echo "$PY" | time python 5.00000500002e+11 0.38 real 0.23 user 0.02 sys 0sec 0.1sec 0.2sec 0.3sec 0.4sec 10 100 1000 10000 1e5 1e6 Ruby Python

Slide 23

Slide 23 text

> N=1e7 && eval echo "$PY" | time python 5.0000005e+13 2.66 real 2.35 user 0.17 sys 0sec 0.75sec 1.5sec 2.25sec 3sec 10 100 1000 10000 1e5 1e6 1e7 Ruby Python

Slide 24

Slide 24 text

> N=1e8 && eval echo "$PY" | time python 5.00000005e+15 48.27 real 25.87 user 10.67 sys 0sec 12.5sec 25sec 37.5sec 50sec 10 100 1000 10000 1e5 1e6 1e7 1e8 Ruby Python

Slide 25

Slide 25 text

> N=1e9 && eval echo "$PY" | time python 5.00000005e+15 48.27 real 25.87 user 10.67 sys 0sec 150sec 300sec 450sec 600sec 10 100 1000 10000 1e5 1e6 1e7 1e8 1e9 Ruby Python

Slide 26

Slide 26 text

> N=1e9 && eval echo "$PY" | time python 5.00000005e+15 48.27 real 25.87 user 10.67 sys 0sec 150sec 300sec 450sec 600sec 10 100 1000 10000 1e5 1e6 1e7 1e8 1e9 Ruby Python Attention Please BTW take note that this micro benchmark is done by my MacBook Pro(2015) with Ruby 2.3.0 and Python 2.7. With my environment Python looks pretty slow but it’s never be a fair judge. Please do not take this measurement result seriously, but please just use this to grab the feeling of the order of each programming environment speed!

Slide 27

Slide 27 text

Sadly BOTH Ruby and Python are Slow⌛ for huge number of loops

Slide 28

Slide 28 text

> SRC=$(cat << EOS "#include \"stdio.h\" int main(){ double n=2.5; for(int i=1;i<=\$N;i++){ n=i+n; } printf(\"%lf\", n); }" EOS ) What About C?

Slide 29

Slide 29 text

> N=1e5 && eval echo "$SRC" > main.c; gcc main.c; time ./a.out 5000050002.500000 real 0m0.006s user 0m0.001s sys 0m0.002s 0sec 0.04sec 0.08sec 0.12sec 0.16sec 10 100 1000 10000 1e5 Ruby Python C

Slide 30

Slide 30 text

C is …… Fast!

Slide 31

Slide 31 text

0sec 0.1sec 0.2sec 0.3sec 0.4sec 10 100 1000 10000 1e5 1e6 Ruby Python C > N=1e6 && eval echo "$SRC" > main.c; gcc main.c; time ./a.out 500000500002.500000 real 0m0.009s user 0m0.004s sys 0m0.002s

Slide 32

Slide 32 text

0sec 0.75sec 1.5sec 2.25sec 3sec 10 100 1000 10000 1e5 1e6 1e7 Ruby Python C > N=1e7 && eval echo "$SRC" > main.c; gcc main.c; time ./a.out 50000005000002.500000 real 0m0.033s user 0m0.029s sys 0m0.002s

Slide 33

Slide 33 text

0sec 12.5sec 25sec 37.5sec 50sec 10 100 1000 10000 1e5 1e6 1e7 1e8 Ruby Python C > N=1e8 && eval echo "$SRC" > main.c; gcc main.c; time ./a.out 5000000050000003.000000 real 0m0.287s user 0m0.281s sys 0m0.003s

Slide 34

Slide 34 text

0sec 150sec 300sec 450sec 600sec 10 100 1000 10000 1e5 1e6 1e7 1e8 1e9 Ruby Python C > N=1e9 && eval echo "$SRC" > main.c; gcc main.c; time ./a.out 500000000067108992.000000 real 0m2.815s user 0m2.799s sys 0m0.008s

Slide 35

Slide 35 text

C is ridiculously Fast

Slide 36

Slide 36 text

Introducing Julia Julia is A dynamic programming language 4 years old since open sourced in 2012 Desgined for scientific computing Fast

Slide 37

Slide 37 text

> JL=$(cat << EOS "function sample_loop(n) for i in 1:\$N n = i+n end n end println(sample_loop(2.5))" EOS ) How About Julia?

Slide 38

Slide 38 text

> N=1e5 && eval echo "$JL" | time julia sample_loop (generic function with 1 method) 5.0000500025e9 0.91 real 0.48 user 0.14 sys 0sec 0.125sec 0.25sec 0.375sec 0.5sec 10 100 1000 10000 1e5 Ruby Python C Julia

Slide 39

Slide 39 text

Julia is the slowest(⁉) for smaller number of loops

Slide 40

Slide 40 text

However

Slide 41

Slide 41 text

0sec 0.125sec 0.25sec 0.375sec 0.5sec 10 100 1000 10000 1e5 1e6 Ruby Python C Julia > N=1e6 && eval echo "$JL" | time julia sample_loop (generic function with 1 method) 5.000005000025e11 0.45 real 0.44 user 0.08 sys

Slide 42

Slide 42 text

0sec 0.75sec 1.5sec 2.25sec 3sec 10 100 1000 10000 1e5 1e6 1e7 Ruby Python C Julia > N=1e7 && eval echo "$JL" | time julia sample_loop (generic function with 1 method) 5.00000050000025e13 0.50 real 0.47 user 0.09 sys

Slide 43

Slide 43 text

0sec 12.5sec 25sec 37.5sec 50sec 10 100 1000 10000 1e5 1e6 1e7 1e8 Ruby Python C Julia > N=1e8 && eval echo "$JL" | time julia sample_loop (generic function with 1 method) 5.000000050000003e15 1.82 real 0.76 user 0.09 sys

Slide 44

Slide 44 text

0sec 150sec 300sec 450sec 600sec 10 100 1000 10000 1e5 1e6 1e7 1e8 1e9 Ruby Python C Julia > N=1e9 && eval echo "$JL" | time julia sample_loop (generic function with 1 method) 5.00000000067109e17 1.71 real 1.70 user 0.08 sys

Slide 45

Slide 45 text

Julia is as fast as C for bigger number of loops

Slide 46

Slide 46 text

Findings Ruby works reasonably fast for smaller number of loops, but for huge number of loops it is advisable to consider to switch language Primary option would be using C Julia is also dynamic language but it can be FAST

Slide 47

Slide 47 text

Chapter 2 3x Challenge An experiment to workaround this performance issue using Julia programming language along with ruby2julia transpiler

Slide 48

Slide 48 text

Given That Performance Issue, Which Option Is The Best To Workaround? Give up and use other languages anyway? Make Ruby itself faster? Make a gem to boost my ruby program?

Slide 49

Slide 49 text

Idea: Transpiler What if we can run arbitrary ruby code on a Julia process? It may look something like `some_ruby_code.to_other_lang`

Slide 50

Slide 50 text

Is it really possible to convert Ruby code to Julia code?

Slide 51

Slide 51 text

# ruby for i in 1..N n = i+n end # julia for i in 1:N n = i+n end Sometimes it's hopeful range operator create range object

Slide 52

Slide 52 text

# ruby class Sample def context return self end end p Sample.new.instance_eval{context} # julia # too many gaps to be filled such as OOP things, methods for reflection etc... But Sometimes it's NOT

Slide 53

Slide 53 text

I tested it anyway, although it wasn’t sure if it’s promising

Slide 54

Slide 54 text

A ruby2julia Transpiler Implementation: Julializer github.com/remore/julializer Very limited syntax is supported as of v0.1.2 TrueClass, FalseClass, Fixnum, Float, integer, Numeric, Random Array, Range, Hash are also partially supported(only very few methods as of now) TBH still need huge improvements including developing error checking tool and writing documentations "Ͱ΋΍ΔΜͩΑ"

Slide 55

Slide 55 text

$ echo “-1.6.to_i” | julializer trunc(Int64,parse(string((-1.6)))); $ cat sample.rb for i in 0..list.size-1 list[i] = (i-list.size/2).abs end $ julializer sample.rb for i::Int64 = 0:size(list)[1]-1;list[i+1]=abs((i- size(list)[1]/2));;end;; Examples

Slide 56

Slide 56 text

$ ruby -r julializer -e "p Julializer.ruby2julia(File.read('calc.rb'))" "const max_exp=6;;const exp_table_size=1000;;const max_sentence_length=1000;;function init_unigram_table(table_size, vocab);train_words_pow=0.0;;power=0.75;;table=fill(0, table_size);;for a::Int64 = 0:size(vocab)[1]-1;train_words_pow+=vocab[a+1] [0+1]^power;;end;;i=0;;d1=(vocab[i+1][0+1]^power)/train_words_pow;;for a::Int64 = 0:table_size-1;table[a+1]=i;if a/float(table_size)>d1;i+=1;;d1+=(vocab[i+1] [0+1]^power)/train_words_pow;;;end;if i>=size(vocab)[1];i=size(vocab)[1]-1;;end;;end;;return table;;;end;;function addop(size, list, base, target);for i::Int64 = 0:size-1;list[i+base+1]+=target[i+1];;end;;list;;end;;function addop2(size, list, base, coefficient, target, base2);for i::Int64 = 0:size-1;list[i+base +1]+=coefficient*target[i+base2+1];;end;;list;;end;;function addop3(size, f, coefficient, target, base);for i::Int64 = 0:size-1;f+=coefficient[i+1]*target[i+base +1];;end;;f;;end;;function addop4(size, list, target, base);for i::Int64 = 0:size-1;list[i+1]+=target[i+base+1];;end;;list;;end;;myrandom=0;;function next_random();global myrandom;myrandom=abs((myrandom*25214903917+11));;return myrandom;;;end;;function exptable(num);num=exp((num/ float(exp_table_size)*2-1)*max_exp);;num/(num+1);;end;;function bsearch_index(list, target);a=0;;z=size(list)[1]-1;;while (true);current_entry=list[a+1:z+1] [floor(Int64,((z-a)/2))+1];if current_entry=target)||z-a<=1;return round(Int64,(a+(z- a)/2+1));;;else;a=round(Int64,(a+(z-a)/2));;;end;;;;else;if a>=target||z-a<=1;return a;;end;;z=round(Int64,(z-(z-a)/2));;;end;;;end;;;end;;function calc_vec(iter, original_text, sample, train_words, debug_mode, __vocab_index_hash, vocab, syn0, syn1neg, negative, alpha, __cum_table, table_size, layer1_size, window);sentence_position=0;;sentence_length=0;;word_count=0;;word_count_actual=0;;last_word_count=0;;sen=[];;local_iter=iter;;neu1=[];;neu1e=[];;backup=copy( original_text);;__denominator=trunc(Int64,parse(string((exp_table_size/max_exp/ 2))));;__sample_train_words=sample*train_words;;table_size=trunc(Int64,parse(string(1e8)));;table=init_unigram_table(table_size,vocab);;starting_alpha=alpha;; while true;if sentence_position%500==0&&debug_mode>1;print(@sprintf(\"%d %d / \",word_count,last_word_count));;end;if word_count- last_word_count>10000;word_count_actual+=word_count-last_word_count;;last_word_count=word_count;;if debug_mode>1;print(string(\"\\r Alpha: \",@sprintf(\"%f\",alpha),\" Progress: \",@sprintf(\"%.2f\",(word_count_actual/float((iter*train_words+1))*100)),\"%\"));;end;;alpha=starting_alpha*(1- word_count_actual/float((iter*train_words+1)));;if alpha0;ran=(sqrt(vocab[word+1][0+1]/__sample_train_words)+1)*__sample_train_words/vocab[word+1][0+1];;if ran<(next_random()&(0xFFFF+0))/ 65536.0;continue;;end;;;end;;push!(sen, word);sentence_length+=1;;if sentence_length>=max_sentence_length;break;;end;;;end;;if max_sentence_length +skipped<=length(original_text)-1;splice!(original_text, 0+1:0+0+max_sentence_length+skipped+1);;else;original_text=[];;;end;;;sentence_position=0;;;end;if size(original_text)[1]==0||word_count>train_words;word_count_actual+=word_count-last_word_count;;local_iter-=1;;if debug_mode>1;print(local_iter);;end;;if local_iter==0;break;;end;;word_count=0;;last_word_count=0;;sentence_length=0;;original_text=copy(backup);;sen=[];;continue;;;end;if sentence_position>=size(sen) [1];continue;;end;word=sen[sentence_position+1];neu1=fill(0.0, layer1_size);neu1e=fill(0.0, layer1_size);b=next_random()%window;cw=0;for j::Int64 = b:window*2- b;if j!=window;k=sentence_position-window+j;;if k<0||k>=sentence_length;continue;;end;;if k>=size(sen)[1];continue;;end;;last_word=sen[k +1];;neu1=addop4(layer1_size,neu1,syn0,last_word*layer1_size);;cw+=1;;;end;;end;if cw!=0;for j::Int64 = 0:layer1_size-1;neu1[j+1]/=cw;;end;;if negative>0;for j::Int64 = 0:negative;if j==0;target=word;;label=1;;;else;nr=next_random();;target=table[(nr>>16)%table_size+1];;if target==0;target=nr%(size(vocab) [1]-1)+1;;end;;if target==word;continue;;end;;label=0;;;end;;l2=target*layer1_size;f=0.0;f=addop3(layer1_size,f,neu1,syn1neg,l2);if f>max_exp;g=(label-1)*alpha;;;elseif f<(-max_exp);g=label*alpha;;;else;g=(label-exptable(trunc(Int64,parse(string(((f +max_exp)*__denominator))))))*alpha;;;end;;;neu1e=addop2(layer1_size,neu1e,0,g,syn1neg,l2);syn1neg=addop2(layer1_size,syn1neg,l2,g,neu1,0);;end;;;end;;for j::Int64 = b:window*2-b;if j!=window;c=sentence_position-window+j;;if c<0||c>=sentence_length;continue;;end;;if c>=size(sen)[1];continue;;end;;last_word=sen[c +1];;syn0=addop(layer1_size,syn0,last_word*layer1_size,neu1e);;;end;;end;;;end;sentence_position+=1;if sentence_position>=sentence_length;sentence_length=0;;end;;end;;[syn0,syn1neg];;end;;" You can convert word2vec.rb

Slide 57

Slide 57 text

Next Problem: How To Run a Julia Program from Ruby For example: Run the external program like this? Process.spawn(\"echo 'p 123' | julializer | julia\", :out=>”STDOUT") Obviously not good solution(you need to marshal data manually + Julia language VM must be booted up at every single function call)

Slide 58

Slide 58 text

Idea: IPC With Julia What if we can pass arbitrary Ruby value to running Julia background process throughout Module via IPC?

Slide 59

Slide 59 text

Introducing virtual_module github.com/remore/virtual_module An IPC module generator Julia and Python are supported as a background process Marshaling with msgpack

Slide 60

Slide 60 text

Sample Usage(1): Calling Julia from Ruby jl = VirtualModule.new(:julia=>["Clustering"]) include jl r = Clustering.kmeans(jl.rand(5, 1000), 20, maxiter:200, display: :iter) p Clustering.assignments(r) # [3, 13, 2, 7, 15, 12, 10, ... 13, 1, 8]

Slide 61

Slide 61 text

jl = VirtualModule.new(:julia=>["Clustering"]) include jl r = Clustering.kmeans(jl.rand(5, 1000), 20, maxiter:200, display: :iter) p Clustering.assignments(r) # [3, 13, 2, 7, 15, 12, 10, ... 13, 1, 8] Sample Usage(1): Calling Julia from Ruby By calling VirtualModule#new method, Julia background process is booted up and starts to idle

Slide 62

Slide 62 text

jl = VirtualModule.new(:julia=>["Clustering"]) include jl r = Clustering.kmeans(jl.rand(5, 1000), 20, maxiter:200, display: :iter) p Clustering.assignments(r) # [3, 13, 2, 7, 15, 12, 10, ... 13, 1, 8] Sample Usage(1): Calling Julia from Ruby VirtualModule#new method will give you back an instance of Module class

Slide 63

Slide 63 text

jl = VirtualModule.new(:julia=>["Clustering"]) include jl r = Clustering.kmeans(jl.rand(5, 1000), 20, maxiter:200, display: :iter) p Clustering.assignments(r) # [3, 13, 2, 7, 15, 12, 10, ... 13, 1, 8] Sample Usage(1): Calling Julia from Ruby Since it’s an instance of Module class, you can #include it

Slide 64

Slide 64 text

jl = VirtualModule.new(:julia=>["Clustering"]) include jl r = Clustering.kmeans(jl.rand(5, 1000), 20, maxiter:200, display: :iter) p Clustering.assignments(r) # [3, 13, 2, 7, 15, 12, 10, ... 13, 1, 8] Sample Usage(1): Calling Julia from Ruby Now is the time to call arbitrary function in Julia. Every single parameters passed in Ruby’s world is converted to Julia’s value by msgpack

Slide 65

Slide 65 text

jl = VirtualModule.new(:julia=>["Clustering"]) include jl r = Clustering.kmeans(jl.rand(5, 1000), 20, maxiter:200, display: :iter) p Clustering.assignments(r) # [3, 13, 2, 7, 15, 12, 10, ... 13, 1, 8] Sample Usage(1): Calling Julia from Ruby msgpack does convert only basic data types(such as Integer, String Array etc). In this case, since kmeans function returns a value of `Clustering.KmeansResult{Float64}` Type, r is still an instance of Module class which keep pointer to `Clustering.KmeansResult{Float64}` value kept in background process.

Slide 66

Slide 66 text

jl = VirtualModule.new(:julia=>["Clustering"]) include jl r = Clustering.kmeans(jl.rand(5, 1000), 20, maxiter:200, display: :iter) p Clustering.assignments(r) # [3, 13, 2, 7, 15, 12, 10, ... 13, 1, 8] Sample Usage(1): Calling Julia from Ruby Since Clustering.assignments function returns basic data type which can be converted to Ruby’s Array, finally we’ve got the clustering result!

Slide 67

Slide 67 text

Sample Usage(2): Calling Python(sklearn) from Ruby skl = VirtualModule.new( :lang=>:python, :pkgs=>["sklearn"=>["datasets", "svm", "grid_search", “cross_validation"]] ) include skl iris = datasets.load_iris(:_) clf = grid_search.GridSearchCV( svm.LinearSVC(:_), {'C':[1, 3, 5],'loss':['hinge', 'squared_hinge']}, verbose:0 ) clf.fit(iris.data, iris.target) p "Best Params: #{best_params = clf.best_params_}" #"Best Params: {\"loss\"=>\"squared_hinge\", \"C\"=>1}" score = cross_validation.cross_val_score( svm.LinearSVC(loss:'squared_hinge', C:1), iris.data, iris.target, cv:5 ) p "Scores: #{[:mean,:min,:max,:std].map{|e| e.to_s + '=' + score.send(e, :_).to_s }.join(',')}" # "Scores: mean=0.9666666666666668,min=0.9,max=1.0,std=0.04216370213557838"

Slide 68

Slide 68 text

Sample Usage(3): Defining Custom Methods With Julializer vm = VirtualModule.new(methods:<->(s) {Julializer.ruby2julia(s)}) def init_table(list) for i in 0..list.size-1 list[i]+=Random.rand end list end EOS p vm.init_table([1,20]) # [1.3066601775641218, 20.17001189249985]

Slide 69

Slide 69 text

Sample Snippets: remore/virtual_module/blob/master/example/calc.rb remore/virtual_module/blob/master/example/scipy.rb remore/virtual_module/blob/master/example/word2vec.rb Corresponding Blog: rimuru.lunanet.gr.jp/blog/calling-python-and-julia-libraries- from-ruby More Details Can Be Found At

Slide 70

Slide 70 text

$ SRC=$(cat << EOS "p VirtualModule.new(methods:<

Slide 71

Slide 71 text

> N=1e5 && eval echo "$SRC" | time ruby -r virtual_module 5000050002.5 2.68 real 1.58 user 0.36 sys 0sec 0.45sec 0.9sec 1.35sec 1.8sec 10 100 1000 10000 1e5 Ruby Python C Julia VirtualModule

Slide 72

Slide 72 text

Is it …… slow?

Slide 73

Slide 73 text

0sec 0.75sec 1.5sec 2.25sec 3sec 10 100 1000 10000 1e5 1e6 Ruby Python C Julia VirtualModule > N=1e6 && eval echo "$SRC" | time ruby -r virtual_module 500000500002.5 2.20 real 1.46 user 0.25 sys

Slide 74

Slide 74 text

0sec 0.75sec 1.5sec 2.25sec 3sec 10 100 1000 10000 1e5 1e6 1e7 Ruby Python C Julia VirtualModule > N=1e7 && eval echo "$SRC" | time ruby -r virtual_module 50000005000002.5 1.68 real 1.51 user 0.21 sys

Slide 75

Slide 75 text

0sec 12.5sec 25sec 37.5sec 50sec 10 100 1000 10000 1e5 1e6 1e7 1e8 Ruby Python C Julia VirtualModule > N=1e8 && eval echo "$SRC" | time ruby -r virtual_module 5.000000050000003e+15 1.95 real 1.75 user 0.21 sys

Slide 76

Slide 76 text

0sec 150sec 300sec 450sec 600sec 10 100 1000 10000 1e5 1e6 1e7 1e8 1e9 Ruby Python C Julia VirtualModule > N=1e9 && eval echo "$SRC" | time ruby -r virtual_module 5.00000000067109e+17 4.50 real 4.29 user 0.21 sys

Slide 77

Slide 77 text

Yay! virtual_module works as fast as C and Julia!

Slide 78

Slide 78 text

$ cd example $ ruby word2vec.rb --output /tmp/vectors.bin --train ../doc/benchmark_word2vec/training_data/ 10mb.txt --size 20 --window 10 --negative 5 --sample 1e-4 --binary 1 --iter 3 --debug 0 > /dev/null 2>&1 $ python Python 2.7.12 (default, Jul 1 2016, 15:12:24) [GCC 5.4.0 20160609] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import gensim >>> model = gensim.models.Word2Vec.load_word2vec_format('/tmp/vectors.bin', binary=True) >>> model.most_similar("japan") [(u'netherlands', 0.9741939902305603), (u'china', 0.9712631702423096), (u'county', 0.9686408042907715), (u'spaniards', 0.9669440388679504), (u'vienna', 0.9614173769950867), (u'abu', 0.9587018489837646), (u'korea', 0.9565504789352417), (u'canberra', 0.954473614692688), (u'erupts', 0.9540712833404541), (u'prefecture', 0.9534248113632202)] Benchmarking Using Pure-Ruby word2vec implementation

Slide 79

Slide 79 text

Benchmarking Program Can Be Found At: github.com/remore/virtual_module/blob/master/doc/ benchmark_word2vec/ compare_three_types_of_word2vec_implementation.sh

Slide 80

Slide 80 text

Benchmarking Result

Slide 81

Slide 81 text

2.4x 3x 5.6x 297x Looks Almost 3x Faster! Done!?

Slide 82

Slide 82 text

Yes done, in a sense of making pure-Ruby word2vec program more than 3x faster But…

Slide 83

Slide 83 text

But Still Problematic If the size of the text going seriously huge, compared to C there is still big gap….

Slide 84

Slide 84 text

Chapter 3 Why Slow An ongoing profiling attempt to know what part of the program cause this performance issue

Slide 85

Slide 85 text

What part of the source code works slow?

Slide 86

Slide 86 text

$ cat profile.rb RubyProf.start x=2.5 1.upto(1e4){|i| x=x+i}; p x result = RubyProf.stop RubyProf::FlatPrinter.new(result).print(STDOUT) $ ruby -r ruby-prof profile.rb ruby-prof

Slide 87

Slide 87 text

$ ruby -r ruby-prof profile.rb 50005002.5 Measure Mode: wall_time Thread ID: 70204763724260 Fiber ID: 70204768000900 Total: 0.009597 Sort by: self_time %self total self wait child calls name 52.19 0.010 0.005 0.000 0.005 1 Integer#upto 16.92 0.002 0.002 0.000 0.000 10001 Fixnum#> 15.79 0.002 0.002 0.000 0.000 10000 Fixnum#+ 14.60 0.001 0.001 0.000 0.000 10000 Float#+ 0.22 0.010 0.000 0.000 0.010 1 Global#[No method] 0.19 0.000 0.000 0.000 0.000 1 Kernel#p 0.09 0.000 0.000 0.000 0.000 1 Float#inspect #upto is the slowest. #> is also slow.

Slide 88

Slide 88 text

What is happening under the hood?

Slide 89

Slide 89 text

$ ruby -e “ printf RubyVM::InstructionSequence.compile( 'x=2.5; 1.upto(1e4){ |i| x = x+i }’ ).disasm" InstructionSequence

Slide 90

Slide 90 text

$ ruby -e "printf RubyVM::InstructionSequence.compile('x=2.5; 1.upto(1e4){ |i| x = x+i }').disasm" == disasm: #@>================================ == catch table | catch type: break st: 0006 ed: 0013 sp: 0000 cont: 0013 |------------------------------------------------------------------------ local table (size: 2, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1]) [ 2] x 0000 trace 1 ( 1) 0002 putobject 2.5 0004 setlocal_OP__WC__0 2 0006 putobject_OP_INT2FIX_O_1_C_ 0007 putobject 10000.0 0009 send , , block in 0013 leave == disasm: #@>======================= == catch table | catch type: redo st: 0002 ed: 0014 sp: 0000 cont: 0002 | catch type: next st: 0002 ed: 0014 sp: 0000 cont: 0014 |------------------------------------------------------------------------ local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1]) [ 2] i 0000 trace 256 ( 1) 0002 trace 1 0004 getlocal_OP__WC__1 2 0006 getlocal_OP__WC__0 2 0008 opt_plus , 0011 dup 0012 setlocal_OP__WC__1 2 0014 trace 512 0016 leave will help us understand what’s happening internally

Slide 91

Slide 91 text

$ ruby -e "printf RubyVM::InstructionSequence.compile('x=2.5; 1.upto(1e4){ |i| x = x+i }').disasm" == disasm: #@>================================ == catch table | catch type: break st: 0006 ed: 0013 sp: 0000 cont: 0013 |------------------------------------------------------------------------ local table (size: 2, argc: 0 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1]) [ 2] x 0000 trace 1 ( 1) 0002 putobject 2.5 0004 setlocal_OP__WC__0 2 0006 putobject_OP_INT2FIX_O_1_C_ 0007 putobject 10000.0 0009 send , , block in 0013 leave == disasm: #@>======================= == catch table | catch type: redo st: 0002 ed: 0014 sp: 0000 cont: 0002 | catch type: next st: 0002 ed: 0014 sp: 0000 cont: 0014 |------------------------------------------------------------------------ local table (size: 2, argc: 1 [opts: 0, rest: -1, post: 0, block: -1, kw: -1@-1, kwrest: -1]) [ 2] i 0000 trace 256 ( 1) 0002 trace 1 0004 getlocal_OP__WC__1 2 0006 getlocal_OP__WC__0 2 0008 opt_plus , 0011 dup 0012 setlocal_OP__WC__1 2 0014 trace 512 0016 leave But which instruction on earth is really slow?

Slide 92

Slide 92 text

What EXACTLY is happening under the hood?

Slide 93

Slide 93 text

Introducing yarv-prof github.com/remore/yarv-prof A tiny DTrace-Based YARV profiler Instrumented profiling with walltime or cputime Only basic dataset are provided so far. Still under development

Slide 94

Slide 94 text

require 'yarv-prof' YarvProf.start(clock: :cpu, out:'~/log/') x=2.5 1.upto(N){|i| x=x+i } p x YarvProf.end yarv-prof

Slide 95

Slide 95 text

opt_plus is more than 50% slower than other insn

Slide 96

Slide 96 text

(For more metrics and features, still under development, to be continued)

Slide 97

Slide 97 text

Chapter 4 Your Turn! Why not to attempt by yourself towards Ruby 3x3? Or even “5xRuby”?

Slide 98

Slide 98 text

Thanks! @remore