Upgrade to Pro — share decks privately, control downloads, hide ads and more …

W2vUtils.jl

 W2vUtils.jl

Talk about my first Julia package presented in JuliaTokyo #3 on 2015-04-25.

Kenta Murata

April 25, 2015
Tweet

More Decks by Kenta Murata

Other Decks in Technology

Transcript

  1. Kenta Murata @mrkn ଜా ݡଠ ✓ Cookpad Inc. ✓ Ruby

    committer as
 a bigdecimal maintainer ✓ One of Julia beginners
  2. Julia and me ✓ First meet is 2012? in Wikipedia

    ✓ First use is 2013 only in REPL
  3. Julia and me ✓ First meet is 2012? in Wikipedia

    ✓ First use is 2013 only in REPL ✓ First script writing is yesterday!!
  4. Why W2vUtils.jl? ✓ I’ve tried to write SOM learner at

    first for this presentation ✓ What the most interesting application for SOM?
  5. Why W2vUtils.jl? ✓ I’ve tried to write SOM learner at

    first for this presentation ✓ What the most interesting application for SOM? ✓ I think it is interesting to map distributed representations of words onto 2d-lattice.
  6. But… ✓ I couldn’t get to done to write SOM

    learner ✓ Writing both data loader and SOM learner is too many to done in one night
  7. But… ✓ I couldn’t get to done to write SOM

    learner ✓ Writing both data loader and SOM learner is too many to done in one night ✓ So I completely focused on to make my first package
  8. Load word2vec data using W2vUtils wv = load("vectors.bin", W2vData) nwords(wv)

    #=> The number of words vocabulary(wv) #=> The array of words projdim(wv) #=> Dimensions of vector (== 200) projection(wv) #=> The projection matrix wordindex(wv, word) #=> Lookup index of the word wordindices(wv, words) #=> Lookup indices of the words
  9. N-best nearest words using W2vUtils wv = load("recipe_steps.bin", W2vData) (words,

    dists) = distance(wv, "νϣί"; n=5) collect(zip(words, dists)) #=> 5-element Array{(UTF8String,Float64),1}: ("νϣίϨʔτ",0.9378173408020709) ("Ψφογϡ",0.7568368811212932) ("ϚγϡϚϩ",0.7461657278585042) ("Ϗλʔνϣί",0.7439865689272069) ("Ϋϥϯν",0.7296649975102198)
  10. Word analogy using W2vUtils wv = load("recipe_steps-phrase.bin", W2vData) (words, dists)

    = analogy(wv, ["໊ݹ԰", "੺ຯḩ", "௕໺"]; n=5) collect(zip(words, dists)) #=> 5-element Array{(UTF8String,Float64),1}: ("৴भ",0.5091896142078718) ("ຯḩ",0.5033710820183608) ("ഴ_ຯḩ",0.5033459461705277) ("৴भ_ຯḩ",0.49885121244487496) ("੺ຯḩ_നຯḩ",0.4816681271563857)
  11. Nearest words for a vector using W2vUtils wv = load("recipe_steps-phrase.bin",

    W2vData) ੺ຯḩ = projection(wv, "੺ຯḩ") ໊ݹ԰ = projection(wv, "໊ݹ԰") ௕໺ = projection(wv, "௕໺") (words, dists) = nearest_words(wv, ੺ຯḩ - ໊ݹ԰ + ௕໺; n=5) collect(zip(words, dists)) #=> 5-element Array{(UTF8String,Float64),1}: ("௕໺",0.6527766969046891) ("੺ຯḩ",0.5241494934036854) ("৴भ",0.5091896142078718) ("ຯḩ",0.5033710820183608) ("ഴ_ຯḩ",0.5033459461705277)
  12. examples/w2v_pca.jl using W2vUtils using MultivariateStats using Gadfly wv = load(ARGS[1],

    W2vData) (words, dists) = distance(wv, ARGS[2]; n=15) vecs = W2vUtils.projection(wv, words) model = fit(PCA, vecs'; maxoutdim=2) transvecs = transform(model, vecs') pca_plot = plot(x=transvecs[1, :], y=transvecs[2, :], label=words, Geom.point, Geom.label) draw(PDF(ARGS[3], 4inch, 3inch), pca_plot)
  13. � ���� ��� ���      

             ���� ���� ��� ��� ��� ��� � $ julia w2v_pca.jl recipe_steps.bin νϣί pca1.pdf
  14. $ julia w2v_pca.jl recipe_steps-phrase.bin ੺ຯḩ pca2.pdf � ���� ��� ���

    ��� �  � ��  � �  � �  � ��  � �� ���� ��� ��� �
  15. Future work ✓ Conform to the standard coding style of

    Julia ✓ Submit to METADATA.jl ✓ Implement self-organizing map
  16. Future work ✓ Conform to the standard coding style of

    Julia ✓ Submit to METADATA.jl ✓ Implement self-organizing map ✓ Visualize 2d-lattice map of word distributed representations