Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Text-similarity analysis for Shiganai Radio

Text-similarity analysis for Shiganai Radio

* Transcribe several episodes of Shiganai Radio.
* Convert sentences into vectors by TF-IDF.
* calculates cosine similarity of each two episodes.

5f5f752a62b0c3886c111a4b10e14ffb?s=128

Takaaki Tanaka

December 17, 2019
Tweet

Transcript

  1. ͕͠ͳ͍ϥδΦΛ෼ੳͯ͠Έͨ ʢ·ͩಓ൒͹ʣ

  2. ࣗݾ঺հ @in_the_rye SIerʹ10೥ۙ͘ۈ຿ʢݱࡏ͸ௐ੔ۀ຿onlyʣˠ৭ʑࢥ ͏ͱ͜Ζ͕͋ͬͯస৬༧ఆɻ ਓੜͰॳΊͯͷLTͰ͢ʂ 2

  3. ෼ੳͷϞνϕʔγϣϯ ͕͠ͳ͍ϥδΦΛep͔̍Βௌ͖࢝Ίͯɺ࠷ۙ Α͏΍͘࠷৽ճʹ௥͍ͭ͘ɻˠ̍೥͘Β͍ ͔͔ͬͨɻ શ෦ฉ͘ͷ͸ָ͍͚͠Ͳɺͪΐͬͱେมɻ ஍ਤ͕͋ΔͱͪΐͬͱָͳͷͰ͸ʁ ͕͠ͳ͍ϥδΦͷ֤Τϐιʔυͷத਎͕ɺ ΋͏গ͠Θ͔ΔΑ͏ͳ஍ਤΛ࡞Ζ͏ʂ

  4. ෼ੳʁͬͯԿΛͨ͠ͷʁ Ի੠͔Βจࣈى͜͠ ܗଶૉղੳ TF-IDFม׵ͰϕΫτϧԽ ίαΠϯྨࣅ౓ܭࢉ

  5. ͓લ͸Ұମ ԿΛݴ͍ͬͯΔΜͩʁ

  6. આ໌͠·͢ʂ ʢάάͬͯಘͨ஌ࣝͰɻɻɻʣ

  7. จࣈى͜͠ Google Cloud Text-to-speech Api Λར༻ɻ ݄60෼·Ͱͷจࣈى͜͠͸ແྉɻͦΕ Ҏ߱͸ैྔ՝ۚɻ શΤϐιʔυ͸ਏ͍ͷͰɺϥϯμϜʹબ Μͩ19ΤϐιʔυΛจࣈى͜͠ɻ

    GCPͷແྉ࿮Ͱˇ5,406՝ۚɻɻɻ
  8. ܗଶૉղੳ pythonͷϥΠϒϥϦΛར༻ͯ͠จষΛ୯ޠʹ෼ׂ

  9. TF-IDFม׵ͰจষΛϕΫτϧԽ TFʢӳ: Term Frequencyɺ୯ޠͷग़ݱස౓ʣ IDFʢӳ: Inverse Document Frequencyɺٯจॻස౓ʣ TFͱIDFͷೋͭͷج४Ͱͷܭࢉ஋Λֻ͚ࢉͯ͠୯ޠʹॏΈΛ͚ͭΔɻˠ୯ ޠͱॏΈࣗମ͕ϕΫτϧͱͳΔͷͰɺυΩϡϝϯτΛϕΫτϧԽͰ͖Δʂ

    ࡾ࣍ݩϕΫτϧ(x, y, z) = (1, 2, 3) ࠲ඪͱͦΕͧΕͷ஋ υΩϡϝϯτͷϕΫτϧ(͕͠ͳ͍, ϥδΦ, SIer) = (1, 2, 3) ୯ޠͱͦΕͧΕͷ஋ X Y Z
  10. TF-IDFม׵ͱ͸ཁ͢Δʹ ߴස౓ͷ୯ޠ΄Ͳ఺਺͕ߴ͍ɻ ϨΞͳ୯ޠ΄Ͳ఺਺͕ߴ͍ɻ ্ͷೋͭͷ఺਺ͷֻ͚ࢉͰ୯ޠͷ఺਺ΛܾΊΔΑʂ ୯ޠ͝ͱͷ఺਺͕͍ͭͨΒͦΕ͕ϕΫτϧͩʂ

  11. ίαΠϯྨࣅ౓ ϕΫτϧͷͳ֯͢ΛϺͱͨ࣌͠ͷ cosΘ ಉ͡ํ޲Λ޲͍͍ͯΕ͹cosΘ = 1 90౓ͳΒcosΘ = 0 ຊ౰͸ਅٯ(180౓)ͳΒ-1ʹͳ

    Δ͚Ͳɺ͜͜Ͱ͸஋͸ਖ਼ͷ஋ ͷΈͳͷͰɺ0ʙ1ͷൣғͷΈɻ Θ
  12. ෼ੳ݁Ռʢ19ΤϐιʔυͷΈʣ cosྨࣅ౓ ͷϚοϓ

  13. SP8A-TBPGR

  14. SP38A-HIGUYUME

  15. SP16A-INFRAGIRL755

  16. Ͳͳ͕ͨήετͷճͰ͠ΐ͏͔ʁ

  17. ·ͱΊ OSSͷϥΠϒϥϦͳͲΛ࢖͕ͬͯ͠ͳ͍ϥδΦͷ෼ੳΛͨ͠ɻ ͲͷΤϐιʔυಉ͕࢜ࣅ͍ͯΔ͔ͳͲͷϚοϓΛ࡞ͬͨɻ ࢥ͍ͷ΄͔ɺͦΕͬΆ͍݁Ռ͕ग़͖ͯͨɻ ͚ͲΑ͘ݟΔͱมͳ୯ޠ͕ೖ͍ͬͯͯɺจࣈى͜͠ͷਫ਼౓͕ͦ͜· Ͱྑ͘ͳ͍ͨΊͱࢥΘΕΔɻ ର৅Τϐιʔυ਺͕·ͩ19Τϐιʔυ͚ͩͳͷͰɺGCPͷແྉΫϨ δοτ࿮಺Ͱ΋͏ͪΐͬͱ෼ੳͯ͠ΈΑ͏ͱࢥ͍·͢ɻ

  18. ͝੩ௌ͋Γ͕ͱ͏͍͟͝·ͨ͠