$30 off During Our Annual Pro Sale. View Details »

Hybrid Keyword Extraction Automated by Python With Cloud Speech To Text API and Video API

tetsuya0617
September 15, 2019

Hybrid Keyword Extraction Automated by Python With Cloud Speech To Text API and Video API

This is my poster in Pycon Japan 2019 🇯🇵 (https://pycon.jp/2019/sessions?category=poster)

tetsuya0617

September 15, 2019
Tweet

More Decks by tetsuya0617

Other Decks in Technology

Transcript

  1. Hybrid Keyword Extraction Automated by Python
    With Cloud Speech To Text API and Video API
    Tetsuya Hirata
    Background
    Search indexes included only title names of each video content
    Zero hit keywords
    are 240 (29.2%)
    Hit keywords are
    583 (70.8%)
    Extracted automatically keywords from speech and texts in the videos with Cloud
    Speech To Text API and Video API
    Reading comprehension
    (Classical Japanese/Chinese Classics/Contemporary Japanese)
    Being able to search the name of character, works, grammar ex)ʮປ૲ࢠʯʮ࢙هʯʮେೲݴʯʮٯઆʯ
    Mathematics
    Being able to search mathematical symbols ex) ʮΞϧϑΝʯʮύΠʯʮαΠϯίαΠϯʯ ʮྦྷ৐ʯ
    English
    Being able to search the name of grammar and english vocabularies ex) ʮҙࢤະདྷʯʮinʯʮalsoʯ
    The code is automized by python and dockerized
    Needed to write the extra code to access to
    Google Cloud Storage(GCS) in order to load
    movie and audio files, and upload the results for
    backup log on GCS.
    → Abstracts Bigquery and Google Cloud Storage
    client functionality for easy access
    Solution to improve search indexes
    Evaluation
    Data Set:
    Movie Files(mp4) about mathematics, reading comprehension (Classical Japanese, Chinese Classics, and Contemporary Japanese), and English
    Targeted Keywords:
    Zero hit keywords searched by more than five unique users from 2018/04/01 to 2019/06/14.
    How to evaluate the keywordsɿ
    Looked at whether the number of hits improved before and after the new keywords extracted from speech and text in the movies, and checked if the keywords are related to each
    subject or not.
    Results
    Future Works
    Procedures for extracting keywords from movie contents
    tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect
    how important a word is to a document in a collection or corpus. (Rajaraman, A.; Ullman, J.D. (2011). "Data
    Mining”. Mining of Massive Datasets. pp. 1–17.)
    The number of a term / a document
    A total sum of number of terms / a document
    A total number of documents
    The number of document frequencies which a term occurs
    Term frequency is a term that occurs in a document. Inverse document frequency is an inverse function of the
    number of documents in which it occurs.
    Architecture with Cloud Speech Text Video API
    The N of keys which hit movie contents: 247 keywords
    The N of keywords which improved hits: 162 / 247 keywords
    Of the keywords, 130/162 (80%) results in related movie contents .
    Outlier value: 32 keywords
    The N of keywords which improved zero hits: 66 / 162 keywords
    Of the keywords, 52/66 (79%) results in related movie contents.
    Outlier value: 14 keywords
    Qualitative results Quantitive results
    Improve code related to Input / Output Improve stop-words and dictionary
    - Cut noise words and automize the cycles to add stop-
    words
    - Exploratory search keywords not related to movie
    contents and add them to dictionary
    - Improve keyword list registered in database by
    hearing from creators of the movies or experts in
    pedagogical domain
    (t: term d: document f: frequency d: document)
    MeCab
    TF-IDF
    Video API
    Cloud Speech To Text API
    GCS Excel
    Elastic Search
    .csv.gzip
    “… ҰਓҰਓͷऑ఺ʹ߹Θͤͯಈըͱ໰୊Λ͓קΊ͢Δͷ͕Ϋϥογʔֶशಈըੜె͸ࣗ਎ͷֶश
    ঢ়گʹ߹Θͤͯͭ·͍͍ͣͯΔ෼໺·ͰḪͬͯͷֶश΍ϨϕϧΞοϓΛ͢Δ͜ͱ͕Ͱ͖·͢ϕωο
    ηͷςετͷडݧ݁Ռ͔Βࠃޠ਺ֶӳޠʹ͓͚Δੜె͕ͨͪࠓऔΓ૊Ή΂ֶ͖शίϯςϯπΛ͓
    ͢͢Ί͠·͢·ֶͨशϚοϓͰ͸ੜెҰਓͻͱΓͷաڈʹड͚ͨςετ݁ՌΛ౿·֤͑ڭՊͷ୯
    ݩ͝ͱʹֶश౸ୡκʔϯΛදࣔۤखॱʹฒΜͰ͍·͢ͷͰੜెͷۤख෼໺ΛҰ໨Ͱ೺Ѳ͢Δ͜ͱ͕
    Ͱ͖·͢…” (https://www.youtube.com/watch?v=XjrBkhaV5Aw)
    1.6016 ͩ͜ͱʹͳΔͩΖ͏ʯ<ܦݧ>Λද͢ະདྷ
    1.6016 am reading 0.906650007
    1.6016 ʮ΋͠΋͏Ұ౓ಡΜͩΒ͜ͷຊΛ3ճಡΜ 1
    1.6016 ׬ྃͰ ʮ(ͦͷ࣌·Ͱʹ)…ͨ͜͠ͱʹͳ
    1.6016 ಈࢺͷ੍࣌<Ϩϕϧ6> 0.95717603
    1.7017 will have read 0.993844032
    (https://www.youtube.com/watch?v=XjrBkhaV5Aw)
    1. Input audio files (.flac) and movie files into Cloud Speech To Text API and Video API
    2. Pre-process output data with MeCab and calculate TF-IDF
    3. Extract keywords and upload them on GCS
    4. Add keywords list to excel file which have master info about movies
    5. The excel file is stored in S3 and the keywords are registered in elastic search.
    Steps to extract and register keywords
    Tokenizers (MeCab)
    """
    All subjects
    """
    CONTENT_WORD_POS = ('໊ࢺ')
    EXCLUDE_WORD_POS = ('ඇཱࣗ', '୅໊ࢺ')
    EXCLUDE_WORD = ('͊', '͌', '͎', '͐', '͒', 'Ό', 'Ύ', 'ΐ', ‘Η’)
    """
    Contemporary Japanese
    """
    INCLUDE_WORD_POS = ('ݻ༗໊ࢺ')
    Remove noise with regular expression
    All subjects: ‘[Ớʀ”ʨʩʔʊˏɾʁʛʹʆˇʻʼʮʯɺ]'
    English: ‘[0-9̌-̕_!-/:-@¥[-`{-~]'
    Classical Japanese/Chinese Classics: '[Ν-ϰa-zA-Z0-9̌-̕Α-ωА-яᴷ-ᵹⅠ-Ⅹᶃ-ᶖ_!-/:-@¥[-`{-~]'

    Others : ’[a-zA-Z0-9̌-̕Α-ωА-яᴷ-ᵹⅠ-Ⅹᶃ-ᶖ_!-/:-@¥[-`{-~]'
    PyPI: https://pypi.org/project/gcp-accessor/0.0.1/

    View Slide