$30 off During Our Annual Pro Sale. View Details »

Hybrid Keyword Extraction Automated by Python With Cloud Speech To Text API and Video API

September 15, 2019

Hybrid Keyword Extraction Automated by Python With Cloud Speech To Text API and Video API

This is my poster in Pycon Japan 2019 🇯🇵 (https://pycon.jp/2019/sessions?category=poster)


September 15, 2019

More Decks by tetsuya0617

Other Decks in Technology


  1. Hybrid Keyword Extraction Automated by Python
    With Cloud Speech To Text API and Video API
    Tetsuya Hirata
    Search indexes included only title names of each video content
    Zero hit keywords
    are 240 (29.2%)
    Hit keywords are
    583 (70.8%)
    Extracted automatically keywords from speech and texts in the videos with Cloud
    Speech To Text API and Video API
    Reading comprehension
    (Classical Japanese/Chinese Classics/Contemporary Japanese)
    Being able to search the name of character, works, grammar ex)ʮປ૲ࢠʯʮ࢙هʯʮେೲݴʯʮٯઆʯ
    Being able to search mathematical symbols ex) ʮΞϧϑΝʯʮύΠʯʮαΠϯίαΠϯʯ ʮྦྷ৐ʯ
    Being able to search the name of grammar and english vocabularies ex) ʮҙࢤະདྷʯʮinʯʮalsoʯ
    The code is automized by python and dockerized
    Needed to write the extra code to access to
    Google Cloud Storage(GCS) in order to load
    movie and audio files, and upload the results for
    backup log on GCS.
    → Abstracts Bigquery and Google Cloud Storage
    client functionality for easy access
    Solution to improve search indexes
    Data Set:
    Movie Files(mp4) about mathematics, reading comprehension (Classical Japanese, Chinese Classics, and Contemporary Japanese), and English
    Targeted Keywords:
    Zero hit keywords searched by more than five unique users from 2018/04/01 to 2019/06/14.
    How to evaluate the keywordsɿ
    Looked at whether the number of hits improved before and after the new keywords extracted from speech and text in the movies, and checked if the keywords are related to each
    subject or not.
    Future Works
    Procedures for extracting keywords from movie contents
    tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect
    how important a word is to a document in a collection or corpus. (Rajaraman, A.; Ullman, J.D. (2011). "Data
    Mining”. Mining of Massive Datasets. pp. 1–17.)
    The number of a term / a document
    A total sum of number of terms / a document
    A total number of documents
    The number of document frequencies which a term occurs
    Term frequency is a term that occurs in a document. Inverse document frequency is an inverse function of the
    number of documents in which it occurs.
    Architecture with Cloud Speech Text Video API
    The N of keys which hit movie contents: 247 keywords
    The N of keywords which improved hits: 162 / 247 keywords
    Of the keywords, 130/162 (80%) results in related movie contents .
    Outlier value: 32 keywords
    The N of keywords which improved zero hits: 66 / 162 keywords
    Of the keywords, 52/66 (79%) results in related movie contents.
    Outlier value: 14 keywords
    Qualitative results Quantitive results
    Improve code related to Input / Output Improve stop-words and dictionary
    - Cut noise words and automize the cycles to add stop-
    - Exploratory search keywords not related to movie
    contents and add them to dictionary
    - Improve keyword list registered in database by
    hearing from creators of the movies or experts in
    pedagogical domain
    (t: term d: document f: frequency d: document)
    Video API
    Cloud Speech To Text API
    GCS Excel
    Elastic Search
    “… ҰਓҰਓͷऑ఺ʹ߹Θͤͯಈըͱ໰୊Λ͓קΊ͢Δͷ͕Ϋϥογʔֶशಈըੜె͸ࣗ਎ͷֶश
    Ͱ͖·͢…” (https://www.youtube.com/watch?v=XjrBkhaV5Aw)
    1.6016 ͩ͜ͱʹͳΔͩΖ͏ʯ<ܦݧ>Λද͢ະདྷ
    1.6016 am reading 0.906650007
    1.6016 ʮ΋͠΋͏Ұ౓ಡΜͩΒ͜ͷຊΛ3ճಡΜ 1
    1.6016 ׬ྃͰ ʮ(ͦͷ࣌·Ͱʹ)…ͨ͜͠ͱʹͳ
    1.6016 ಈࢺͷ੍࣌<Ϩϕϧ6> 0.95717603
    1.7017 will have read 0.993844032
    1. Input audio files (.flac) and movie files into Cloud Speech To Text API and Video API
    2. Pre-process output data with MeCab and calculate TF-IDF
    3. Extract keywords and upload them on GCS
    4. Add keywords list to excel file which have master info about movies
    5. The excel file is stored in S3 and the keywords are registered in elastic search.
    Steps to extract and register keywords
    Tokenizers (MeCab)
    All subjects
    CONTENT_WORD_POS = ('໊ࢺ')
    EXCLUDE_WORD_POS = ('ඇཱࣗ', '୅໊ࢺ')
    EXCLUDE_WORD = ('͊', '͌', '͎', '͐', '͒', 'Ό', 'Ύ', 'ΐ', ‘Η’)
    Contemporary Japanese
    INCLUDE_WORD_POS = ('ݻ༗໊ࢺ')
    Remove noise with regular expression
    All subjects: ‘[Ớʀ”ʨʩʔʊˏɾʁʛʹʆˇʻʼʮʯɺ]'
    English: ‘[0-9̌-̕_!-/:-@¥[-`{-~]'
    Classical Japanese/Chinese Classics: '[Ν-ϰa-zA-Z0-9̌-̕Α-ωА-яᴷ-ᵹⅠ-Ⅹᶃ-ᶖ_!-/:-@¥[-`{-~]'

    Others : ’[a-zA-Z0-9̌-̕Α-ωА-яᴷ-ᵹⅠ-Ⅹᶃ-ᶖ_!-/:-@¥[-`{-~]'
    PyPI: https://pypi.org/project/gcp-accessor/0.0.1/

    View Slide