Hybrid Keyword Extraction Automated by Python
With Cloud Speech To Text API and Video API
Tetsuya Hirata
Background
Search indexes included only title names of each video content
Zero hit keywords
are 240 (29.2%)
Hit keywords are
583 (70.8%)
Extracted automatically keywords from speech and texts in the videos with Cloud
Speech To Text API and Video API
Reading comprehension
(Classical Japanese/Chinese Classics/Contemporary Japanese)
Being able to search the name of character, works, grammar ex)ʮປࢠʯʮ࢙هʯʮେೲݴʯʮٯઆʯ
Mathematics
Being able to search mathematical symbols ex) ʮΞϧϑΝʯʮύΠʯʮαΠϯίαΠϯʯ ʮྦྷʯ
English
Being able to search the name of grammar and english vocabularies ex) ʮҙࢤະདྷʯʮinʯʮalsoʯ
The code is automized by python and dockerized
Needed to write the extra code to access to
Google Cloud Storage(GCS) in order to load
movie and audio files, and upload the results for
backup log on GCS.
→ Abstracts Bigquery and Google Cloud Storage
client functionality for easy access
Solution to improve search indexes
Evaluation
Data Set:
Movie Files(mp4) about mathematics, reading comprehension (Classical Japanese, Chinese Classics, and Contemporary Japanese), and English
Targeted Keywords:
Zero hit keywords searched by more than five unique users from 2018/04/01 to 2019/06/14.
How to evaluate the keywordsɿ
Looked at whether the number of hits improved before and after the new keywords extracted from speech and text in the movies, and checked if the keywords are related to each
subject or not.
Results
Future Works
Procedures for extracting keywords from movie contents
tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect
how important a word is to a document in a collection or corpus. (Rajaraman, A.; Ullman, J.D. (2011). "Data
Mining”. Mining of Massive Datasets. pp. 1–17.)
The number of a term / a document
A total sum of number of terms / a document
A total number of documents
The number of document frequencies which a term occurs
Term frequency is a term that occurs in a document. Inverse document frequency is an inverse function of the
number of documents in which it occurs.
Architecture with Cloud Speech Text Video API
The N of keys which hit movie contents: 247 keywords
The N of keywords which improved hits: 162 / 247 keywords
Of the keywords, 130/162 (80%) results in related movie contents .
Outlier value: 32 keywords
The N of keywords which improved zero hits: 66 / 162 keywords
Of the keywords, 52/66 (79%) results in related movie contents.
Outlier value: 14 keywords
Qualitative results Quantitive results
Improve code related to Input / Output Improve stop-words and dictionary
- Cut noise words and automize the cycles to add stop-
words
- Exploratory search keywords not related to movie
contents and add them to dictionary
- Improve keyword list registered in database by
hearing from creators of the movies or experts in
pedagogical domain
(t: term d: document f: frequency d: document)
MeCab
TF-IDF
Video API
Cloud Speech To Text API
GCS Excel
Elastic Search
.csv.gzip
“… ҰਓҰਓͷऑʹ߹ΘͤͯಈըͱΛ͓קΊ͢Δͷ͕Ϋϥογʔֶशಈըੜెࣗͷֶश
ঢ়گʹ߹Θͤͯͭ·͍͍ͣͯΔ·ͰḪͬͯͷֶशϨϕϧΞοϓΛ͢Δ͜ͱ͕Ͱ͖·͢ϕωο
ηͷςετͷडݧ݁Ռ͔Βࠃޠֶӳޠʹ͓͚Δੜె͕ͨͪࠓऔΓΉֶ͖शίϯςϯπΛ͓
͢͢Ί͠·͢·ֶͨशϚοϓͰੜెҰਓͻͱΓͷաڈʹड͚ͨςετ݁ՌΛ౿·֤͑ڭՊͷ୯
ݩ͝ͱʹֶश౸ୡκʔϯΛදࣔۤखॱʹฒΜͰ͍·͢ͷͰੜెͷۤखΛҰͰѲ͢Δ͜ͱ͕
Ͱ͖·͢…” (https://www.youtube.com/watch?v=XjrBkhaV5Aw)
1.6016 ͩ͜ͱʹͳΔͩΖ͏ʯ<ܦݧ>Λද͢ະདྷ
1.6016 am reading 0.906650007
1.6016 ʮ͠͏ҰಡΜͩΒ͜ͷຊΛ3ճಡΜ 1
1.6016 ྃͰ ʮ(ͦͷ࣌·Ͱʹ)…ͨ͜͠ͱʹͳ
1.6016 ಈࢺͷ੍࣌<Ϩϕϧ6> 0.95717603
1.7017 will have read 0.993844032
(https://www.youtube.com/watch?v=XjrBkhaV5Aw)
1. Input audio files (.flac) and movie files into Cloud Speech To Text API and Video API
2. Pre-process output data with MeCab and calculate TF-IDF
3. Extract keywords and upload them on GCS
4. Add keywords list to excel file which have master info about movies
5. The excel file is stored in S3 and the keywords are registered in elastic search.
Steps to extract and register keywords
Tokenizers (MeCab)
"""
All subjects
"""
CONTENT_WORD_POS = ('໊ࢺ')
EXCLUDE_WORD_POS = ('ඇཱࣗ', '໊ࢺ')
EXCLUDE_WORD = ('͊', '͌', '͎', '͐', '͒', 'Ό', 'Ύ', 'ΐ', ‘Η’)
"""
Contemporary Japanese
"""
INCLUDE_WORD_POS = ('ݻ༗໊ࢺ')
Remove noise with regular expression
All subjects: ‘[Ớʀ”ʨʩʔʊˏɾʁʛʹʆˇʻʼʮʯɺ]'
English: ‘[0-9̌-̕_!-/:-@¥[-`{-~]'
Classical Japanese/Chinese Classics: '[Ν-ϰa-zA-Z0-9̌-̕Α-ωА-яᴷ-ᵹⅠ-Ⅹᶃ-ᶖ_!-/:-@¥[-`{-~]'
Others : ’[a-zA-Z0-9̌-̕Α-ωА-яᴷ-ᵹⅠ-Ⅹᶃ-ᶖ_!-/:-@¥[-`{-~]'
PyPI: https://pypi.org/project/gcp-accessor/0.0.1/