IALP2023 Utilizing Word Embedding Representations in Word Sense Analysis Focusing on Character Types

Utilizing Word Embedding Representations in Word Sense Analysis Focusing on
Character Types Tomoki Okugawa, Takashi Inui University of Tsukuba 1

l Spelling variants = Words with the same meanings but
with different spells l Negatively impact NLP tasks[1] l Examples: l Abbreviation（NLP, natural language processing） l Slang（bucks, dollar） l Okurigana（moving: 引っ越し, 引越し, 引越） l Foreign word（user: ユーザー, ユーザ） l Character types（apple: りんご, リンゴ, 林檎, ringo） Introduction(1/2) 2 [1] K. Yamamoto, “Nihongo no hyokiyure mondai ni kansuru kosatsu to taisho” [On Orthographical Variants Problem and Our Solution], Japio year book, pp. 202–205, 2015 (in Japanese).

Introduction(2/2) l CCT words(=Spelling variants Caused by Character Types) l
Frequently caused in Japanese l (1)Hiragana, (2)Katakana, (3)Kanji and (4)Roman alphabet are used together in the same context. l e.g. 「コーヒーを飲む。 (I drink coffee.)」 l Often represent difference word meanings l An in-depth semantic analysis has not been conducted yet l Some previous studies focus on Japanese character types, but they are based only on surface aspects of a text. 3 (1) (1) (2) (3)

Purpose l Investigate the differences in word meanings in terms
of CCT words l For the convenience of investigation, we focused on two Japanese character types: Hiragana and Kanji. （おいしい）（美味しい） l To realize semantic-level analyses, we tried using word embedding. 4

Related Work(1/3) l Motivation for changing character type l Standard:
l Japanese-origin words: Hiragana l Chinese-origin words: Kanji l Loan words: Katakana and Roman alphabet l Cases where the notation changes: l Simple Japanese: character type conversion from Kanji to Hiragana is actively performed for vocabulary simplification[2] 5 [2] Agency for Cultural Affairs, “Zairyushien no tame no yasashii nihongogaidorain” [Simple Japanese Guidelines for Residential Support], 2020(in Japanese)

Related Work(2/3) l Survey on surface aspects of character type
usage l Conducted a character type frequency survey using the “Balanced Corpus of Contemporary Written Japanese” to investigate the use of character types for each corpus genre[3] 6 [3] W. Kashino and M.Okumura, “Wago ya kango no katakana hyoki :‘Gendai nihongo kakikotoba kinko kopasu’ no shoseki ni okeru shiyojittai” [Analysis of Katakana Representation for Japanese Native Wordsand the Words Imported from Classical Chinese : Using BCCWJJapanese Corpus], Mathematical linguistics, Vol. 28, No. 4, pp. 153–161, 2012.

Related Work(3/3) l Word embeddings = Techniques that express the
meaning of a single word with a low-dimensional real-valued vector l Static word embeddings (e.g. word2vec[4]) l Dynamic word embeddings (e.g. BERT[5]) l Word2vec l Be based on the assumption that words used in the same context have similar meanings l Can perform addition and subtraction l Can calculate similarity l Remove the effects of dynamic contexts 7 [4] T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, “Efficient Estimationof Word Representations in Vector Space,” In Proceedings of Workshopat ICLR, 2013. [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pre-trainingof Deep Bidirectional Transformers for Language Understanding”, InProceedings of the 2019 Conference of the North American Chapter ofthe Association for Computational Linguistics, Minneapolis, MN, USA,pp. 4171–4186, 2019.

Settings l Focus on two character types: Hiragana and Kanji
l Examine how the differences in character types appear in word embedding representations l Used type of vectors: l CCT embedding pair l Subtracted Vector l CCT embedding pair difference 8 美味しいH 美味しいK 美味しいK−H O e.g. “美味しい” Spelling variants Caused by Character Types

Word Embeddings Used l Wikipedia Entity Vectors[6] l Trained from
the Japanese version of Wikipedia articles with the word2vec algorithm l Contains 751,361 word vectors with a 200-dimensional space 9 [6] M. Suzuki, K. Matsuda, S. Sekine, N. Okazaki, and K. Inui, “AJoint Neural Model for Fine-Grained Named Entity Classification ofWikipedia Articles,” IEICE Transactions on Information and Systems,Special Section on Semantic Web and Linked Data, Vol. E101-D, No.1,pp. 73–81, 2018 美味しいH 美味しいK O e.g. “美味しい”

Subtracted Vector l Given a word 𝑤, Kanji variant: 𝑤!
, Hiragana variant: 𝑤" CCT embedding pair: (𝒘𝑲 , 𝒘𝑯 ) Subtracted vector: 𝒘𝑲%𝑯 = 𝒘𝑲 − 𝒘𝑯 e.g. “美味しい” 10 美味しいH 美味しいK 美味しいK−H O

Target CCT Embedding Pairs l Select the target CCT embedding
pairs in 3 steps Step 1: Collect pairs from Japanese textbooks l Basic words in both Kanji and Hiragana notations. l Transform verb and adjective words into their standard form Step 2: Exclude pairs not included in Wikipedia Entity Vectors Step 3: Exclude pairs with ambiguity caused primarily by the character types l Occasionally, two words that have different meanings each other share the same Hiragana notation in Japanese l obtain 293 CCT embedding pairs 11

Three Analyses Analysis 1: Similarity of CCT embedding pairs Analysis
2: Correlations with Character-based Indexes Analysis 3: Clustering of the Subtracted Vectors 12

Analysis 1: Similarity of CCT embedding pairs(1/2) l Cosine similarity
varies depending on CCT embedding pair 13

Analysis 1: Similarity of CCT embedding pairs(2/2) l Similarity 14
TOP 10 BOTTOM 10 (mainly verbs, adjectives, adverbs) (mainly nouns) People might intentionally select a spelling to represent a specific meaning.

Analysis 2: Correlations with Character-based Indexes l Not much correlation
l len_H: the number of characters of 𝑤! l len_K: the number of characters of 𝑤" l level_K: values defined based on standard learning year of 𝑤! l 1-6: taught in 𝑛th grade of elementary school l 7: taught after elementary school l 8: non-common use Kanji characters 15

Analysis 3: Clustering of the Subtracted Vectors(1/6) l We performed
a hierarchical clustering l To explore words with comparable differences in word meanings in the CCT variants l It can be assumed that words with similar subtracted vectors are similarly used in different character types, so it is possible to classify the influence of character types. l WARD algorithm 16

Analysis 3: Clustering of the Subtracted Vectors(2/6) l [n=2 cluster]
two subtracted vectors with some semantic relations tend initially to be merged. l It suggests that words with some semantic relations have a similar usage of CCT variants in a text. l Synonym l 速い(fast) and 早い(quick) l 曲がる(bend) and 折れる(break) l Words with causal or order relationships l 合格(pass) and 卒業(graduation) l 危ない(dangerous) and 壊れる(broken) l Antonym l ⼤きな(large) and ⼩さな(small) l 登る(climb) and 降りる(climb down) 17

Analysis 3: Clustering of the Subtracted Vectors(3/6) l pick out
three clusters: Cⅰ, Cⅱ, Cⅲ 18 C ⅰ C ⅱ C ⅲ small cosine large cosine the choice of character type can be freely left to the writer unfamiliar with Hiragana notation in a text

Analysis 3: Clustering of the Subtracted Vectors(4/6) 19 Cⅰ Cⅱ
the choice of character type can be freely left to the writer

Analysis 3: Clustering of the Subtracted Vectors(5/6) 20 Cⅲ unfamiliar
with Hiragana notation in a text

Analysis 3: Clustering of the Subtracted Vectors(6/6) l Vector visualization
in 3D space by PCA l Cⅱ is near C i , C iii is far from C i 21

Conclusion l We focused on spelling variants caused by character
types and attempted to utilize word embeddings to analyze the differences in word meanings. l We analyzed 293 words and found that CCT words can have different meanings, especially CCT nouns, which tend to have low cosine similarity values that suggest holding different meanings. l Through the clustering of subtracted vectors, CCT word pairs consisting of two semantically-related words tend to construct clusters. l We are currently analyzing to find out more details. 22

Future Work l Similar investigations with different conditions l other
character types (e.g. Katakana) l other target words (e.g. onomatopoeia) l other vocabulary-based indices (e.g. word familiarity) l dynamic word embeddings (e.g. BERT) 23

IALP2023 Utilizing Word Embedding Representatio...

IALP2023 Utilizing Word Embedding Representations in Word Sense Analysis Focusing on Character Types

Takashi INUI

More Decks by Takashi INUI

Other Decks in Technology

Featured

Transcript

Utilizing Word Embedding Representations in Word Sense Analysis Focusing on

l Spelling variants = Words with the same meanings but

Introduction(2/2) l CCT words(=Spelling variants Caused by Character Types) l

Purpose l Investigate the differences in word meanings in terms

Related Work(1/3) l Motivation for changing character type l Standard:

Related Work(2/3) l Survey on surface aspects of character type

Related Work(3/3) l Word embeddings = Techniques that express the

Settings l Focus on two character types: Hiragana and Kanji

Word Embeddings Used l Wikipedia Entity Vectors[6] l Trained from

Subtracted Vector l Given a word 𝑤, Kanji variant: 𝑤!

Target CCT Embedding Pairs l Select the target CCT embedding

Three Analyses Analysis 1: Similarity of CCT embedding pairs Analysis

Analysis 1: Similarity of CCT embedding pairs(1/2) l Cosine similarity

Analysis 1: Similarity of CCT embedding pairs(2/2) l Similarity 14

Analysis 2: Correlations with Character-based Indexes l Not much correlation

Analysis 3: Clustering of the Subtracted Vectors(1/6) l We performed

Analysis 3: Clustering of the Subtracted Vectors(2/6) l [n=2 cluster]

Analysis 3: Clustering of the Subtracted Vectors(3/6) l pick out

Analysis 3: Clustering of the Subtracted Vectors(4/6) 19 Cⅰ Cⅱ

Analysis 3: Clustering of the Subtracted Vectors(5/6) 20 Cⅲ unfamiliar

Analysis 3: Clustering of the Subtracted Vectors(6/6) l Vector visualization

Conclusion l We focused on spelling variants caused by character

Future Work l Similar investigations with different conditions l other