Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Big Data Analysis of Yumentingzheng

A Big Data Analysis of Yumentingzheng

https://aaron.kr/content/portfolio/word-cloud-big-data-analysis-of-manchu-script-in-weiwenqiju/

Poster presentation at The 50th Fall Comprehensive Conference of the Korea Information and Communication Society (KIICE) (제50회 한국정보통신학회 추계종합학술대회).

The Qing dynasty's Yumentingzheng (御門聽政) is an important document comparable to the Annals of Joseon in Korea. This is a record of conversations between the Qing dynasty emperor and his ministers.

Manchu script is one of the few scripts left in the world that does NOT have an easy way to perform OCR (optical character recognition) on it to scan and read documents into a digital format. There is also no readily available and complete Manchu dictionary available for word lookup. Therefore, pages containing the Manchu script must first be transliterated in a way that computers can perform analysis.

In this study, text that was already transcribed using the Möllendorf method was simplified to the Abkai method (transliterations of Manchu) in order to perform a big data analysis on the text and pull out the most important keywords from this portion of the text, the Weiwenqiji (慰問起居).

Aaron Snowberger

October 28, 2021
Tweet

More Decks by Aaron Snowberger

Other Decks in Technology

Transcript

  1. Aaron Daniel Snowberger, Choong Ho Lee 어문청정 빅데이터 분석: 위문기거

    일례 A Big Data Analysis of Yumentingzheng: Weiwenqiju as an Example 御 門 聽 政 慰 問 起 居
  2. Introduction Yumentingzheng(御門聽政), which records the contents of the Qing dynasty's

    discussions with his subjects, is an important document like the Annals of Joseon in Korea. This paper describes the methods and steps for big data analysis of Yumentingzheng written in the Manchu alphabet. In big data analysis of documents written in Manchu characters, there are many problems that need to be solved in advance, and research on these should be preceded. In this paper, a method of big data analysis using the R language was proposed in the stage where the text written in Manchurian characters was transliterated into Latin characters through a preliminary study to be conducted in the future. In the proposed method, the Apkai method was adopted for the transliteration of Yumentingzheng, and the results of big data analysis were presented using the text of Weiwenqiju(慰問起居).
  3. Text[1] that was already transcribed by the Möllendorf method was

    converted to the Abkai method, and then big data analysis was performed with R on the frequency of words appearing in the text. Möllendorf Abkai Möllendorf Abkai Manchu 0 1 š x ᡡ 0 2 c q ᡡ 0 3 ū v ᡡ
  4. Manchu Text Latin Transliteration Manchu Möllendorf Abkai The original Manchu

    text was transcribed using the Möllendorf method. The transcription was edited to according to the Abkai method in the previous table. The resulting edited transcription was used in the big data analysis. 0 1 0 2
  5. It is necessary to find meaningless words among the words

    displayed in large letters to remove in order to display the word cloud with meaningful and important words. 1st Wordcloud Analysis The words ‘be’ and ‘de’ appear the most often because nouns cannot be extracted individually. Word Korean English 0 1 be ~을,~를, ~로써, ~로 하여금 particles 0 2 de ~에, ~에서, ~에로, 에 대하여 at, to 0 3 amban 대신(大臣) Minister 0 4 kemuni 늘, 언제나 always 0 5 aniya 해, 년 year 0 6 yasa 눈(目) eye 0 7 udu ~라 할지라도 even though
  6. It is necessary to find meaningless words among the words

    displayed in large letters to remove in order to display the word cloud with meaningful and important words. 2nd Wordcloud Analysis Word Korean English 0 1 yasa 눈(目) eye 0 2 amban 대신(大臣) Minister 0 3 kemuni 늘, 언제나 always 0 4 aniya 해, 년 year 0 5 udu ~라 할지라도 even though With the particles ‘be’ and ‘de’ removed, as well as other pronouns such as ‘bi’, ‘si’, ‘udu’, and ‘mini’, we can find the most important words.
  7. Wordcloud2 Analysis The R package wordcloud2 produces a wordcloud with

    the ability to mouseover any word to see its frequency. (This requires sorting the table in decreasing order first.) Word Count yasa 10 qi 9 ye 9 se 9 ere 8 ba 7 amban 7 aniya 7 the 7 kemuni 6
  8. Conclusion This paper presented a big data analysis method of

    literature written in Manchu characters. Since the Manchu dictionary package has not been provided in the R language until now, a word cloud was created with the frequency of words in the current state. The method of Romanizing Manchu characters was converted to the Apkai method without special symbols. In fact, the effectiveness of this method was demonstrated by conducting an experiment with the Weiwenqiju portion of the Yumentingzheng. 0 1 0 2 CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik
  9. [1] Zhuang Jifa, Yumentingzheng, Wenshizhe Press, 2000. [2] Manchu alphabet.

    [Internet] Available: https://en.wikipedia.org/wiki/Manchu_alphabet [3] Diandian Zhang, Yan Liu, Zhuowei Wang, and Depei Wang, "OCR with the Deep CNN Model for Ligature Script-Based Languages like Manchu," Hindawi Scientific Programming, vol. 2021, Article ID 5520338, https://doi.org/10.1155/2021/5520338 [4] Jang Yongsik, Kang Higu, Learning to code in R language, Saengneung Press, 2018. [5] Manchu Language. [Internet] Available: https://namu.wiki/w/%EB%A7%8C%EC%A3%BC%EC%96%B4 References CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, and infographics & images by Freepik