Task-Oriented Word Segmentation (Presentation for Doctoral Dissertation)

Slide 1

Slide 1 text

Task-Oriented Word Segmentation Tatsuya Hiraoka Okazaki-lab Doctoral Dissertation 2022/1/5 博⼠論⽂発表会（平岡達也） 1

Slide 106

Slide 106 text

発表文献 • 申請論⽂ • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki. Joint Optimization of Tokenization and Downstream Model. Findings of ACL-IJCNLP 2021, pages 244‒255 (double-column), August 2021. • 平岡達也, ⾼瀬翔, 内海慶, 欅惇志, 岡崎直観. テキストベクトルの重みづけを⽤いたタスクに対する単語分割の最適化. ⾃然⾔語処理, Vol. 28, No. 2, pages 479-507 (シングルカラム), 2021年6⽉. • その他の主著論⽂ • 平岡達也, ⾼瀬翔, 内海慶, 欅惇志, 岡崎直観. 単語分割と後段モデルの損失値を⽤いた同時最適化. ⾃然⾔語処理, 29(1):to appear, 33 pages （シングルカラム）, 2022年3⽉. • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki. Recurrent Neural Hidden Markov Model for High-Order Transition. ACM TALLIP, 21(2): pages 1‒15 (double-column), March 2022. • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki. Optimizing Word Segmentation for Downstream Task. Findings of EMNLP, pages 1341‒1351 (double-column), Association for Computational Linguistics, November 2020. • Tatsuya Hiraoka, Hiroyuki Shindo, Yuji Matsumoto. Stochastic Tokenization with a Language Model for Neural Text Classification. ACL, pages 1620‒1629 (double-column), July 2019. • 平岡達也, ⾼瀬翔, 内海慶, 欅惇志, 岡崎直観. 後段モデルの損失値を⽤いた単語分割のタスクへの最適化. ⾔語処理学会第27回年次⼤会 (NLP2021), pages486‒491 (ダブルカラム), 2021年3⽉.(若⼿奨励賞) • 平岡達也, ⾼瀬翔, 内海慶, 欅惇志, 岡崎直観. RNNにより⾼次の依存を考慮したニューラル隠れマルコフモデル. ⾔語処理学会第26回年次⼤会 (NLP2020), pp. A4‒2 (4 pages,ダブルカラム), 茨城⼤学（茨城県）, 2020年3⽉. • 平岡達也, ⾼瀬翔, 内海慶, 欅惇志, 岡崎直観. RNNによる遷移確率計算を⽤いた隠れマルコフモデル. 第242回⾃然⾔語処理研究会, 2019-NL-242(2), pp. 1‒6 (ダブルカラム), 奈良先端科学技術⼤学院⼤学（奈良県）, 2019年10⽉. (若⼿奨励賞) 2022/1/5 博⼠論⽂発表会（平岡達也） 106

Slide 107

Slide 107 text

参考文献1 • Xu, Jia, et al. "Bayesian semi-supervised chinese word segmentation for statistical machine translation." Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). 2008. • Chang, Pi-Chuan, Michel Galley, and Christopher D. Manning. "Optimizing Chinese word segmentation for machine translation performance." Proceedings of the third workshop on statistical machine translation. 2008. • Nguyen, ThuyLinh, Stephan Vogel, and Noah A. Smith. "Nonparametric word segmentation for machine translation." Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010). 2010. • Domingo, Miguel, et al. "How Much Does Tokenization Affect Neural Machine Translation?." arXiv preprint arXiv:1812.08621 (2018). • Thamme Gowda and Jonathan May. 2020. Finding the optimal vocabulary size for neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3955‒3964, Online. Association for Computational Linguistics. • Taku Kudo. 2006. Mecab: Yet another part-of-speech and morphological analyzer. http://taku910.github.io/mecab/. • Morita, Hajime, Daisuke Kawahara, and Sadao Kurohashi. "Morphological analysis for unsegmented languages using recurrent neural network language model." Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. • Kazuma Takaoka, Sorami Hisamoto, Noriko Kawa- hara, Miho Sakamoto, Yoshitaka Uchida, and Yuji Matsumoto. 2018. Sudachi: a japanese tokenizer for business. In Proceedings of the Eleventh International Conference on Language Resources and Eval- uation (LREC 2018), Paris, France. European Lan- guage Resources Association (ELRA). • Yang, Jie, Yue Zhang, and Fei Dong. "Neural Word Segmentation with Rich Pretraining." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. 2022/1/5 博⼠論⽂発表会（平岡達也） 107

Slide 108

Slide 108 text

参考文献2 • Deng Cai, Hai Zhao, Zhisong Zhang, Yuan Xin, Yongjian Wu, and Feiyue Huang. 2017. Fast and accurate neural word segmentation for chinese. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 608‒615. • Yang, Jie, Yue Zhang, and Shuailong Liang. "Subword Encoding in Lattice LSTM for Chinese Word Segmentation." Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. • Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages P1715‒1725. • Kudo, Taku, and John Richardson. "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing." arXiv preprint arXiv:1808.06226 (2018). • Xinchi Chen, Zhan Shi, Xipeng Qiu, and Xuanjing Huang. 2017. Dag-based long short-term memory for neural word segmentation. arXiv preprintarXiv:1707.00248. • Yue Zhang and Jie Yang. 2018. Chinese ner using lattice lstm. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1554‒1564. • Jie Yang, Yue Zhang, and Shuailong Liang. 2018. Subword encoding in lattice lstm for chinese word segmentation. arXiv preprint arXiv:1810.12594. • Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66‒75. • Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. 2019. Bpe-dropout: Simple and effective subword regularization. arXiv preprint arXiv:1910.13267. 2022/1/5 博⼠論⽂発表会（平岡達也） 108

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text