to extract entities (such as name, location and time) from text. It is common to use Machine Learning methods to achieve this. It is often formalized as a Sequence Labeling problem. [Sang 2002], [Sang & De Meulder 2003] “Introduction to the CoNLL-{2002,2003} Shared Task: Language-Independent Named Entity Recognition” [Nadeau & Sekine 2007] “A survey of named entity recognition and classification” 2 / 17
labels to each tokens. One way is to construct a graph representing possible sequences and their scores. We can get the best path conveniently using Viterbi algorithm. Common methods to create the graph: Conditional Random Fields (CRF). We can use BIO/IOBES labels, etc. to formalize NER as a Sequence Labeling problem. [McCallum & Li 2003] “Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons” [Sutton & McCallum 2010] “An Introduction to Conditional Random Fields” 3 / 17
Hyperplane • Perceptron, SVM, and PA // Speaker Deck • Separete the “Linearly Non-separable” • Kernel Methods • SVM with polynomial kernel visualization - YouTube • Deep Learning (Neural Network) • a.k.a. Representation Learning • can consider it as “learning the Kernel function” • Deep: more expressive • Why Now? • More Data • More Computational Power • Effective Learning Methods for Deep Models 4 / 17
Aggregate things • How do Convolutional Neural Networks work? (ja trans) • RNN: Recurrent Neural Network • Feed itself • LSTM: Long Short-Term Memory • Understanding LSTM Networks – colah’s blog (ja trans) • Attention and Augmented Recurrent Neural Networks • and more recently ... • [Bradbury et al. 2016] “Quasi-Recurrent Neural Networks” • [Vaswani et al. 2017] “Attention is All You Need” • [Sabour et al. 2017] “Dynamic Routing Between Capsules” • ... How to treat Natural Language Data? −→ Embedding [Goldberg 2017] “Neural Network Methods for Natural Language Processing” (textbook) 5 / 17
want real-value dense vectors. [Mikolov et al. 2013] word2vec “Efficient Estimation of Word Representations in Vector Space” [Pennington et al. 2014] GloVe “GloVe: Global Vectors for Word Representation” [Bojanowski & Grave et al. 2016] fastText “Enriching Word Vectors with Subword Information” Embedding often learned from large-scale dataset e.g. entire Wikipedia −→ Semi-supervised Word embeddings in 2017: Trends and future directions 6 / 17
The models are not Neural Network yet. 1 [Turian et al. 2010] • “Word representations: A simple and general method for semi-supervised learning” • Embedding and Brown Cluster as features 2 [Passos et al. 2014] • “Lexicon Infused Phrase Embeddings for Named Entity Resolution” • Phrase Embeddings for NER 7 / 17
2011] Embedding + Feed Forward Network “Natural Language Processing (almost) from Scratch” • (almost) from Scratch: minimal hand-crafted features • Train language model to get word vectors from raw data ∗Figures from [Collobert et al. 2011] 9 / 17
with Bidirectional LSTM-CNNs” • Sequence Labeling with BiLSTM [Graves et al. 2013] • CNN for character-level features [Santos et al. 2016], [Labeau et al. 2015] • atoutput layer to get tag probabilities5 ∗Figures from [Chiu & Nichols 2015] 10 / 17
Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks” • “Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging” (summary of above paper) • Evaluation with over 50,000 (!) setups • [Huang et al. 2015], [Lample et al. 2016], and [Ma & Hovy 2016] • NNs are non-deterministic → results depends on initialization • Diff. between ”Stete-of-the-art” & ”mediocre” may be insignificant • Score Distribution would be more trustable • Hyperparameters • Word Embeddings, Character Representation, Optimizer, Gradient Clipping & Normalization, Tagging Schemes, Classifier, Dropout, Number of LSTM Layers / Recurrent Units, Mini-batch Size, Backend • Lots of empirical knowledge and best pracitce advices 13 / 17
• “Attending to Characters in Neural Sequence Labeling Models” • Attention to dynamically decide how much word-/char-level info. to use • [Kuru et al. 2016] • “CharNER: Character-Level Named Entity Recognition” • Sentence as sequence of chars. Stacked BiLSTM • Predicts tag for each char & forces that tags in a word are the same • [Misawa et al. 2017] • “Character-based Bidirectional LSTM-CRF with words and characters for Japanese Named Entity Recognition” • CNN not suitable for Japanese subword: shorter words, no capitalization • Average word length: CoNLL 2003 6.43 chars, Mainichi corpus 1.95 chars • ”entity/word boundary conflict” in Japanese • Char-BiLSTM-CRF: predict a tag for every char independently • [Sato et al. 2017] • “Segment-Level Neural Conditional Random Fields for Named Entity Recognition” • Segment Lattice from word-level tagging model • Easier to incorporate Dictionary Features e.g. Binary Feature, Wikipedia Embedding [Yamada et al. 2016] 14 / 17
[Lample et al. 2016] glample/tagger, clab/stack-lstm-ner • NeuroNER • [Dernoncourt et al. 2016]“De-identification of Patient Notes with Recurrent Neural Networks” • [Dernoncourt et al. 2017]“NeuroNER: an easy-to-use program for named-entity recognition based on neural networks” • TensorFlow • anaGo • BiLSTM-CRF based on [Lample et al. 2016] • Keras • Hironsan/ss-78366210 - SlideShare (slide, ja) • deep-crf • By the author of [Sato et al. 2017] • Chainer 15 / 17
Hovy 2016] • TensorFlow • zseder/hunvec • [Pajkossy & Zsder 2016]“The Hunvec Framework For NN-CRF-based Sequential Tagging” • Theano and Pylearn2 • monikkinom/ner-lstm • [Athavale et al. 2016]“Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity” • BiLSTM • TensorFlow 16 / 17
Data • Network Extension • For documents, incorporate the doc structure • Features • More suitable hand-crafted features for domain / language • Outside Information • Dictionary • Pattern • ... 17 / 17