DEEP NER - Speaker Deck

Slide 1

Slide 1 text

DEEP NER Neural Network Methods for Named Entity Recognition Sorami Hisamoto WAP Tokushima Laboratory of AI and NLP 2017.12.26 caution: slides based on shallow understanding 1 / 17

Slide 2

Slide 2 text

Named Entity Recognition Named Entity Recognition (NER) is a task to extract entities (such as name, location and time) from text. It is common to use Machine Learning methods to achieve this. It is often formalized as a Sequence Labeling problem. [Sang 2002], [Sang & De Meulder 2003] “Introduction to the CoNLL-{2002,2003} Shared Task: Language-Independent Named Entity Recognition” [Nadeau & Sekine 2007] “A survey of named entity recognition and classiﬁcation” 2 / 17

Slide 3

Slide 3 text

NER as Sequence Labeling Given a sequence of tokens, assign labels to each tokens. One way is to construct a graph representing possible sequences and their scores. We can get the best path conveniently using Viterbi algorithm. Common methods to create the graph: Conditional Random Fields (CRF). We can use BIO/IOBES labels, etc. to formalize NER as a Sequence Labeling problem. [McCallum & Li 2003] “Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons” [Sutton & McCallum 2010] “An Introduction to Conditional Random Fields” 3 / 17

Slide 4

Slide 4 text

Learning The Representetion • Example: Binary Classiﬁcation • Learning the Hyperplane • Perceptron, SVM, and PA // Speaker Deck • Separete the “Linearly Non-separable” • Kernel Methods • SVM with polynomial kernel visualization - YouTube • Deep Learning (Neural Network) • a.k.a. Representation Learning • can consider it as “learning the Kernel function” • Deep: more expressive • Why Now? • More Data • More Computational Power • Eﬀective Learning Methods for Deep Models 4 / 17

Slide 5

Slide 5 text

Components for Neural Network • CNN: Convolutional Neural Network • Aggregate things • How do Convolutional Neural Networks work? (ja trans) • RNN: Recurrent Neural Network • Feed itself • LSTM: Long Short-Term Memory • Understanding LSTM Networks – colah’s blog (ja trans) • Attention and Augmented Recurrent Neural Networks • and more recently ... • [Bradbury et al. 2016] “Quasi-Recurrent Neural Networks” • [Vaswani et al. 2017] “Attention is All You Need” • [Sabour et al. 2017] “Dynamic Routing Between Capsules” • ... How to treat Natural Language Data? −→ Embedding [Goldberg 2017] “Neural Network Methods for Natural Language Processing” (textbook) 5 / 17

Slide 6

Slide 6 text

Embedding: Representing Text as Vectors For neural network input, we want real-value dense vectors. [Mikolov et al. 2013] word2vec “Eﬃcient Estimation of Word Representations in Vector Space” [Pennington et al. 2014] GloVe “GloVe: Global Vectors for Word Representation” [Bojanowski & Grave et al. 2016] fastText “Enriching Word Vectors with Subword Information” Embedding often learned from large-scale dataset e.g. entire Wikipedia −→ Semi-supervised Word embeddings in 2017: Trends and future directions 6 / 17

Slide 7

Slide 7 text

NER with Embedding Embedding as unsupervised features from unlabeled data. The models are not Neural Network yet. 1 [Turian et al. 2010] • “Word representations: A simple and general method for semi-supervised learning” • Embedding and Brown Cluster as features 2 [Passos et al. 2014] • “Lexicon Infused Phrase Embeddings for Named Entity Resolution” • Phrase Embeddings for NER 7 / 17

Slide 8

Slide 8 text

NER with Neural Networks 1 [Collobert et al. 2011] • “Natural Language Processing (almost) from Scratch” 2 [Chiu & Nichols 2015] • “Named Entity Recognition with Bidirectional LSTM-CNNs” 3 [Huang et al. 2015] • “Bidirectional LSTM-CRF Models for Sequence Tagging” 4 [Lample et al. 2016] • “Neural Architecture for Named Entity Recognition” 5 [Ma & Hovy 2016] • “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF” 6 [Reimers & Gurevych 2017] • “Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks” • “Reporting Score Distributions Makes a Diﬀerence: Performance Study of LSTM-networks for Sequence Tagging” 8 / 17

Slide 9

Slide 9 text

I. Uniﬁed Neural Network Architecture for NLP [Collobert et al. 2011] Embedding + Feed Forward Network “Natural Language Processing (almost) from Scratch” • (almost) from Scratch: minimal hand-crafted features • Train language model to get word vectors from raw data ∗Figures from [Collobert et al. 2011] 9 / 17

Slide 10

Slide 10 text

II. BiLSTM-CNNs [Chiu & Nichols 2015] BiLSTM-CNNs “Named Entity Recognition with Bidirectional LSTM-CNNs” • Sequence Labeling with BiLSTM [Graves et al. 2013] • CNN for character-level features [Santos et al. 2016], [Labeau et al. 2015] • atoutput layer to get tag probabilities5 ∗Figures from [Chiu & Nichols 2015] 10 / 17

Slide 11

Slide 11 text

III. BiLSTM-CRF [Huang et al. 2015] BiLSTM-CRF “Bidirectional LSTM-CRF for Sequence Modeling” • Sequence Labeling with BiLSTM [Graves et al. 2013] • CRF to model label dependencies ∗Figures from [Huang et al. 2015] 11 / 17

Slide 12

Slide 12 text

IV. BiLSTM-CRF with Character-Level Features [Lample et al. 2016] BiLSTM-LSTM-CRF, S-LSTM “Neural Architectures for Named Entity Recognition” • BiLSTM for character embedding [Ling et al. 2015], [Ballesteros et al. 2015] • (also propose S-LSTM, a transition-based approach inspired by shift-reduce parsers) [Ma & Hovy 2016] BiLSTM-CNN-CRF “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF” • CNN for character embedding [Santos & Zadrozny 2014], [Chiu & Nichols 2015] • No feature engineering or pre-processing, hence ”truely end-to-end” ∗Figures from [Lample et al. 2016], [Ma & Hovy 2016] 12 / 17

Slide 13

Slide 13 text

V. Comparison & Tuning [Reimers & Gurevych 2017] • “Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks” • “Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging” (summary of above paper) • Evaluation with over 50,000 (!) setups • [Huang et al. 2015], [Lample et al. 2016], and [Ma & Hovy 2016] • NNs are non-deterministic → results depends on initialization • Diff. between ”Stete-of-the-art” & ”mediocre” may be insignificant • Score Distribution would be more trustable • Hyperparameters • Word Embeddings, Character Representation, Optimizer, Gradient Clipping & Normalization, Tagging Schemes, Classifier, Dropout, Number of LSTM Layers / Recurrent Units, Mini-batch Size, Backend • Lots of empirical knowledge and best pracitce advices 13 / 17

Slide 14

Slide 14 text

NER with LSTM-CRF: Other Topicss • [Rei et al. 2016] • “Attending to Characters in Neural Sequence Labeling Models” • Attention to dynamically decide how much word-/char-level info. to use • [Kuru et al. 2016] • “CharNER: Character-Level Named Entity Recognition” • Sentence as sequence of chars. Stacked BiLSTM • Predicts tag for each char & forces that tags in a word are the same • [Misawa et al. 2017] • “Character-based Bidirectional LSTM-CRF with words and characters for Japanese Named Entity Recognition” • CNN not suitable for Japanese subword: shorter words, no capitalization • Average word length: CoNLL 2003 6.43 chars, Mainichi corpus 1.95 chars • ”entity/word boundary conﬂict” in Japanese • Char-BiLSTM-CRF: predict a tag for every char independently • [Sato et al. 2017] • “Segment-Level Neural Conditional Random Fields for Named Entity Recognition” • Segment Lattice from word-level tagging model • Easier to incorporate Dictionary Features e.g. Binary Feature, Wikipedia Embedding [Yamada et al. 2016] 14 / 17

Slide 15

Slide 15 text

Existing Implementations (I) • [Reimers & Gurevych 2017] UKPLab/emnlp2017-bilstm-cnn-crf • [Lample et al. 2016] glample/tagger, clab/stack-lstm-ner • NeuroNER • [Dernoncourt et al. 2016]“De-identiﬁcation of Patient Notes with Recurrent Neural Networks” • [Dernoncourt et al. 2017]“NeuroNER: an easy-to-use program for named-entity recognition based on neural networks” • TensorFlow • anaGo • BiLSTM-CRF based on [Lample et al. 2016] • Keras • Hironsan/ss-78366210 - SlideShare (slide, ja) • deep-crf • By the author of [Sato et al. 2017] • Chainer 15 / 17

Slide 16

Slide 16 text

Existing Implementations (II) • LopezGG/NN NER tensorFlow • [Ma & Hovy 2016] • TensorFlow • zseder/hunvec • [Pajkossy & Zsder 2016]“The Hunvec Framework For NN-CRF-based Sequential Tagging” • Theano and Pylearn2 • monikkinom/ner-lstm • [Athavale et al. 2016]“Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity” • BiLSTM • TensorFlow 16 / 17

Slide 17

Slide 17 text

Future directions for us? • Embedding • Model • Domain-speciﬁc Data • Network Extension • For documents, incorporate the doc structure • Features • More suitable hand-crafted features for domain / language • Outside Information • Dictionary • Pattern • ... 17 / 17