Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DEEP NER

Sorami Hisamoto
December 26, 2017

DEEP NER

Survey on Neural Network Methods for Named Entity Recognition.

Presented at WAP Tokushima Laboratory of AI and NLP ( http://nlp.worksap.co.jp/ ).

Sorami Hisamoto

December 26, 2017
Tweet

More Decks by Sorami Hisamoto

Other Decks in Research

Transcript

  1. DEEP NER Neural Network Methods for Named Entity Recognition Sorami

    Hisamoto WAP Tokushima Laboratory of AI and NLP 2017.12.26 caution: slides based on shallow understanding 1 / 17
  2. Named Entity Recognition Named Entity Recognition (NER) is a task

    to extract entities (such as name, location and time) from text. It is common to use Machine Learning methods to achieve this. It is often formalized as a Sequence Labeling problem. [Sang 2002], [Sang & De Meulder 2003] “Introduction to the CoNLL-{2002,2003} Shared Task: Language-Independent Named Entity Recognition” [Nadeau & Sekine 2007] “A survey of named entity recognition and classification” 2 / 17
  3. NER as Sequence Labeling Given a sequence of tokens, assign

    labels to each tokens. One way is to construct a graph representing possible sequences and their scores. We can get the best path conveniently using Viterbi algorithm. Common methods to create the graph: Conditional Random Fields (CRF). We can use BIO/IOBES labels, etc. to formalize NER as a Sequence Labeling problem. [McCallum & Li 2003] “Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons” [Sutton & McCallum 2010] “An Introduction to Conditional Random Fields” 3 / 17
  4. Learning The Representetion • Example: Binary Classification • Learning the

    Hyperplane • Perceptron, SVM, and PA // Speaker Deck • Separete the “Linearly Non-separable” • Kernel Methods • SVM with polynomial kernel visualization - YouTube • Deep Learning (Neural Network) • a.k.a. Representation Learning • can consider it as “learning the Kernel function” • Deep: more expressive • Why Now? • More Data • More Computational Power • Effective Learning Methods for Deep Models 4 / 17
  5. Components for Neural Network • CNN: Convolutional Neural Network •

    Aggregate things • How do Convolutional Neural Networks work? (ja trans) • RNN: Recurrent Neural Network • Feed itself • LSTM: Long Short-Term Memory • Understanding LSTM Networks – colah’s blog (ja trans) • Attention and Augmented Recurrent Neural Networks • and more recently ... • [Bradbury et al. 2016] “Quasi-Recurrent Neural Networks” • [Vaswani et al. 2017] “Attention is All You Need” • [Sabour et al. 2017] “Dynamic Routing Between Capsules” • ... How to treat Natural Language Data? −→ Embedding [Goldberg 2017] “Neural Network Methods for Natural Language Processing” (textbook) 5 / 17
  6. Embedding: Representing Text as Vectors For neural network input, we

    want real-value dense vectors. [Mikolov et al. 2013] word2vec “Efficient Estimation of Word Representations in Vector Space” [Pennington et al. 2014] GloVe “GloVe: Global Vectors for Word Representation” [Bojanowski & Grave et al. 2016] fastText “Enriching Word Vectors with Subword Information” Embedding often learned from large-scale dataset e.g. entire Wikipedia −→ Semi-supervised Word embeddings in 2017: Trends and future directions 6 / 17
  7. NER with Embedding Embedding as unsupervised features from unlabeled data.

    The models are not Neural Network yet. 1 [Turian et al. 2010] • “Word representations: A simple and general method for semi-supervised learning” • Embedding and Brown Cluster as features 2 [Passos et al. 2014] • “Lexicon Infused Phrase Embeddings for Named Entity Resolution” • Phrase Embeddings for NER 7 / 17
  8. NER with Neural Networks 1 [Collobert et al. 2011] •

    “Natural Language Processing (almost) from Scratch” 2 [Chiu & Nichols 2015] • “Named Entity Recognition with Bidirectional LSTM-CNNs” 3 [Huang et al. 2015] • “Bidirectional LSTM-CRF Models for Sequence Tagging” 4 [Lample et al. 2016] • “Neural Architecture for Named Entity Recognition” 5 [Ma & Hovy 2016] • “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF” 6 [Reimers & Gurevych 2017] • “Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks” • “Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging” 8 / 17
  9. I. Unified Neural Network Architecture for NLP [Collobert et al.

    2011] Embedding + Feed Forward Network “Natural Language Processing (almost) from Scratch” • (almost) from Scratch: minimal hand-crafted features • Train language model to get word vectors from raw data ∗Figures from [Collobert et al. 2011] 9 / 17
  10. II. BiLSTM-CNNs [Chiu & Nichols 2015] BiLSTM-CNNs “Named Entity Recognition

    with Bidirectional LSTM-CNNs” • Sequence Labeling with BiLSTM [Graves et al. 2013] • CNN for character-level features [Santos et al. 2016], [Labeau et al. 2015] • atoutput layer to get tag probabilities5 ∗Figures from [Chiu & Nichols 2015] 10 / 17
  11. III. BiLSTM-CRF [Huang et al. 2015] BiLSTM-CRF “Bidirectional LSTM-CRF for

    Sequence Modeling” • Sequence Labeling with BiLSTM [Graves et al. 2013] • CRF to model label dependencies ∗Figures from [Huang et al. 2015] 11 / 17
  12. IV. BiLSTM-CRF with Character-Level Features [Lample et al. 2016] BiLSTM-LSTM-CRF,

    S-LSTM “Neural Architectures for Named Entity Recognition” • BiLSTM for character embedding [Ling et al. 2015], [Ballesteros et al. 2015] • (also propose S-LSTM, a transition-based approach inspired by shift-reduce parsers) [Ma & Hovy 2016] BiLSTM-CNN-CRF “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF” • CNN for character embedding [Santos & Zadrozny 2014], [Chiu & Nichols 2015] • No feature engineering or pre-processing, hence ”truely end-to-end” ∗Figures from [Lample et al. 2016], [Ma & Hovy 2016] 12 / 17
  13. V. Comparison & Tuning [Reimers & Gurevych 2017] • “Optimal

    Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks” • “Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging” (summary of above paper) • Evaluation with over 50,000 (!) setups • [Huang et al. 2015], [Lample et al. 2016], and [Ma & Hovy 2016] • NNs are non-deterministic → results depends on initialization • Diff. between ”Stete-of-the-art” & ”mediocre” may be insignificant • Score Distribution would be more trustable • Hyperparameters • Word Embeddings, Character Representation, Optimizer, Gradient Clipping & Normalization, Tagging Schemes, Classifier, Dropout, Number of LSTM Layers / Recurrent Units, Mini-batch Size, Backend • Lots of empirical knowledge and best pracitce advices 13 / 17
  14. NER with LSTM-CRF: Other Topicss • [Rei et al. 2016]

    • “Attending to Characters in Neural Sequence Labeling Models” • Attention to dynamically decide how much word-/char-level info. to use • [Kuru et al. 2016] • “CharNER: Character-Level Named Entity Recognition” • Sentence as sequence of chars. Stacked BiLSTM • Predicts tag for each char & forces that tags in a word are the same • [Misawa et al. 2017] • “Character-based Bidirectional LSTM-CRF with words and characters for Japanese Named Entity Recognition” • CNN not suitable for Japanese subword: shorter words, no capitalization • Average word length: CoNLL 2003 6.43 chars, Mainichi corpus 1.95 chars • ”entity/word boundary conflict” in Japanese • Char-BiLSTM-CRF: predict a tag for every char independently • [Sato et al. 2017] • “Segment-Level Neural Conditional Random Fields for Named Entity Recognition” • Segment Lattice from word-level tagging model • Easier to incorporate Dictionary Features e.g. Binary Feature, Wikipedia Embedding [Yamada et al. 2016] 14 / 17
  15. Existing Implementations (I) • [Reimers & Gurevych 2017] UKPLab/emnlp2017-bilstm-cnn-crf •

    [Lample et al. 2016] glample/tagger, clab/stack-lstm-ner • NeuroNER • [Dernoncourt et al. 2016]“De-identification of Patient Notes with Recurrent Neural Networks” • [Dernoncourt et al. 2017]“NeuroNER: an easy-to-use program for named-entity recognition based on neural networks” • TensorFlow • anaGo • BiLSTM-CRF based on [Lample et al. 2016] • Keras • Hironsan/ss-78366210 - SlideShare (slide, ja) • deep-crf • By the author of [Sato et al. 2017] • Chainer 15 / 17
  16. Existing Implementations (II) • LopezGG/NN NER tensorFlow • [Ma &

    Hovy 2016] • TensorFlow • zseder/hunvec • [Pajkossy & Zsder 2016]“The Hunvec Framework For NN-CRF-based Sequential Tagging” • Theano and Pylearn2 • monikkinom/ner-lstm • [Athavale et al. 2016]“Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Scarcity” • BiLSTM • TensorFlow 16 / 17
  17. Future directions for us? • Embedding • Model • Domain-specific

    Data • Network Extension • For documents, incorporate the doc structure • Features • More suitable hand-crafted features for domain / language • Outside Information • Dictionary • Pattern • ... 17 / 17