Upgrade to Pro — share decks privately, control downloads, hide ads and more …

DEEP NER

Sorami Hisamoto
December 26, 2017

DEEP NER

Survey on Neural Network Methods for Named Entity Recognition.

Presented at WAP Tokushima Laboratory of AI and NLP ( http://nlp.worksap.co.jp/ ).

Sorami Hisamoto

December 26, 2017
Tweet

More Decks by Sorami Hisamoto

Other Decks in Research

Transcript

  1. DEEP NER
    Neural Network Methods for Named Entity Recognition
    Sorami Hisamoto
    WAP Tokushima Laboratory of AI and NLP
    2017.12.26
    caution: slides based on shallow understanding
    1 / 17

    View Slide

  2. Named Entity Recognition
    Named Entity Recognition (NER) is a task to extract entities (such as
    name, location and time) from text.
    It is common to use Machine Learning methods to achieve this.
    It is often formalized as a Sequence Labeling problem.
    [Sang 2002], [Sang & De Meulder 2003]
    “Introduction to the CoNLL-{2002,2003} Shared Task: Language-Independent
    Named Entity Recognition”
    [Nadeau & Sekine 2007]
    “A survey of named entity recognition and classification”
    2 / 17

    View Slide

  3. NER as Sequence Labeling
    Given a sequence of tokens, assign labels to each tokens.
    One way is to construct a graph representing possible sequences and their
    scores. We can get the best path conveniently using Viterbi algorithm.
    Common methods to create the graph: Conditional Random Fields (CRF).
    We can use BIO/IOBES labels, etc. to formalize NER as a Sequence
    Labeling problem.
    [McCallum & Li 2003]
    “Early Results for Named Entity Recognition with Conditional Random Fields,
    Feature Induction and Web-Enhanced Lexicons”
    [Sutton & McCallum 2010]
    “An Introduction to Conditional Random Fields”
    3 / 17

    View Slide

  4. Learning The Representetion
    • Example: Binary Classification
    • Learning the Hyperplane
    • Perceptron, SVM, and PA // Speaker Deck
    • Separete the “Linearly Non-separable”
    • Kernel Methods
    • SVM with polynomial kernel visualization - YouTube
    • Deep Learning (Neural Network)
    • a.k.a. Representation Learning
    • can consider it as “learning the Kernel function”
    • Deep: more expressive
    • Why Now?
    • More Data
    • More Computational Power
    • Effective Learning Methods for Deep Models
    4 / 17

    View Slide

  5. Components for Neural Network
    • CNN: Convolutional Neural Network
    • Aggregate things
    • How do Convolutional Neural Networks work? (ja trans)
    • RNN: Recurrent Neural Network
    • Feed itself
    • LSTM: Long Short-Term Memory
    • Understanding LSTM Networks – colah’s blog (ja trans)
    • Attention and Augmented Recurrent Neural Networks
    • and more recently ...
    • [Bradbury et al. 2016] “Quasi-Recurrent Neural Networks”
    • [Vaswani et al. 2017] “Attention is All You Need”
    • [Sabour et al. 2017] “Dynamic Routing Between Capsules”
    • ...
    How to treat Natural Language Data? −→ Embedding
    [Goldberg 2017]
    “Neural Network Methods for Natural Language Processing” (textbook)
    5 / 17

    View Slide

  6. Embedding: Representing Text as Vectors
    For neural network input, we want real-value dense vectors.
    [Mikolov et al. 2013] word2vec
    “Efficient Estimation of Word Representations in Vector Space”
    [Pennington et al. 2014] GloVe
    “GloVe: Global Vectors for Word Representation”
    [Bojanowski & Grave et al. 2016] fastText
    “Enriching Word Vectors with Subword Information”
    Embedding often learned from large-scale dataset e.g. entire Wikipedia
    −→ Semi-supervised
    Word embeddings in 2017: Trends and future directions
    6 / 17

    View Slide

  7. NER with Embedding
    Embedding as unsupervised features from unlabeled data.
    The models are not Neural Network yet.
    1 [Turian et al. 2010]
    • “Word representations: A simple and general method for semi-supervised
    learning”
    • Embedding and Brown Cluster as features
    2 [Passos et al. 2014]
    • “Lexicon Infused Phrase Embeddings for Named Entity Resolution”
    • Phrase Embeddings for NER
    7 / 17

    View Slide

  8. NER with Neural Networks
    1 [Collobert et al. 2011]
    • “Natural Language Processing (almost) from Scratch”
    2 [Chiu & Nichols 2015]
    • “Named Entity Recognition with Bidirectional LSTM-CNNs”
    3 [Huang et al. 2015]
    • “Bidirectional LSTM-CRF Models for Sequence Tagging”
    4 [Lample et al. 2016]
    • “Neural Architecture for Named Entity Recognition”
    5 [Ma & Hovy 2016]
    • “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF”
    6 [Reimers & Gurevych 2017]
    • “Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks”
    • “Reporting Score Distributions Makes a Difference: Performance Study of
    LSTM-networks for Sequence Tagging”
    8 / 17

    View Slide

  9. I. Unified Neural Network Architecture for NLP
    [Collobert et al. 2011] Embedding + Feed Forward Network
    “Natural Language Processing (almost) from Scratch”
    • (almost) from Scratch: minimal hand-crafted features
    • Train language model to get word vectors from raw data
    ∗Figures from [Collobert et al. 2011]
    9 / 17

    View Slide

  10. II. BiLSTM-CNNs
    [Chiu & Nichols 2015] BiLSTM-CNNs
    “Named Entity Recognition with Bidirectional LSTM-CNNs”
    • Sequence Labeling with BiLSTM [Graves et al. 2013]
    • CNN for character-level features [Santos et al. 2016], [Labeau et al. 2015]
    • atoutput layer to get tag probabilities5
    ∗Figures from [Chiu & Nichols 2015] 10 / 17

    View Slide

  11. III. BiLSTM-CRF
    [Huang et al. 2015] BiLSTM-CRF
    “Bidirectional LSTM-CRF for Sequence Modeling”
    • Sequence Labeling with BiLSTM [Graves et al. 2013]
    • CRF to model label dependencies
    ∗Figures from [Huang et al. 2015]
    11 / 17

    View Slide

  12. IV. BiLSTM-CRF with Character-Level Features
    [Lample et al. 2016] BiLSTM-LSTM-CRF, S-LSTM
    “Neural Architectures for Named Entity Recognition”
    • BiLSTM for character embedding
    [Ling et al. 2015], [Ballesteros et al. 2015]
    • (also propose S-LSTM, a transition-based
    approach inspired by shift-reduce parsers)
    [Ma & Hovy 2016] BiLSTM-CNN-CRF
    “End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF”
    • CNN for character embedding
    [Santos & Zadrozny 2014], [Chiu & Nichols 2015]
    • No feature engineering or pre-processing,
    hence ”truely end-to-end”
    ∗Figures from [Lample et al. 2016], [Ma & Hovy 2016] 12 / 17

    View Slide

  13. V. Comparison & Tuning
    [Reimers & Gurevych 2017]
    • “Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks”
    • “Reporting Score Distributions Makes a Difference: Performance Study of
    LSTM-networks for Sequence Tagging” (summary of above paper)
    • Evaluation with over 50,000 (!) setups
    • [Huang et al. 2015], [Lample et al. 2016], and [Ma & Hovy 2016]
    • NNs are non-deterministic → results depends on initialization
    • Diff. between ”Stete-of-the-art” & ”mediocre” may be insignificant
    • Score Distribution would be more trustable
    • Hyperparameters
    • Word Embeddings, Character Representation, Optimizer, Gradient
    Clipping & Normalization, Tagging Schemes, Classifier, Dropout,
    Number of LSTM Layers / Recurrent Units, Mini-batch Size, Backend
    • Lots of empirical knowledge and best pracitce advices
    13 / 17

    View Slide

  14. NER with LSTM-CRF: Other Topicss
    • [Rei et al. 2016]
    • “Attending to Characters in Neural Sequence Labeling Models”
    • Attention to dynamically decide how much word-/char-level info. to use
    • [Kuru et al. 2016]
    • “CharNER: Character-Level Named Entity Recognition”
    • Sentence as sequence of chars. Stacked BiLSTM
    • Predicts tag for each char & forces that tags in a word are the same
    • [Misawa et al. 2017]
    • “Character-based Bidirectional LSTM-CRF with words and characters for Japanese
    Named Entity Recognition”
    • CNN not suitable for Japanese subword: shorter words, no capitalization
    • Average word length: CoNLL 2003 6.43 chars, Mainichi corpus 1.95 chars
    • ”entity/word boundary conflict” in Japanese
    • Char-BiLSTM-CRF: predict a tag for every char independently
    • [Sato et al. 2017]
    • “Segment-Level Neural Conditional Random Fields for Named Entity Recognition”
    • Segment Lattice from word-level tagging model
    • Easier to incorporate Dictionary Features
    e.g. Binary Feature, Wikipedia Embedding [Yamada et al. 2016]
    14 / 17

    View Slide

  15. Existing Implementations (I)
    • [Reimers & Gurevych 2017] UKPLab/emnlp2017-bilstm-cnn-crf
    • [Lample et al. 2016] glample/tagger, clab/stack-lstm-ner
    • NeuroNER
    • [Dernoncourt et al. 2016]“De-identification of Patient Notes with Recurrent Neural
    Networks”
    • [Dernoncourt et al. 2017]“NeuroNER: an easy-to-use program for named-entity
    recognition based on neural networks”
    • TensorFlow
    • anaGo
    • BiLSTM-CRF based on [Lample et al. 2016]
    • Keras
    • Hironsan/ss-78366210 - SlideShare (slide, ja)
    • deep-crf
    • By the author of [Sato et al. 2017]
    • Chainer
    15 / 17

    View Slide

  16. Existing Implementations (II)
    • LopezGG/NN NER tensorFlow
    • [Ma & Hovy 2016]
    • TensorFlow
    • zseder/hunvec
    • [Pajkossy & Zsder 2016]“The Hunvec Framework For NN-CRF-based Sequential
    Tagging”
    • Theano and Pylearn2
    • monikkinom/ner-lstm
    • [Athavale et al. 2016]“Towards Deep Learning in Hindi NER: An approach to tackle
    the Labelled Data Scarcity”
    • BiLSTM
    • TensorFlow
    16 / 17

    View Slide

  17. Future directions for us?
    • Embedding
    • Model
    • Domain-specific Data
    • Network Extension
    • For documents, incorporate the doc structure
    • Features
    • More suitable hand-crafted features for domain / language
    • Outside Information
    • Dictionary
    • Pattern
    • ...
    17 / 17

    View Slide