Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Multi-lingual natural language understanding with spaCy

Multi-lingual natural language understanding with spaCy

spaCy is a popular open-source Natural Language Processing library designed for practical usage. In this talk, I'll outline the new parsing model we've been developing to improve spaCy's support for more languages and text types. Like other transition-based parsers, the model predicts a sequence of actions that push tokens to and from a stack and build arcs between them. However, we expect the arc-eager system with actions that can also repair previous parse errors, introduce sentence boundaries, and split or merge the pre-segmented tokens. The joint approach improves parse accuracy on many types of text, especially for non-whitespace writing systems. We have also found significant practical advantage to short pipelines. Short pipelines are easier to reason about, and increase runtime flexibility by reducing the risk of train/test skew.

Matthew Honnibal

April 15, 2018

More Decks by Matthew Honnibal

Other Decks in Programming


  1. Explosion AI is a digital studio specialising in Artificial Intelligence

    and Natural Language Processing. Open-source library for industrial-strength Natural Language Processing spaCy’s next-generation Machine Learning library for deep learning with text Coming soon: pre-trained, customisable models
 for a variety of languages and domains A radically efficient data collection and annotation tool, powered by active learning
  2. Matthew Honnibal CO-FOUNDER PhD in Computer Science in 2009. 10

    years publishing research on state-of-the- art natural language understanding systems. Left academia in 2014 to develop spaCy. Ines Montani CO-FOUNDER Programmer and front-end developer with degree in media science and linguistics. Has been working on spaCy since its first release. Lead developer of Prodigy.
  3. “I don’t get it. Can you explain like I’m five?”

    Think of us as a boutique kitchen. free recipes published online catering for select events a line of kitchen gadgets soon: a line of fancy sauces and spice mixes you can use at home open-source software consulting downloadable tools pre-trained models
  4. doc = nlp(u"Apple is looking at buying U.K. startup") for

    token in doc: print(token.text, token.pos_, token.tag_, token.dep_, token.head.text, token.lefts, token.rights)
  5. Trees are the truth sentences are tree-structured
 dependencies can be

    arbitrarily long in string space
 syntax is application-independent

  6. Trees are the truth sentences are tree-structured
 ... but they’re

    read and written in order dependencies can be arbitrarily long in string space
 ... but they’re usually short syntax is application-independent
 Learn the language once, apply it many times.
  7. Model design goals must be updateable (including learning new vocab)

    must transfer well (including on fragments) must run fast on CPU pipelines are bad
  8. features = doc2array([NORM, PREFIX, SUFFIX, SHAPE]) norm = get_col(0) >>

    HashEmbed(128, 7000) prefix = get_col(1) >> HashEmbed(128, 7000) suffix = get_col(2) >> HashEmbed(128, 7000) shape = get_col(3) >> HashEmbed(128, 7000) embed_word = ( (norm | prefix | suffix | shape) >> LayerNorm(Maxout(128, pieces=3)) ) Hash Embeddings allow flexible vocabulary Notation | Function concatenation >> Function composition
  9. trigram_cnn = ( ExtractWindow(nW=1) >> LayerNorm(Maxout(128)) ) encode_context = (

    embed_word >> Residual(trigram_cnn) >> Residual(trigram_cnn) >> Residual(trigram_cnn) >> Residual(trigram_cnn) ) CNN makes model faster and less brittle Notation | Function concatenation >> Function composition
  10. tensor = trigram_cnn(embed_word(doc)) state_weights = state2vec(tensor) state = initialize_state(doc) while

    not state.is_finished: features = get_features(state, state_weights) probs = mlp(features) action = (probs * valid_actions(state)).argmax() state = do_action(action, state) Transition-based approach makes joint modelling easy
  11. What’s the current progress? implemented learning to merge working on

    learning to split ranking ~2nd place on the CoNLL 2017 benchmark great results for Chinese, Vietnamese, Japanese joint model consistently better than pipeline
  12. Results on CoNLL 2017 (1) English LAS UAS Sent Word

    Pipe 79.0 81.8 73.2 98.7 Joint 80.2 83.1 77.3 98.7 Chinese LAS UAS Sent Word Pipe 57.1 61.8 98.2 88.9 Joint 63.7 68.3 99.1 92.5
  13. Results on CoNLL 2017 (2) Japanese LAS UAS Sent Word

    Pipe 73.1 74.4 94.9 89.7 Joint 78.5 80.0 95.5 92.9 Vietnamese LAS UAS Sent Word Pipe 39.1 44.2 92.6 82.5 Joint 44.7 50.8 91.8 86.7
  14. Workflow of the future start with pre-trained models same representation

    across languages parse tree enables powerful rule-based matching updateable models for accuracy on your domain rapid iteration and data annotation