Multi-lingual natural language understanding with spaCy

Multi-lingual natural language understanding with spaCy Matthew Honnibal Explosion AI

Explosion AI is a digital studio specialising in Artiﬁcial Intelligence
and Natural Language Processing. Open-source library for industrial-strength Natural Language Processing spaCy’s next-generation Machine Learning library for deep learning with text Coming soon: pre-trained, customisable models  for a variety of languages and domains A radically efficient data collection and annotation tool, powered by active learning

Matthew Honnibal CO-FOUNDER PhD in Computer Science in 2009. 10
years publishing research on state-of-the- art natural language understanding systems. Left academia in 2014 to develop spaCy. Ines Montani CO-FOUNDER Programmer and front-end developer with degree in media science and linguistics. Has been working on spaCy since its first release. Lead developer of Prodigy.

“I don’t get it. Can you explain like I’m ﬁve?”
Think of us as a boutique kitchen. free recipes published online catering for select events a line of kitchen gadgets soon: a line of fancy sauces and spice mixes you can use at home open-source software consulting downloadable tools pre-trained models

Joint transition-based segmentation and parsing

doc = nlp(u"Apple is looking at buying U.K. startup") for
token in doc: print(token.text, token.pos_, token.tag_, token.dep_, token.head.text, token.lefts, token.rights)

What’s parsing good for?

Trees are the truth sentences are tree-structured  dependencies can be
arbitrarily long in string space  syntax is application-independent 

Trees are the truth sentences are tree-structured  ... but they’re
read and written in order dependencies can be arbitrarily long in string space  ... but they’re usually short syntax is application-independent  Learn the language once, apply it many times.

טפשמ והז. Whitespace != Word income tax return Einkommensteuererklärung C’est
une phrase. これは⽂文章です。

How the parser works

Model design goals must be updateable (including learning new vocab)
must transfer well (including on fragments) must run fast on CPU pipelines are bad

features = doc2array([NORM, PREFIX, SUFFIX, SHAPE]) norm = get_col(0) >>
HashEmbed(128, 7000) prefix = get_col(1) >> HashEmbed(128, 7000) suffix = get_col(2) >> HashEmbed(128, 7000) shape = get_col(3) >> HashEmbed(128, 7000) embed_word = ( (norm | prefix | suffix | shape) >> LayerNorm(Maxout(128, pieces=3)) ) Hash Embeddings allow flexible vocabulary Notation | Function concatenation >> Function composition

trigram_cnn = ( ExtractWindow(nW=1) >> LayerNorm(Maxout(128)) ) encode_context = (
embed_word >> Residual(trigram_cnn) >> Residual(trigram_cnn) >> Residual(trigram_cnn) >> Residual(trigram_cnn) ) CNN makes model faster and less brittle Notation | Function concatenation >> Function composition

tensor = trigram_cnn(embed_word(doc)) state_weights = state2vec(tensor) state = initialize_state(doc) while
not state.is_finished: features = get_features(state, state_weights) probs = mlp(features) action = (probs * valid_actions(state)).argmax() state = do_action(action, state) Transition-based approach makes joint modelling easy

Google

Reader

was  Reader

Reader 

cancelled 

in  cancelled

October in  cancelled

2011 October in  cancelled

spaCy for other languages learning to merge tokens learning to
split tokens

What’s the current progress? implemented learning to merge working on
learning to split ranking ~2nd place on the CoNLL 2017 benchmark great results for Chinese, Vietnamese, Japanese joint model consistently better than pipeline

Results on CoNLL 2017 (1) English LAS UAS Sent Word
Pipe 79.0 81.8 73.2 98.7 Joint 80.2 83.1 77.3 98.7 Chinese LAS UAS Sent Word Pipe 57.1 61.8 98.2 88.9 Joint 63.7 68.3 99.1 92.5

Results on CoNLL 2017 (2) Japanese LAS UAS Sent Word
Pipe 73.1 74.4 94.9 89.7 Joint 78.5 80.0 95.5 92.9 Vietnamese LAS UAS Sent Word Pipe 39.1 44.2 92.6 82.5 Joint 44.7 50.8 91.8 86.7

Workflow of the future start with pre-trained models same representation
across languages parse tree enables powerful rule-based matching updateable models for accuracy on your domain rapid iteration and data annotation

Thanks! Explosion AI  explosion.ai Follow us on Twitter  @honnibal  @_inesmontani 
@explosion_ai

Multi-lingual natural language understanding wi...

Multi-lingual natural language understanding with spaCy

Matthew Honnibal PRO

More Decks by Matthew Honnibal

Other Decks in Programming

Featured

Transcript