Multi-lingual natural language understanding with spaCy

Slide 1

Slide 1 text

Multi-lingual natural language understanding with spaCy Matthew Honnibal Explosion AI

Slide 2

Slide 2 text

Explosion AI is a digital studio specialising in Artiﬁcial Intelligence and Natural Language Processing. Open-source library for industrial-strength Natural Language Processing spaCy’s next-generation Machine Learning library for deep learning with text Coming soon: pre-trained, customisable models  for a variety of languages and domains A radically efficient data collection and annotation tool, powered by active learning

Slide 3

Slide 3 text

Matthew Honnibal CO-FOUNDER PhD in Computer Science in 2009. 10 years publishing research on state-of-the- art natural language understanding systems. Left academia in 2014 to develop spaCy. Ines Montani CO-FOUNDER Programmer and front-end developer with degree in media science and linguistics. Has been working on spaCy since its first release. Lead developer of Prodigy.

Slide 4

Slide 4 text

“I don’t get it. Can you explain like I’m ﬁve?” Think of us as a boutique kitchen. free recipes published online catering for select events a line of kitchen gadgets soon: a line of fancy sauces and spice mixes you can use at home open-source software consulting downloadable tools pre-trained models

Slide 5

Slide 5 text

Joint transition-based segmentation and parsing

Slide 6

Slide 6 text

doc = nlp(u"Apple is looking at buying U.K. startup") for token in doc: print(token.text, token.pos_, token.tag_, token.dep_, token.head.text, token.lefts, token.rights)

Slide 7

Slide 7 text

What’s parsing good for?

Slide 8

Slide 8 text

Trees are the truth sentences are tree-structured  dependencies can be arbitrarily long in string space  syntax is application-independent 

Slide 9

Slide 9 text

Trees are the truth sentences are tree-structured  ... but they’re read and written in order dependencies can be arbitrarily long in string space  ... but they’re usually short syntax is application-independent  Learn the language once, apply it many times.

Slide 10

Slide 10 text

טפשמ והז. Whitespace != Word income tax return Einkommensteuererklärung C’est une phrase. これは⽂文章です。

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

How the parser works

Slide 14

Slide 14 text

Model design goals must be updateable (including learning new vocab) must transfer well (including on fragments) must run fast on CPU pipelines are bad

Slide 15

Slide 15 text

features = doc2array([NORM, PREFIX, SUFFIX, SHAPE]) norm = get_col(0) >> HashEmbed(128, 7000) prefix = get_col(1) >> HashEmbed(128, 7000) suffix = get_col(2) >> HashEmbed(128, 7000) shape = get_col(3) >> HashEmbed(128, 7000) embed_word = ( (norm | prefix | suffix | shape) >> LayerNorm(Maxout(128, pieces=3)) ) Hash Embeddings allow flexible vocabulary Notation | Function concatenation >> Function composition

Slide 16

Slide 16 text

trigram_cnn = ( ExtractWindow(nW=1) >> LayerNorm(Maxout(128)) ) encode_context = ( embed_word >> Residual(trigram_cnn) >> Residual(trigram_cnn) >> Residual(trigram_cnn) >> Residual(trigram_cnn) ) CNN makes model faster and less brittle Notation | Function concatenation >> Function composition

Slide 17

Slide 17 text

tensor = trigram_cnn(embed_word(doc)) state_weights = state2vec(tensor) state = initialize_state(doc) while not state.is_finished: features = get_features(state, state_weights) probs = mlp(features) action = (probs * valid_actions(state)).argmax() state = do_action(action, state) Transition-based approach makes joint modelling easy

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

Google

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Reader

Slide 23

Slide 23 text

was  Reader

Slide 24

Slide 24 text

Reader 

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

cancelled 

Slide 27

Slide 27 text

in  cancelled

Slide 28

Slide 28 text

October in  cancelled

Slide 29

Slide 29 text

2011 October in  cancelled

Slide 30

Slide 30 text

No content

Slide 31

Slide 31 text

spaCy for other languages learning to merge tokens learning to split tokens

Slide 32

Slide 32 text

What’s the current progress? implemented learning to merge working on learning to split ranking ~2nd place on the CoNLL 2017 benchmark great results for Chinese, Vietnamese, Japanese joint model consistently better than pipeline

Slide 33

Slide 33 text

Results on CoNLL 2017 (1) English LAS UAS Sent Word Pipe 79.0 81.8 73.2 98.7 Joint 80.2 83.1 77.3 98.7 Chinese LAS UAS Sent Word Pipe 57.1 61.8 98.2 88.9 Joint 63.7 68.3 99.1 92.5

Slide 34

Slide 34 text

Results on CoNLL 2017 (2) Japanese LAS UAS Sent Word Pipe 73.1 74.4 94.9 89.7 Joint 78.5 80.0 95.5 92.9 Vietnamese LAS UAS Sent Word Pipe 39.1 44.2 92.6 82.5 Joint 44.7 50.8 91.8 86.7

Slide 35

Slide 35 text

Workflow of the future start with pre-trained models same representation across languages parse tree enables powerful rule-based matching updateable models for accuracy on your domain rapid iteration and data annotation

Slide 36

Slide 36 text

Thanks! Explosion AI  explosion.ai Follow us on Twitter  @honnibal  @_inesmontani  @explosion_ai