Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Speaker Deck
PRO
Sign in
Sign up for free
A Natural Language Pipeline
ddqz
July 06, 2019
Technology
0
290
A Natural Language Pipeline
Presentation from the spaCy IRL 2019 conference.
ddqz
July 06, 2019
Tweet
Share
Other Decks in Technology
See All in Technology
Lessons Learned from Scaling Infrastructure as Code
joatmon08
0
800
令和4年資金決済法等改正を踏まえたステーブルコインに関する規制の動向
finengine
0
500
雑な攻撃からELBを守る一工夫 +おまけ / Know-how to protect servers from miscellaneous attacks
hiroga
0
740
oakのミドルウェアを書くときの技のらしきもの
toranoana
0
140
拡散確率モデルと音声波形生成
yumakoizumi
0
540
Data in Google I/O - IO Extended GDG Seoul
kennethanceyer
0
170
20220628event_ogura_part
caddi_eng
0
150
ソフトウェアライセンス 2022 / Software License 2022
cybozuinsideout
PRO
1
1.2k
ZephyrRTOSのLongan Nanoへの移植
tokitahiroshi
0
110
Meet passkeys
satotakeshi
1
130
情報の世界 2022年度 第11回「都市のデータ」 #情報の世界 / Data of City 2022
yumulab
0
110
【toranoana.deno#7】Denoからwasmを呼び出す基礎
toranoana
0
130
Featured
See All Featured
Building Your Own Lightsaber
phodgson
94
4.6k
GitHub's CSS Performance
jonrohan
1020
420k
A designer walks into a library…
pauljervisheath
196
16k
Designing on Purpose - Digital PM Summit 2013
jponch
106
5.6k
The Web Native Designer (August 2011)
paulrobertlloyd
74
1.9k
KATA
mclloyd
7
8.7k
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
237
19k
Rails Girls Zürich Keynote
gr2m
86
12k
Thoughts on Productivity
jonyablonski
43
2.3k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
236
1M
Keith and Marios Guide to Fast Websites
keithpitt
404
21k
Fireside Chat
paigeccino
12
1.3k
Transcript
A Natural Language Pipeline
More Input
Knowledge” “A compendium of human...
Library
Physical archives became digital records, encoded with metadata
The internet promised rich dynamic experiences
The internet promised rich dynamic experiences but served us banner
ads
Advertising has and continues to fuel a substantial portion of
the innovation on the internet
What would The Economist look like if it were founded
in 2012?
User
First
Experience
“There’s a reason that tech companies are topping the lists
of most valuable companies and brands. Every company is a tech company.” Maggie Chan Jones
Every story, at its core, is a business story
Language
None
None
Stage -> Stenographer -> Editors -> spaCy -> Data Store
<-> Backend <- Slack <- Users Proto-Pipeline
Over eight hours we created data from the content of
the event, building the model in real-time
The model evolved over time
This was the experiment that would evolve into SiO 2
Silicon, a key element in everything from glass to microchips,
is at the core of global business
Oxygen, the journalistic voice Quartz breathes into the global business
news cycle
Entities are linguistic anchors, defined by context and around which
context can be inferred
Standard Entities PERSON FACILITY ORG PRODUCT GPE EVENT... Additional Entities
TECHNOLOGY PROCESS NATURE MEDIA CONSTRUCT
70K articles 1.4M blocks of text 85K labeled sentences
Entities
This spaCy model made rich analysis for any given text
easy to do on the fly
Stored analysis of a large corpus is a vital resource
The language graph...
Graph
The language graph is a mutable map of the language
model
Any new content is analyzed and then mapped onto the
language graph
Changes made to the graph can then be incorporated into
the next model iteration
The language graph becomes a primary resource for extracting training
data
Snapshots of time can be extracted from the language graph
Context can be derived by looking at the relationships in
the language graph
Elon Musk
Jeff Bezos
Mark Zuckerberg
Context
SiO 2 is a living Natural Language Pipeline of networked
algorithms trained on the corpus of Quartz to understand the linguistic patterns of global business news
The Pipeline(s) Quartz Corpus -> Training Sentences -> spaCy Content
-> spaCy -> Language Graph Language Graph -> Training Data -> Statistical Models / Classifiers Language Graph -> Training Sentences -> spaCy Unseen Content -> spaCy -> Pre-Processed Text / Vectors -> Statistical Models / Classifiers
Thank you