Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Natural Language Pipeline
Search
ddqz
July 06, 2019
Technology
0
500
A Natural Language Pipeline
Presentation from the spaCy IRL 2019 conference.
ddqz
July 06, 2019
Tweet
Share
Other Decks in Technology
See All in Technology
Instant Apps Eulogy
cyrilmottier
1
110
アカデミーキャンプ 2025 SuuuuuuMMeR「燃えろ!!ロボコン」 / Academy Camp 2025 SuuuuuuMMeR "Burn the Spirit, Robocon!!" DAY 1
ks91
PRO
0
150
Claude Codeから我々が学ぶべきこと
oikon48
10
2.8k
僕たちが「開発しやすさ」を求め 模索し続けたアーキテクチャ #アーキテクチャ勉強会_findy
bengo4com
0
2.4k
Intro to Software Startups: Spring 2025
arnabdotorg
0
260
【CEDEC2025】『Shadowverse: Worlds Beyond』二度目のDCG開発でゲームをリデザインする~遊びやすさと競技性の両立~
cygames
PRO
1
370
o11yツールを乗り換えた話
tak0x00
2
1.4k
Backlog AI アシスタントが切り開く未来
vvatanabe
1
140
事業特性から逆算したインフラ設計
upsider_tech
0
130
React Server ComponentsでAPI不要の開発体験
polidog
PRO
0
270
AIに頼りすぎない新人育成術
cuebic9bic
3
310
S3 Glacier のデータを Athena からクエリしようとしたらどうなるのか/try-to-query-s3-glacier-from-athena
emiki
0
220
Featured
See All Featured
[RailsConf 2023 Opening Keynote] The Magic of Rails
eileencodes
29
9.6k
The Cult of Friendly URLs
andyhume
79
6.5k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
47
9.6k
A better future with KSS
kneath
239
17k
Reflections from 52 weeks, 52 projects
jeffersonlam
351
21k
Easily Structure & Communicate Ideas using Wireframe
afnizarnur
194
16k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
248
1.3M
Become a Pro
speakerdeck
PRO
29
5.5k
Writing Fast Ruby
sferik
628
62k
Optimising Largest Contentful Paint
csswizardry
37
3.4k
Intergalactic Javascript Robots from Outer Space
tanoku
272
27k
Java REST API Framework Comparison - PWX 2021
mraible
33
8.8k
Transcript
A Natural Language Pipeline
More Input
Knowledge” “A compendium of human...
Library
Physical archives became digital records, encoded with metadata
The internet promised rich dynamic experiences
The internet promised rich dynamic experiences but served us banner
ads
Advertising has and continues to fuel a substantial portion of
the innovation on the internet
What would The Economist look like if it were founded
in 2012?
User
First
Experience
“There’s a reason that tech companies are topping the lists
of most valuable companies and brands. Every company is a tech company.” Maggie Chan Jones
Every story, at its core, is a business story
Language
None
None
Stage -> Stenographer -> Editors -> spaCy -> Data Store
<-> Backend <- Slack <- Users Proto-Pipeline
Over eight hours we created data from the content of
the event, building the model in real-time
The model evolved over time
This was the experiment that would evolve into SiO 2
Silicon, a key element in everything from glass to microchips,
is at the core of global business
Oxygen, the journalistic voice Quartz breathes into the global business
news cycle
Entities are linguistic anchors, defined by context and around which
context can be inferred
Standard Entities PERSON FACILITY ORG PRODUCT GPE EVENT... Additional Entities
TECHNOLOGY PROCESS NATURE MEDIA CONSTRUCT
70K articles 1.4M blocks of text 85K labeled sentences
Entities
This spaCy model made rich analysis for any given text
easy to do on the fly
Stored analysis of a large corpus is a vital resource
The language graph...
Graph
The language graph is a mutable map of the language
model
Any new content is analyzed and then mapped onto the
language graph
Changes made to the graph can then be incorporated into
the next model iteration
The language graph becomes a primary resource for extracting training
data
Snapshots of time can be extracted from the language graph
Context can be derived by looking at the relationships in
the language graph
Elon Musk
Jeff Bezos
Mark Zuckerberg
Context
SiO 2 is a living Natural Language Pipeline of networked
algorithms trained on the corpus of Quartz to understand the linguistic patterns of global business news
The Pipeline(s) Quartz Corpus -> Training Sentences -> spaCy Content
-> spaCy -> Language Graph Language Graph -> Training Data -> Statistical Models / Classifiers Language Graph -> Training Sentences -> spaCy Unseen Content -> spaCy -> Pre-Processed Text / Vectors -> Statistical Models / Classifiers
Thank you