Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Natural Language Pipeline
Search
ddqz
July 06, 2019
Technology
0
500
A Natural Language Pipeline
Presentation from the spaCy IRL 2019 conference.
ddqz
July 06, 2019
Tweet
Share
Other Decks in Technology
See All in Technology
United™️ Airlines®️ Customer®️ USA Contact Numbers: Complete 2025 Support Guide
flyunitedguide
0
800
AI Ready API ─ AI時代に求められるAPI設計とは?/ AI-Ready API - Designing MCP and APIs in the AI Era
yokawasa
8
2.1k
cdk initで生成されるあのファイル達は何なのか/cdk-init-generated-files
tomoki10
1
670
QuickSight SPICE の効果的な運用戦略~S3 + Athena 構成での実践ノウハウ~/quicksight-spice-s3-athena-best-practices
emiki
0
290
SREの次のキャリアの道しるべ 〜SREがマネジメントレイヤーに挑戦して、 気づいたこととTips〜
coconala_engineer
1
4.4k
Snowflake Intelligenceという名のAI Agentが切り開くデータ活用の未来とその実現に必要なこと@SnowVillage『Data Management #1 Summit 2025 Recap!!』
ryo_suzuki
1
160
セキュアな社内Dify運用と外部連携の両立 ~AIによるAPIリスク評価~
zozotech
PRO
0
120
AWS CDK 入門ガイド これだけは知っておきたいヒント集
anank
5
750
三視点LLMによる複数観点レビュー
mhlyc
0
230
助けて! XからWaylandに移行しないと新しいGNOMEが使えなくなっちゃう 2025-07-12
nobutomurata
2
200
united airlines ™®️ USA Contact Numbers: Complete 2025 Support Guide
flyunitedhelp
1
470
ソフトウェアテストのAI活用_ver1.25
fumisuke
1
610
Featured
See All Featured
Scaling GitHub
holman
460
140k
The MySQL Ecosystem @ GitHub 2015
samlambert
251
13k
Testing 201, or: Great Expectations
jmmastey
43
7.6k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
29
1.8k
Visualizing Your Data: Incorporating Mongo into Loggly Infrastructure
mongodb
47
9.6k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
138
34k
Navigating Team Friction
lara
187
15k
Optimizing for Happiness
mojombo
379
70k
CoffeeScript is Beautiful & I Never Want to Write Plain JavaScript Again
sstephenson
161
15k
Designing Experiences People Love
moore
142
24k
Code Review Best Practice
trishagee
69
19k
ピンチをチャンスに:未来をつくるプロダクトロードマップ #pmconf2020
aki_iinuma
126
53k
Transcript
A Natural Language Pipeline
More Input
Knowledge” “A compendium of human...
Library
Physical archives became digital records, encoded with metadata
The internet promised rich dynamic experiences
The internet promised rich dynamic experiences but served us banner
ads
Advertising has and continues to fuel a substantial portion of
the innovation on the internet
What would The Economist look like if it were founded
in 2012?
User
First
Experience
“There’s a reason that tech companies are topping the lists
of most valuable companies and brands. Every company is a tech company.” Maggie Chan Jones
Every story, at its core, is a business story
Language
None
None
Stage -> Stenographer -> Editors -> spaCy -> Data Store
<-> Backend <- Slack <- Users Proto-Pipeline
Over eight hours we created data from the content of
the event, building the model in real-time
The model evolved over time
This was the experiment that would evolve into SiO 2
Silicon, a key element in everything from glass to microchips,
is at the core of global business
Oxygen, the journalistic voice Quartz breathes into the global business
news cycle
Entities are linguistic anchors, defined by context and around which
context can be inferred
Standard Entities PERSON FACILITY ORG PRODUCT GPE EVENT... Additional Entities
TECHNOLOGY PROCESS NATURE MEDIA CONSTRUCT
70K articles 1.4M blocks of text 85K labeled sentences
Entities
This spaCy model made rich analysis for any given text
easy to do on the fly
Stored analysis of a large corpus is a vital resource
The language graph...
Graph
The language graph is a mutable map of the language
model
Any new content is analyzed and then mapped onto the
language graph
Changes made to the graph can then be incorporated into
the next model iteration
The language graph becomes a primary resource for extracting training
data
Snapshots of time can be extracted from the language graph
Context can be derived by looking at the relationships in
the language graph
Elon Musk
Jeff Bezos
Mark Zuckerberg
Context
SiO 2 is a living Natural Language Pipeline of networked
algorithms trained on the corpus of Quartz to understand the linguistic patterns of global business news
The Pipeline(s) Quartz Corpus -> Training Sentences -> spaCy Content
-> spaCy -> Language Graph Language Graph -> Training Data -> Statistical Models / Classifiers Language Graph -> Training Sentences -> spaCy Unseen Content -> spaCy -> Pre-Processed Text / Vectors -> Statistical Models / Classifiers
Thank you