Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Natural Language Pipeline
Search
ddqz
July 06, 2019
Technology
540
0
Share
A Natural Language Pipeline
Presentation from the spaCy IRL 2019 conference.
ddqz
July 06, 2019
Other Decks in Technology
See All in Technology
イベントストーミングとKiroの仕様駆動開発で実現する要件の認識合わせプロセス
syobochim
5
480
AIAgentと取り組むKaggle
508shuto
2
560
ラズパイ & Picoで入門:Zephyr(RTOS)の環境構築からビルドまでの紹介
iotengineer22
0
240
生成AIに振り回されない 〜確率論と決定論の使い分け〜
shukob
0
110
AI とサービス・デザイン / AI and Service Design
ks91
PRO
0
170
AI時代から振り返るTerraform drift運用の歴史 / AI Age Reflections on the History of Terraform Drift Operations
aeonpeople
0
390
Generative UI × A2UI で AI エージェントを作った話 AI-DLC も使ってみた!
kmiya84377
1
170
脅威をエンジニアリングの糧にして:恐怖を乗り越えた先にあったもの / Turn threats into fuel for engineering: what lay beyond overcoming fear
nrslib
1
290
なぜハノーバーメッセに行くべきなのか 〜初参加だから語れること〜
tanakaseiya
0
110
Gradle×GitHub_ActionsでCI時間を約50%短縮 ジョブ分割の設計と落とし穴 / Cutting CI Time by ~50% with Gradle and GitHub Actions: Job-Splitting Design and Pitfalls
takatty
0
140
AI駆動開発でなんでもハンズオン環境をつくってみた
yoshimi0227
0
140
データ基盤構築・運用の現場から 〜 Snowflake Intelligence 導入で変わった、データ活用の未来 〜
wonohe
0
180
Featured
See All Featured
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
287
14k
Raft: Consensus for Rubyists
vanstee
141
7.4k
Git: the NoSQL Database
bkeepers
PRO
432
67k
How to Think Like a Performance Engineer
csswizardry
28
2.6k
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
350
Rebuilding a faster, lazier Slack
samanthasiow
85
9.5k
Public Speaking Without Barfing On Your Shoes - THAT 2023
reverentgeek
1
400
XXLCSS - How to scale CSS and keep your sanity
sugarenia
250
1.3M
Large-scale JavaScript Application Architecture
addyosmani
515
110k
CSS Pre-Processors: Stylus, Less & Sass
bermonpainter
360
30k
Building the Perfect Custom Keyboard
takai
2
770
Marketing to machines
jonoalderson
1
5.3k
Transcript
A Natural Language Pipeline
More Input
Knowledge” “A compendium of human...
Library
Physical archives became digital records, encoded with metadata
The internet promised rich dynamic experiences
The internet promised rich dynamic experiences but served us banner
ads
Advertising has and continues to fuel a substantial portion of
the innovation on the internet
What would The Economist look like if it were founded
in 2012?
User
First
Experience
“There’s a reason that tech companies are topping the lists
of most valuable companies and brands. Every company is a tech company.” Maggie Chan Jones
Every story, at its core, is a business story
Language
None
None
Stage -> Stenographer -> Editors -> spaCy -> Data Store
<-> Backend <- Slack <- Users Proto-Pipeline
Over eight hours we created data from the content of
the event, building the model in real-time
The model evolved over time
This was the experiment that would evolve into SiO 2
Silicon, a key element in everything from glass to microchips,
is at the core of global business
Oxygen, the journalistic voice Quartz breathes into the global business
news cycle
Entities are linguistic anchors, defined by context and around which
context can be inferred
Standard Entities PERSON FACILITY ORG PRODUCT GPE EVENT... Additional Entities
TECHNOLOGY PROCESS NATURE MEDIA CONSTRUCT
70K articles 1.4M blocks of text 85K labeled sentences
Entities
This spaCy model made rich analysis for any given text
easy to do on the fly
Stored analysis of a large corpus is a vital resource
The language graph...
Graph
The language graph is a mutable map of the language
model
Any new content is analyzed and then mapped onto the
language graph
Changes made to the graph can then be incorporated into
the next model iteration
The language graph becomes a primary resource for extracting training
data
Snapshots of time can be extracted from the language graph
Context can be derived by looking at the relationships in
the language graph
Elon Musk
Jeff Bezos
Mark Zuckerberg
Context
SiO 2 is a living Natural Language Pipeline of networked
algorithms trained on the corpus of Quartz to understand the linguistic patterns of global business news
The Pipeline(s) Quartz Corpus -> Training Sentences -> spaCy Content
-> spaCy -> Language Graph Language Graph -> Training Data -> Statistical Models / Classifiers Language Graph -> Training Sentences -> spaCy Unseen Content -> spaCy -> Pre-Processed Text / Vectors -> Statistical Models / Classifiers
Thank you