Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Natural Language Pipeline
Search
ddqz
July 06, 2019
Technology
0
520
A Natural Language Pipeline
Presentation from the spaCy IRL 2019 conference.
ddqz
July 06, 2019
Tweet
Share
Other Decks in Technology
See All in Technology
制約が導く迷わない設計 〜 信頼性と運用性を両立するマイナンバー管理システムの実践 〜
bwkw
3
1k
SREチームをどう作り、どう育てるか ― Findy横断SREのマネジメント
rvirus0817
0
350
~Everything as Codeを諦めない~ 後からCDK
mu7889yoon
3
500
こんなところでも(地味に)活躍するImage Modeさんを知ってるかい?- Image Mode for OpenShift -
tsukaman
1
170
SRE Enabling戦記 - 急成長する組織にSREを浸透させる戦いの歴史
markie1009
0
170
Frontier Agents (Kiro autonomous agent / AWS Security Agent / AWS DevOps Agent) の紹介
msysh
3
190
SchooでVue.js/Nuxtを技術選定している理由
yamanoku
3
200
OWASP Top 10:2025 リリースと 少しの日本語化にまつわる裏話
okdt
PRO
3
840
顧客との商談議事録をみんなで読んで顧客解像度を上げよう
shibayu36
0
310
生成AIを活用した音声文字起こしシステムの2つの構築パターンについて
miu_crescent
PRO
3
220
顧客の言葉を、そのまま信じない勇気
yamatai1212
1
360
モダンUIでフルサーバーレスなAIエージェントをAmplifyとCDKでサクッとデプロイしよう
minorun365
4
220
Featured
See All Featured
Between Models and Reality
mayunak
1
190
Fireside Chat
paigeccino
41
3.8k
Breaking role norms: Why Content Design is so much more than writing copy - Taylor Woolridge
uxyall
0
170
Imperfection Machines: The Place of Print at Facebook
scottboms
269
14k
From π to Pie charts
rasagy
0
130
AI in Enterprises - Java and Open Source to the Rescue
ivargrimstad
0
1.1k
Claude Code どこまでも/ Claude Code Everywhere
nwiizo
61
52k
WENDY [Excerpt]
tessaabrams
9
36k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
12
1k
Building Flexible Design Systems
yeseniaperezcruz
330
40k
Refactoring Trust on Your Teams (GOTO; Chicago 2020)
rmw
35
3.4k
SERP Conf. Vienna - Web Accessibility: Optimizing for Inclusivity and SEO
sarafernandez
1
1.3k
Transcript
A Natural Language Pipeline
More Input
Knowledge” “A compendium of human...
Library
Physical archives became digital records, encoded with metadata
The internet promised rich dynamic experiences
The internet promised rich dynamic experiences but served us banner
ads
Advertising has and continues to fuel a substantial portion of
the innovation on the internet
What would The Economist look like if it were founded
in 2012?
User
First
Experience
“There’s a reason that tech companies are topping the lists
of most valuable companies and brands. Every company is a tech company.” Maggie Chan Jones
Every story, at its core, is a business story
Language
None
None
Stage -> Stenographer -> Editors -> spaCy -> Data Store
<-> Backend <- Slack <- Users Proto-Pipeline
Over eight hours we created data from the content of
the event, building the model in real-time
The model evolved over time
This was the experiment that would evolve into SiO 2
Silicon, a key element in everything from glass to microchips,
is at the core of global business
Oxygen, the journalistic voice Quartz breathes into the global business
news cycle
Entities are linguistic anchors, defined by context and around which
context can be inferred
Standard Entities PERSON FACILITY ORG PRODUCT GPE EVENT... Additional Entities
TECHNOLOGY PROCESS NATURE MEDIA CONSTRUCT
70K articles 1.4M blocks of text 85K labeled sentences
Entities
This spaCy model made rich analysis for any given text
easy to do on the fly
Stored analysis of a large corpus is a vital resource
The language graph...
Graph
The language graph is a mutable map of the language
model
Any new content is analyzed and then mapped onto the
language graph
Changes made to the graph can then be incorporated into
the next model iteration
The language graph becomes a primary resource for extracting training
data
Snapshots of time can be extracted from the language graph
Context can be derived by looking at the relationships in
the language graph
Elon Musk
Jeff Bezos
Mark Zuckerberg
Context
SiO 2 is a living Natural Language Pipeline of networked
algorithms trained on the corpus of Quartz to understand the linguistic patterns of global business news
The Pipeline(s) Quartz Corpus -> Training Sentences -> spaCy Content
-> spaCy -> Language Graph Language Graph -> Training Data -> Statistical Models / Classifiers Language Graph -> Training Sentences -> spaCy Unseen Content -> spaCy -> Pre-Processed Text / Vectors -> Statistical Models / Classifiers
Thank you