Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
A Natural Language Pipeline
Search
Sponsored
·
SiteGround - Reliable hosting with speed, security, and support you can count on.
→
ddqz
July 06, 2019
Technology
0
530
A Natural Language Pipeline
Presentation from the spaCy IRL 2019 conference.
ddqz
July 06, 2019
Tweet
Share
Other Decks in Technology
See All in Technology
チームメンバー迷わないIaC設計
hayama17
5
3.6k
Snowflakeデータ基盤で挑むAI活用 〜4年間のDataOpsの基礎をもとに〜
kaz3284
1
330
【PyCon mini Shizuoka 2026】生成AI時代に画像処理やオーディオ処理のノードエディターを作る理由
kazuhitotakahashi
0
260
【SLO】"多様な期待値" と向き合ってみた
z63d
2
290
Master Dataグループ紹介資料
sansan33
PRO
1
4.4k
三菱UFJ銀行におけるエンタープライズAI駆動開発のリアル / Enterprise AI_Driven Development at MUFG Bank: The Real Story
muit
10
20k
研究開発部メンバーの働き⽅ / Sansan R&D Profile
sansan33
PRO
4
22k
Oracle Cloud Infrastructure:2026年2月度サービス・アップデート
oracle4engineer
PRO
0
190
クラウド時代における一時権限取得
krrrr38
1
150
入門DBSC
ynojima
0
120
技術キャッチアップ効率化を実現する記事推薦システムの構築
yudai00
2
170
技術的負債の泥沼から組織を救う3つの転換点
nwiizo
4
800
Featured
See All Featured
Have SEOs Ruined the Internet? - User Awareness of SEO in 2025
akashhashmi
0
280
Distributed Sagas: A Protocol for Coordinating Microservices
caitiem20
333
22k
Principles of Awesome APIs and How to Build Them.
keavy
128
17k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
62
50k
No one is an island. Learnings from fostering a developers community.
thoeni
21
3.6k
Avoiding the “Bad Training, Faster” Trap in the Age of AI
tmiket
0
96
Dealing with People You Can't Stand - Big Design 2015
cassininazir
367
27k
Navigating Weather and Climate Data
rabernat
0
130
The Power of CSS Pseudo Elements
geoffreycrofte
82
6.2k
Designing for Performance
lara
611
70k
Test your architecture with Archunit
thirion
1
2.2k
Intergalactic Javascript Robots from Outer Space
tanoku
273
27k
Transcript
A Natural Language Pipeline
More Input
Knowledge” “A compendium of human...
Library
Physical archives became digital records, encoded with metadata
The internet promised rich dynamic experiences
The internet promised rich dynamic experiences but served us banner
ads
Advertising has and continues to fuel a substantial portion of
the innovation on the internet
What would The Economist look like if it were founded
in 2012?
User
First
Experience
“There’s a reason that tech companies are topping the lists
of most valuable companies and brands. Every company is a tech company.” Maggie Chan Jones
Every story, at its core, is a business story
Language
None
None
Stage -> Stenographer -> Editors -> spaCy -> Data Store
<-> Backend <- Slack <- Users Proto-Pipeline
Over eight hours we created data from the content of
the event, building the model in real-time
The model evolved over time
This was the experiment that would evolve into SiO 2
Silicon, a key element in everything from glass to microchips,
is at the core of global business
Oxygen, the journalistic voice Quartz breathes into the global business
news cycle
Entities are linguistic anchors, defined by context and around which
context can be inferred
Standard Entities PERSON FACILITY ORG PRODUCT GPE EVENT... Additional Entities
TECHNOLOGY PROCESS NATURE MEDIA CONSTRUCT
70K articles 1.4M blocks of text 85K labeled sentences
Entities
This spaCy model made rich analysis for any given text
easy to do on the fly
Stored analysis of a large corpus is a vital resource
The language graph...
Graph
The language graph is a mutable map of the language
model
Any new content is analyzed and then mapped onto the
language graph
Changes made to the graph can then be incorporated into
the next model iteration
The language graph becomes a primary resource for extracting training
data
Snapshots of time can be extracted from the language graph
Context can be derived by looking at the relationships in
the language graph
Elon Musk
Jeff Bezos
Mark Zuckerberg
Context
SiO 2 is a living Natural Language Pipeline of networked
algorithms trained on the corpus of Quartz to understand the linguistic patterns of global business news
The Pipeline(s) Quartz Corpus -> Training Sentences -> spaCy Content
-> spaCy -> Language Graph Language Graph -> Training Data -> Statistical Models / Classifiers Language Graph -> Training Sentences -> spaCy Unseen Content -> spaCy -> Pre-Processed Text / Vectors -> Statistical Models / Classifiers
Thank you