Slide 1

Slide 1 text

Building a Gigaword Corpus Data Ingestion, Management, and Processing for NLP Rebecca Bilbro PyCon 2017

Slide 2

Slide 2 text

● Me and my motivation ● Why make a custom corpus? ● Things likely to go wrong ○ Ingestion ○ Management ○ Loading ○ Preprocessing ○ Analysis ● Lessons we learned ● Open source tools we made

Slide 3

Slide 3 text

Rebecca Bilbro Data Scientist

Slide 4

Slide 4 text

Yellowbrick

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Natural language processing

Slide 7

Slide 7 text

Everyone’s doing it NLTK So many great tools

Slide 8

Slide 8 text

The Natural Language Toolkit import nltk moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) print(moby.similar("ahab")) print(moby.common_contexts(["ahab", "starbuck"])) print(moby.concordance("monstrous", 55, lines=10))

Slide 9

Slide 9 text

Gensim + Wikipedia import bz2 import gensim # Load id to word dictionary id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt') # Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!) mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2')) # Do latent Semantic Analysis and find 10 prominent topics lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400) lsa.print_topics(10)

Slide 10

Slide 10 text

A custom corpus

Slide 11

Slide 11 text

RSS import os import requests import feedparser feed = "http://feeds.washingtonpost.com/rss/national" for entry in feedparser.parse(feed)['entries']: r = requests.get(entry['link']) path = entry['title'].lower().replace(" ", "-") + ".html" with open(path, 'wb') as f: f.write(r.content)

Slide 12

Slide 12 text

Ingestion ● Scheduling ● Adding new feeds ● Synchronizing feeds, finding duplicates ● Parsing different feeds/entries into a standard form ● Monitoring Storage ● Database choice ● Data representation, indexing, fetching ● Connection and configuration ● Error tracking and handling ● Exporting

Slide 13

Slide 13 text

And as the corpus began to grow … … new questions arose about costs (storage, time) and surprising results (videos?).

Slide 14

Slide 14 text

Post + title + url + content + hash() + htmlize() Feed + title + link + active OPML Reader + categories() + counts() + __iter__() + __len__() ingest() Configuration + logging + database + flags Exporter + export() + readme() Admin + ingest_feeds() + ingest_opml() + summary() + run() + export() Utilities + timez Logging + logger + mongolog Ingest + feeds() + started() + finished() + process() + ingest() Feed Sync + parse() + sync() + entries() Post Wrangler + wrangle() + fetch() connect() Production-grade ingestion: Baleen

Slide 15

Slide 15 text

Raw corpus != Usable data

Slide 16

Slide 16 text

From each doc, extract html, identify paras/sents/words, tag with part-of-speech Raw Corpus HTML corpus = [(‘How’, ’WRB’), (‘long’, ‘RB’), (‘will’, ‘MD’), (‘this’, ‘DT’), (‘go’, ‘VB’), (‘on’, ‘IN’), (‘?’, ‘.’), ... ] Paras Sents Tokens Tags

Slide 17

Slide 17 text

Streaming Corpus Preprocessing Tokenized Corpus CorpusReader for streaming access, preprocessing, and saving the tokenized version HTML Paras Sents Tokens Tags Raw Corpus

Slide 18

Slide 18 text

Vectorization ...so many features

Slide 19

Slide 19 text

Visualize top tokens, document distribution & part-of-speech tagging Yellowbrick

Slide 20

Slide 20 text

Data Loader Text Normalization Text Vectorization Feature Transformation Estimator Data Loader Feature Union Pipeline Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer Minke

Slide 21

Slide 21 text

Dynamic graph analysis

Slide 22

Slide 22 text

1.5 M documents, 7,500 jobs, 524 GB (uncompressed) Keyphrase Graph: - 2.7 M nodes - 47 M edges - Average degree of 35

Slide 23

Slide 23 text

Lessons learned

Slide 24

Slide 24 text

Meaningfully literate data products rely on… ...a custom, domain-specific corpus.

Slide 25

Slide 25 text

Meaningfully literate data products rely on… ...a data management layer for flexibility and iteration during modeling. Feature Analysis Algorithm Selection Hyperparameter Tuning

Slide 26

Slide 26 text

corpus ├── citation.bib ├── feeds.json ├── LICENSE.md ├── manifest.json ├── README.md └── books ├── 56d629e7c1808113ffb87eaf.html ├── 56d629e7c1808113ffb87eb3.html └── 56d629ebc1808113ffb87ed0.html └── business ├── 56d625d5c1808113ffb87730.html ├── 56d625d6c1808113ffb87736.html └── 56d625ddc1808113ffb87752.html └── cinema ├── 56d629b5c1808113ffb87d8f.html ├── 56d629b5c1808113ffb87d93.html └── 56d629b6c1808113ffb87d9a.html └── cooking ├── 56d62af2c1808113ffb880ec.html ├── 56d62af2c1808113ffb880ee.html └── 56d62af2c1808113ffb880fa.html Preprocessing Transformer Raw CorpusReader Tokenized Corpus Post-processed CorpusReader Meaningfully literate data products rely on… ...a custom CorpusReader for streaming, and also intermediate storage.

Slide 27

Slide 27 text

Meaningfully literate data products rely on… ...visual steering and graph analysis for interpretation.

Slide 28

Slide 28 text

Corpus Processing Extract noun keyphrases weighted by TF-IDF. Corpus Ingestion Routine Document Collection Every Hour

Slide 29

Slide 29 text

Baleen & Minke

Slide 30

Slide 30 text

Yellowbrick

Slide 31

Slide 31 text

Getting data ● (Tutorial) “Fantastic Data and Where to Find Them” by Nicole Donnelly ● (Poster) “On the Hour Data Ingestion” by Benjamin Bengfort and Will Voorhees Speed to insight ● (Talk) “Human-Machine Collaboration” by Tony Ojeda ● (Poster) “A Framework for Exploratory Data Analysis” by Tony Ojeda and Sasan Bahadaran Machine learning ● (Poster) “Model Management Systems” by Benjamin Bengfort and Laura Lorenz ● (Poster) “Yellowbrick” by Benjamin Bengfort and Rebecca Bilbro Also, sprints! pycon.districtdatalabs.com

Slide 32

Slide 32 text

Thank you! Rebecca Bilbro Twitter: twitter.com/rebeccabilbro Github: github.com/rebeccabilbro Email: [email protected]