Rebecca Bilbro - Building A Gigaword Corpus: Lessons on Data Ingestion, Management, and Processing for NLP

Rebecca Bilbro - Building A Gigaword Corpus: Lessons on Data Ingestion, Management, and Processing for NLP

As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. This talk walks through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.

While applications like Siri, Cortana, and Alexa may still seem like novelties, language-aware applications are rapidly becoming the new norm. Under the hood, these applications take in text data as input, parse it into composite parts, compute upon those composites, and then recombine them to deliver a meaningful and tailored end result. The best applications use language models trained on _domain-specific corpora_ (collections of related documents containing natural language) that reduce ambiguity and prediction space to make results more intelligible. Here's the catch: these corpora are huge, generally consisting of at least hundreds of gigabytes of data inside of thousands of documents, and often more!

In this talk, we'll see how working with text data is substantially different from working with numeric data, and show that ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. For instance, when dealing with a text corpus, you have to consider not only how the data comes in (e.g. respecting rate limits, terms of use, etc.), but also where to store the data and how to keep it organized. Because the data comes from the web, it's often unpredictable, containing not only text but audio files, ads, videos, and other kinds of web detritus. Since the datasets are large, you need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiprocessing. Finally, in anticipation of the machine learning components, you have to establish a standardized method of transforming your raw ingested text into a corpus that's ready for computation and modeling.

In this talk, we'll explore many of the challenges we experienced along the way and introduce two Python packages that make this work a bit easier: [Baleen]( and [Minke]( Baleen is a package for ingesting formal natural language data from the discourse of professional and amateur writers, like bloggers and news outlets, in a categorized fashion. Minke extends Baleen with a library that performs parallel data loading, preprocessing, normalization, and keyphrase extraction to support machine learning on a large-scale custom corpus.


PyCon 2017

May 21, 2017


  1. Building a Gigaword Corpus Data Ingestion, Management, and Processing for

    NLP Rebecca Bilbro PyCon 2017
  2. • Me and my motivation • Why make a custom

    corpus? • Things likely to go wrong ◦ Ingestion ◦ Management ◦ Loading ◦ Preprocessing ◦ Analysis • Lessons we learned • Open source tools we made
  3. Rebecca Bilbro Data Scientist

  4. Yellowbrick

  5. None
  6. Natural language processing

  7. Everyone’s doing it NLTK So many great tools

  8. The Natural Language Toolkit import nltk moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) print(moby.similar("ahab"))

    print(moby.common_contexts(["ahab", "starbuck"])) print(moby.concordance("monstrous", 55, lines=10))
  9. Gensim + Wikipedia import bz2 import gensim # Load id

    to word dictionary id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt') # Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!) mm = gensim.corpora.MmCorpus(bz2.BZ2File('')) # Do latent Semantic Analysis and find 10 prominent topics lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400) lsa.print_topics(10)
  10. A custom corpus

  11. RSS import os import requests import feedparser feed = ""

    for entry in feedparser.parse(feed)['entries']: r = requests.get(entry['link']) path = entry['title'].lower().replace(" ", "-") + ".html" with open(path, 'wb') as f: f.write(r.content)
  12. Ingestion • Scheduling • Adding new feeds • Synchronizing feeds,

    finding duplicates • Parsing different feeds/entries into a standard form • Monitoring Storage • Database choice • Data representation, indexing, fetching • Connection and configuration • Error tracking and handling • Exporting
  13. And as the corpus began to grow … … new

    questions arose about costs (storage, time) and surprising results (videos?).
  14. Post + title + url + content + hash() +

    htmlize() Feed + title + link + active OPML Reader + categories() + counts() + __iter__() + __len__() ingest() Configuration + logging + database + flags Exporter + export() + readme() Admin + ingest_feeds() + ingest_opml() + summary() + run() + export() Utilities + timez Logging + logger + mongolog Ingest + feeds() + started() + finished() + process() + ingest() Feed Sync + parse() + sync() + entries() Post Wrangler + wrangle() + fetch() connect() <OPML /> Production-grade ingestion: Baleen
  15. Raw corpus != Usable data

  16. From each doc, extract html, identify paras/sents/words, tag with part-of-speech

    Raw Corpus HTML corpus = [(‘How’, ’WRB’), (‘long’, ‘RB’), (‘will’, ‘MD’), (‘this’, ‘DT’), (‘go’, ‘VB’), (‘on’, ‘IN’), (‘?’, ‘.’), ... ] Paras Sents Tokens Tags
  17. Streaming Corpus Preprocessing Tokenized Corpus CorpusReader for streaming access, preprocessing,

    and saving the tokenized version HTML Paras Sents Tokens Tags Raw Corpus
  18. Vectorization many features

  19. Visualize top tokens, document distribution & part-of-speech tagging Yellowbrick

  20. Data Loader Text Normalization Text Vectorization Feature Transformation Estimator Data

    Loader Feature Union Pipeline Estimator Text Normalization Document Features Text Extraction Summary Vectorization Article Vectorization Concept Features Metadata Features Dict Vectorizer Minke
  21. Dynamic graph analysis

  22. 1.5 M documents, 7,500 jobs, 524 GB (uncompressed) Keyphrase Graph:

    - 2.7 M nodes - 47 M edges - Average degree of 35
  23. Lessons learned

  24. Meaningfully literate data products rely on… ...a custom, domain-specific corpus.

  25. Meaningfully literate data products rely on… ...a data management layer

    for flexibility and iteration during modeling. Feature Analysis Algorithm Selection Hyperparameter Tuning
  26. corpus ├── citation.bib ├── feeds.json ├── ├── manifest.json ├── └── books ├── 56d629e7c1808113ffb87eaf.html ├── 56d629e7c1808113ffb87eb3.html └── 56d629ebc1808113ffb87ed0.html └── business ├── 56d625d5c1808113ffb87730.html ├── 56d625d6c1808113ffb87736.html └── 56d625ddc1808113ffb87752.html └── cinema ├── 56d629b5c1808113ffb87d8f.html ├── 56d629b5c1808113ffb87d93.html └── 56d629b6c1808113ffb87d9a.html └── cooking ├── 56d62af2c1808113ffb880ec.html ├── 56d62af2c1808113ffb880ee.html └── 56d62af2c1808113ffb880fa.html Preprocessing Transformer Raw CorpusReader Tokenized Corpus Post-processed CorpusReader Meaningfully literate data products rely on… ...a custom CorpusReader for streaming, and also intermediate storage.
  27. Meaningfully literate data products rely on… ...visual steering and graph

    analysis for interpretation.
  28. Corpus Processing Extract noun keyphrases weighted by TF-IDF. Corpus Ingestion

    Routine Document Collection Every Hour
  29. Baleen & Minke

  30. Yellowbrick

  31. Getting data • (Tutorial) “Fantastic Data and Where to Find

    Them” by Nicole Donnelly • (Poster) “On the Hour Data Ingestion” by Benjamin Bengfort and Will Voorhees Speed to insight • (Talk) “Human-Machine Collaboration” by Tony Ojeda • (Poster) “A Framework for Exploratory Data Analysis” by Tony Ojeda and Sasan Bahadaran Machine learning • (Poster) “Model Management Systems” by Benjamin Bengfort and Laura Lorenz • (Poster) “Yellowbrick” by Benjamin Bengfort and Rebecca Bilbro Also, sprints!
  32. Thank you! Rebecca Bilbro Twitter: Github: Email: