Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Gigaword Corpus: Data Ingestion, Management, and Processing for NLP

Data Intelligence
June 28, 2017
460

Building a Gigaword Corpus: Data Ingestion, Management, and Processing for NLP

Rebecca Bilbro Bytecubed & District Data Labs
Audience level: Intermediate
Topic area: Modeling

Description

As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. In this talk we walk through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.

Data Intelligence

June 28, 2017
Tweet

Transcript

  1. Building a Gigaword Corpus
    Data Ingestion, Management, and Processing for NLP
    Rebecca Bilbro
    Data Intelligence 2017

    View Slide

  2. ● Me and my motivation
    ● Why make a custom corpus?
    ● Things likely to go wrong
    ○ Ingestion
    ○ Management
    ○ Loading
    ○ Preprocessing
    ○ Analysis
    ● Lessons we learned
    ● Open source tools we made

    View Slide

  3. Rebecca Bilbro
    Data Scientist

    View Slide

  4. Yellowbrick

    View Slide

  5. View Slide

  6. Natural language
    processing

    View Slide

  7. Everyone’s doing it
    NLTK
    So many great tools

    View Slide

  8. The Natural Language Toolkit
    import nltk
    moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
    print(moby.similar("ahab"))
    print(moby.common_contexts(["ahab", "starbuck"]))
    print(moby.concordance("monstrous", 55, lines=10))

    View Slide

  9. Gensim + Wikipedia
    import bz2
    import gensim
    # Load id to word dictionary
    id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt')
    # Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!)
    mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2'))
    # Do latent Semantic Analysis and find 10 prominent topics
    lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
    lsa.print_topics(10)

    View Slide

  10. Case Study:
    Predicting Political Orientation

    View Slide

  11. Partisan Discourse: Architecture
    Initial Model
    Debate Transcripts
    Submit URL
    Preprocessing
    Feature
    Extraction
    Evaluate
    Model
    Fit Model
    Model Storage
    Model Monitoring
    Corpus
    Storage
    Corpus Monitoring
    Classification
    Feedback
    Model Selection

    View Slide

  12. Partisan Discourse: New Documents

    View Slide

  13. Partisan Discourse: User Model

    View Slide

  14. A custom corpus

    View Slide

  15. RSS
    import os
    import requests
    import feedparser
    feed = "http://feeds.washingtonpost.com/rss/national"
    for entry in feedparser.parse(feed)['entries']:
    r = requests.get(entry['link'])
    path = entry['title'].lower().replace(" ", "-") + ".html"
    with open(path, 'wb') as f:
    f.write(r.content)

    View Slide

  16. Ingestion
    ● Scheduling
    ● Adding new feeds
    ● Synchronizing feeds, finding duplicates
    ● Parsing different feeds/entries into a
    standard form
    ● Monitoring
    Storage
    ● Database choice
    ● Data representation, indexing, fetching
    ● Connection and configuration
    ● Error tracking and handling
    ● Exporting

    View Slide

  17. And as the corpus began
    to grow …
    … new questions arose about costs
    (storage, time) and surprising results
    (videos?).

    View Slide

  18. Post
    + title
    + url
    + content
    + hash()
    + htmlize()
    Feed
    + title
    + link
    + active
    OPML Reader
    + categories()
    + counts()
    + __iter__()
    + __len__()
    ingest()
    Configuration
    + logging
    + database
    + flags
    Exporter
    + export()
    + readme()
    Admin
    + ingest_feeds()
    + ingest_opml()
    + summary()
    + run()
    + export()
    Utilities
    + timez
    Logging
    + logger
    + mongolog
    Ingest
    + feeds()
    + started()
    + finished()
    + process()
    + ingest()
    Feed Sync
    + parse()
    + sync()
    + entries()
    Post Wrangler
    + wrangle()
    + fetch()
    connect()

    Production-grade
    ingestion:
    Baleen

    View Slide

  19. Raw corpus != Usable data

    View Slide

  20. From each doc, extract html, identify paras/sents/words, tag with part-of-speech
    Raw Corpus
    HTML
    corpus = [(‘How’, ’WRB’),
    (‘long’, ‘RB’),
    (‘will’, ‘MD’),
    (‘this’, ‘DT’),
    (‘go’, ‘VB’),
    (‘on’, ‘IN’),
    (‘?’, ‘.’),
    ...
    ]
    Paras
    Sents
    Tokens
    Tags

    View Slide

  21. Streaming
    Corpus Preprocessing
    Tokenized
    Corpus
    CorpusReader for streaming access, preprocessing, and saving the tokenized version
    HTML
    Paras
    Sents
    Tokens
    Tags
    Raw
    Corpus

    View Slide

  22. Vectorization
    ...so many features

    View Slide

  23. Visualize top tokens,
    document distribution
    & part-of-speech
    tagging
    Yellowbrick

    View Slide

  24. Data Loader
    Text
    Normalization
    Text
    Vectorization
    Feature
    Transformation
    Estimator
    Data Loader
    Feature Union Pipeline
    Estimator
    Text
    Normalization
    Document
    Features
    Text Extraction
    Summary
    Vectorization
    Article
    Vectorization
    Concept Features
    Metadata Features
    Dict Vectorizer
    Minke

    View Slide

  25. Dynamic graph
    analysis

    View Slide

  26. 1.5 M documents, 7,500 jobs, 524 GB (uncompressed)
    Keyphrase Graph:
    - 2.7 M nodes
    - 47 M edges
    - Average
    degree of 35

    View Slide

  27. Lessons learned

    View Slide

  28. Meaningfully literate data products rely
    on…
    ...a custom, domain-specific corpus.

    View Slide

  29. Meaningfully literate data products rely
    on…
    ...a data management layer for flexibility
    and iteration during
    modeling.
    Feature
    Analysis
    Algorithm
    Selection
    Hyperparameter
    Tuning

    View Slide

  30. corpus
    ├── citation.bib
    ├── feeds.json
    ├── LICENSE.md
    ├── manifest.json
    ├── README.md
    └── books
    ├── 56d629e7c1808113ffb87eaf.html
    ├── 56d629e7c1808113ffb87eb3.html
    └── 56d629ebc1808113ffb87ed0.html
    └── business
    ├── 56d625d5c1808113ffb87730.html
    ├── 56d625d6c1808113ffb87736.html
    └── 56d625ddc1808113ffb87752.html
    └── cinema
    ├── 56d629b5c1808113ffb87d8f.html
    ├── 56d629b5c1808113ffb87d93.html
    └── 56d629b6c1808113ffb87d9a.html
    └── cooking
    ├── 56d62af2c1808113ffb880ec.html
    ├── 56d62af2c1808113ffb880ee.html
    └── 56d62af2c1808113ffb880fa.html
    Preprocessing
    Transformer
    Raw
    CorpusReader
    Tokenized
    Corpus
    Post-processed
    CorpusReader
    Meaningfully literate data products rely
    on…
    ...a custom CorpusReader for streaming,
    and also intermediate storage.

    View Slide

  31. Meaningfully literate data products rely
    on…
    ...visual steering and
    graph analysis for
    interpretation.

    View Slide

  32. Corpus
    Processing
    Extract noun
    keyphrases
    weighted by
    TF-IDF.
    Corpus
    Ingestion
    Routine
    Document
    Collection
    Every Hour

    View Slide

  33. Baleen & Minke

    View Slide

  34. Yellowbrick

    View Slide

  35. Thank you!
    Rebecca Bilbro
    Twitter: twitter.com/rebeccabilbro
    Github: github.com/rebeccabilbro
    Email: [email protected]

    View Slide