Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Building a Gigaword Corpus: Data Ingestion, Management, and Processing for NLP

Data Intelligence
June 28, 2017
510

Building a Gigaword Corpus: Data Ingestion, Management, and Processing for NLP

Rebecca Bilbro Bytecubed & District Data Labs
Audience level: Intermediate
Topic area: Modeling

Description

As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. In this talk we walk through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.

Data Intelligence

June 28, 2017
Tweet

Transcript

  1. Building a Gigaword Corpus
    Data Ingestion, Management, and Processing for NLP
    Rebecca Bilbro
    Data Intelligence 2017

    View full-size slide

  2. ● Me and my motivation
    ● Why make a custom corpus?
    ● Things likely to go wrong
    ○ Ingestion
    ○ Management
    ○ Loading
    ○ Preprocessing
    ○ Analysis
    ● Lessons we learned
    ● Open source tools we made

    View full-size slide

  3. Rebecca Bilbro
    Data Scientist

    View full-size slide

  4. Natural language
    processing

    View full-size slide

  5. Everyone’s doing it
    NLTK
    So many great tools

    View full-size slide

  6. The Natural Language Toolkit
    import nltk
    moby = nltk.text.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
    print(moby.similar("ahab"))
    print(moby.common_contexts(["ahab", "starbuck"]))
    print(moby.concordance("monstrous", 55, lines=10))

    View full-size slide

  7. Gensim + Wikipedia
    import bz2
    import gensim
    # Load id to word dictionary
    id2word = gensim.corpora.Dictionary.load_from_text('wikipedia_wordids.txt')
    # Instantiate iterator for corpus (which is ~24.14 GB on disk after compression!)
    mm = gensim.corpora.MmCorpus(bz2.BZ2File('wikipedia_tfidf.mm.bz2'))
    # Do latent Semantic Analysis and find 10 prominent topics
    lsa = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400)
    lsa.print_topics(10)

    View full-size slide

  8. Case Study:
    Predicting Political Orientation

    View full-size slide

  9. Partisan Discourse: Architecture
    Initial Model
    Debate Transcripts
    Submit URL
    Preprocessing
    Feature
    Extraction
    Evaluate
    Model
    Fit Model
    Model Storage
    Model Monitoring
    Corpus
    Storage
    Corpus Monitoring
    Classification
    Feedback
    Model Selection

    View full-size slide

  10. Partisan Discourse: New Documents

    View full-size slide

  11. Partisan Discourse: User Model

    View full-size slide

  12. A custom corpus

    View full-size slide

  13. RSS
    import os
    import requests
    import feedparser
    feed = "http://feeds.washingtonpost.com/rss/national"
    for entry in feedparser.parse(feed)['entries']:
    r = requests.get(entry['link'])
    path = entry['title'].lower().replace(" ", "-") + ".html"
    with open(path, 'wb') as f:
    f.write(r.content)

    View full-size slide

  14. Ingestion
    ● Scheduling
    ● Adding new feeds
    ● Synchronizing feeds, finding duplicates
    ● Parsing different feeds/entries into a
    standard form
    ● Monitoring
    Storage
    ● Database choice
    ● Data representation, indexing, fetching
    ● Connection and configuration
    ● Error tracking and handling
    ● Exporting

    View full-size slide

  15. And as the corpus began
    to grow …
    … new questions arose about costs
    (storage, time) and surprising results
    (videos?).

    View full-size slide

  16. Post
    + title
    + url
    + content
    + hash()
    + htmlize()
    Feed
    + title
    + link
    + active
    OPML Reader
    + categories()
    + counts()
    + __iter__()
    + __len__()
    ingest()
    Configuration
    + logging
    + database
    + flags
    Exporter
    + export()
    + readme()
    Admin
    + ingest_feeds()
    + ingest_opml()
    + summary()
    + run()
    + export()
    Utilities
    + timez
    Logging
    + logger
    + mongolog
    Ingest
    + feeds()
    + started()
    + finished()
    + process()
    + ingest()
    Feed Sync
    + parse()
    + sync()
    + entries()
    Post Wrangler
    + wrangle()
    + fetch()
    connect()

    Production-grade
    ingestion:
    Baleen

    View full-size slide

  17. Raw corpus != Usable data

    View full-size slide

  18. From each doc, extract html, identify paras/sents/words, tag with part-of-speech
    Raw Corpus
    HTML
    corpus = [(‘How’, ’WRB’),
    (‘long’, ‘RB’),
    (‘will’, ‘MD’),
    (‘this’, ‘DT’),
    (‘go’, ‘VB’),
    (‘on’, ‘IN’),
    (‘?’, ‘.’),
    ...
    ]
    Paras
    Sents
    Tokens
    Tags

    View full-size slide

  19. Streaming
    Corpus Preprocessing
    Tokenized
    Corpus
    CorpusReader for streaming access, preprocessing, and saving the tokenized version
    HTML
    Paras
    Sents
    Tokens
    Tags
    Raw
    Corpus

    View full-size slide

  20. Vectorization
    ...so many features

    View full-size slide

  21. Visualize top tokens,
    document distribution
    & part-of-speech
    tagging
    Yellowbrick

    View full-size slide

  22. Data Loader
    Text
    Normalization
    Text
    Vectorization
    Feature
    Transformation
    Estimator
    Data Loader
    Feature Union Pipeline
    Estimator
    Text
    Normalization
    Document
    Features
    Text Extraction
    Summary
    Vectorization
    Article
    Vectorization
    Concept Features
    Metadata Features
    Dict Vectorizer
    Minke

    View full-size slide

  23. Dynamic graph
    analysis

    View full-size slide

  24. 1.5 M documents, 7,500 jobs, 524 GB (uncompressed)
    Keyphrase Graph:
    - 2.7 M nodes
    - 47 M edges
    - Average
    degree of 35

    View full-size slide

  25. Lessons learned

    View full-size slide

  26. Meaningfully literate data products rely
    on…
    ...a custom, domain-specific corpus.

    View full-size slide

  27. Meaningfully literate data products rely
    on…
    ...a data management layer for flexibility
    and iteration during
    modeling.
    Feature
    Analysis
    Algorithm
    Selection
    Hyperparameter
    Tuning

    View full-size slide

  28. corpus
    ├── citation.bib
    ├── feeds.json
    ├── LICENSE.md
    ├── manifest.json
    ├── README.md
    └── books
    ├── 56d629e7c1808113ffb87eaf.html
    ├── 56d629e7c1808113ffb87eb3.html
    └── 56d629ebc1808113ffb87ed0.html
    └── business
    ├── 56d625d5c1808113ffb87730.html
    ├── 56d625d6c1808113ffb87736.html
    └── 56d625ddc1808113ffb87752.html
    └── cinema
    ├── 56d629b5c1808113ffb87d8f.html
    ├── 56d629b5c1808113ffb87d93.html
    └── 56d629b6c1808113ffb87d9a.html
    └── cooking
    ├── 56d62af2c1808113ffb880ec.html
    ├── 56d62af2c1808113ffb880ee.html
    └── 56d62af2c1808113ffb880fa.html
    Preprocessing
    Transformer
    Raw
    CorpusReader
    Tokenized
    Corpus
    Post-processed
    CorpusReader
    Meaningfully literate data products rely
    on…
    ...a custom CorpusReader for streaming,
    and also intermediate storage.

    View full-size slide

  29. Meaningfully literate data products rely
    on…
    ...visual steering and
    graph analysis for
    interpretation.

    View full-size slide

  30. Corpus
    Processing
    Extract noun
    keyphrases
    weighted by
    TF-IDF.
    Corpus
    Ingestion
    Routine
    Document
    Collection
    Every Hour

    View full-size slide

  31. Baleen & Minke

    View full-size slide

  32. Thank you!
    Rebecca Bilbro
    Twitter: twitter.com/rebeccabilbro
    Github: github.com/rebeccabilbro
    Email: [email protected]

    View full-size slide