Building a Gigaword Corpus: Data Ingestion, Management, and Processing for NLP

by Data Intelligence

Published June 28, 2017

Rebecca Bilbro Bytecubed & District Data Labs
Audience level: Intermediate
Topic area: Modeling


As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. In this talk we walk through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.