Building a Gigaword Corpus: Data Ingestion, Management, and Processing for NLP

C93e0512fbfca1b61a9913bfceeac7ec?s=47 Data Intelligence
June 28, 2017
210

Building a Gigaword Corpus: Data Ingestion, Management, and Processing forĀ NLP

Rebecca Bilbro Bytecubed & District Data Labs
Audience level: Intermediate
Topic area: Modeling

Description

As the applications we build are increasingly driven by text, doing data ingestion, management, loading, and preprocessing in a robust, organized, parallel, and memory-safe way can get tricky. In this talk we walk through the highs (a custom billion-word corpus!), the lows (segfaults, 400 errors, pesky mp3s), and the new Python libraries we built to ingest and preprocess text for machine learning.

C93e0512fbfca1b61a9913bfceeac7ec?s=128

Data Intelligence

June 28, 2017
Tweet