Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Finding Unavailable Data

Finding Unavailable Data

This talk is about data - where to get it and how to create it if it doesn’t exist. I’ll take the audience through the process of creating the dataset for my most recent project and show how to view unavailable data as an opportunity rather than an obstacle to answering questions. I’ll cover how to get and read data as well as popular libraries for data analysis and processing in Python — NLTK (Natural Language Toolkit), Panda, Gensim and techniques like regular expressions.

Omayeli Arenyeka

August 11, 2018
Tweet

More Decks by Omayeli Arenyeka

Other Decks in Technology

Transcript

  1. 1. You can find all the gendered words. 2. You

    can find the equivalent of a gendered word.
  2. → lady / gentleman → prince / princess → king

    / queen → father / mother → seamstress / seamster → ministress / minister → iron man → cougar
  3. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male',

    'boy', 'men', 'son', 'father', 'husband'] A gendered word is a word with one of these terms (above ) in its definition.
  4. → lady / gentleman → prince / princess → king

    / queen → father / mother → seamstress → ministress → iron man
  5. text-processing.com/demo/tokenize Tokenization -> chopping up a string into pieces (called

    tokens) -> throwing away certain characters, such as punctuation
  6. → lady / gentleman → prince / princess → king

    / queen → father / mother → actor / actress
  7. My meal wasn’t very tasty so I put some maggi

    on it. My meal wasn’t very tasty so I put some salt on it. My meal wasn’t very tasty so I put some seasoning on it. I sat on the chair to eat my meal.