Pro Yearly is on sale from $80 to $50! »

Finding Unavailable Data

Finding Unavailable Data

This talk is about data - where to get it and how to create it if it doesn’t exist. I’ll take the audience through the process of creating the dataset for my most recent project and show how to view unavailable data as an opportunity rather than an obstacle to answering questions. I’ll cover how to get and read data as well as popular libraries for data analysis and processing in Python — NLTK (Natural Language Toolkit), Panda, Gensim and techniques like regular expressions.

7c53df6428cb68810cf90f77ecfee0f8?s=128

Omayeli Arenyeka

August 11, 2018
Tweet

Transcript

  1. Yeli @YellzHeard omayeli.com

  2. Finding Unavailable Data

  3. None
  4. What is the male equivalent of a nun?

  5. google.com

  6. google.com

  7. quora.com

  8. quora.com

  9. english.stackexchange.com

  10. 1. You can find all the gendered words. 2. You

    can find the equivalent of a gendered word.
  11. → lady / gentleman → prince / princess → king

    / queen → father / mother → seamstress / seamster → ministress / minister → iron man → cougar
  12. None
  13. None
  14. Where and how to get data

  15. APIs Static Data Web Scraping

  16. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male',

    'boy', 'men', 'son', 'father', 'husband'] A gendered word is a word with one of these terms (above ) in its definition.
  17. None
  18. None
  19. APIs: Application Programming Interface

  20. programmableweb.com

  21. wordnik.com

  22. wordnik.com

  23. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male',

    'boy', 'men', 'son', 'father', 'husband']
  24. None
  25. 400 words

  26. Static Data .json .txt .csv ...

  27. None
  28. None
  29. → lady / gentleman → prince / princess → king

    / queen → father / mother → seamstress → ministress → iron man
  30. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male',

    'boy', 'men', 'son', 'father', 'husband']
  31. None
  32. Regular Expressions -> a sequence of characters that define a

    search pattern
  33. regextester.com

  34. None
  35. ['woman', 'female', 'girl', 'lady', 'women', 'mother', 'daughter', 'wife'] ['man', 'male',

    'boy', 'men', 'son', 'father', 'husband']
  36. None
  37. regextester.com

  38. None
  39. regextester.com

  40. None
  41. None
  42. None
  43. None
  44. None
  45. None
  46. None
  47. None
  48. None
  49. None
  50. None
  51. ~ 8000 words

  52. None
  53. Patterns

  54. Patterns -> object of a preposition

  55. None
  56. None
  57. None
  58. None
  59. nltk -> natural language toolkit -> for processing the english

    language
  60. text-processing.com/demo/tokenize Tokenization -> chopping up a string into pieces (called

    tokens) -> throwing away certain characters, such as punctuation
  61. None
  62. None
  63. None
  64. Patterns -> object of a preposition -> clothing items

  65. None
  66. collinsdictionary.com/us/word-list

  67. None
  68. Web Scraping Icons made by Smashicons from www.flaticon.com/authors/smashicons

  69. urllib.request -> opening URLs BeautifulSoup -> parsing HTML documents

  70. None
  71. None
  72. None
  73. None
  74. ~ 4000 words

  75. None
  76. → lady / gentleman → prince / princess → king

    / queen → father / mother → actor / actress
  77. bionlp-www.utu.fi/wv_demo/

  78. Word2Vec -> words to vectors

  79. suriyadeepan.github.io

  80. My meal wasn’t very tasty so I put some maggi

    on it. My meal wasn’t very tasty so I put some salt on it. My meal wasn’t very tasty so I put some seasoning on it. I sat on the chair to eat my meal.
  81. None
  82. None
  83. Gensim -> Google trained word2vec model

  84. None
  85. Yeli @YellzHeard omayeli.com