$30 off During Our Annual Pro Sale. View Details »

Natural Language Processing with Swift

Natural Language Processing with Swift

Talk given at Swift Language User Group in SF on 5 March 2015

Apple has offered an API for natural language processing since iOS 5, which allowed us to tokenize text, detect the language, and determine parts of speech. With Swift and the introduction of Playgrounds, it’s faster and more delightful than ever to experiment with linguistics. We’ll go over how to build a spam detector in Swift, starting with the basic theory and ending with a fully functional Naive Bayes classifier. Feel free to bring your laptop to code along!

Ayaka Nonaka

March 05, 2015
Tweet

More Decks by Ayaka Nonaka

Other Decks in Programming

Transcript

  1. Natural Language Processing with /swɪft/ Ayaka Nonaka @ayanonagon

  2. SPAM spam sp@M $PAM spam sp@m SP4M $p@m sp@M SPAM

    spam sp@M $PAM spam sp@m SP4M $p@m sp@M
  3. Spam or Ham?

  4. URGENT - HELP ME DISTRIBUTE MY $15 MILLION TO CHARITY

    IN SUMMARY:- I have 15,000,000.00 (fifteen million) U.S. Dollars and I want you to assist me in distributing the money to charity organizations. I agree to reward you with part of the money for your assistance, kindness and participation in this Godly project. This mail might come to you as a surprise and the temptation to ignore it as unserious could come into your mind but please consider it a divine wish and accept it with a deep sense of humility.
  5. See you at Natural Language Processing in Swift with Ayaka

    Nonaka of Venmo Swift Language User Group (San Francisco + Silicon Valley) Invite 1 friend Simply forward this email to a friend and have them join the Meetup.
  6. Forming a new startup and need an iOS developer to

    partner with and join me on this new, exciting venture. This startup will be the next “big thing” in social media, creating a new way for users to connect with one another, essentially creating its own niche among facebook, twitter and foursquare. If interested please contact the information below. XXXX XXXX XXXXX@XXXXX.XXX XXX XXX XXXX
  7. Naive Bayes Classifier

  8. Bayes’ theorem

  9. None
  10. Probability of – & ⷁ ?

  11. Probability of – & ⷁ ? = Probability of –

    × Probability of ⷁ given –
  12. Probability of – & ⷁ ? = Probability of –

    = 1/4 × Probability of ⷁ given –
  13. Probability of – & ⷁ ? = Probability of –

    = 1/4 × Probability of ⷁ given – = 1/13
  14. Probability of – & ⷁ ? = Probability of –

    = 1/4 × Probability of ⷁ given – = 1/13 = 1/52
  15. None
  16. None
  17. None
  18. None
  19. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam
  20. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam What’s the probability that an email is spam given that it contains the word SODIUM?
  21. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam What’s the probability that an email is spam given that it contains the word SODIUM?
  22. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam
  23. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  24. None
  25. None
  26. None
  27. None
  28. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  29. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  30. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  31. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  32. Throw in another word, .

  33. Naive Bayes Classifier

  34. Naive?

  35. Assume conditional independence!

  36. Assume conditional independence and are conditionally independent.

  37. Assume conditional independence and are conditionally independent.

  38. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  39. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  40. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  41. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  42. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  43. • 30 emails of a total of 50 are spam

    • 20 out of the total 50 contain the word SODIUM • 15 of the emails that contain the word SODIUM are spam • 15 out of the total 50 contain the word CHOLESTEROL • 10 of the emails that contain the word CHOLESTEROL are spam
  44. None
  45. vs.

  46. None
  47. None
  48. None
  49. None
  50. None
  51. </MATH>

  52. NSLinguisticTagger

  53. NSLinguisticTagger • Lemmatization • Part of speech tagging • Language

    detection
  54. Swift Playgrounds

  55. bit.ly/swiftnbc (Swift 1.2)

  56. NLTK nltk.org

  57. Parsimmon github.com/ayanonagon/ Parsimmon

  58. ??? @ayanonagon

  59. None