Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Assigning Stress to Out-Of-Vocabulary Words: Three Approaches

Manex Agirrezabal
July 23, 2014
40

Assigning Stress to Out-Of-Vocabulary Words: Three Approaches

Manex Agirrezabal

July 23, 2014
Tweet

Transcript

  1. AthenaRhythm Assigning Stress to Out-of-vocabulary Words Manex Agirrezabal1∗, Jeffrey Heinz2,

    Mans Hulden3, Bertol Arrieta1 (1) University of the Basque Country (UPV/EHU) (2) University of Delaware (3) University of Colorado Boulder July 23rd , 2014 The 2014 International Conference on Artificial Intelligence Las Vegas, Nevada, USA https://athenarhythm.googlecode.com https://zeuscansion.googlecode.com (*) At this moment, the first author is doing an internship in the University of Delaware 1/24, https://athenarhythm.googlecode.com
  2. AthenaRhythm ... before we start ... Scansion Determine the rhythmic

    nature of lines of verse. Example and the mighty Mudjekeewis →´ and the | m´ ıghty | M´ udje|k´ eewis 2/24, https://athenarhythm.googlecode.com
  3. AthenaRhythm ... before we start ... Scansion Determine the rhythmic

    nature of lines of verse. Example and the mighty Mudjekeewis →´ and the | m´ ıghty | M´ udje|k´ eewis Trochee: − 2/24, https://athenarhythm.googlecode.com
  4. AthenaRhythm ... before we start ... Scansion Determine the rhythmic

    nature of lines of verse. Example and the mighty Mudjekeewis →´ and the | m´ ıghty | M´ udje|k´ eewis Trochee: − Four trochees Trochaic tetrameter 2/24, https://athenarhythm.googlecode.com
  5. AthenaRhythm ... what you’ll see in this presentation ... A

    prerequsite problem: how to determine the location of stress in words that are not in the dictionary (out-of-vocabulary words) 3/24, https://athenarhythm.googlecode.com
  6. AthenaRhythm ... what you’ll see in this presentation ... A

    prerequsite problem: how to determine the location of stress in words that are not in the dictionary (out-of-vocabulary words) Three approaches to predict the primary stress location: 3/24, https://athenarhythm.googlecode.com
  7. AthenaRhythm ... what you’ll see in this presentation ... A

    prerequsite problem: how to determine the location of stress in words that are not in the dictionary (out-of-vocabulary words) Three approaches to predict the primary stress location: Word Similarity Linguistic Generalizations Machine Learning 3/24, https://athenarhythm.googlecode.com
  8. AthenaRhythm ... what you’ll see in this presentation ... A

    prerequsite problem: how to determine the location of stress in words that are not in the dictionary (out-of-vocabulary words) Three approaches to predict the primary stress location: Word Similarity Linguistic Generalizations Machine Learning The best results with Linguistic Generalizations, but Machine Learning is not so far. 3/24, https://athenarhythm.googlecode.com
  9. AthenaRhythm Outline 1 Background 2 Technical details: Corpus & Software

    3 Task 4 Approaches Similarity approach Linguistic approach Machine Learning approach 5 Results 6 Discussion & Future Work 4/24, https://athenarhythm.googlecode.com
  10. AthenaRhythm Background 1 Background 2 Technical details: Corpus & Software

    3 Task 4 Approaches Similarity approach Linguistic approach Machine Learning approach 5 Results 6 Discussion & Future Work 5/24, https://athenarhythm.googlecode.com
  11. AthenaRhythm Background FACTS: The location of primary stress in English

    words is a key element of successful scansion of English poetry. Scansion systems fail when they can’t get the lexical stress of the words ZeuScansion: A tool for scansion of English poetry (Agirrezabal et al., 2013) Similarity approach for unknown words We have developed other two approaches 6/24, https://athenarhythm.googlecode.com
  12. AthenaRhythm Technical details: Corpus & Software 1 Background 2 Technical

    details: Corpus & Software 3 Task 4 Approaches Similarity approach Linguistic approach Machine Learning approach 5 Results 6 Discussion & Future Work 7/24, https://athenarhythm.googlecode.com
  13. AthenaRhythm Technical details: Corpus & Software Corpus NETtalk pronunciation dictionary

    Software Foma: Finite-state compiler and C library Phonetisaurus: Grapheme-to-Phoneme converter LIBLINEAR: Library for Large Linear Classification LIBSVM: Library for Support Vector Machines 8/24, https://athenarhythm.googlecode.com
  14. AthenaRhythm Task 1 Background 2 Technical details: Corpus & Software

    3 Task 4 Approaches Similarity approach Linguistic approach Machine Learning approach 5 Results 6 Discussion & Future Work 9/24, https://athenarhythm.googlecode.com
  15. AthenaRhythm Task Goal Given an out-of-vocabulary word, infer the location

    of the primary stress 10/24, https://athenarhythm.googlecode.com
  16. AthenaRhythm Task Goal Given an out-of-vocabulary word, infer the location

    of the primary stress Mudjekeewis Where is the primary stress? 10/24, https://athenarhythm.googlecode.com
  17. AthenaRhythm Task Goal Given an out-of-vocabulary word, infer the location

    of the primary stress M´ udjekeewis Where is the primary stress? 10/24, https://athenarhythm.googlecode.com
  18. AthenaRhythm Task Goal Given an out-of-vocabulary word, infer the location

    of the primary stress Mudj´ ekeewis Where is the primary stress? 10/24, https://athenarhythm.googlecode.com
  19. AthenaRhythm Task Goal Given an out-of-vocabulary word, infer the location

    of the primary stress Mudjek´ e´ ewis Where is the primary stress? 10/24, https://athenarhythm.googlecode.com
  20. AthenaRhythm Task Goal Given an out-of-vocabulary word, infer the location

    of the primary stress Mudjekeew´ ıs Where is the primary stress? 10/24, https://athenarhythm.googlecode.com
  21. AthenaRhythm Task Goal Given an out-of-vocabulary word, infer the location

    of the primary stress Mudjekeewis Where is the primary stress? The system has to learn how to locate the stress 10/24, https://athenarhythm.googlecode.com
  22. AthenaRhythm Approaches 1 Background 2 Technical details: Corpus & Software

    3 Task 4 Approaches Similarity approach Linguistic approach Machine Learning approach 5 Results 6 Discussion & Future Work 11/24, https://athenarhythm.googlecode.com
  23. AthenaRhythm Approaches Similarity approach Similarity approach Our thought Similarly spelled

    words have the same lexical stress. Example I prophesy there’ll be a row 12/24, https://athenarhythm.googlecode.com
  24. AthenaRhythm Approaches Similarity approach Similarity approach Our thought Similarly spelled

    words have the same lexical stress. Example prophesy 12/24, https://athenarhythm.googlecode.com
  25. AthenaRhythm Approaches Similarity approach Similarity approach Our thought Similarly spelled

    words have the same lexical stress. Example prop|hesy 12/24, https://athenarhythm.googlecode.com
  26. AthenaRhythm Approaches Similarity approach Similarity approach Our thought Similarly spelled

    words have the same lexical stress. Example prop|hecy 12/24, https://athenarhythm.googlecode.com
  27. AthenaRhythm Approaches Similarity approach Similarity approach Our thought Similarly spelled

    words have the same lexical stress. Example prophecy We can get it from the dictionary 12/24, https://athenarhythm.googlecode.com
  28. AthenaRhythm Approaches Similarity approach Similarity approach Our thought Similarly spelled

    words have the same lexical stress. Example prophecy − − We can get it from the dictionary 12/24, https://athenarhythm.googlecode.com
  29. AthenaRhythm Approaches Linguistic approach Linguistic approach Linguistic chain: Grapheme-to-phoneme (G2P)

    conversion (Novak et al., 2012) Syllabification procedure (Hulden, 2006) Stress-assignment rules (Hayes, 1995) Generalization: Heavy syllables tend to attract stress 13/24, https://athenarhythm.googlecode.com
  30. AthenaRhythm Approaches Linguistic approach Linguistic approach Stress assignment rules dialectal

    dYxlEktL dY.x.lEk.tL dY.x.'lEk.tL 14/24, https://athenarhythm.googlecode.com
  31. AthenaRhythm Approaches Machine Learning approach Machine Learning approach Example reference

    Feature extraction {#r}:1, {ef}:1, {fe}:1, {er}:1 {en}:1, {nc}:1, {ce}:1, {e#}:1 {re}:2 15/24, https://athenarhythm.googlecode.com
  32. AthenaRhythm Approaches Machine Learning approach Machine Learning approach Example reference

    Feature extraction {#r}:1, {ef}:1, {fe}:1, {er}:1 {en}:1, {nc}:1, {ce}:1, {e#}:1 {re}:2 , class: < − > 15/24, https://athenarhythm.googlecode.com
  33. AthenaRhythm Approaches Machine Learning approach Machine Learning approach Example reference

    Feature extraction {#r}:1, {ef}:1, {fe}:1, {er}:1 {en}:1, {nc}:1, {ce}:1, {e#}:1 {re}:2 , class: < − > Learning linear SVMs, non-linear SVMs, different kernels 15/24, https://athenarhythm.googlecode.com
  34. AthenaRhythm Approaches Machine Learning approach Machine Learning approach Example reference

    Feature extraction {#r}:1, {ef}:1, {fe}:1, {er}:1 {en}:1, {nc}:1, {ce}:1, {e#}:1 {re}:2 , class: < − > Learning linear SVMs, non-linear SVMs, different kernels Hyperparameter optimization Grid-search for C and γ params. 15/24, https://athenarhythm.googlecode.com
  35. AthenaRhythm Approaches Machine Learning approach Machine Learning approach Example reference

    Feature extraction {#r}:1, {ef}:1, {fe}:1, {er}:1 {en}:1, {nc}:1, {ce}:1, {e#}:1 {re}:2 , class: < − > Learning linear SVMs, non-linear SVMs, different kernels Hyperparameter optimization Grid-search for C and γ params. The best 15/24, https://athenarhythm.googlecode.com
  36. AthenaRhythm Approaches Machine Learning approach Machine Learning approach Example reference

    Feature extraction {#r}:1, {ef}:1, {fe}:1, {er}:1 {en}:1, {nc}:1, {ce}:1, {e#}:1 {re}:2 , class: < − > Learning linear SVMs, non-linear SVMs, different kernels Hyperparameter optimization Grid-search for C and γ params. The best C-Support Vector Classifier 15/24, https://athenarhythm.googlecode.com
  37. AthenaRhythm Approaches Machine Learning approach Machine Learning approach Example reference

    Feature extraction {#r}:1, {ef}:1, {fe}:1, {er}:1 {en}:1, {nc}:1, {ce}:1, {e#}:1 {re}:2 , class: < − > Learning linear SVMs, non-linear SVMs, different kernels Hyperparameter optimization Grid-search for C and γ params. The best C-Support Vector Classifier with RBF kernel 15/24, https://athenarhythm.googlecode.com
  38. AthenaRhythm Approaches Machine Learning approach Machine Learning approach Example reference

    Feature extraction {#r}:1, {ef}:1, {fe}:1, {er}:1 {en}:1, {nc}:1, {ce}:1, {e#}:1 {re}:2 , class: < − > Learning linear SVMs, non-linear SVMs, different kernels Hyperparameter optimization Grid-search for C and γ params. The best C-Support Vector Classifier with RBF kernel and C = 1024 and γ = 0.0078125 15/24, https://athenarhythm.googlecode.com
  39. AthenaRhythm Results 1 Background 2 Technical details: Corpus & Software

    3 Task 4 Approaches Similarity approach Linguistic approach Machine Learning approach 5 Results 6 Discussion & Future Work 16/24, https://athenarhythm.googlecode.com
  40. AthenaRhythm Results Evaluation method: Cross-validation for Similarity approach and Machine

    Learning approach Whole corpus for the Linguistic approach (Expert system) 17/24, https://athenarhythm.googlecode.com
  41. AthenaRhythm Results Approach Accuracy SIM 0.5843 SIM-OO 0.6777 LING 0.7362

    SVM 0.7098 SIM: Similarity approach SIM-OO: Similarity approach with optimal ordering of transducers LING: Linguistic approach SVM: Support Vector Machines (with the BEST parameters, C = 1024 and γ = 0.0078125) 18/24, https://athenarhythm.googlecode.com
  42. AthenaRhythm Results Approach Accuracy SIM 0.5843 SIM-OO 0.6777 LING 0.7362

    SVM 0.7098 SIM: Similarity approach SIM-OO: Similarity approach with optimal ordering of transducers LING: Linguistic approach SVM: Support Vector Machines (with the BEST parameters, C = 1024 and γ = 0.0078125) 18/24, https://athenarhythm.googlecode.com
  43. AthenaRhythm Discussion & Future Work 1 Background 2 Technical details:

    Corpus & Software 3 Task 4 Approaches Similarity approach Linguistic approach Machine Learning approach 5 Results 6 Discussion & Future Work 19/24, https://athenarhythm.googlecode.com
  44. AthenaRhythm Discussion & Future Work We have presented three approaches

    for predicting primary stress in out-of-vocabulary words. 20/24, https://athenarhythm.googlecode.com
  45. AthenaRhythm Discussion & Future Work We have presented three approaches

    for predicting primary stress in out-of-vocabulary words. Important part of ZeuScansion 20/24, https://athenarhythm.googlecode.com
  46. AthenaRhythm Discussion & Future Work We have presented three approaches

    for predicting primary stress in out-of-vocabulary words. Important part of ZeuScansion (The automatic scansion system). 20/24, https://athenarhythm.googlecode.com
  47. AthenaRhythm Discussion & Future Work We have presented three approaches

    for predicting primary stress in out-of-vocabulary words. Important part of ZeuScansion (The automatic scansion system). Best results: Linguistic rules and Machine Learning 20/24, https://athenarhythm.googlecode.com
  48. AthenaRhythm Discussion & Future Work We have presented three approaches

    for predicting primary stress in out-of-vocabulary words. Important part of ZeuScansion (The automatic scansion system). Best results: Linguistic rules and Machine Learning These implementations are released under GNU GPL license at athenarhythm.googlecode.com 20/24, https://athenarhythm.googlecode.com
  49. AthenaRhythm Discussion & Future Work Future Work: Include part-of-speech information

    More feature sets (Machine Learning approach) Improve the linguistic rules, expecially disyllables Apply these approaches in other languages 21/24, https://athenarhythm.googlecode.com
  50. AthenaRhythm Acknowledgments Thanks to: Department of Linguistics & Cognitive Sciences

    (University of Delaware) 22/24, https://athenarhythm.googlecode.com
  51. AthenaRhythm Acknowledgments Thanks to: Department of Linguistics & Cognitive Sciences

    (University of Delaware) University of the Basque Country (UPV/EHU) 22/24, https://athenarhythm.googlecode.com
  52. AthenaRhythm Acknowledgments Thanks to: Department of Linguistics & Cognitive Sciences

    (University of Delaware) University of the Basque Country (UPV/EHU) NSF grant #1123692 to the second author. 22/24, https://athenarhythm.googlecode.com
  53. AthenaRhythm Assigning Stress to Out-of-vocabulary Words Manex Agirrezabal1∗, Jeffrey Heinz2,

    Mans Hulden3, Bertol Arrieta1 (1) University of the Basque Country (UPV/EHU) (2) University of Delaware (3) University of Colorado Boulder July 23rd , 2014 The 2014 International Conference on Artificial Intelligence Las Vegas, Nevada, USA https://athenarhythm.googlecode.com https://zeuscansion.googlecode.com (*) At this moment, the first author is doing an internship in the University of Delaware 24/24, https://athenarhythm.googlecode.com
  54. AthenaRhythm Hayes, B. (1995). Metrical stress theory: Principles and case

    studies. University of Chicago Press. Hulden, M. (2006). Finite-state syllabification. Finite-State Methods and Natural Language Processing, pages 86–96. Novak, J., Minematsu, N., and Hirose, K. (2012). Wfst-based Grapheme-to-Phoneme conversion: Open source tools for alignment, model-building and decoding. Finite-State Methods and Natural Language Processing. 24/24, https://athenarhythm.googlecode.com