Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hindi Morphological Analyzer

Hindi Morphological Analyzer

Developed at Centre for Indian Language Technology, IIT Bombay and integrated with Hindi Wordnet.

Ankit Bahuguna

March 21, 2013
Tweet

More Decks by Ankit Bahuguna

Other Decks in Technology

Transcript

  1. Outline • Introduction • Role of morphology in developing shallow

    parsing tools. • Morphology Analyzer – Motivation and Morphology Approaches • DM based Morphology Analyzer • Hindi Nominal and Verbal Inflection in DM • Development of Lexicon – Lexicon Entry Tool • DM Based MA Implementation and its Demo • Output and Result • Future Work • References
  2. Introduction The work is aimed to: • Provide a comprehensive

    and coherent account of the various aspects of Hindi inflectional morphology. • Apply it concretely in the development of Natural Language Processing tools (like Morphology Analyzer) for Hindi. • Our Implementation is rule - based i.e. required to analyze and describe the various inflectional forms in Hindi • This analysis is presented here in theoretical framework of Distributed Morphology.
  3. Role of Morphology in Developing Shallow Parsing Tools. • Used

    to develop a number of language processing systems for tasks such as spell-checking, stemming, morphological analysis, Part-of-Speech Tagging, Chunking, Machine Translation. • It accounts for the morphological properties of languages in a systematic manner, enabling us to understand how words are formed, what their constituents are, how they may be arranged to make larger units, what the semantic and grammatical constraints involved are, and how morphological processes interact with syntactic and phonological ones.
  4. Example (I) - Word decomposed in two or more different

    ways to produce roots of different POS categories. • मेरे कई खाते है mere kai khate hai {I have many bank accounts} Analysis: khata (root) + suffix /-e/ • वे रोज चावल खाते है ve roz chaval khate hai {They eat rice everyday} Analysis: kha (root) + suffixes {/-t-/ and /-e/}
  5. Example (II) – Multiple Roots and Multiple Analysis within same

    POS Category Input word form: नालो (nalõ) • POS Category: Noun Root 1: nal 'horseshoe' Suffix: -õ Analysis: Plural, Oblique • POS Category: Noun Root 2: nala 'water channel/trough' Suffix: -õ Analysis: Plural, Oblique
  6. Motivation - Morphology Analyzer • To work towards developing a

    morphology based MA (using a paradigm-based approach) as statistical methods often fail to correctly learn and represent the morphological patterns and the linguistic generalizations. • Emphasis on: Efficiency, High Accuracy and High Coverage.
  7. Morphology - Various Approaches • Lexicalist Approach [Lieber, 1992] –

    Considers a sharp division between syntax and morphology. – Words are formed in lexicon before they enter the terminal node of the syntactic tree. – Syntax has no access to alter the word internal structure. It can only rearrange these words to form phrases. – Problem: Dealing with freely formed Phrasal Compounds in Afrikaans. – Ex: [[saas laat in die bed] kinders ] (children who go to bed late)
  8. Morphology - Various Approaches (cont..) • Affix-less Approach [Anderson, 1992]

    – Promotes that there is no independent component like Morphology. – All words are formed by syntactically using WFR (Word Formation Rules) . All kinds of derivations are treated as syntax. – No affixes in the model. – Problematic Stand: Model cannot treat stem modification and affixation as two different operations. – Ex: Following verb pairs cannot be treated as same kind of operation: sing-sang (Partial Stem Modification); go-went (Total Stem Modification); cut-cut (No Modification); play- played (Affixation).
  9. DM based Morphology Analyzer • According to Distributed Morphology theory

    [Halle and Marantz 1993], the morphological structure of a word or a word form is generated using syntactic operations. • It is syntax that provides features and the structures upon which morphology operates. • Various components of morphology are distributed among various levels in the process of word computation. • It combines the features of both Lexicalist and the Affixless approaches.
  10. Distributed Morphology – Basic Architecture In DM, the various components

    of morphology are distributed among various levels in the process of word formation.
  11. Characteristics of DM based MA • Fewer inflectional classes, Fewer

    rules (Ordered + well-constrained.) • Rules that can both analyze and generate the possible morphological forms. • Increase in efficiency and accuracy over the existing Hindi Morphological Analyzers • Ease of computational implementation; can be used in other NLP tools such as POS taggers and Chunkers (word groupers). • Ability to mimic as far as possible a native speaker’ s use of morphological knowledge – the representation is not non-intuitive. • Use of a simple lexicon with fewer inflectional classes, makes it easier for lexicographers to classify words.
  12. Nominal Inflection in DM • Class A+ : Non-inflecting, Masculine,

    Feminine – क्रोध • Class B+ : Feminine, take याँ / यां In 'direct, plural' and ओं (which becomes यो after glide insertion) In 'plural, oblique' लड़की, उपलि, ब्ध, गुड़ियड़या • Class C+ : Feminine, take the suffix एँ/एं in the ‘plural, direct’ and /-õ/ in the ‘plural, oblique’ case. रात, माला, ऋतुड़, बहू, लौ. • Class D+ : Masculine ending in आ or या. लड़का, धागा, क ुड़ आँ, साया • Class E+ : Masculine nouns that inflect only in the ‘plural, oblique’ form and take the 'suffix' -ओं (becomes –यो for ई and इ ending roots due to glide insertion) and 'null' for all other case-number values. आलू, साधुड़, माली, कियव, खेत, घर, राजा, ियपता, भैया.
  13. Example – Nominal Inflection Token : घरो Suffix Extraction Rule:

    ोो/X,X,NC /[+pl, +oblique] Stem Obtained: घर Root Formation [Looked up in the lexicon] Only one Lexicon Entry Found: <घर> <E+,+masc,NC> <noun> Apply Readjustment Rules: In condition where Readjustment Rules may apply to make new stems,thereby we use them and lookup in the lexicon again to get new roots. Readjustment Rule: None [In this case] Output Morphological Analysis ------------------Set of Roots and Features are---------------------- Token : घरो, Total Output : 1 [ Root : घर, Class : E, Category : noun, Suffix : ोो ] [ Gender : +masc, Number : +pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ]
  14. Example - Involving Readjustment Rules Token : नालो Suffix Extraction

    Rule:ोो/X,X,NC /[+pl, +oblique] and ोो/D,X,Naa/[+pl, +oblique] Stem Obtained: नाल Root Formation [Looked up in the lexicon] Lexicon Entry Found: <नाल> <C+,-masc,NC> <noun> Root Obtained: नाल Output Morphological Analysis Readjustment Rule: ोा / Null/ D,X,Naa: ोे or ोो Stem Formation: [After application of RR Rules, Look-up in the lexicon] : नाला Lexicon Entry Found: <नाला> <D+,+masc,Naa> <noun> New Root Obtained: नाला Output Morphological Analysis (For both) ------------------Set of Roots and Features are---------------------- Token : नालो, Total Output : 2 [ Root : नाल, Class : C, Category : noun, Suffix : ोो ] [ Gender : -masc, Number : +pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ] [ Root : नाला, Class : D, Category : noun, Suffix : ोो ] [ Gender : +masc, Number : +pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ]
  15. Verbal Inflection in DM Grammatical Category Exponents Finite Forms a)

    Tense Present ho Past th- Future -g- b) Aspect Habitual -t- Progressive rəh Perfective -(y)ā, -(y)ī, -(y)ĩ, -(y)e Completive cuk c) Mood Imperative null, -o, -iye, -jiye, -nā Subjunctive (root)/Optative -ũ, -o, -(y)e, -(y)ẽ Subjunctive with auxiliary ho Presumptive ho-g- Root Conditional -t- Conditional with the auxiliary ho ‘be’ ho-t- d) Gender-Number Masculine-singular -(y)ā Masculine-plural -(y)e Feminine-singular -(y)ī Feminine-plural -(y)ĩ e) Person-Number 1st p-singular -ũ 1st p-plural -(y)ẽ
  16. Verbal Inflection (Cont..) Template for Verbal Inflection • Main Verb

    > infinitive > passive marker > person- number > Modal > Aspect >Tense/Mood > gender- number Morpheme arrangement rules enable us to identify valid word forms and rule out invalid ones.
  17. Various Verb Forms (all Aspects, Moods, Tenses, etc.) for the

    3rd Person, Singular for different kinds of Verbal Roots
  18. Example – Verbal Inflection Token : गया Suffix Extraction Rule:

    या / [+masc, -pl, +perfect] / Vx (Read as: Suffix / [Gender, Number, Aspect] / Phonological Ending) Stem: ग Suffix: या Verb Readjustment Rules: जा / ग / [+perfect] (Read as: Replace ग with जा and Aspect: +perfect) Lexicon Look-Up: <जा> <Vaa> <verb> Root Formation: जा Output Morphological Analysis [ Root : जा, Class : , Category : verb, Suffix : या ] [ Gender : +masc, Number : -pl, Person : x, Case : x, Tense : x, Aspect : +perfect, Mood : x ]
  19. Inflection in Adjectives in DM • Class A - Adjectives

    that always remain unchanged. Consonantal ending (बहादुड़र, शान्त/शांत) or vowel ending such as भारी. • Class B - Adjectives that do not inflect in the masculine gender but are marked with आ to mark the feminine gender, for example, अबल-अबला, महोदय-महोदया. • Class C - Adjectives that inflect for feminine and plural and oblique.; as अच्छा अच्छे अच्छी ( लंबा, छोटा, काला)
  20. Output and its Format • Detailed morpheme analysis for each

    word and provides the root, the grammatical category, the inflectional class and the feature values associated with the word. • Detailed morphological analysis for each morpheme that constitutes the word form. • The morpheme analysis of each suffix is produced in a seven field with values for the features: gender, number, person, case, tense, aspect, and mood. Input Token: TOKEN_IN • Possible Root 1: class: category: suffix: morphemes (morpheme1, morpheme 2, ..): Morpheme analysis (morpheme 1, morpheme 2, ... ) • Possible Root 2: category: suffix: morphemes (morpheme1, morpheme 2, .... ): Morpheme analysis (morpheme 1, morpheme 2, ... )
  21. Result – Multiple Roots within same Category Queried Word: नालो

    ------------------Set of Roots and Features are---------------------- Token : नालो, Total Output : 2 [ Root : नाल, Class : C, Category : noun, Suffix : ोो ] [ Gender : -masc, Number : +pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ] [ Root : नाला, Class : D, Category : noun, Suffix : ोो ] [ Gender : +masc, Number : +pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ]
  22. Result - Multiple Roots across POS Categories ------------------Set of Roots

    and Features are---------------------- Token : खाते, Total Output : 2 [ Root : खाता, Class : D, Category : noun, Suffix : ोे ] [ Gender : +masc, Number : -pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ] [ Gender : +masc, Number : +pl, Person : x, Case : -oblique, Tense : x, Aspect : x, Mood : x ] [ Root : खा, Class : , Category : verb, Suffix : ते ] [ Gender : +masc, Number : +-pl, Person : x, Case : x, Tense : , Aspect : +conditional, Mood : x ] ोे -> [ Gender : +masc, Number : +-pl, Person : x, Case : x, Tense : , Aspect : x, Mood : x ] त -> [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : +conditional, Mood : x ] [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : (-perfect : +habitual), Mood : x ] [ Gender : +masc, Number : +-pl, Person : x, Case : x, Tense : , Aspect : (-perfect : +habitual), Mood : x ] ोे -> [ Gender : +masc, Number : +-pl, Person : x, Case : x, Tense : , Aspect : x, Mood : x ] त -> [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : +conditional, Mood : x ] [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : (-perfect : +habitual), Mood : x ]
  23. Result - Multiple morphological analyses for a word form Queried

    Word: साए ------------------Set of Roots and Features are---------------------- Token : साए, Total Output : 2 [ Root : सा, Class : , Category : particle, Suffix : ए ] [ Gender : , Number : , Person : , Case : , Tense : , Aspect : , Mood : ] [ Root : साया, Class : D, Category : noun, Suffix : ए ] [ Gender : +masc, Number : -pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ] [ Gender : +masc, Number : -pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ]
  24. Result – Irregular Forms Queried Word: गए ------------------Set of Roots

    and Features are---------------------- Token : गए, Total Output : 1 [ Root : जा, Class : , Category : verb, Suffix : ए ] [ Gender : +masc, Number : +pl, Person : x, Case : x, Tense : x, Aspect : +perfect, Mood : x ]
  25. Result - Stem Modifications Queried Word: ताइयाँ ------------------Set of Roots

    and Features are---------------------- Token : ताइयाँ, Total Output : 1 [ Root : ताई, Class : B, Category : noun, Suffix : याँ ] [ Gender : -masc, Number : +pl, Person : x, Case : -oblique, Tense : x, Aspect : x, Mood : x ]
  26. Result - Compound ------------------Set of Roots and Features are---------------------- Token

    : टेढ़ी-मेढ़ी, Total Output : 1 [ Root : टेढ़ा-मेढ़ा, Class : C, Category : compound, Suffix : ोी ] [ Gender : -masc, Number : +-pl, Person : x, Case : +-obl, Tense : x, Aspect : x, Mood : x ] ------------------Set of Roots and Features are---------------------- Token : टेढ़े-मेढ़े, Total Output : 2 [ Root : टेढ़ा-मेढ़ा, Class : C, Category : compound, Suffix : ोे ] [ Gender : +masc, Number : -pl, Person : x, Case : +obl, Tense : x, Aspect : x, Mood : x ] [ Root : टेढ़ा-मेढ़ा, Class : C, Category : compound, Suffix : ोे ] [ Gender : +masc, Number : +pl, Person : x, Case : -obl, Tense : x, Aspect : x, Mood : x ]
  27. Results (Verbs and Nouns) Queried Word: जाऊ ं गा ------------------Set

    of Roots and Features are---------------------- Token : जाऊ ं गा, Total Output : 1 [ Root : जा, Class : , Category : verb, Suffix : ऊ ं गा ] [ Gender : +masc, Number : -pl, Person : 1p, Case : x, Tense : +future, Aspect : x, Mood : x ] ोा -> [ Gender : +masc, Number : +pl, Person : x, Case : x, Tense : x, Aspect : x, Mood : x ] ग -> [ Gender : x, Number : x, Person : x, Case : x, Tense : +future, Aspect : x, Mood : x ] ऊ ं -> [ Gender : x, Number : -pl, Person : 1p, Case : x, Tense : x, Aspect : x, Mood : x ] Queried Word: लड़ना ------------------Set of Roots and Features are---------------------- Token : लड़ना, Total Output : 3 [ Root : लड़ना, Class : D, Category : noun, Suffix : Null ] [ Gender : +masc, Number : , Person : , Case : , Tense : , Aspect : , Mood : ] [ Root : लड़, Class : , Category : verb, Suffix : ना ] [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : x, Mood : x ] [ Root : लड़, Class : , Category : verb, Suffix : ना ] [ Gender : +masc, Number : +pl, Person : x, Case : x, Tense : +infinitive, Aspect : x, Mood : x ] ोा -> [ Gender : +masc, Number : +pl, Person : x, Case : x, Tense : x, Aspect : x, Mood : x ] न -> [ Gender : x, Number : x, Person : x, Case : x, Tense : +infinitive, Aspect : x, Mood : x ]
  28. Results (Adjective and Adverb) Queried Word: अच्छे ------------------Set of Roots

    and Features are---------------------- Token : अच्छे, Total Output : 2 [ Root : अच्छा, Class : C, Category : adjective, Suffix : ोे ] [ Gender : +masc, Number : -pl, Person : x, Case : +obl, Tense : x, Aspect : x, Mood : x ] [ Root : अच्छा, Class : C, Category : adjective, Suffix : ोे ] [ Gender : +masc, Number : +pl, Person : x, Case : -obl, Tense : x, Aspect : x, Mood : x ] Queried Word: अंदाजन ------------------Set of Roots and Features are---------------------- Token : अंदाजन, Total Output : 1 [ Root : अंदाजन, Class : , Category : adverb, Suffix : Null ] [ Gender : , Number : , Person : , Case : , Tense : , Aspect : , Mood : ]
  29. Problems Faced • Issue with fonts: ढ़ ; ड़ –

    Single or Double Key Stroke. • Compound Category separated by Space: काला पीला
  30. Issues – Hindi Derivational Morphology ROOT = Derivational Forms (Level

    of Prefixes/Suffixes) 1. मान = ‘सम’ + ‘मान’ + ‘इत’ = सम्माियनत (P: 1 and S: 1) 2. शुड़द = ‘अ’ + ‘शुड़द’ + ‘ता’ = अशुड़दता (P: 1 and S: 1) 3. अपना = अपना + पन = अपनापन (P: 0 and S: 1) 4. राषर = 'राषर' + 'ईय' + 'ता' + 'वादी' = राषरीयतावादी (P: 0 and S: 3) 5. स्वाधीन = 'स्वाधीन' + ता + वादी = स्वाधीनतावादी (P: 0 and S: 2)
  31. Issues – Middle Inflection • अक े लापन : अक

    े लेपन • गमी_का_मौसम : गमी_क े _मौसम • नया_वषर : नये_वषर • काला_चोर : काले_चोर • कालाधन : कालेधन
  32. Future Work Developing a Word Generator for Hindi. • The

    idea is to see whether the framework is able to capture all kinds of words forms in Hindi – both regular and irregular. • The implementation will not be very different from that of the MA’ s implementation. • The linguistic resources used in the DM-based MA namely, the vocabulary items (suffixal entries) and the re-adjustment rules need to be applied in the reverse direction to produce fully inflected words using the root entries from the root-list and combining them with the affixal entries to generate surface forms.
  33. Conclusion • Presented our study of Hindi inflectional morphology described

    within the framework of Distributed Morphology. • The study analyses the formation of inflectional forms of Hindi through the application of suffix insertion rules and phonological readjustment rules. • Implementation of a DM-based Hindi Morphological Analyzer that uses a set of ordered contextual rules to extract out suffixes from a word form and to provide detailed morpheme analysis. • We showed that the DM-based Hindi morphological Analyzer is quite accurate and reliable, capable of both analysis and generation. • We show that using Distributed Morphology, the representation of Hindi morphology is minimal, affix driven, efficient and accurate for the tasks of stemming and morphological analysis.
  34. References • Jurafsky, Daniel, and James H. Martin. 2009. Speech

    and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. 2nd edition. Prentice-Hall. • Smriti Singh, PhD Thesis. 2012. Hindi Inflectional Morphology and its Implementation in Language Processing Tools: A Distributed Morphology Approach. • Morris Halle and Alec Marantz 1993. Distributed Morphology and pieces of Inflection, In The View from Building 20: Essays in Linguistics in Honor of Sylvain Bromberger, eds. K. Hale and S. J. Keyser, 111– 176. Cambridge, Mass.: MIT Press.
  35. THANK YOU! Also thanks to our team of linguists: Mrs.

    Rajita Shukla Mrs. Jaya Jha Mrs. Laxmi Kashyap Mrs. Nootan Verma Mr. Dhirendra Singh