$30 off During Our Annual Pro Sale. View Details »

Hindi Morphological Analyzer

Hindi Morphological Analyzer

Developed at Centre for Indian Language Technology, IIT Bombay and integrated with Hindi Wordnet.

Ankit Bahuguna

March 21, 2013
Tweet

More Decks by Ankit Bahuguna

Other Decks in Technology

Transcript

  1. By: Ankit Bahuguna
    Guided By: Dr. Pushpak Bhattacharyya
    Hindi Morphological Analyzer

    View Slide

  2. Outline

    Introduction

    Role of morphology in developing shallow parsing tools.

    Morphology Analyzer – Motivation and Morphology Approaches

    DM based Morphology Analyzer

    Hindi Nominal and Verbal Inflection in DM

    Development of Lexicon – Lexicon Entry Tool

    DM Based MA Implementation and its Demo

    Output and Result

    Future Work

    References

    View Slide

  3. Introduction
    The work is aimed to:

    Provide a comprehensive and coherent account of the
    various aspects of Hindi inflectional morphology.

    Apply it concretely in the development of Natural Language
    Processing tools (like Morphology Analyzer) for Hindi.

    Our Implementation is rule - based i.e. required to analyze
    and describe the various inflectional forms in Hindi

    This analysis is presented here in theoretical framework of
    Distributed Morphology.

    View Slide

  4. Role of Morphology in Developing Shallow Parsing Tools.

    Used to develop a number of language processing systems for
    tasks such as spell-checking, stemming, morphological analysis,
    Part-of-Speech Tagging, Chunking, Machine Translation.

    It accounts for the morphological properties of languages in a
    systematic manner, enabling us to understand how words are
    formed, what their constituents are, how they may be
    arranged to make larger units, what the semantic and
    grammatical constraints involved are, and how
    morphological processes interact with syntactic and
    phonological ones.

    View Slide

  5. Example (I) - Word decomposed in two or more different
    ways to produce roots of
    different POS categories.

    मेरे कई खाते है
    mere kai khate hai
    {I have many bank accounts}
    Analysis: khata (root) + suffix /-e/

    वे रोज चावल खाते है
    ve roz chaval khate hai
    {They eat rice everyday}
    Analysis: kha (root) + suffixes {/-t-/ and /-e/}

    View Slide

  6. Example (II) – Multiple Roots and Multiple Analysis
    within same POS Category
    Input word form: नालो (nalõ)

    POS Category: Noun
    Root 1: nal 'horseshoe'
    Suffix: -õ
    Analysis: Plural, Oblique

    POS Category: Noun
    Root 2: nala 'water channel/trough'
    Suffix: -õ
    Analysis: Plural, Oblique

    View Slide

  7. Motivation - Morphology Analyzer

    To work towards developing a morphology based MA
    (using a paradigm-based approach) as statistical methods
    often fail to correctly learn and represent the
    morphological patterns and the linguistic generalizations.

    Emphasis on: Efficiency, High Accuracy and High
    Coverage.

    View Slide

  8. Morphology - Various Approaches

    Lexicalist Approach [Lieber, 1992]
    – Considers a sharp division between syntax and morphology.
    – Words are formed in lexicon before they enter the terminal node
    of the syntactic tree.
    – Syntax has no access to alter the word internal structure. It can
    only rearrange these words to form phrases.
    – Problem: Dealing with freely formed Phrasal Compounds in
    Afrikaans.
    – Ex: [[saas laat in die bed] kinders ] (children who go to bed
    late)

    View Slide

  9. Morphology - Various Approaches (cont..)

    Affix-less Approach [Anderson, 1992]
    – Promotes that there is no independent component like
    Morphology.
    – All words are formed by syntactically using WFR (Word
    Formation Rules) . All kinds of derivations are treated as syntax.
    – No affixes in the model.
    – Problematic Stand: Model cannot treat stem modification and
    affixation as two different operations.
    – Ex: Following verb pairs cannot be treated as same kind of
    operation: sing-sang (Partial Stem Modification); go-went
    (Total Stem Modification); cut-cut (No Modification); play-
    played (Affixation).

    View Slide

  10. DM based Morphology Analyzer

    According to Distributed Morphology theory [Halle and
    Marantz 1993], the morphological structure of a word or a
    word form is generated using syntactic operations.

    It is syntax that provides features and the structures upon
    which morphology operates.

    Various components of morphology are distributed among
    various levels in the process of word computation.

    It combines the features of both Lexicalist and the Affixless
    approaches.

    View Slide

  11. Distributed Morphology – Basic Architecture
    In DM, the various components of morphology are distributed
    among various levels in the process of word formation.

    View Slide

  12. Characteristics of DM based MA

    Fewer inflectional classes, Fewer rules (Ordered + well-constrained.)

    Rules that can both analyze and generate the possible morphological
    forms.

    Increase in efficiency and accuracy over the existing Hindi
    Morphological Analyzers

    Ease of computational implementation; can be used in other NLP
    tools such as POS taggers and Chunkers (word groupers).

    Ability to mimic as far as possible a native speaker’ s use of
    morphological knowledge – the representation is not non-intuitive.

    Use of a simple lexicon with fewer inflectional classes, makes it easier
    for lexicographers to classify words.

    View Slide

  13. Nominal Inflection in DM

    Class A+ : Non-inflecting, Masculine, Feminine – क्रोध

    Class B+ : Feminine, take याँ / यां In 'direct, plural' and ओं (which
    becomes यो after glide insertion) In 'plural, oblique' लड़की, उपलि, ब्ध,
    गुड़ियड़या

    Class C+ : Feminine, take the suffix एँ/एं in the ‘plural, direct’ and /-õ/ in
    the ‘plural, oblique’ case. रात, माला, ऋतुड़, बहू, लौ.

    Class D+ : Masculine ending in आ or या. लड़का, धागा, क
    ुड़ आँ, साया

    Class E+ : Masculine nouns that inflect only in the ‘plural, oblique’ form
    and take the 'suffix' -ओं (becomes –यो for ई and इ ending roots due to
    glide insertion) and 'null' for all other case-number values. आलू, साधुड़,
    माली, कियव, खेत, घर, राजा, ियपता, भैया.

    View Slide

  14. Example – Nominal Inflection
    Token : घरो
    Suffix Extraction Rule: ोो/X,X,NC /[+pl, +oblique]
    Stem Obtained: घर
    Root Formation [Looked up in the lexicon]
    Only one Lexicon Entry Found: <घर>
    Apply Readjustment Rules: In condition where Readjustment Rules may apply to
    make new stems,thereby we use them and lookup in the lexicon again to get new
    roots.
    Readjustment Rule: None [In this case]
    Output Morphological Analysis
    ------------------Set of Roots and Features are----------------------
    Token : घरो, Total Output : 1
    [ Root : घर, Class : E, Category : noun, Suffix : ोो ]
    [ Gender : +masc, Number : +pl, Person : x, Case : +oblique, Tense : x, Aspect : x,
    Mood : x ]

    View Slide

  15. Example - Involving Readjustment Rules
    Token : नालो
    Suffix Extraction Rule:ोो/X,X,NC /[+pl, +oblique] and ोो/D,X,Naa/[+pl, +oblique]
    Stem Obtained: नाल
    Root Formation [Looked up in the lexicon]
    Lexicon Entry Found: <नाल>
    Root Obtained: नाल
    Output Morphological Analysis
    Readjustment Rule: ोा / Null/ D,X,Naa: ोे or ोो
    Stem Formation: [After application of RR Rules, Look-up in the lexicon] : नाला
    Lexicon Entry Found: <नाला>
    New Root Obtained: नाला
    Output Morphological Analysis (For both)
    ------------------Set of Roots and Features are----------------------
    Token : नालो, Total Output : 2
    [ Root : नाल, Class : C, Category : noun, Suffix : ोो ]
    [ Gender : -masc, Number : +pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ]
    [ Root : नाला, Class : D, Category : noun, Suffix : ोो ]
    [ Gender : +masc, Number : +pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ]

    View Slide

  16. Verbal Inflection in DM
    Grammatical Category Exponents
    Finite Forms
    a) Tense
    Present ho
    Past th-
    Future -g-
    b) Aspect
    Habitual -t-
    Progressive rəh
    Perfective -(y)ā, -(y)ī, -(y)ĩ, -(y)e
    Completive cuk
    c) Mood
    Imperative null, -o, -iye, -jiye, -nā
    Subjunctive (root)/Optative -ũ, -o, -(y)e, -(y)ẽ
    Subjunctive with auxiliary ho
    Presumptive ho-g-
    Root Conditional -t-
    Conditional with the auxiliary ho ‘be’ ho-t-
    d) Gender-Number
    Masculine-singular -(y)ā
    Masculine-plural -(y)e
    Feminine-singular -(y)ī
    Feminine-plural -(y)ĩ
    e) Person-Number
    1st p-singular -ũ
    1st p-plural -(y)ẽ

    View Slide

  17. Verbal Inflection (Cont..)
    Template for Verbal Inflection

    Main Verb > infinitive > passive marker > person-
    number > Modal > Aspect >Tense/Mood > gender-
    number
    Morpheme arrangement rules enable us to identify valid
    word forms and rule out invalid ones.

    View Slide

  18. Various Verb Forms (all Aspects, Moods, Tenses,
    etc.) for the 3rd Person, Singular for different kinds of Verbal Roots

    View Slide

  19. Example – Verbal Inflection
    Token : गया
    Suffix Extraction Rule: या / [+masc, -pl, +perfect] / Vx
    (Read as: Suffix / [Gender, Number, Aspect] / Phonological Ending)
    Stem: ग Suffix: या
    Verb Readjustment Rules: जा / ग / [+perfect]
    (Read as: Replace ग with जा and Aspect: +perfect)
    Lexicon Look-Up: <जा>
    Root Formation: जा
    Output Morphological Analysis
    [ Root : जा, Class : , Category : verb, Suffix : या ]
    [ Gender : +masc, Number : -pl, Person : x, Case : x, Tense : x, Aspect :
    +perfect, Mood : x ]

    View Slide

  20. Inflection in Adjectives in DM

    Class A - Adjectives that always remain unchanged.
    Consonantal ending (बहादुड़र, शान्त/शांत) or vowel ending
    such as भारी.

    Class B - Adjectives that do not inflect in the masculine
    gender but are marked with आ to mark the feminine
    gender, for example, अबल-अबला, महोदय-महोदया.

    Class C - Adjectives that inflect for feminine and plural and
    oblique.; as अच्छा अच्छे अच्छी ( लंबा, छोटा, काला)

    View Slide

  21. Development of Lexicon – Lexicon Entry Tool

    View Slide

  22. DM Based MA Implementation

    View Slide

  23. Output and its Format

    Detailed morpheme analysis for each word and provides the root, the grammatical
    category, the inflectional class and the feature values associated with the word.

    Detailed morphological analysis for each morpheme that constitutes the word form.

    The morpheme analysis of each suffix is produced in a seven field with values for the
    features: gender, number, person, case, tense, aspect, and mood.
    Input Token: TOKEN_IN

    Possible Root 1: class: category: suffix: morphemes (morpheme1, morpheme 2, ..):
    Morpheme analysis (morpheme 1, morpheme 2, ... )

    Possible Root 2: category: suffix: morphemes (morpheme1, morpheme 2, .... ):
    Morpheme analysis (morpheme 1, morpheme 2, ... )

    View Slide

  24. Result – Multiple Roots within same Category
    Queried Word: नालो
    ------------------Set of Roots and Features are----------------------
    Token : नालो, Total Output : 2
    [ Root : नाल, Class : C, Category : noun, Suffix : ोो ]
    [ Gender : -masc, Number : +pl, Person : x, Case : +oblique, Tense : x,
    Aspect : x, Mood : x ]
    [ Root : नाला, Class : D, Category : noun, Suffix : ोो ]
    [ Gender : +masc, Number : +pl, Person : x, Case : +oblique, Tense : x,
    Aspect : x, Mood : x ]

    View Slide

  25. Result - Multiple Roots across POS Categories
    ------------------Set of Roots and Features are----------------------
    Token : खाते, Total Output : 2
    [ Root : खाता, Class : D, Category : noun, Suffix : ोे ]
    [ Gender : +masc, Number : -pl, Person : x, Case : +oblique, Tense : x, Aspect : x, Mood : x ]
    [ Gender : +masc, Number : +pl, Person : x, Case : -oblique, Tense : x, Aspect : x, Mood : x ]
    [ Root : खा, Class : , Category : verb, Suffix : ते ]
    [ Gender : +masc, Number : +-pl, Person : x, Case : x, Tense : , Aspect : +conditional, Mood : x ]
    ोे -> [ Gender : +masc, Number : +-pl, Person : x, Case : x, Tense : , Aspect : x, Mood : x ]
    त -> [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : +conditional, Mood : x ]
    [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : (-perfect : +habitual), Mood : x ]
    [ Gender : +masc, Number : +-pl, Person : x, Case : x, Tense : , Aspect : (-perfect : +habitual), Mood : x ]
    ोे -> [ Gender : +masc, Number : +-pl, Person : x, Case : x, Tense : , Aspect : x, Mood : x ]
    त -> [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : +conditional, Mood : x ]
    [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : (-perfect : +habitual), Mood : x ]

    View Slide

  26. Result - Multiple morphological analyses
    for a word form
    Queried Word: साए
    ------------------Set of Roots and Features are----------------------
    Token : साए, Total Output : 2
    [ Root : सा, Class : , Category : particle, Suffix : ए ]
    [ Gender : , Number : , Person : , Case : , Tense : , Aspect : , Mood : ]
    [ Root : साया, Class : D, Category : noun, Suffix : ए ]
    [ Gender : +masc, Number : -pl, Person : x, Case : +oblique, Tense : x, Aspect :
    x, Mood : x ]
    [ Gender : +masc, Number : -pl, Person : x, Case : +oblique, Tense : x, Aspect :
    x, Mood : x ]

    View Slide

  27. Result – Irregular Forms
    Queried Word: गए
    ------------------Set of Roots and Features are----------------------
    Token : गए, Total Output : 1
    [ Root : जा, Class : , Category : verb, Suffix : ए ]
    [ Gender : +masc, Number : +pl, Person : x, Case : x, Tense : x, Aspect :
    +perfect, Mood : x ]

    View Slide

  28. Result - Stem Modifications
    Queried Word: ताइयाँ
    ------------------Set of Roots and Features are----------------------
    Token : ताइयाँ, Total Output : 1
    [ Root : ताई, Class : B, Category : noun, Suffix : याँ ]
    [ Gender : -masc, Number : +pl, Person : x, Case : -oblique, Tense : x, Aspect : x,
    Mood : x ]

    View Slide

  29. Result - Compound
    ------------------Set of Roots and Features are----------------------
    Token : टेढ़ी-मेढ़ी, Total Output : 1
    [ Root : टेढ़ा-मेढ़ा, Class : C, Category : compound, Suffix : ोी ]
    [ Gender : -masc, Number : +-pl, Person : x, Case : +-obl, Tense : x, Aspect : x, Mood : x ]
    ------------------Set of Roots and Features are----------------------
    Token : टेढ़े-मेढ़े, Total Output : 2
    [ Root : टेढ़ा-मेढ़ा, Class : C, Category : compound, Suffix : ोे ]
    [ Gender : +masc, Number : -pl, Person : x, Case : +obl, Tense : x, Aspect : x, Mood : x ]
    [ Root : टेढ़ा-मेढ़ा, Class : C, Category : compound, Suffix : ोे ]
    [ Gender : +masc, Number : +pl, Person : x, Case : -obl, Tense : x, Aspect : x, Mood : x ]

    View Slide

  30. Results (Verbs and Nouns)
    Queried Word: जाऊ
    ं गा
    ------------------Set of Roots and Features are----------------------
    Token : जाऊ
    ं गा, Total Output : 1
    [ Root : जा, Class : , Category : verb, Suffix : ऊ
    ं गा ]
    [ Gender : +masc, Number : -pl, Person : 1p, Case : x, Tense : +future, Aspect : x, Mood : x ]
    ोा -> [ Gender : +masc, Number : +pl, Person : x, Case : x, Tense : x, Aspect : x, Mood : x ]
    ग -> [ Gender : x, Number : x, Person : x, Case : x, Tense : +future, Aspect : x, Mood : x ]

    ं -> [ Gender : x, Number : -pl, Person : 1p, Case : x, Tense : x, Aspect : x, Mood : x ]
    Queried Word: लड़ना
    ------------------Set of Roots and Features are----------------------
    Token : लड़ना, Total Output : 3
    [ Root : लड़ना, Class : D, Category : noun, Suffix : Null ]
    [ Gender : +masc, Number : , Person : , Case : , Tense : , Aspect : , Mood : ]
    [ Root : लड़, Class : , Category : verb, Suffix : ना ]
    [ Gender : x, Number : x, Person : x, Case : x, Tense : x, Aspect : x, Mood : x ]
    [ Root : लड़, Class : , Category : verb, Suffix : ना ]
    [ Gender : +masc, Number : +pl, Person : x, Case : x, Tense : +infinitive, Aspect : x, Mood : x ]
    ोा -> [ Gender : +masc, Number : +pl, Person : x, Case : x, Tense : x, Aspect : x, Mood : x ]
    न -> [ Gender : x, Number : x, Person : x, Case : x, Tense : +infinitive, Aspect : x, Mood : x ]

    View Slide

  31. Results (Adjective and Adverb)
    Queried Word: अच्छे
    ------------------Set of Roots and Features are----------------------
    Token : अच्छे, Total Output : 2
    [ Root : अच्छा, Class : C, Category : adjective, Suffix : ोे ]
    [ Gender : +masc, Number : -pl, Person : x, Case : +obl, Tense : x, Aspect : x, Mood : x ]
    [ Root : अच्छा, Class : C, Category : adjective, Suffix : ोे ]
    [ Gender : +masc, Number : +pl, Person : x, Case : -obl, Tense : x, Aspect : x, Mood : x ]
    Queried Word: अंदाजन
    ------------------Set of Roots and Features are----------------------
    Token : अंदाजन, Total Output : 1
    [ Root : अंदाजन, Class : , Category : adverb, Suffix : Null ]
    [ Gender : , Number : , Person : , Case : , Tense : , Aspect : , Mood : ]

    View Slide

  32. Problems Faced

    Issue with fonts: ढ़ ; ड़ – Single or Double Key Stroke.

    Compound Category separated by Space: काला पीला

    View Slide

  33. Issues – Hindi Derivational Morphology
    ROOT = Derivational Forms (Level of Prefixes/Suffixes)
    1. मान = ‘सम’ + ‘मान’ + ‘इत’ = सम्माियनत (P: 1 and S: 1)
    2. शुड़द = ‘अ’ + ‘शुड़द’ + ‘ता’ = अशुड़दता (P: 1 and S: 1)
    3. अपना = अपना + पन = अपनापन (P: 0 and S: 1)
    4. राषर = 'राषर' + 'ईय' + 'ता' + 'वादी' = राषरीयतावादी (P: 0 and S: 3)
    5. स्वाधीन = 'स्वाधीन' + ता + वादी = स्वाधीनतावादी (P: 0 and S: 2)

    View Slide

  34. Issues – Middle Inflection

    अक
    े लापन : अक
    े लेपन

    गमी_का_मौसम : गमी_क
    े _मौसम

    नया_वषर : नये_वषर

    काला_चोर : काले_चोर

    कालाधन : कालेधन

    View Slide

  35. Future Work
    Developing a Word Generator for Hindi.

    The idea is to see whether the framework is able to capture all kinds
    of words forms in Hindi – both regular and irregular.

    The implementation will not be very different from that of the MA’ s
    implementation.

    The linguistic resources used in the DM-based MA namely, the
    vocabulary items (suffixal entries) and the re-adjustment rules need
    to be applied in the reverse direction to produce fully inflected
    words using the root entries from the root-list and combining them
    with the affixal entries to generate surface forms.

    View Slide

  36. Conclusion

    Presented our study of Hindi inflectional morphology described within the
    framework of Distributed Morphology.

    The study analyses the formation of inflectional forms of Hindi through the
    application of suffix insertion rules and phonological readjustment rules.

    Implementation of a DM-based Hindi Morphological Analyzer that uses a set
    of ordered contextual rules to extract out suffixes from a word form and to
    provide detailed morpheme analysis.

    We showed that the DM-based Hindi morphological Analyzer is quite accurate
    and reliable, capable of both analysis and generation.

    We show that using Distributed Morphology, the representation of Hindi
    morphology is minimal, affix driven, efficient and accurate for the tasks of
    stemming and morphological analysis.

    View Slide

  37. References

    Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing:
    An Introduction to Natural Language Processing, Speech Recognition, and
    Computational Linguistics. 2nd edition. Prentice-Hall.

    Smriti Singh, PhD Thesis. 2012. Hindi Inflectional Morphology and its
    Implementation in Language Processing Tools: A Distributed Morphology
    Approach.

    Morris Halle and Alec Marantz 1993. Distributed Morphology and pieces of
    Inflection, In The View from Building 20: Essays in Linguistics in Honor of
    Sylvain Bromberger, eds. K. Hale and S. J. Keyser, 111– 176. Cambridge,
    Mass.: MIT Press.

    View Slide

  38. THANK YOU!
    Also thanks to our team of linguists:
    Mrs. Rajita Shukla
    Mrs. Jaya Jha
    Mrs. Laxmi Kashyap
    Mrs. Nootan Verma
    Mr. Dhirendra Singh

    View Slide