Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Modelling sound change with the help of multi-tiered sequence representations

Tiago Tresoldi
September 15, 2018

Modelling sound change with the help of multi-tiered sequence representations

Tiago Tresoldi

September 15, 2018
Tweet

More Decks by Tiago Tresoldi

Other Decks in Research

Transcript

  1. Modelling sound change with
    the help of multi-tiered
    sequence representations
    Tiago Tresoldi, Cormac Anderson, and Johann-Mattis List
    Max-Planck-Institut für Menschheitsgeschichte (MPI-SHH, Jena)
    Poznań, September 15th 2018

    View full-size slide

  2. Issues in computational historical phonology
    ● Computational historical linguistics has been transferring and adapting models
    and methods from evolutionary biology
    ○ Increasing availability of large digital corpora of cross-linguistic data
    ○ Phylogenetic turn
    ● Despite the advances, we have been dealing mostly with lexical and cognacy
    characters: phonetic and phonological tasks are still generally performed in the
    traditional way, without the assistance computers could give
    ○ One of the main reasons for this is that, while it might seem straightforward to compare sound
    sequences with genetic ones, there are striking differences
    ○ The analogies between linguistic and biological basic sequences (i.e., sequences of sounds and
    sequences of genetic bases) breaks down when we consider the underlying alphabets and the
    assumptions involved
    ○ This is unfortunate, as we possess lots of language data transcribed in this manner

    View full-size slide

  3. Properties of alphabetic transcriptions
    ● Unlike genetic bases phonological character sets ("alphabets") are
    language-specific and vary in number and detail
    ○ As we render in a discrete way what is continuous, there is always some level of information loss
    ● Phonetic and phonological transcriptions are idealised representations of various
    levels of abstraction of multidimensional and continuous information
    ○ Not necessarily captured by a single vector of information
    ○ Many phonological domains are not commensurate to a segment
    ● None of the proposed solutions for dealing with the difficulty of modelling such
    sound sequences has become standard, and none is suitable for the
    computational treatment of three of the main tasks of historical phonology:
    ○ Word generation/evaluation
    ○ Rule inference
    ○ Output prediction

    View full-size slide

  4. Segmental sequences: phonological issues
    Besides the uncertainty in terms of how many discrete units to consider for a given
    system (the problem of non-uniqueness), phonologies have a number of
    non-segmental properties relevant for sound change, e.g.:
    ● sound changes frequently act on natural classes, not individual segments
    ○ while class-defining features such as localisation, manner, voice, etc. are included in IPA graphemes,
    they are conflated (e.g. /b/ as a feature bundle “bilabial”, “stop”, “voiced”, possibly also implicit
    negative information “cannot be a syllable nucleus”)
    ● stress, tone, etc. are non-segmental and operate over a domain much larger than
    that of a segment, frequently determining what segments can occur, e.g. in
    unstressed position
    ● it is not infrequent for also melodic features to operate over domains larger than a
    segment, e.g. vowel harmony, distance effects, etc.
    ● explicit recognition of this in i.a. Firthian phonology (prosodies)

    View full-size slide

  5. Segmental sequences: other issues
    The idealisation of alphabetic transcription is insufficient for machine-representation
    also in further cases:
    ● word frequency
    ● register
    ● part of speech
    ● donor language and period of borrowing in the case of loanwords
    ● contrasting information, such as cases where authors diverge in terms of a related
    proto-form
    ● combined information from different word forms, perhaps also from cognates,
    that might aid us in the identification of changes

    View full-size slide

  6. Our proposal: tiers
    ● We propose to use extensive annotation to deal with these issues: “tiers”
    ○ these must be parallel, multilayered, and conceptually linked
    ○ linguistically, this can begin by involving annotation of alphabetic transcriptions with (the many)
    distinctive feature systems that researchers have been using for decades, thus also recreating
    natural classes, but need not be limited to this
    ○ computationally, the layers are analogous to solutions used in stochastic methods such as Layered
    Markov Models
    ● In our proposal, a potentially large number of "tiers" can be expressed in its
    relationship to a given sequence (i.e., word)
    ○ while the most obvious tiers are distinctive features, suprasegmental information and extra lexical
    information can accommodate all kinds of information, including the relationship between two or
    more words
    ○ there is no need to discuss "context" in terms of subsequences, as each aligned position can hold all
    the necessary information (and algorithms can be used to identify which tiers are informative and
    which are redundant)

    View full-size slide

  7. Tiers as annotation
    ● Our proposal is inspired by linguistic annotation in general. Similar to linguistic
    annotation of corpora, which provides an “added value” (Milà Garcia 2018: 271),
    our annotation framework that represents one sound sequence as a
    supra-sequence consisting of multiple annotation layers, we add value to pure
    alphabetic transcriptions in order to overcome their well-known disadvantages.
    While these disadvantages can be easily handled in manual approaches, for
    computational approaches it is indispensable that the annotations are explicit.
    This is what our framework makes possible.
    ● Our proposal has predecessors in historical linguistics, and especially in
    Hoenigswald (1990) we can find an annotation of accented versus unaccented
    initial stops in Germanic that is very similar to our idea of using complex
    annotations to increase the expressiveness of classical transcription.

    View full-size slide

  8. An initial example: "cat"
    Tier name
    Grapheme c a t
    Phoneme k æ t
    Position 1 2 3
    CV C V C
    Voiceness 0 1 0
    Sound class K A T
    Preceding sound class (SC -1) ∅ K A
    Following sound class (SC +1) A T ∅
    ... ... ... ...

    View full-size slide

  9. Correspondences
    The tiers are not limited to phonological information such as distinctive features. They
    can be used to encode other types of information, such as:
    ● Grammatical properties (for example, when modeling processes that only apply to
    a given part-of-speech)
    ● Statistical properties (such as word frequency, when modeling processes that
    might only apply under a certain threshold)
    ● Historical and social properties (such the value of a given tier in a cognate in
    another language, in a dialectal variety)
    ● Linguistic disagreements (for dealing with and evaluating instances in which there
    different authors give different accounts)

    View full-size slide

  10. Task 1: word generation/evaluation
    ● The generation of random words (e.g., in psycholinguistic experiments) or the
    evaluation of the naturalness of a random word (i.e., its statistical likelihood given
    a set of other observed languages) is usually carried out either by generative
    patterns or by Markov models
    ○ Generative patterns tend to be repetitive, favoring the most frequent value of each aspect (syllable
    structure, sound distribution, etc.), with difficulties in modelling even basic phonotactics
    ○ Markov models have a short-attention span and cannot use too large n-gram window or they start
    overfitting; they may also fail for more complex models such as vowel harmony
    ● Multitiers can potentially be used as alternatives to RNNs (Recurrent Neural
    Networks), but as a human interpretable alternative
    ○ computer-assisted not computer-performed studies
    ● While word generation/ evaluation primarily regards the synchronic level, it is also
    necessary to evaluate the plausibility of language states, not just language
    processes

    View full-size slide

  11. Task 2: Rule inference
    We do not know of any attempts to automatise of formalise the inference of sound
    change rules (sound laws) that account for the development of ancestral words to their
    descendant words when given only a set of ancestral forms (usually reconstructed)
    aligned with their descendant forms (usually in an attested language). This task is so far
    almost exclusively done manually by the experts.
    The particular problems of rule inference are manifold, and we do not need to list them
    all. We emphasise, however, that it would be highly desirable for historical linguistics to
    provide a formalised approach, since this would not only allow us to test different
    approaches against each other, but also to evaluate potential approaches.
    We ran experiments with Germanic and Chinese data, which are here briefly presented.
    They are intended to illustrate what multitiers are and what they can potentially do.

    View full-size slide

  12. From Middle Chinese to Mandarin
    We ran similar experiments to investigate the tone development from Middle Chinese
    to Mandarin, using MC data from Newman and Raman (1999), kindly provided by the
    authors. The development from Middle Chinese to Mandarin has a peculiar change in
    the voiced plosives (notably MC *b and *d) which have reflexes of devoiced
    counterparts (p, and d) as well as devoiced and aspirated counterparts (pʰ, and tʰ) in
    Mandarin Chinese.
    We know from previous research that the reason for this lies in the Middle Chinese
    tones (tone 1 in MC triggers aspiration, while the other tones 2-4 show only devoicing).
    With our multi-tier approach, we can easily test this on Newman's dataset. In order to
    do so, we add a tone-tier to the Middle Chinese words in the data starting with *b and
    *d and check the reflexes in Mandarin.

    View full-size slide

  13. From Middle Chinese to Mandarin
    MCH MCH-TONE Mandarin Frequency
    d 1 tʰ 45
    d 4 tʰ 2
    d 2 t 11
    d 3 t 30
    d 4 t 11
    b 1 pʰ 31
    b 3 pʰ 1
    b 4 pʰ 1
    b 2 p 15
    b 3 p 16
    b 4 p 4

    View full-size slide

  14. From Middle Chinese to Mandarin
    The pattern we describe here is by no means NEW or unknown to historical linguists,
    although it is rarely mentioned in the literature.
    We can find the pattern quickly with multi-tiered sequence presentation, provided that
    we test for the correlation of voicing and devoicing patterns from Middle Chinese to
    Mandarin and tone.
    Although we only show in this example what is already known, our approach to the
    problem with multi-tiered sequence representations illustrates that we can in fact use
    multi-tiers for quick tests on data that has so far not yet been analysed in this way (e.g.,
    testing devoicing and tone development in other SEA language families). Or we could
    search for exceptions in datasets, as we have done in this example.

    View full-size slide

  15. Task 3: Word prediction
    Multi-tiers can also be also for word prediction, as the process is essentially the inverse
    of the rule induction of task #2:
    ● given a set of rules which manipulates a sequence, in cases where the reflex of a
    proto-form is missing we could automatically generate the expected reflex from
    the rules inferred from other words, checking for cases where the reflex was
    subject to a major semantic shift
    ○ Historical linguists have been doing this for centuries, but multitiers would allow us to partially
    automate the task, providing computer assistance to researchers

    View full-size slide

  16. Future work
    We are developing the multitiers as an independent library for the Python programming
    language, with the intention of merging it with LingPy in the future.
    Possibility of annotating and testing different theories of phonological representation

    View full-size slide

  17. References
    Blevins, J. 2004. Evolutionary Phonology: The Emergence of Sound Patterns. Cambridge University Press.
    Bouchard-Côté, Alexandre, David Hall, Thomas L. Griffiths, and Dan Klein. 2013. “Automated Reconstruction of Ancient Languages Using
    Probabilistic Models of Sound Change.” Proceedings of the National Academy of Sciences of the United States of America 110 (11): 4224–9.
    Chomsky, Noam, and Morris Halle. 1968. The Sound Pattern of English. New York; Evanston; London: Harper; Row.
    Durbin, Richard, Sean R. Eddy, Anders Krogh, and Graeme Mitchinson. (1998) 2002. Biological Sequence Analysis. Probabilistic Models of
    Proteins and Nucleic Acids. 7th ed. Cambridge: Cambridge University Press.
    Firth, John Rupert. 1948. “Sounds and Peosodies.” Transactions of the Philological Society 47 (1). Wiley Online Library: 127–52.
    Harris, Zellig Sabbettai. 1963. “Structural Linguistics.” Chicago University Press.
    Hartman, Lee. 2003. “Phono (Version 4.0): Software for Modeling Regular Historical Sound Change.” Santiago de Cuba.
    Jakobson, Roman, C Gunnar Fant, and Morris Halle. 1951. “Preliminaries to Speech Analysis: The Distinctive Features and Their Correlates.”
    Kluge and Seebold (2002): Etymologisches Wörterbuch der deutschen Sprache. De Grutyer.
    Ladefoged, P., and I. Maddieson. 1996. The Sounds of the World’s Languages. Phonological Theory. Wiley.
    List, Johann-Mattis. 2014. Sequence Comparison in Historical Linguistics. Düsseldorf: Düsseldorf University Press.
    List, Johann-Mattis, and Thiago Chacon. 2015. “Towards a Cross-Linguistic Database for Historical Phonology? A Proposal for a Machine
    Readable Modeling of Phonetic Context.” Leiden.
    Mielke, Jeff, 2008. The Emergence of distinctive features. OUP Oxford.
    Newman, J. and Anand, V. Raman, 1999. Historical Chinese Phonology: A Compendium of Beijing and Cantonese Pronunciations of Characters
    and their Derivations from Middle Chinese. Newcastle and München: Lincom.
    Wheeler, W. C., and Peter M. Whiteley. 2015. “Historical Linguistics as a Sequence Optimization Problem: The Evolution and Biogeography of
    Uto-Aztecan Languages.” Cladistics 31 (2): 113–25.

    View full-size slide