$30 off During Our Annual Pro Sale. View Details »

Improving phonetic alignment by handling secondary sequence structures

Improving phonetic alignment by handling secondary sequence structures

Talk held at the workshop Computational approaches to the study of dialectal and typological variation, organized as part of the ESSLLI 2012. August 6-10, Opole, Poland.

Johann-Mattis List

August 08, 2012
Tweet

More Decks by Johann-Mattis List

Other Decks in Science

Transcript

  1. .
    .
    . .
    .
    .
    .
    Improving Phonetic Alignment by Handling
    Secondary Sequence Structures
    Johann-Mattis List∗
    ∗Institute for Romance Languages and Literature
    Heinrich Heine University Düsseldorf
    2012/08/10
    1 / 40

    View Slide

  2. Structure of the Talk
    .
    . .
    1 Historical Linguistics
    Keys to the Past
    Comparative Method
    Sound Correspondences
    .
    . .
    2 Sequence Comparison
    Sequences
    Alignment Analyses
    Alignment Modes
    .
    . .
    3 Secondary Alignment
    Secondary Sequence Structures
    Secondary Alignment Problem
    Secondary Alignment Algorithm
    .
    . .
    4 Phonetic Alignment
    SCA
    Paradigmatic Aspects
    Syntagmatic Aspects
    .
    . .
    5 Evaluation
    Evaluation Measures
    Gold Standard
    Results
    2 / 40

    View Slide

  3. Historical Linguistics
    Historical Linguistics
    3 / 40

    View Slide

  4. Historical Linguistics Keys to the Past
    Charles Lyell on Languages
    4 / 40

    View Slide

  5. Historical Linguistics Keys to the Past
    Charles Lyell on Languages
    The Geological Evidences
    of
    The Antiquity of Man
    with Remarks on Theories of
    The Origin of Species by Variation
    By Sir Charles Lyell
    London
    John Murray, Albemarle Street
    1863
    4 / 40

    View Slide

  6. Historical Linguistics Keys to the Past
    Charles Lyell on Languages
    If we new not-
    hing of the existence
    of Latin, - if all
    historical documents
    previous to the fin-
    teenth century had
    been lost, - if tra-
    dition even was si-
    lent as to the former
    existance of a Ro-
    man empire, a me-
    re comparison of the
    Italian, Spanish,
    Portuguese, French,
    Wallachian, and
    Rhaetian dialects
    would enable us to
    say that at some
    time there must ha-
    ve been a language,
    from which these
    six modern dialects
    derive their origin
    in common.
    4 / 40

    View Slide

  7. Historical Linguistics Keys to the Past
    Historical Scenarios
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - -
    5 / 40

    View Slide

  8. Historical Linguistics Keys to the Past
    Historical Scenarios
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - -
    5 / 40

    View Slide

  9. Historical Linguistics Keys to the Past
    Historical Scenarios
    German ʦ aː n - -
    * Proto-Germanic t a n d
    English t ʊː - θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    * Proto-Romance d e n t
    French d ɑ̃ - - -
    5 / 40

    View Slide

  10. Historical Linguistics Keys to the Past
    Historical Scenarios
    German ʦ aː n - -
    Proto-Germanic t a n θ -
    English t ʊː - θ -
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃ - - -
    5 / 40

    View Slide

  11. Historical Linguistics Keys to the Past
    Historical Scenarios
    German ʦ aː n -
    Proto-Germanic t a n θ -
    English t ʊː - θ
    ** Proto-Indo-European d o n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃ - -
    5 / 40

    View Slide

  12. Historical Linguistics Keys to the Past
    Historical Scenarios
    German ʦ aː n -
    Proto-Germanic t a n θ -
    English t ʊː - θ
    Proto-Indo-European d e n t -
    Italian d ɛ n t ə
    Proto-Romance d e n t e
    French d ɑ̃ - -
    5 / 40

    View Slide

  13. Historical Linguistics Keys to the Past
    Historical Scenarios
    German ʦ aː n -
    * Proto-Germanic t a n d
    English t ʊː - θ
    Proto-Indo-European d e n t
    Italian d ɛ n t ə
    * Proto-Romance d e n t
    French d ɑ̃ - -
    5 / 40

    View Slide

  14. Historical Linguistics Keys to the Past
    Historical Scenarios
    German ʦ aː n
    Proto-Germanic t a n θ
    English t ʊː θ
    Proto-Indo-European d e n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃
    German ʦ aː n
    Proto-Germanic t a n θ
    English t ʊː θ
    Proto-Indo-European d e n t
    Italian d ɛ n t e
    Proto-Romance d e n t e
    French d ɑ̃
    15 / 40

    View Slide

  15. Historical Linguistics Comparative Method
    The Comparative Method
    Compile an initial list of putative cognate sets.
    Extract an initial list of putative sets of sound correspondences
    from the initial cognate list.
    Refine the cognate list and the correspondence list by
    adding and deleting cognate sets from the cognate list, depending
    on whether they are consistent with the correspondence list or not,
    and
    adding and deleting correspondence sets from the correspondence
    list, depending on whether they are consistent with the cognate list
    or not.
    Finish when the results are satisfying enough.
    6 / 40

    View Slide

  16. Historical Linguistics Sound Correspondences
    Sound Correspondences
    Sequence similarity is determined on the basis of systematic
    sound correspondences as opposed to similarity based on surface
    resemblances of phonetic segments.
    Lass (1997) calls this notion of similarity phenotypic as opposed to
    a genotypic notion of similarity.
    The most crucial aspect of correspondence-based similarity is that
    it is language-specific: Genotypic similarity is never defined in
    general terms but always with respect to the language systems
    which are being compared.
    bla
    German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth”
    German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten”
    German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue”
    7 / 40

    View Slide

  17. Historical Linguistics Sound Correspondences
    Sound Correspondences
    Sequence similarity is determined on the basis of systematic
    sound correspondences as opposed to similarity based on surface
    resemblances of phonetic segments.
    Lass (1997) calls this notion of similarity phenotypic as opposed to
    a genotypic notion of similarity.
    The most crucial aspect of correspondence-based similarity is that
    it is language-specific: Genotypic similarity is never defined in
    general terms but always with respect to the language systems
    which are being compared.
    Meaning German Dutch English
    “tooth” Zahn [ ʦ aːn] tand [ t ɑnt] tooth [ t ʊːθ]
    “ten” zehn [ ʦ eːn] tien [ t iːn] ten [ t ɛn]
    “tongue” Zunge [ ʦ ʊŋə] tong [ t ɔŋ] tongue [ t ʌŋ]
    7 / 40

    View Slide

  18. Historical Linguistics Sound Correspondences
    Sound Correspondences
    Sequence similarity is determined on the basis of systematic
    sound correspondences as opposed to similarity based on surface
    resemblances of phonetic segments.
    Lass (1997) calls this notion of similarity phenotypic as opposed to
    a genotypic notion of similarity.
    The most crucial aspect of correspondence-based similarity is that
    it is language-specific: Genotypic similarity is never defined in
    general terms but always with respect to the language systems
    which are being compared.
    Meaning Shanghai Beijing Guangzhou
    “nine” [ ʨ iɤ³⁵] Beijing [ ʨ iou²¹⁴] [ k ɐu³⁵]
    “today” [ ʨ iŋ⁵⁵ʦɔ²¹] Beijing [ ʨ iɚ⁵⁵] [ k ɐm⁵³jɐt²]
    “rooster” [koŋ⁵⁵ ʨ i²¹] Beijing[kuŋ⁵⁵ ʨ i⁵⁵] [ k ɐi⁵⁵koŋ⁵⁵]
    7 / 40

    View Slide

  19. Sequence Comparison
    Sequence
    C
    om
    parison
    8 / 40

    View Slide

  20. Sequence Comparison Sequences
    Sequences
    Definition 1
    Given an alphabet (a non-empty finite set, whose elements
    are called characters), a sequence is an ordered list of char-
    acters drawn from the alphabet. The elements of sequences
    are called segments. (cf. Böckenbauer & Bongartz 2003:
    30f)
    9 / 40

    View Slide

  21. Sequence Comparison Sequences
    Sequences
    10 / 40

    View Slide

  22. Sequence Comparison Sequences
    Sequences
    10 / 40

    View Slide

  23. Sequence Comparison Sequences
    Sequences
    4
    3
    11 / 40

    View Slide

  24. Sequence Comparison Sequences
    Sequences
    1 1 1 1
    11 / 40

    View Slide

  25. Sequence Comparison Sequences
    Sequences
    1
    Baked Rabbit
    1 rabbit
    1 1/2 tsp. salt
    1 1/8 1/8 tsp. pepper
    1 1/2 c. onion slices
    • Rub salt and pepper on
    rabbit pieces.
    • Place on large sheet of
    aluminium foil.
    • Place onion slices on
    rabbit.
    • Bake at 350 degrees.
    • Eat when done and
    tender.
    11 / 40

    View Slide

  26. Sequence Comparison Alignment Analyses
    Alignment Analyses
    Definition 2
    An alignment of two sequences s and t is a two-row matrix
    in which both sequences are aranged in such a way that
    all matching and mismatching segments occur in the same
    column, while empty cells, resulting from empty matches,
    are filled with gap symbols. (cf. Kruskal 1983)
    12 / 40

    View Slide

  27. Sequence Comparison Alignment Analyses
    Alignment Analyses
    0 H H H H H 0
    0 H H H H 0
    13 / 40

    View Slide

  28. Sequence Comparison Alignment Analyses
    Alignment Analyses
    0 H H H H H 0
    0 H H H H 0
    13 / 40

    View Slide

  29. Sequence Comparison Alignment Analyses
    Alignment Analyses
    0 H H H H H 0
    0 H H H H H 0
    13 / 40

    View Slide

  30. Sequence Comparison Alignment Modes
    Global Alignment
    Global alignment analyses are the most basic way to com-
    pare sequences. The traditional Needleman-Wunsch algo-
    rithm (Needleman and Wunsch 1971) conducts global align-
    ment analyses, and the Levenshtein distance (edit distance,
    Levenshtein 1965) is defined for global alignments.
    14 / 40

    View Slide

  31. Sequence Comparison Alignment Modes
    Global Alignment
    Global alignment analyses are the most basic way to com-
    pare sequences. The traditional Needleman-Wunsch algo-
    rithm (Needleman and Wunsch 1971) conducts global align-
    ment analyses, and the Levenshtein distance (edit distance,
    Levenshtein 1965) is defined for global alignments.
    Mode Alignment
    global
    G R E E N C A T F I S H H U N T E R
    A F A T C A T - - - - H U N T E R
    14 / 40

    View Slide

  32. Sequence Comparison Alignment Modes
    Semi-Global Alignment
    Semi-global alignment analyses do not necessarily compare
    two sequences as a whole but allow prefixes and suffixes to
    be ignored in an alignment analysis, if these would otherwise
    increase the cost of the optimal alignment. Computationally,
    this is done by setting the costs for gaps inserted in the begin
    and at the end of an alignment to zero.
    15 / 40

    View Slide

  33. Sequence Comparison Alignment Modes
    Semi-Global Alignment
    Semi-global alignment analyses do not necessarily compare
    two sequences as a whole but allow prefixes and suffixes to
    be ignored in an alignment analysis, if these would otherwise
    increase the cost of the optimal alignment. Computationally,
    this is done by setting the costs for gaps inserted in the begin
    and at the end of an alignment to zero.
    Mode Alignment
    global
    G R E E N C A T F I S H H U N T E R
    A F A T C A T - - - - H U N T E R
    semi-global
    G R E E N - C A T F I S H H U N T E R
    - - - - - A F A T C A T H U N T E R
    15 / 40

    View Slide

  34. Sequence Comparison Alignment Modes
    Local Alignment
    While semi-global alignment analyses allow prefixes and
    suffixes to be ignored only if one sequence contains a
    prefix or suffix while the other does not, local alignment
    analyses (Smith-Waterman algorithm, Smith and Waterman
    1981) only align the best scoring subsequences of two se-
    quences, while leaving the rest of the sequences completely
    unaligned. Computationally, this is done by prohibiting that
    the cost of an alignment analysis goes beyond zero.
    16 / 40

    View Slide

  35. Sequence Comparison Alignment Modes
    Local Alignment
    While semi-global alignment analyses allow prefixes and
    suffixes to be ignored only if one sequence contains a
    prefix or suffix while the other does not, local alignment
    analyses (Smith-Waterman algorithm, Smith and Waterman
    1981) only align the best scoring subsequences of two se-
    quences, while leaving the rest of the sequences completely
    unaligned. Computationally, this is done by prohibiting that
    the cost of an alignment analysis goes beyond zero.
    Mode Alignment
    global
    G R E E N C A T F I S H H U N T E R
    A F A T C A T - - - - H U N T E R
    semi-global
    G R E E N - C A T F I S H H U N T E R
    - - - - - A F A T C A T H U N T E R
    local
    GREEN CATFISH H U N T E R
    A FAT CAT H U N T E R
    16 / 40

    View Slide

  36. Sequence Comparison Alignment Modes
    Diagonal Alignment
    While local alignment analyses leave unalignable parts of
    sequences unaligned, diagonal alignment analyses (DI-
    ALIGN algorith, Morgenstern 1996) align sequences glob-
    ally, but search for local similarities at the same time. Local
    similarities are defined as “diagonals”, i.e. ungapped align-
    ments. Diagonal alignment analyses maximize the score of
    diagonals in an alignment.
    17 / 40

    View Slide

  37. Sequence Comparison Alignment Modes
    Diagonal Alignment
    While local alignment analyses leave unalignable parts of
    sequences unaligned, diagonal alignment analyses (DI-
    ALIGN algorith, Morgenstern 1996) align sequences glob-
    ally, but search for local similarities at the same time. Local
    similarities are defined as “diagonals”, i.e. ungapped align-
    ments. Diagonal alignment analyses maximize the score of
    diagonals in an alignment.
    Mode Alignment
    global
    G R E E N C A T F I S H H U N T E R
    A F A T C A T - - - - H U N T E R
    semi-global
    G R E E N - C A T F I S H H U N T E R
    - - - - - A F A T C A T H U N T E R
    local
    GREEN CATFISH H U N T E R
    A FAT CAT H U N T E R
    diagonal
    - - - - - G R E E N C A T F I S H H U N T E R
    A F A T - - - - - C A T - - - - H U N T E R
    17 / 40

    View Slide

  38. Secondary Alignment
    secondarysequencestructures
    secondary sequence structures
    se co nda ry se que nce stru ctu re s
    se con da ry se quence struc tures
    s e c o n d a r y s e q u e n c e s t r u c t u r e s
    S E C O N D A R Y S E Q U E N C E S T R U C T U R E
    sec ond ary seq uen ces tru ctu res
    seco ndar yseq uenc estr ctur es
    Secondary
    Alignm
    ent
    18 / 40

    View Slide

  39. Secondary Alignment Secondary Sequence Structures
    Secondary Sequence Structures
    Apart from a primary structure, sequences can also have a
    secondary structure. Primary structure refers to the order of
    segments. Secondary structure refers to the order of sec-
    ondary segments, i.e. segments that result from the group-
    ing of primary segments into higher units.
    19 / 40

    View Slide

  40. Secondary Alignment Secondary Sequence Structures
    Secondary Sequence Structures
    Apart from a primary structure, sequences can also have a
    secondary structure. Primary structure refers to the order of
    segments. Secondary structure refers to the order of sec-
    ondary segments, i.e. segments that result from the group-
    ing of primary segments into higher units.
    "ABCEFGIJK" → "ABC.EFG.IJK"
    19 / 40

    View Slide

  41. Secondary Alignment Secondary Sequence Structures
    Secondary Sequence Structures
    Apart from a primary structure, sequences can also have a
    secondary structure. Primary structure refers to the order of
    segments. Secondary structure refers to the order of sec-
    ondary segments, i.e. segments that result from the group-
    ing of primary segments into higher units.
    "ABCEFGIJK" → "ABC.EFG.IJK"
    "THECATFISHHUNTER" → "THE.CATFISH.HUNTER"
    19 / 40

    View Slide

  42. Secondary Alignment Secondary Sequence Structures
    Secondary Sequence Structures
    Apart from a primary structure, sequences can also have a
    secondary structure. Primary structure refers to the order of
    segments. Secondary structure refers to the order of sec-
    ondary segments, i.e. segments that result from the group-
    ing of primary segments into higher units.
    "ABCEFGIJK" → "ABC.EFG.IJK"
    "THECATFISHHUNTER" → "THE.CATFISH.HUNTER"
    "KARAOKE" → "KA.RA.O.KE"
    19 / 40

    View Slide

  43. Secondary Alignment Secondary Alignment Problem
    The Secondary Alignment Problem
    Secondary Alignment Problem
    Given two sequences s and t of length m and n which have
    the primary structures s1
    , ..., sm and t1
    , ..., tn , and the
    secondary structures s0→i
    , ..., sj→m
    and t0→k
    , ..., tl→n
    , find an
    alignment of maximal score in which segments belonging to
    the same secondary segment in s only correspond to seg-
    ments belonging to the same secondary segment in t, and
    vice versa.
    20 / 40

    View Slide

  44. Secondary Alignment Secondary Alignment Problem
    The Secondary Alignment Problem
    Mode Alignment
    global
    T H E - C A T - F I S H - H U N T S
    T H E - C A T - F I S H - E - - - S
    semiglobal
    T H E - C A T - F I S H - - - H U N T S
    T H E - C A T - F I S H E S - - - - - -
    local
    T H E - C A T - F I S H HUNTS
    T H E - C A T - F I S H ES
    diagonal
    T H E - C A T - F I S H - - H U N T S
    T H E - C A T - F I S H E - - - - - S
    secondary
    T H E - C A T F I S H - H U N T - S
    T H E - C A T - - - - - F I S H E S
    21 / 40

    View Slide

  45. Secondary Alignment Secondary Alignment Algorithm
    A Secondary Alignment Algorithm
    Algorithm 1: Secondary(x, y, g, r, score)
    comment: matrix construction and initialization
    . . .
    comment: main loop
    for i ← 1 to length(x)
    do



































    for j ← 1 to length(y)
    do





























    M[i][j] ← max





























    M[i − 1][j − 1] + score(xi−1
    , yj−1
    )
    comment: check for restriction 2
    if xi−1
    = r and yj−1
    = r and j = length(y)
    then − ∞)
    else M[i − 1][j] + g
    if yj−1
    = r and xi−1
    = r and i = length(x)
    then − ∞)
    else M[i][j − 1] + g
    22 / 40

    View Slide

  46. Secondary Alignment Secondary Alignment Algorithm
    A Secondary Alignment Algorithm
    1 0 0
    0 0
    A . B C . D E
    0 0
    0 0
    0 0
    0 0
    0 -1
    - A
    0 -2
    - .
    0 -3
    - B
    0 -4
    - C
    0 -5
    - .
    0 -6
    - D
    0 -7
    - E
    A A -1
    - 0
    A 1
    0 A
    A 0
    - .
    A -1
    - B
    A -2
    - C
    A -3
    - .
    A -4
    - D
    A -5
    - E
    A A -2
    - 0
    A 0
    - A
    A 0
    0 .
    A -1
    - B
    A -2
    - C
    A -3
    - .
    A -4
    - D
    A -5
    - E
    B B -3
    - 0
    B -1
    - A
    B -1
    -
    . B 1
    0 B
    B 0
    - C
    B -1
    - .
    B -2
    - D
    B -3
    - E
    C C -4
    - 0
    C -2
    - A
    C -2
    -
    . C 0
    - B
    C 2
    0 C
    C 1
    - .
    C 0
    - D
    C -1
    - E
    D D -5
    - 0
    D -3
    - A
    D -3
    -
    . D -1
    - B
    D 1
    - C
    D 1
    0 .
    D 2
    0 D
    D 1
    - E
    E E -6
    - 0
    E -4
    - A
    E -4
    -
    . E -2
    - B
    E 0
    - C
    E 0
    -
    . E 1
    - D
    E 3
    0 E
    . . -7
    - 0
    . -5
    - A
    . -3
    0 .
    . -3
    - B
    . -1
    - C
    . 1
    0 .
    . 0
    - D
    . 2
    - E
    E E -8
    - 0
    E -6
    - A
    E -4
    -
    . E -4
    - B
    E -2
    - C
    E 0
    -
    . E 0
    0 D
    E 1
    - E
    2 0 0
    0 0
    A . B C . D E
    0 0
    0 0
    0 0
    0 0
    0 -1
    - A
    0 -2
    - .
    0 -3
    - B
    0 -4
    - C
    0 -5
    - .
    0 -6
    - D
    0 -7
    - E
    A A -1
    - 0
    A 1
    0 A
    A -3
    -
    . A -3
    0 B
    A -4
    - C
    A -6
    -
    . A -6
    0 D
    A -7
    - E
    A A -2
    - 0
    A 0
    - A
    A -4
    -
    . A -4
    - B
    A -4
    0 C
    A -7
    -
    . A -7
    - D
    A -7
    0 E
    B B -3
    - 0
    B -1
    - A
    B -5
    -
    . B -3
    0 B
    B -4
    - C
    B -8
    -
    . B -8
    - D
    B -8
    - E
    C C -4
    - 0
    C -2
    - A
    C -6
    -
    . C -4
    - B
    C -2
    0 C
    C -9
    -
    . C -9
    - D
    C -9
    - E
    D D -5
    - 0
    D -3
    - A
    D -7
    -
    . D -5
    - B
    D -3
    - C
    D -10
    -
    . D -8
    0 D
    D -9
    - E
    E E -6
    - 0
    E -4
    - A
    E -8
    -
    . E -6
    - B
    E -4
    - C
    E -11
    -
    . E -9
    - D
    E -7
    0 E
    . . -7
    - 0
    . -8
    - A
    . -3
    0 .
    . -4
    - B
    . -5
    - C
    . -3
    0 .
    . -4
    - D
    . -5
    - E
    E E -8
    - 0
    E -8
    0 A
    E -4
    -
    . E -4
    0 B
    E -5
    - C
    E -4
    -
    . E -4
    0 D
    E -3
    0 E
    23 / 40

    View Slide

  47. Secondary Alignment Secondary Alignment Algorithm
    A Secondary Alignment Algorithm
    The extension for secondary alignment is independent of the
    underlying alignment mode.
    Global, semi-global, local, and diagonal alignment analyses that
    are sensitive for secondary sequence structures can be carried out.
    The only requirement of the algorithm in contrast to the traditional
    alignment algorithms is the boundary character which has to be
    specified by the user.
    24 / 40

    View Slide

  48. Phonetic Alignment
    h j - ä r t a -
    h - e - r z - -
    h - e a r t - -
    c - - o r d i s
    hjärta
    herz
    heart
    cordis
    Phonetic Alignment
    25 / 40

    View Slide

  49. Phonetic Alignment SCA
    Sound-Class-Based Phonetic Alignment (SCA)
    SCA (List 2012) is a new method for pairwise and multiple
    phonetic alignment, implemented as part of LingPy
    (http://lingulist.de/lingpy), a Python library for
    quantitative tasks in historical linguistics.
    SCA is based on a novel framework for phonetic alignment that
    combines both the most recent developments in computational
    biology with new approaches to sequence modelling in historical
    linguistics and dialectology.
    According to the new framework for sequence modelling, sound
    sequences are internally represented in different layers which
    relate to both important paradigmatic and syntagmatic aspects of
    linguistic sequences.
    26 / 40

    View Slide

  50. Phonetic Alignment Paradigmatic Aspects
    Sound Classes
    .
    Sound Classes
    .
    .
    .
    . .
    .
    .
    .
    Sounds which often occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes (types). It
    is assumed “that phonetic
    correspondences inside a‘type’
    are more regular than those
    between different‘types’”
    (Dolgopolsky 1986: 35).
    27 / 40

    View Slide

  51. Phonetic Alignment Paradigmatic Aspects
    Sound Classes
    .
    Sound Classes
    .
    .
    .
    . .
    .
    .
    .
    Sounds which often occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes (types). It
    is assumed “that phonetic
    correspondences inside a‘type’
    are more regular than those
    between different‘types’”
    (Dolgopolsky 1986: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    27 / 40

    View Slide

  52. Phonetic Alignment Paradigmatic Aspects
    Sound Classes
    .
    Sound Classes
    .
    .
    .
    . .
    .
    .
    .
    Sounds which often occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes (types). It
    is assumed “that phonetic
    correspondences inside a‘type’
    are more regular than those
    between different‘types’”
    (Dolgopolsky 1986: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    27 / 40

    View Slide

  53. Phonetic Alignment Paradigmatic Aspects
    Sound Classes
    .
    Sound Classes
    .
    .
    .
    . .
    .
    .
    .
    Sounds which often occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes (types). It
    is assumed “that phonetic
    correspondences inside a‘type’
    are more regular than those
    between different‘types’”
    (Dolgopolsky 1986: 35).
    k g p b
    ʧ ʤ f v
    t d ʃ ʒ
    θ ð s z
    1
    27 / 40

    View Slide

  54. Phonetic Alignment Paradigmatic Aspects
    Sound Classes
    .
    Sound Classes
    .
    .
    .
    . .
    .
    .
    .
    Sounds which often occur in
    correspondence relations in
    genetically related languages can
    be clustered into classes (types). It
    is assumed “that phonetic
    correspondences inside a‘type’
    are more regular than those
    between different‘types’”
    (Dolgopolsky 1986: 35).
    K
    T
    P
    S
    1
    27 / 40

    View Slide

  55. Phonetic Alignment Paradigmatic Aspects
    Scoring Functions for Sound Classes
    LingPy offers default scoring functions for three standard
    sound-class models (ASJP, SCA, DOLGO).
    The standard models vary regarding the roughness by which the
    continuum of sounds is split into discrete classes.
    The scoring functions are based on empirical data on sound
    correspondence frequencies (ASJP model, Brown et al. 2011), and
    on general theoretical models of the directionality and probability of
    sound change processes that are converted into non-directional
    similarity matrices (SCA, DOLGO, see List 2012 for details).
    28 / 40

    View Slide

  56. Phonetic Alignment Syntagmatic Aspects
    Prosodic Strings
    Sound change occurs more frequently in prosodically weak
    positions of phonetic sequences (Geisler 1992).
    Given the sonority profile of a phonetic sequence, one can
    distinguish positions that differ regarding their prosodic context.
    Prosodic context can be modelled by representing a sequence by
    a prosodic string, indicating the different prosodic contexts of
    each segment.
    Based on the relative strength of all sites in a phonetic sequence,
    substitution scores and gap penalties can be modified when
    carrying out alignment analyses.
    Prosodic strings are an alternative to n-gram approaches, since
    they also handle context, their specific advantage being that they
    are more abstract and less data-dependent.
    29 / 40

    View Slide

  57. Phonetic Alignment Syntagmatic Aspects
    Prosodic Strings
    j a b ə l k a
    30 / 40

    View Slide

  58. Phonetic Alignment Syntagmatic Aspects
    Prosodic Strings
    j a b ə l k a
    sonority
    increases
    30 / 40

    View Slide

  59. Phonetic Alignment Syntagmatic Aspects
    Prosodic Strings
    j a b ə l k a
    ↑ △ ↑ △ ↓ ↑ △
    ↑ ascending
    △ maximum
    ↓ descending
    30 / 40

    View Slide

  60. Phonetic Alignment Syntagmatic Aspects
    Prosodic Strings
    j a b ə l k a
    ↑ △ ↑ △ ↓ ↑ △
    o strong
    weak
    30 / 40

    View Slide

  61. Phonetic Alignment Syntagmatic Aspects
    Prosodic Strings
    phonetic sequence j a b ə l k a
    SCA model J A P E L K A
    ASJP model y a b I l k a
    DOLGO model J V P V R K V
    sonority profile 6 7 1 7 5 1 7
    prosodic string # v C v c C >
    Relative Weight 2.0 1.5 1.5 1.3 1.1 1.5 0.7
    30 / 40

    View Slide

  62. Phonetic Alignment Syntagmatic Aspects
    Secondary Alignment
    While secondary alignment was never an issue in computational
    biology, it is a desideratum in historical linguistics and dialectology.
    Secondary structures are especially important when
    (1) aligning whole sentences, where the alignment of one word from
    one with two words from another sentence should be avoided,
    (2) aligning language data for which morphological information is also
    available, or
    (3) when aligning words from South-East-Asian tone languages which
    generally show a structure in which one syllable corresponds to one
    morpheme.
    31 / 40

    View Slide

  63. Phonetic Alignment Syntagmatic Aspects
    Secondary Alignment
    Primary Alignment
    Haikou z i - t - ³
    Beijing ʐ ʅ ⁵¹ tʰ ou ¹
    Secondary Alignment
    Haikou z i t ³ - - -
    Beijing ʐ ʅ - ⁵¹ tʰ ou ¹
    32 / 40

    View Slide

  64. Evaluation
    *
    *
    *
    *
    *
    *
    *
    *
    * *
    *
    *
    *
    v o l - d e m o r t
    v - l a d i m i r -
    v a l - d e m a r -
    Evaluation
    33 / 40

    View Slide

  65. Evaluation Evaluation Measures
    Evaluation Measures
    PAS: Perfect Alignment Score
    CS: Column Score
    SPS: Sum-of-Pairs Score
    34 / 40

    View Slide

  66. Evaluation Evaluation Measures
    Evaluation Measures
    Column-Score (CS)
    CS = 100 · 2 · |Ct∩Cr|
    |Cr|+|Ct|
    ,
    where Ct
    is the set of columns in the test alignment and Cr
    is the set of columns in the reference alignment (Rosenberg
    and Ogden 2009).
    Sum-of-Pairs Score (SPS)
    SPS = 100 · 2 · |Pt∩Pr|
    |Pr|+|Pt|
    ,
    where Pt
    is the set of all aligned residue pairs in the test
    alignment and Pr
    is the set of all aligned residue pairs in the
    reference alignment (ibd.).
    35 / 40

    View Slide

  67. Evaluation Gold Standard
    Gold Standard
    1 089 manually aligned sequence pairs.
    Words taken from the Bai dialects (Wang 2006, Allen 2007) and
    Chinese dialects (Hou 2004).
    Both Bai and Chinese are tone languages.
    All data is available under
    http://lingulist.de/supp/secondary.zip
    36 / 40

    View Slide

  68. Evaluation Results
    Results
    Score Primary Secondary
    PAS 83.47 88.89
    CS 88.54 92.70
    SPS 92.78 95.52
    37 / 40

    View Slide

  69. Concluding Remarks
    As can be seen from the results, the modified algorithm which is
    sensitive to secondary sequence structures shows a great
    improvement compared to the traditional algorithm which aligns
    sequences only with respect to their primary structure.
    The improvement is significant with p < 0.01 using the Wilcoxon
    signed rank test as suggested by Notredame (2000).
    The algorithm for secondary alignment proves very useful for the
    alignment of tonal languages, yet it may also be employed for the
    analysis of other kinds of sequential data and, e.g., help to carry
    out phonetic alignment analyses of whole sentences.
    38 / 40

    View Slide

  70. *deh3
    -
    ?
    What’s next?
    39 / 40

    View Slide

  71. Special thanks to:
    • The German Federal Mi-
    nistry of Education and
    Research (BMBF) for
    funding our research
    project.
    • Hans Geisler for his hel-
    pful, critical, and ins-
    piring support.
    40 / 40

    View Slide

  72. THANK YOU
    1
    FOR LISTENING!
    40 / 40

    View Slide