Slide 1

Slide 1 text

. . . . . . . Improving Phonetic Alignment by Handling Secondary Sequence Structures Johann-Mattis List∗ ∗Institute for Romance Languages and Literature Heinrich Heine University Düsseldorf 2012/08/10 1 / 40

Slide 2

Slide 2 text

Structure of the Talk . . . 1 Historical Linguistics Keys to the Past Comparative Method Sound Correspondences . . . 2 Sequence Comparison Sequences Alignment Analyses Alignment Modes . . . 3 Secondary Alignment Secondary Sequence Structures Secondary Alignment Problem Secondary Alignment Algorithm . . . 4 Phonetic Alignment SCA Paradigmatic Aspects Syntagmatic Aspects . . . 5 Evaluation Evaluation Measures Gold Standard Results 2 / 40

Slide 3

Slide 3 text

Historical Linguistics Historical Linguistics 3 / 40

Slide 4

Slide 4 text

Historical Linguistics Keys to the Past Charles Lyell on Languages 4 / 40

Slide 5

Slide 5 text

Historical Linguistics Keys to the Past Charles Lyell on Languages The Geological Evidences of The Antiquity of Man with Remarks on Theories of The Origin of Species by Variation By Sir Charles Lyell London John Murray, Albemarle Street 1863 4 / 40

Slide 6

Slide 6 text

Historical Linguistics Keys to the Past Charles Lyell on Languages If we new not- hing of the existence of Latin, - if all historical documents previous to the fin- teenth century had been lost, - if tra- dition even was si- lent as to the former existance of a Ro- man empire, a me- re comparison of the Italian, Spanish, Portuguese, French, Wallachian, and Rhaetian dialects would enable us to say that at some time there must ha- ve been a language, from which these six modern dialects derive their origin in common. 4 / 40

Slide 7

Slide 7 text

Historical Linguistics Keys to the Past Historical Scenarios German ʦ aː n - * Proto-Germanic t a n d English t ʊː θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - 5 / 40

Slide 8

Slide 8 text

Historical Linguistics Keys to the Past Historical Scenarios German ʦ aː n - * Proto-Germanic t a n d English t ʊː θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - 5 / 40

Slide 9

Slide 9 text

Historical Linguistics Keys to the Past Historical Scenarios German ʦ aː n - - * Proto-Germanic t a n d English t ʊː - θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - - 5 / 40

Slide 10

Slide 10 text

Historical Linguistics Keys to the Past Historical Scenarios German ʦ aː n - - Proto-Germanic t a n θ - English t ʊː - θ - ** Proto-Indo-European d o n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ - - - 5 / 40

Slide 11

Slide 11 text

Historical Linguistics Keys to the Past Historical Scenarios German ʦ aː n - Proto-Germanic t a n θ - English t ʊː - θ ** Proto-Indo-European d o n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ - - 5 / 40

Slide 12

Slide 12 text

Historical Linguistics Keys to the Past Historical Scenarios German ʦ aː n - Proto-Germanic t a n θ - English t ʊː - θ Proto-Indo-European d e n t - Italian d ɛ n t ə Proto-Romance d e n t e French d ɑ̃ - - 5 / 40

Slide 13

Slide 13 text

Historical Linguistics Keys to the Past Historical Scenarios German ʦ aː n - * Proto-Germanic t a n d English t ʊː - θ Proto-Indo-European d e n t Italian d ɛ n t ə * Proto-Romance d e n t French d ɑ̃ - - 5 / 40

Slide 14

Slide 14 text

Historical Linguistics Keys to the Past Historical Scenarios German ʦ aː n Proto-Germanic t a n θ English t ʊː θ Proto-Indo-European d e n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ German ʦ aː n Proto-Germanic t a n θ English t ʊː θ Proto-Indo-European d e n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ 15 / 40

Slide 15

Slide 15 text

Historical Linguistics Comparative Method The Comparative Method Compile an initial list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not. Finish when the results are satisfying enough. 6 / 40

Slide 16

Slide 16 text

Historical Linguistics Sound Correspondences Sound Correspondences Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue” 7 / 40

Slide 17

Slide 17 text

Historical Linguistics Sound Correspondences Sound Correspondences Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. Meaning German Dutch English “tooth” Zahn [ ʦ aːn] tand [ t ɑnt] tooth [ t ʊːθ] “ten” zehn [ ʦ eːn] tien [ t iːn] ten [ t ɛn] “tongue” Zunge [ ʦ ʊŋə] tong [ t ɔŋ] tongue [ t ʌŋ] 7 / 40

Slide 18

Slide 18 text

Historical Linguistics Sound Correspondences Sound Correspondences Sequence similarity is determined on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. Meaning Shanghai Beijing Guangzhou “nine” [ ʨ iɤ³⁵] Beijing [ ʨ iou²¹⁴] [ k ɐu³⁵] “today” [ ʨ iŋ⁵⁵ʦɔ²¹] Beijing [ ʨ iɚ⁵⁵] [ k ɐm⁵³jɐt²] “rooster” [koŋ⁵⁵ ʨ i²¹] Beijing[kuŋ⁵⁵ ʨ i⁵⁵] [ k ɐi⁵⁵koŋ⁵⁵] 7 / 40

Slide 19

Slide 19 text

Sequence Comparison Sequence C om parison 8 / 40

Slide 20

Slide 20 text

Sequence Comparison Sequences Sequences Definition 1 Given an alphabet (a non-empty finite set, whose elements are called characters), a sequence is an ordered list of char- acters drawn from the alphabet. The elements of sequences are called segments. (cf. Böckenbauer & Bongartz 2003: 30f) 9 / 40

Slide 21

Slide 21 text

Sequence Comparison Sequences Sequences 10 / 40

Slide 22

Slide 22 text

Sequence Comparison Sequences Sequences 10 / 40

Slide 23

Slide 23 text

Sequence Comparison Sequences Sequences 4 3 11 / 40

Slide 24

Slide 24 text

Sequence Comparison Sequences Sequences 1 1 1 1 11 / 40

Slide 25

Slide 25 text

Sequence Comparison Sequences Sequences 1 Baked Rabbit 1 rabbit 1 1/2 tsp. salt 1 1/8 1/8 tsp. pepper 1 1/2 c. onion slices • Rub salt and pepper on rabbit pieces. • Place on large sheet of aluminium foil. • Place onion slices on rabbit. • Bake at 350 degrees. • Eat when done and tender. 11 / 40

Slide 26

Slide 26 text

Sequence Comparison Alignment Analyses Alignment Analyses Definition 2 An alignment of two sequences s and t is a two-row matrix in which both sequences are aranged in such a way that all matching and mismatching segments occur in the same column, while empty cells, resulting from empty matches, are filled with gap symbols. (cf. Kruskal 1983) 12 / 40

Slide 27

Slide 27 text

Sequence Comparison Alignment Analyses Alignment Analyses 0 H H H H H 0 0 H H H H 0 13 / 40

Slide 28

Slide 28 text

Sequence Comparison Alignment Analyses Alignment Analyses 0 H H H H H 0 0 H H H H 0 13 / 40

Slide 29

Slide 29 text

Sequence Comparison Alignment Analyses Alignment Analyses 0 H H H H H 0 0 H H H H H 0 13 / 40

Slide 30

Slide 30 text

Sequence Comparison Alignment Modes Global Alignment Global alignment analyses are the most basic way to com- pare sequences. The traditional Needleman-Wunsch algo- rithm (Needleman and Wunsch 1971) conducts global align- ment analyses, and the Levenshtein distance (edit distance, Levenshtein 1965) is defined for global alignments. 14 / 40

Slide 31

Slide 31 text

Sequence Comparison Alignment Modes Global Alignment Global alignment analyses are the most basic way to com- pare sequences. The traditional Needleman-Wunsch algo- rithm (Needleman and Wunsch 1971) conducts global align- ment analyses, and the Levenshtein distance (edit distance, Levenshtein 1965) is defined for global alignments. Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T - - - - H U N T E R 14 / 40

Slide 32

Slide 32 text

Sequence Comparison Alignment Modes Semi-Global Alignment Semi-global alignment analyses do not necessarily compare two sequences as a whole but allow prefixes and suffixes to be ignored in an alignment analysis, if these would otherwise increase the cost of the optimal alignment. Computationally, this is done by setting the costs for gaps inserted in the begin and at the end of an alignment to zero. 15 / 40

Slide 33

Slide 33 text

Sequence Comparison Alignment Modes Semi-Global Alignment Semi-global alignment analyses do not necessarily compare two sequences as a whole but allow prefixes and suffixes to be ignored in an alignment analysis, if these would otherwise increase the cost of the optimal alignment. Computationally, this is done by setting the costs for gaps inserted in the begin and at the end of an alignment to zero. Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T - - - - H U N T E R semi-global G R E E N - C A T F I S H H U N T E R - - - - - A F A T C A T H U N T E R 15 / 40

Slide 34

Slide 34 text

Sequence Comparison Alignment Modes Local Alignment While semi-global alignment analyses allow prefixes and suffixes to be ignored only if one sequence contains a prefix or suffix while the other does not, local alignment analyses (Smith-Waterman algorithm, Smith and Waterman 1981) only align the best scoring subsequences of two se- quences, while leaving the rest of the sequences completely unaligned. Computationally, this is done by prohibiting that the cost of an alignment analysis goes beyond zero. 16 / 40

Slide 35

Slide 35 text

Sequence Comparison Alignment Modes Local Alignment While semi-global alignment analyses allow prefixes and suffixes to be ignored only if one sequence contains a prefix or suffix while the other does not, local alignment analyses (Smith-Waterman algorithm, Smith and Waterman 1981) only align the best scoring subsequences of two se- quences, while leaving the rest of the sequences completely unaligned. Computationally, this is done by prohibiting that the cost of an alignment analysis goes beyond zero. Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T - - - - H U N T E R semi-global G R E E N - C A T F I S H H U N T E R - - - - - A F A T C A T H U N T E R local GREEN CATFISH H U N T E R A FAT CAT H U N T E R 16 / 40

Slide 36

Slide 36 text

Sequence Comparison Alignment Modes Diagonal Alignment While local alignment analyses leave unalignable parts of sequences unaligned, diagonal alignment analyses (DI- ALIGN algorith, Morgenstern 1996) align sequences glob- ally, but search for local similarities at the same time. Local similarities are defined as “diagonals”, i.e. ungapped align- ments. Diagonal alignment analyses maximize the score of diagonals in an alignment. 17 / 40

Slide 37

Slide 37 text

Sequence Comparison Alignment Modes Diagonal Alignment While local alignment analyses leave unalignable parts of sequences unaligned, diagonal alignment analyses (DI- ALIGN algorith, Morgenstern 1996) align sequences glob- ally, but search for local similarities at the same time. Local similarities are defined as “diagonals”, i.e. ungapped align- ments. Diagonal alignment analyses maximize the score of diagonals in an alignment. Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T - - - - H U N T E R semi-global G R E E N - C A T F I S H H U N T E R - - - - - A F A T C A T H U N T E R local GREEN CATFISH H U N T E R A FAT CAT H U N T E R diagonal - - - - - G R E E N C A T F I S H H U N T E R A F A T - - - - - C A T - - - - H U N T E R 17 / 40

Slide 38

Slide 38 text

Secondary Alignment secondarysequencestructures secondary sequence structures se co nda ry se que nce stru ctu re s se con da ry se quence struc tures s e c o n d a r y s e q u e n c e s t r u c t u r e s S E C O N D A R Y S E Q U E N C E S T R U C T U R E sec ond ary seq uen ces tru ctu res seco ndar yseq uenc estr ctur es Secondary Alignm ent 18 / 40

Slide 39

Slide 39 text

Secondary Alignment Secondary Sequence Structures Secondary Sequence Structures Apart from a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of segments. Secondary structure refers to the order of sec- ondary segments, i.e. segments that result from the group- ing of primary segments into higher units. 19 / 40

Slide 40

Slide 40 text

Secondary Alignment Secondary Sequence Structures Secondary Sequence Structures Apart from a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of segments. Secondary structure refers to the order of sec- ondary segments, i.e. segments that result from the group- ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" 19 / 40

Slide 41

Slide 41 text

Secondary Alignment Secondary Sequence Structures Secondary Sequence Structures Apart from a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of segments. Secondary structure refers to the order of sec- ondary segments, i.e. segments that result from the group- ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" "THECATFISHHUNTER" → "THE.CATFISH.HUNTER" 19 / 40

Slide 42

Slide 42 text

Secondary Alignment Secondary Sequence Structures Secondary Sequence Structures Apart from a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of segments. Secondary structure refers to the order of sec- ondary segments, i.e. segments that result from the group- ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" "THECATFISHHUNTER" → "THE.CATFISH.HUNTER" "KARAOKE" → "KA.RA.O.KE" 19 / 40

Slide 43

Slide 43 text

Secondary Alignment Secondary Alignment Problem The Secondary Alignment Problem Secondary Alignment Problem Given two sequences s and t of length m and n which have the primary structures s1 , ..., sm and t1 , ..., tn , and the secondary structures s0→i , ..., sj→m and t0→k , ..., tl→n , find an alignment of maximal score in which segments belonging to the same secondary segment in s only correspond to seg- ments belonging to the same secondary segment in t, and vice versa. 20 / 40

Slide 44

Slide 44 text

Secondary Alignment Secondary Alignment Problem The Secondary Alignment Problem Mode Alignment global T H E - C A T - F I S H - H U N T S T H E - C A T - F I S H - E - - - S semiglobal T H E - C A T - F I S H - - - H U N T S T H E - C A T - F I S H E S - - - - - - local T H E - C A T - F I S H HUNTS T H E - C A T - F I S H ES diagonal T H E - C A T - F I S H - - H U N T S T H E - C A T - F I S H E - - - - - S secondary T H E - C A T F I S H - H U N T - S T H E - C A T - - - - - F I S H E S 21 / 40

Slide 45

Slide 45 text

Secondary Alignment Secondary Alignment Algorithm A Secondary Alignment Algorithm Algorithm 1: Secondary(x, y, g, r, score) comment: matrix construction and initialization . . . comment: main loop for i ← 1 to length(x) do                                    for j ← 1 to length(y) do                              M[i][j] ← max                              M[i − 1][j − 1] + score(xi−1 , yj−1 ) comment: check for restriction 2 if xi−1 = r and yj−1 = r and j = length(y) then − ∞) else M[i − 1][j] + g if yj−1 = r and xi−1 = r and i = length(x) then − ∞) else M[i][j − 1] + g 22 / 40

Slide 46

Slide 46 text

Secondary Alignment Secondary Alignment Algorithm A Secondary Alignment Algorithm 1 0 0 0 0 A . B C . D E 0 0 0 0 0 0 0 0 0 -1 - A 0 -2 - . 0 -3 - B 0 -4 - C 0 -5 - . 0 -6 - D 0 -7 - E A A -1 - 0 A 1 0 A A 0 - . A -1 - B A -2 - C A -3 - . A -4 - D A -5 - E A A -2 - 0 A 0 - A A 0 0 . A -1 - B A -2 - C A -3 - . A -4 - D A -5 - E B B -3 - 0 B -1 - A B -1 - . B 1 0 B B 0 - C B -1 - . B -2 - D B -3 - E C C -4 - 0 C -2 - A C -2 - . C 0 - B C 2 0 C C 1 - . C 0 - D C -1 - E D D -5 - 0 D -3 - A D -3 - . D -1 - B D 1 - C D 1 0 . D 2 0 D D 1 - E E E -6 - 0 E -4 - A E -4 - . E -2 - B E 0 - C E 0 - . E 1 - D E 3 0 E . . -7 - 0 . -5 - A . -3 0 . . -3 - B . -1 - C . 1 0 . . 0 - D . 2 - E E E -8 - 0 E -6 - A E -4 - . E -4 - B E -2 - C E 0 - . E 0 0 D E 1 - E 2 0 0 0 0 A . B C . D E 0 0 0 0 0 0 0 0 0 -1 - A 0 -2 - . 0 -3 - B 0 -4 - C 0 -5 - . 0 -6 - D 0 -7 - E A A -1 - 0 A 1 0 A A -3 - . A -3 0 B A -4 - C A -6 - . A -6 0 D A -7 - E A A -2 - 0 A 0 - A A -4 - . A -4 - B A -4 0 C A -7 - . A -7 - D A -7 0 E B B -3 - 0 B -1 - A B -5 - . B -3 0 B B -4 - C B -8 - . B -8 - D B -8 - E C C -4 - 0 C -2 - A C -6 - . C -4 - B C -2 0 C C -9 - . C -9 - D C -9 - E D D -5 - 0 D -3 - A D -7 - . D -5 - B D -3 - C D -10 - . D -8 0 D D -9 - E E E -6 - 0 E -4 - A E -8 - . E -6 - B E -4 - C E -11 - . E -9 - D E -7 0 E . . -7 - 0 . -8 - A . -3 0 . . -4 - B . -5 - C . -3 0 . . -4 - D . -5 - E E E -8 - 0 E -8 0 A E -4 - . E -4 0 B E -5 - C E -4 - . E -4 0 D E -3 0 E 23 / 40

Slide 47

Slide 47 text

Secondary Alignment Secondary Alignment Algorithm A Secondary Alignment Algorithm The extension for secondary alignment is independent of the underlying alignment mode. Global, semi-global, local, and diagonal alignment analyses that are sensitive for secondary sequence structures can be carried out. The only requirement of the algorithm in contrast to the traditional alignment algorithms is the boundary character which has to be specified by the user. 24 / 40

Slide 48

Slide 48 text

Phonetic Alignment h j - ä r t a - h - e - r z - - h - e a r t - - c - - o r d i s hjärta herz heart cordis Phonetic Alignment 25 / 40

Slide 49

Slide 49 text

Phonetic Alignment SCA Sound-Class-Based Phonetic Alignment (SCA) SCA (List 2012) is a new method for pairwise and multiple phonetic alignment, implemented as part of LingPy (http://lingulist.de/lingpy), a Python library for quantitative tasks in historical linguistics. SCA is based on a novel framework for phonetic alignment that combines both the most recent developments in computational biology with new approaches to sequence modelling in historical linguistics and dialectology. According to the new framework for sequence modelling, sound sequences are internally represented in different layers which relate to both important paradigmatic and syntagmatic aspects of linguistic sequences. 26 / 40

Slide 50

Slide 50 text

Phonetic Alignment Paradigmatic Aspects Sound Classes . Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35). 27 / 40

Slide 51

Slide 51 text

Phonetic Alignment Paradigmatic Aspects Sound Classes . Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 27 / 40

Slide 52

Slide 52 text

Phonetic Alignment Paradigmatic Aspects Sound Classes . Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 27 / 40

Slide 53

Slide 53 text

Phonetic Alignment Paradigmatic Aspects Sound Classes . Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 27 / 40

Slide 54

Slide 54 text

Phonetic Alignment Paradigmatic Aspects Sound Classes . Sound Classes . . . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35). K T P S 1 27 / 40

Slide 55

Slide 55 text

Phonetic Alignment Paradigmatic Aspects Scoring Functions for Sound Classes LingPy offers default scoring functions for three standard sound-class models (ASJP, SCA, DOLGO). The standard models vary regarding the roughness by which the continuum of sounds is split into discrete classes. The scoring functions are based on empirical data on sound correspondence frequencies (ASJP model, Brown et al. 2011), and on general theoretical models of the directionality and probability of sound change processes that are converted into non-directional similarity matrices (SCA, DOLGO, see List 2012 for details). 28 / 40

Slide 56

Slide 56 text

Phonetic Alignment Syntagmatic Aspects Prosodic Strings Sound change occurs more frequently in prosodically weak positions of phonetic sequences (Geisler 1992). Given the sonority profile of a phonetic sequence, one can distinguish positions that differ regarding their prosodic context. Prosodic context can be modelled by representing a sequence by a prosodic string, indicating the different prosodic contexts of each segment. Based on the relative strength of all sites in a phonetic sequence, substitution scores and gap penalties can be modified when carrying out alignment analyses. Prosodic strings are an alternative to n-gram approaches, since they also handle context, their specific advantage being that they are more abstract and less data-dependent. 29 / 40

Slide 57

Slide 57 text

Phonetic Alignment Syntagmatic Aspects Prosodic Strings j a b ə l k a 30 / 40

Slide 58

Slide 58 text

Phonetic Alignment Syntagmatic Aspects Prosodic Strings j a b ə l k a sonority increases 30 / 40

Slide 59

Slide 59 text

Phonetic Alignment Syntagmatic Aspects Prosodic Strings j a b ə l k a ↑ △ ↑ △ ↓ ↑ △ ↑ ascending △ maximum ↓ descending 30 / 40

Slide 60

Slide 60 text

Phonetic Alignment Syntagmatic Aspects Prosodic Strings j a b ə l k a ↑ △ ↑ △ ↓ ↑ △ o strong weak 30 / 40

Slide 61

Slide 61 text

Phonetic Alignment Syntagmatic Aspects Prosodic Strings phonetic sequence j a b ə l k a SCA model J A P E L K A ASJP model y a b I l k a DOLGO model J V P V R K V sonority profile 6 7 1 7 5 1 7 prosodic string # v C v c C > Relative Weight 2.0 1.5 1.5 1.3 1.1 1.5 0.7 30 / 40

Slide 62

Slide 62 text

Phonetic Alignment Syntagmatic Aspects Secondary Alignment While secondary alignment was never an issue in computational biology, it is a desideratum in historical linguistics and dialectology. Secondary structures are especially important when (1) aligning whole sentences, where the alignment of one word from one with two words from another sentence should be avoided, (2) aligning language data for which morphological information is also available, or (3) when aligning words from South-East-Asian tone languages which generally show a structure in which one syllable corresponds to one morpheme. 31 / 40

Slide 63

Slide 63 text

Phonetic Alignment Syntagmatic Aspects Secondary Alignment Primary Alignment Haikou z i - t - ³ Beijing ʐ ʅ ⁵¹ tʰ ou ¹ Secondary Alignment Haikou z i t ³ - - - Beijing ʐ ʅ - ⁵¹ tʰ ou ¹ 32 / 40

Slide 64

Slide 64 text

Evaluation * * * * * * * * * * * * * v o l - d e m o r t v - l a d i m i r - v a l - d e m a r - Evaluation 33 / 40

Slide 65

Slide 65 text

Evaluation Evaluation Measures Evaluation Measures PAS: Perfect Alignment Score CS: Column Score SPS: Sum-of-Pairs Score 34 / 40

Slide 66

Slide 66 text

Evaluation Evaluation Measures Evaluation Measures Column-Score (CS) CS = 100 · 2 · |Ct∩Cr| |Cr|+|Ct| , where Ct is the set of columns in the test alignment and Cr is the set of columns in the reference alignment (Rosenberg and Ogden 2009). Sum-of-Pairs Score (SPS) SPS = 100 · 2 · |Pt∩Pr| |Pr|+|Pt| , where Pt is the set of all aligned residue pairs in the test alignment and Pr is the set of all aligned residue pairs in the reference alignment (ibd.). 35 / 40

Slide 67

Slide 67 text

Evaluation Gold Standard Gold Standard 1 089 manually aligned sequence pairs. Words taken from the Bai dialects (Wang 2006, Allen 2007) and Chinese dialects (Hou 2004). Both Bai and Chinese are tone languages. All data is available under http://lingulist.de/supp/secondary.zip 36 / 40

Slide 68

Slide 68 text

Evaluation Results Results Score Primary Secondary PAS 83.47 88.89 CS 88.54 92.70 SPS 92.78 95.52 37 / 40

Slide 69

Slide 69 text

Concluding Remarks As can be seen from the results, the modified algorithm which is sensitive to secondary sequence structures shows a great improvement compared to the traditional algorithm which aligns sequences only with respect to their primary structure. The improvement is significant with p < 0.01 using the Wilcoxon signed rank test as suggested by Notredame (2000). The algorithm for secondary alignment proves very useful for the alignment of tonal languages, yet it may also be employed for the analysis of other kinds of sequential data and, e.g., help to carry out phonetic alignment analyses of whole sentences. 38 / 40

Slide 70

Slide 70 text

*deh3 - ? What’s next? 39 / 40

Slide 71

Slide 71 text

Special thanks to: • The German Federal Mi- nistry of Education and Research (BMBF) for funding our research project. • Hans Geisler for his hel- pful, critical, and ins- piring support. 40 / 40

Slide 72

Slide 72 text

THANK YOU 1 FOR LISTENING! 40 / 40