Improving phonetic alignment by handling secondary sequence structures

. . . . . . . Improving Phonetic Alignment
by Handling Secondary Sequence Structures Johann-Mattis List∗ ∗Institute for Romance Languages and Literature Heinrich Heine University Düsseldorf 2012/08/10 1 / 40

Structure of the Talk . . . 1 Historical Linguistics
Keys to the Past Comparative Method Sound Correspondences . . . 2 Sequence Comparison Sequences Alignment Analyses Alignment Modes . . . 3 Secondary Alignment Secondary Sequence Structures Secondary Alignment Problem Secondary Alignment Algorithm . . . 4 Phonetic Alignment SCA Paradigmatic Aspects Syntagmatic Aspects . . . 5 Evaluation Evaluation Measures Gold Standard Results 2 / 40

Historical Linguistics Historical Linguistics 3 / 40

Historical Linguistics Keys to the Past Charles Lyell on Languages
4 / 40

The Geological Evidences of The Antiquity of Man with Remarks on Theories of The Origin of Species by Variation By Sir Charles Lyell London John Murray, Albemarle Street 1863 4 / 40

If we new not- hing of the existence of Latin, - if all historical documents previous to the fin- teenth century had been lost, - if tra- dition even was si- lent as to the former existance of a Ro- man empire, a me- re comparison of the Italian, Spanish, Portuguese, French, Wallachian, and Rhaetian dialects would enable us to say that at some time there must have been a language, from which these six modern dialects derive their origin in common. 4 / 40

Historical Linguistics Keys to the Past Historical Scenarios German ʦ
aː n - * Proto-Germanic t a n d English t ʊː θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - 5 / 40

aː n - - * Proto-Germanic t a n d English t ʊː - θ - ** Proto-Indo-European d o n t Italian d ɛ n t e * Proto-Romance d e n t French d ɑ̃ - - - 5 / 40

aː n - - Proto-Germanic t a n θ - English t ʊː - θ - ** Proto-Indo-European d o n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ - - - 5 / 40

aː n - Proto-Germanic t a n θ - English t ʊː - θ ** Proto-Indo-European d o n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ - - 5 / 40

aː n - Proto-Germanic t a n θ - English t ʊː - θ Proto-Indo-European d e n t - Italian d ɛ n t ə Proto-Romance d e n t e French d ɑ̃ - - 5 / 40

aː n - * Proto-Germanic t a n d English t ʊː - θ Proto-Indo-European d e n t Italian d ɛ n t ə * Proto-Romance d e n t French d ɑ̃ - - 5 / 40

aː n Proto-Germanic t a n θ English t ʊː θ Proto-Indo-European d e n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ German ʦ aː n Proto-Germanic t a n θ English t ʊː θ Proto-Indo-European d e n t Italian d ɛ n t e Proto-Romance d e n t e French d ɑ̃ 15 / 40

Historical Linguistics Comparative Method The Comparative Method Compile an initial
list of putative cognate sets. Extract an initial list of putative sets of sound correspondences from the initial cognate list. Refine the cognate list and the correspondence list by adding and deleting cognate sets from the cognate list, depending on whether they are consistent with the correspondence list or not, and adding and deleting correspondence sets from the correspondence list, depending on whether they are consistent with the cognate list or not. Finish when the results are satisfying enough. 6 / 40

Historical Linguistics Sound Correspondences Sound Correspondences Sequence similarity is determined
on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. bla German [ʦaːn] “tooth” Dutch tand [tɑnt] English [tʊːθ] “tooth” German [ʦeːn] “ten” Dutch tien [tiːn] English [tɛn] “ten” German [ʦʊŋə] “tongue” Dutch tong [tɔŋ] English [tʌŋ] “tongue” 7 / 40

on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. Meaning German Dutch English “tooth” Zahn [ ʦ aːn] tand [ t ɑnt] tooth [ t ʊːθ] “ten” zehn [ ʦ eːn] tien [ t iːn] ten [ t ɛn] “tongue” Zunge [ ʦ ʊŋə] tong [ t ɔŋ] tongue [ t ʌŋ] 7 / 40

on the basis of systematic sound correspondences as opposed to similarity based on surface resemblances of phonetic segments. Lass (1997) calls this notion of similarity phenotypic as opposed to a genotypic notion of similarity. The most crucial aspect of correspondence-based similarity is that it is language-specific: Genotypic similarity is never defined in general terms but always with respect to the language systems which are being compared. Meaning Shanghai Beijing Guangzhou “nine” [ ʨ iɤ³⁵] Beijing [ ʨ iou²¹⁴] [ k ɐu³⁵] “today” [ ʨ iŋ⁵⁵ʦɔ²¹] Beijing [ ʨ iɚ⁵⁵] [ k ɐm⁵³jɐt²] “rooster” [koŋ⁵⁵ ʨ i²¹] Beijing[kuŋ⁵⁵ ʨ i⁵⁵] [ k ɐi⁵⁵koŋ⁵⁵] 7 / 40

Sequence Comparison Sequence C om parison 8 / 40

Sequence Comparison Sequences Sequences Definition 1 Given an alphabet (a
non-empty finite set, whose elements are called characters), a sequence is an ordered list of characters drawn from the alphabet. The elements of sequences are called segments. (cf. Böckenbauer & Bongartz 2003: 30f) 9 / 40

Sequence Comparison Sequences Sequences 10 / 40

Sequence Comparison Sequences Sequences 4 3 11 / 40

Sequence Comparison Sequences Sequences 1 1 1 1 11 /
40

Sequence Comparison Sequences Sequences 1 Baked Rabbit 1 rabbit 1
1/2 tsp. salt 1 1/8 1/8 tsp. pepper 1 1/2 c. onion slices • Rub salt and pepper on rabbit pieces. • Place on large sheet of aluminium foil. • Place onion slices on rabbit. • Bake at 350 degrees. • Eat when done and tender. 11 / 40

Sequence Comparison Alignment Analyses Alignment Analyses Definition 2 An alignment
of two sequences s and t is a two-row matrix in which both sequences are aranged in such a way that all matching and mismatching segments occur in the same column, while empty cells, resulting from empty matches, are filled with gap symbols. (cf. Kruskal 1983) 12 / 40

Sequence Comparison Alignment Analyses Alignment Analyses 0 H H H
H H 0 0 H H H H 0 13 / 40

Sequence Comparison Alignment Analyses Alignment Analyses 0 H H H
H H 0 0 H H H H H 0 13 / 40

Sequence Comparison Alignment Modes Global Alignment Global alignment analyses are
the most basic way to compare sequences. The traditional Needleman-Wunsch algorithm (Needleman and Wunsch 1971) conducts global alignment analyses, and the Levenshtein distance (edit distance, Levenshtein 1965) is defined for global alignments. 14 / 40

Sequence Comparison Alignment Modes Global Alignment Global alignment analyses are
the most basic way to compare sequences. The traditional Needleman-Wunsch algorithm (Needleman and Wunsch 1971) conducts global alignment analyses, and the Levenshtein distance (edit distance, Levenshtein 1965) is defined for global alignments. Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T - - - - H U N T E R 14 / 40

Sequence Comparison Alignment Modes Semi-Global Alignment Semi-global alignment analyses do
not necessarily compare two sequences as a whole but allow prefixes and suffixes to be ignored in an alignment analysis, if these would otherwise increase the cost of the optimal alignment. Computationally, this is done by setting the costs for gaps inserted in the begin and at the end of an alignment to zero. 15 / 40

Sequence Comparison Alignment Modes Semi-Global Alignment Semi-global alignment analyses do
not necessarily compare two sequences as a whole but allow prefixes and suffixes to be ignored in an alignment analysis, if these would otherwise increase the cost of the optimal alignment. Computationally, this is done by setting the costs for gaps inserted in the begin and at the end of an alignment to zero. Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T - - - - H U N T E R semi-global G R E E N - C A T F I S H H U N T E R - - - - - A F A T C A T H U N T E R 15 / 40

Sequence Comparison Alignment Modes Local Alignment While semi-global alignment analyses
allow prefixes and suffixes to be ignored only if one sequence contains a prefix or suffix while the other does not, local alignment analyses (Smith-Waterman algorithm, Smith and Waterman 1981) only align the best scoring subsequences of two sequences, while leaving the rest of the sequences completely unaligned. Computationally, this is done by prohibiting that the cost of an alignment analysis goes beyond zero. 16 / 40

Sequence Comparison Alignment Modes Local Alignment While semi-global alignment analyses
allow prefixes and suffixes to be ignored only if one sequence contains a prefix or suffix while the other does not, local alignment analyses (Smith-Waterman algorithm, Smith and Waterman 1981) only align the best scoring subsequences of two sequences, while leaving the rest of the sequences completely unaligned. Computationally, this is done by prohibiting that the cost of an alignment analysis goes beyond zero. Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T - - - - H U N T E R semi-global G R E E N - C A T F I S H H U N T E R - - - - - A F A T C A T H U N T E R local GREEN CATFISH H U N T E R A FAT CAT H U N T E R 16 / 40

Sequence Comparison Alignment Modes Diagonal Alignment While local alignment analyses
leave unalignable parts of sequences unaligned, diagonal alignment analyses (DI- ALIGN algorith, Morgenstern 1996) align sequences glob- ally, but search for local similarities at the same time. Local similarities are defined as “diagonals”, i.e. ungapped alignments. Diagonal alignment analyses maximize the score of diagonals in an alignment. 17 / 40

Sequence Comparison Alignment Modes Diagonal Alignment While local alignment analyses
leave unalignable parts of sequences unaligned, diagonal alignment analyses (DI- ALIGN algorith, Morgenstern 1996) align sequences glob- ally, but search for local similarities at the same time. Local similarities are defined as “diagonals”, i.e. ungapped alignments. Diagonal alignment analyses maximize the score of diagonals in an alignment. Mode Alignment global G R E E N C A T F I S H H U N T E R A F A T C A T - - - - H U N T E R semi-global G R E E N - C A T F I S H H U N T E R - - - - - A F A T C A T H U N T E R local GREEN CATFISH H U N T E R A FAT CAT H U N T E R diagonal - - - - - G R E E N C A T F I S H H U N T E R A F A T - - - - - C A T - - - - H U N T E R 17 / 40

Secondary Alignment secondarysequencestructures secondary sequence structures se co nda ry
se que nce stru ctu re s se con da ry se quence struc tures s e c o n d a r y s e q u e n c e s t r u c t u r e s S E C O N D A R Y S E Q U E N C E S T R U C T U R E sec ond ary seq uen ces tru ctu res seco ndar yseq uenc estr ctur es Secondary Alignm ent 18 / 40

Secondary Alignment Secondary Sequence Structures Secondary Sequence Structures Apart from
a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of segments. Secondary structure refers to the order of secondary segments, i.e. segments that result from the group- ing of primary segments into higher units. 19 / 40

a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of segments. Secondary structure refers to the order of secondary segments, i.e. segments that result from the group- ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" 19 / 40

a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of segments. Secondary structure refers to the order of secondary segments, i.e. segments that result from the group- ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" "THECATFISHHUNTER" → "THE.CATFISH.HUNTER" 19 / 40

a primary structure, sequences can also have a secondary structure. Primary structure refers to the order of segments. Secondary structure refers to the order of secondary segments, i.e. segments that result from the group- ing of primary segments into higher units. "ABCEFGIJK" → "ABC.EFG.IJK" "THECATFISHHUNTER" → "THE.CATFISH.HUNTER" "KARAOKE" → "KA.RA.O.KE" 19 / 40

Secondary Alignment Secondary Alignment Problem The Secondary Alignment Problem Secondary
Alignment Problem Given two sequences s and t of length m and n which have the primary structures s1 , ..., sm and t1 , ..., tn , and the secondary structures s0→i , ..., sj→m and t0→k , ..., tl→n , find an alignment of maximal score in which segments belonging to the same secondary segment in s only correspond to segments belonging to the same secondary segment in t, and vice versa. 20 / 40

Secondary Alignment Secondary Alignment Problem The Secondary Alignment Problem Mode
Alignment global T H E - C A T - F I S H - H U N T S T H E - C A T - F I S H - E - - - S semiglobal T H E - C A T - F I S H - - - H U N T S T H E - C A T - F I S H E S - - - - - - local T H E - C A T - F I S H HUNTS T H E - C A T - F I S H ES diagonal T H E - C A T - F I S H - - H U N T S T H E - C A T - F I S H E - - - - - S secondary T H E - C A T F I S H - H U N T - S T H E - C A T - - - - - F I S H E S 21 / 40

Secondary Alignment Secondary Alignment Algorithm A Secondary Alignment Algorithm Algorithm
1: Secondary(x, y, g, r, score) comment: matrix construction and initialization . . . comment: main loop for i ← 1 to length(x) do                                    for j ← 1 to length(y) do                              M[i][j] ← max                              M[i − 1][j − 1] + score(xi−1 , yj−1 ) comment: check for restriction 2 if xi−1 = r and yj−1 = r and j = length(y) then − ∞) else M[i − 1][j] + g if yj−1 = r and xi−1 = r and i = length(x) then − ∞) else M[i][j − 1] + g 22 / 40

Secondary Alignment Secondary Alignment Algorithm A Secondary Alignment Algorithm 1
0 0 0 0 A . B C . D E 0 0 0 0 0 0 0 0 0 -1 - A 0 -2 - . 0 -3 - B 0 -4 - C 0 -5 - . 0 -6 - D 0 -7 - E A A -1 - 0 A 1 0 A A 0 - . A -1 - B A -2 - C A -3 - . A -4 - D A -5 - E A A -2 - 0 A 0 - A A 0 0 . A -1 - B A -2 - C A -3 - . A -4 - D A -5 - E B B -3 - 0 B -1 - A B -1 - . B 1 0 B B 0 - C B -1 - . B -2 - D B -3 - E C C -4 - 0 C -2 - A C -2 - . C 0 - B C 2 0 C C 1 - . C 0 - D C -1 - E D D -5 - 0 D -3 - A D -3 - . D -1 - B D 1 - C D 1 0 . D 2 0 D D 1 - E E E -6 - 0 E -4 - A E -4 - . E -2 - B E 0 - C E 0 - . E 1 - D E 3 0 E . . -7 - 0 . -5 - A . -3 0 . . -3 - B . -1 - C . 1 0 . . 0 - D . 2 - E E E -8 - 0 E -6 - A E -4 - . E -4 - B E -2 - C E 0 - . E 0 0 D E 1 - E 2 0 0 0 0 A . B C . D E 0 0 0 0 0 0 0 0 0 -1 - A 0 -2 - . 0 -3 - B 0 -4 - C 0 -5 - . 0 -6 - D 0 -7 - E A A -1 - 0 A 1 0 A A -3 - . A -3 0 B A -4 - C A -6 - . A -6 0 D A -7 - E A A -2 - 0 A 0 - A A -4 - . A -4 - B A -4 0 C A -7 - . A -7 - D A -7 0 E B B -3 - 0 B -1 - A B -5 - . B -3 0 B B -4 - C B -8 - . B -8 - D B -8 - E C C -4 - 0 C -2 - A C -6 - . C -4 - B C -2 0 C C -9 - . C -9 - D C -9 - E D D -5 - 0 D -3 - A D -7 - . D -5 - B D -3 - C D -10 - . D -8 0 D D -9 - E E E -6 - 0 E -4 - A E -8 - . E -6 - B E -4 - C E -11 - . E -9 - D E -7 0 E . . -7 - 0 . -8 - A . -3 0 . . -4 - B . -5 - C . -3 0 . . -4 - D . -5 - E E E -8 - 0 E -8 0 A E -4 - . E -4 0 B E -5 - C E -4 - . E -4 0 D E -3 0 E 23 / 40

Secondary Alignment Secondary Alignment Algorithm A Secondary Alignment Algorithm The
extension for secondary alignment is independent of the underlying alignment mode. Global, semi-global, local, and diagonal alignment analyses that are sensitive for secondary sequence structures can be carried out. The only requirement of the algorithm in contrast to the traditional alignment algorithms is the boundary character which has to be specified by the user. 24 / 40

Phonetic Alignment h j - ä r t a -
h - e - r z - - h - e a r t - - c - - o r d i s hjärta herz heart cordis Phonetic Alignment 25 / 40

Phonetic Alignment SCA Sound-Class-Based Phonetic Alignment (SCA) SCA (List 2012)
is a new method for pairwise and multiple phonetic alignment, implemented as part of LingPy (http://lingulist.de/lingpy), a Python library for quantitative tasks in historical linguistics. SCA is based on a novel framework for phonetic alignment that combines both the most recent developments in computational biology with new approaches to sequence modelling in historical linguistics and dialectology. According to the new framework for sequence modelling, sound sequences are internally represented in different layers which relate to both important paradigmatic and syntagmatic aspects of linguistic sequences. 26 / 40

Phonetic Alignment Paradigmatic Aspects Sound Classes . Sound Classes .
. . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35). 27 / 40

. . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35). k g p b ʧ ʤ f v t d ʃ ʒ θ ð s z 1 27 / 40

. . . . . . . Sounds which often occur in correspondence relations in genetically related languages can be clustered into classes (types). It is assumed “that phonetic correspondences inside a‘type’ are more regular than those between different‘types’” (Dolgopolsky 1986: 35). K T P S 1 27 / 40

Phonetic Alignment Paradigmatic Aspects Scoring Functions for Sound Classes LingPy
offers default scoring functions for three standard sound-class models (ASJP, SCA, DOLGO). The standard models vary regarding the roughness by which the continuum of sounds is split into discrete classes. The scoring functions are based on empirical data on sound correspondence frequencies (ASJP model, Brown et al. 2011), and on general theoretical models of the directionality and probability of sound change processes that are converted into non-directional similarity matrices (SCA, DOLGO, see List 2012 for details). 28 / 40

Phonetic Alignment Syntagmatic Aspects Prosodic Strings Sound change occurs more
frequently in prosodically weak positions of phonetic sequences (Geisler 1992). Given the sonority profile of a phonetic sequence, one can distinguish positions that differ regarding their prosodic context. Prosodic context can be modelled by representing a sequence by a prosodic string, indicating the different prosodic contexts of each segment. Based on the relative strength of all sites in a phonetic sequence, substitution scores and gap penalties can be modified when carrying out alignment analyses. Prosodic strings are an alternative to n-gram approaches, since they also handle context, their specific advantage being that they are more abstract and less data-dependent. 29 / 40

Phonetic Alignment Syntagmatic Aspects Prosodic Strings j a b ə
l k a 30 / 40

l k a sonority increases 30 / 40

l k a ↑ △ ↑ △ ↓ ↑ △ ↑ ascending △ maximum ↓ descending 30 / 40

l k a ↑ △ ↑ △ ↓ ↑ △ o strong weak 30 / 40

Phonetic Alignment Syntagmatic Aspects Prosodic Strings phonetic sequence j a
b ə l k a SCA model J A P E L K A ASJP model y a b I l k a DOLGO model J V P V R K V sonority profile 6 7 1 7 5 1 7 prosodic string # v C v c C > Relative Weight 2.0 1.5 1.5 1.3 1.1 1.5 0.7 30 / 40

Phonetic Alignment Syntagmatic Aspects Secondary Alignment While secondary alignment was
never an issue in computational biology, it is a desideratum in historical linguistics and dialectology. Secondary structures are especially important when (1) aligning whole sentences, where the alignment of one word from one with two words from another sentence should be avoided, (2) aligning language data for which morphological information is also available, or (3) when aligning words from South-East-Asian tone languages which generally show a structure in which one syllable corresponds to one morpheme. 31 / 40

Phonetic Alignment Syntagmatic Aspects Secondary Alignment Primary Alignment Haikou z
i - t - ³ Beijing ʐ ʅ ⁵¹ tʰ ou ¹ Secondary Alignment Haikou z i t ³ - - - Beijing ʐ ʅ - ⁵¹ tʰ ou ¹ 32 / 40

Evaluation * * * * * * * * *
* * * * v o l - d e m o r t v - l a d i m i r - v a l - d e m a r - Evaluation 33 / 40

Evaluation Evaluation Measures Evaluation Measures PAS: Perfect Alignment Score CS:
Column Score SPS: Sum-of-Pairs Score 34 / 40

Evaluation Evaluation Measures Evaluation Measures Column-Score (CS) CS = 100
· 2 · |Ct∩Cr| |Cr|+|Ct| , where Ct is the set of columns in the test alignment and Cr is the set of columns in the reference alignment (Rosenberg and Ogden 2009). Sum-of-Pairs Score (SPS) SPS = 100 · 2 · |Pt∩Pr| |Pr|+|Pt| , where Pt is the set of all aligned residue pairs in the test alignment and Pr is the set of all aligned residue pairs in the reference alignment (ibd.). 35 / 40

Evaluation Gold Standard Gold Standard 1 089 manually aligned sequence
pairs. Words taken from the Bai dialects (Wang 2006, Allen 2007) and Chinese dialects (Hou 2004). Both Bai and Chinese are tone languages. All data is available under http://lingulist.de/supp/secondary.zip 36 / 40

Evaluation Results Results Score Primary Secondary PAS 83.47 88.89 CS
88.54 92.70 SPS 92.78 95.52 37 / 40

Concluding Remarks As can be seen from the results, the
modified algorithm which is sensitive to secondary sequence structures shows a great improvement compared to the traditional algorithm which aligns sequences only with respect to their primary structure. The improvement is significant with p < 0.01 using the Wilcoxon signed rank test as suggested by Notredame (2000). The algorithm for secondary alignment proves very useful for the alignment of tonal languages, yet it may also be employed for the analysis of other kinds of sequential data and, e.g., help to carry out phonetic alignment analyses of whole sentences. 38 / 40

*deh3 - ? What’s next? 39 / 40

Special thanks to: • The German Federal Mi- nistry of
Education and Research (BMBF) for funding our research project. • Hans Geisler for his hel- pful, critical, and ins- piring support. 40 / 40

THANK YOU 1 FOR LISTENING! 40 / 40

Improving phonetic alignment by handling secon...

Improving phonetic alignment by handling secondary sequence structures

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript