Handling word formation in historical-comparative linguistics

Handling word formation in historical-comparative linguistics Annotation and Analysis Johann-Mattis
List DFG Research Fellow Centre des recherches linguistiques sur l’Asie Orientale Team Adaptation, Integration, Reticulation, Evolution EHESS and UPMC, Paris 2016/12/01 1 / 38

Preliminaries Preliminaries 2 / 38

Preliminaries The Word Formation Problem The Word Formation Problem Frucht,
ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpﬂanzung der eigenen Art dienendes Produkt einer Pﬂanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] German "Frucht" in Pfei�er (1993, also at http://dwds.de) 3 / 38

ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpﬂanzung der eigenen Art dienendes Produkt einer Pﬂanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] German "Frucht" in Pfei�er (1993, also at http://dwds.de 3 / 38

ferner fruchten, befruchten, Befruchtung, fruchtbar, fruchtig Frucht f. ‘der Fortpflanzung der eigenen Art dienendes Produkt einer Pflanze’, auch ‘ungeborenes Lebewesen’, übertragen ‘Ertrag’, ahd. fruht (9. Jh.), mhd. vruht, asächs. fruht, mnd. mnl. nl. vrucht beruhen auf einer frühen Entlehnung von gleichbed. lat. frūctus, abgeleitet vom Verb lat. fruī (frūctus sum) ‘genießen, Nutzen ziehen’ (verwandt mit brauchen, s. d.). Das Deminutiv Früchtchen hat die spezielle Bedeutung [...] inherited from borrowed from derived from PIE *bhreu◌◌̯ Hg◌ ◌ ̑ - “to use” PIE *bhruHg◌ ◌ ̑ -ié- “to use” (present tense) PGM *ƀrūkan- “to use” OHG brūhhan “to use” G brauchen “to use” G Brauch “custom” OHG fruht “profit, fruit” G frugal “modest (food)” Fr fruit “profit,fruit” Fr frugal “modest (food)” Lt fruor, fruī “I enjoy” Lt frūctus “profit” Lt frux “fruit, grain” Lt frugalis “bring profit” Adapted from an Illustration by Hans Geisler (University Düsseldorf) German "Frucht" in Pfei�er (1993, also at http://dwds.de 3 / 38

Preliminaries The Word Formation Problem The Word Formation Problem While
etymological dictionaries provide us with very detailed scenarios on processes in language change, including processes of inheri- tance, contact, and word formation, they present the knowledge in a prosaic fashion that resists quantiﬁcation and makes it also very diﬃcult to comprehend, especially for those who are not experts in the given language family. 3 / 38

Preliminaries The Word Formation Problem The Word Formation Problem solej
SUN French sol SUN Spanish SUN zɔnə SUN German SUN suːl Swedish 4 / 38

Preliminaries The Word Formation Problem The Word Formation Problem 'soh₂-wl̩-
sh₂uˈen- SUN Indo-European 4 / 38

sh₂uˈen- SUN Indo-European soːwel- sunːoː- SUN Germanic 4 / 38

sh₂uˈen- SUN Indo-European soːwel- sunːoː- SUN Germanic zɔnə SUN German suːl SUN Swedish 4 / 38

sh₂uˈen- SUN Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN Romance zɔnə SUN German suːl SUN Swedish 4 / 38

sh₂uˈen- SUN Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance zɔnə SUN German suːl SUN Swedish 4 / 38

sh₂uˈen- SUN Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance solej SUN French sol SUN Spanish zɔnə SUN German suːl SUN Swedish 4 / 38

Preliminaries The Word Formation Problem The Word Formation Problem 'soh₂-wl◌◌̩
- sh₂uˈen- SUN Indo-European soːwel- sunːoː- SUN Germanic soːl- SUN soːlikul- SMALL SUN Romance solej SUN French sol SUN Spanish zɔnə SUN German suːl SUN Swedish SEM ANTIC SHIFT M O RPH O LO G ICAL CH AN G E M O R PH O LO G ICA L CH A N G E MORPHOLOGICAL CHANGE MORPHOLOGICAL CHANGE 4 / 38

Preliminaries The Word Formation Problem The Word Formation Problem Computational
phylogenetic approaches, on the other hand, usually ignore word formation, mak- ing it diﬃcult to study language change in all its complexity. 4 / 38

Preliminaries The Word Formation Problem Reconciling Etymology with Computers Can
we ﬁnd a way to increase the consis- tency of classical etymological accounts and the complexity of computational approaches to handle word formation processes in comparative- historical linguistics in a more reliable and more transparent way? 5 / 38

Preliminaries Annotation and Analysis Annotation and Analysis Annotation The ﬁrst
step in linguistic reconstruction, during which we assemble our evidence and identify related words and morphemes in and outside languages. Analysis The second, crucial step of linguistic reconstruction, during which we present our hypotheses in form of historical scenarios that explain how the relations that we annotated evolved into their current shape. 6 / 38

Preliminaries Annotation and Analysis Annotation and Analysis solej SUN French
sol SUN Spanish SUN zɔnə SUN German SUN suːl 7 / 38

Preliminaries Annotation and Analysis Annotation Descriptive Annotation The ﬁrst step
of annotation in which we only state which relations hold between the entities we analyse and does not make speciﬁc assumptions regarding the historical processes which they result from. Etymological Annotation An interpretive step in the annotation in which certain parts of the reconstruction are already carried out and certain basic processes are postulated. 8 / 38

Preliminaries Annotation and Analysis Analysis Linguistic Reconstruction Postulating a classical
proto-form which is supposed to have been fulﬁlling a certain function in the proto-language. Etymological Analysis A complete account on a given proto-form, providing the best explanations available to account for the development of the form in the descendant languages (including idiosyncratic processes). 9 / 38

Preliminaries Technical Aspects Data Representation Multilingual Word List A list
of words from various languages, in which each row corresponds to one word form in one language, marked by a unique identifier in the first column, with additional informa- tion regarding the word (meaning, pronunciation, cognacy, etc.) placed in additional columns. Word List Format Given the very abstract and lightweight character of this word list structure, it can be stored in all computer formats which allow for this kind of data, such as CSV (comma-separated values, pure text file) or more specific spreadsheet formats (Excel, LibreOffice, GoogleSheets, etc.). 10 / 38

Preliminaries Technical Aspects Data Representation ID CONCEPT ORTHOGRAPHY VALUE DOCULECT
COGNACY 1 hand Hand hant German 1 2 hand hand hænd English 1 3 hand рука ruka Russian 2 4 hand рука ruka Ukrainian 2 5 leg Bein bain German 3 6 leg leg lɛg English 4 7 leg нога noga Russian 5 8 leg нога noha Ukrainian 5 9 Woldemort Waldemar valdemar German 6 10 Woldemort Woldemort wɔldemɔrt English 6 11 Woldemort Владимир vladimir Russian 6 12 Woldemort Володимир volodimir Ukrainian 6 13 Harry Harald haralt German 7 14 Harry Harry hæri English 7 15 Harry Гарри gari Russian 7 16 Harry Гаррi hari Ukrainian 7 11 / 38

Preliminaries Technical Aspects Data Representation 11 / 38

Preliminaries Technical Aspects Data Representation EDICTOR Version 0.1 (List 2016)
http://edictor.digling.org 11 / 38

Preliminaries Technical Aspects Examples for Basic Fields DOCULECT The name
of the language variety (EDICTOR prefers a simple name, without special charac- ters and whitespace). CONCEPT The gloss for the concept expressed by the word (content arbitrary, as long as diﬀerent concepts are given diﬀerent glosses). TOKENS Machine-readable representation of phonetic transcription. Space-segmented, that is, phonetic/phonological units (“sounds”) are separated by spaces. 12 / 38

Preliminaries Technical Aspects Relations COGNACY The identiﬁer for cognate sets
by which words which have the same identiﬁer are grouped into the same cognate set. ALIGNMENT The aligned representation of the word when be- ing aligned with cognate words. 13 / 38

Preliminaries Technical Aspects Relations 14 / 38

Compounding Compounding 15 / 38

Compounding Preliminaries Handling Compounding in SEA Languages The following is
common work with Nathan Hill (SOAS, London). We developed the ideas for handling and analysing word formation in a project on the history of the Burmish language family. 16 / 38

Compounding Preliminaries Compounding in SEA Languages German m oː n
t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e 17 / 38

t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - 17 / 38

t - English m uː n - - Danish m ɔː n - ə Swedish m oː n - e Fúzhōu ŋ u o ʔ ⁵ - - - - - - - - - - Měixiàn ŋ i a t ⁵ - - - - - k u o ŋ ⁴⁴ Guǎngzhōu j - y t ² l - œ ŋ ²² - - - - - Běijīng - y ɛ - ⁵¹ l i ɑ ŋ - - - - - - 1 2 3 4 number of morphemes per word 0.0 0.2 0.4 0.6 0.8 1.0 relative frequency all words nouns Compounds in the basic vocabulary (Swadesh1952) across 23 Chinese dialects (data by Hamed and Wang 2006) 30% 50% 17 / 38

Compounding Preliminaries Partial Cognacy ‘Cognacy is not a binary relation
which is either present or not. Instead, we can distinguish diﬀerent subtypes of cognacy, just as biologists can identify speciﬁc types of homol- ogy between genes.’ (List 2016: 133) 18 / 38

Compounding Annotation of Compounds Language-External Annotation TOKENS (SEGMENTS) Our phonetic
transcriptions are expanded by adding a layer of morphological segmentation in which diﬀerent morphemes in the same word form are separated with a morpheme separation character (usually a +). PARTIAL_COGNATES Partial cognate relations are indicated with help of partial cognate identiﬁers which are separated by a space in the partial-cognate column. PARTIAL_ALIGNMENTS When aligning the words, we still write all alignments in only one column, but we align each morpheme in the word only for the partial cognate set to which it belongs. 19 / 38

Compounding Annotation of Compounds Language-External Annotation Partial cognates which are
annotated in this way can be directly converted into the “normal” cognate sets we know from the quantitative phylogenetic analyses. Partial cognate annotation would be extremely tedious in spreadsheets or other formats. The EDICTOR tool, however, supports partial cognate annotation, and LingPy oﬀers tools for partial cognate identiﬁcation and alignment (List et al. 2016). 19 / 38

Compounding Annotation of Compounds Language-External Annotation 20 / 38

Compounding Annotation of Compounds Language-Internal Annotation MORPHEMES We can annotate
the morpheme-separated word forms further by using a very straightforward space-segmented format of “word-form-glossing”, in which we use the same iden- tiﬁer (a brief gloss for a concept) to gloss the structure of word forms. In this way, we can annotate the semantic structure of multi-morphemic compounds (compare Chinese shùpí 树皮 ‘bark’ tree skin), and at the same time annotate language-internal cognates (word families or “allofams”) in a transparent way. 21 / 38

Compounding Annotation of Compounds Language-Internal Annotation Morpheme annotation using word-form-glossing
is supported by the EDICTOR tool and will be supported in form of partial colexiﬁcation analyses in future versions of LingPy. 21 / 38

Compounding Annotation of Compounds Language-Internal Annotation 21 / 38

Compounding Analysis of Compounds Ancestral State Reconstruction Bějīng Fúzhōu Měixiàn
Guǎngzhōu 月 ŋiat⁵ kuoŋ⁴⁴ jyt² ŋuoʔ⁵ liɑŋ¹ lœŋ²² yɛ⁵¹ 光亮 B G M F B G M F 22 / 38

Compounding Analysis of Compounds Ancestral State Reconstruction Bějīng Fúzhōu Měixiàn
Guǎngzhōu 月 ŋiat⁵ kuoŋ⁴⁴ jyt² ŋuoʔ⁵ liɑŋ¹ lœŋ²² yɛ⁵¹ 光亮 B G LOSS INNO VATIO N INNO VATIO N BORROWING M F 22 / 38

Compounding Analysis of Compounds Ancestral State Reconstruction Once we have
annotated the compounds in lexical data for partial cognacy, we can use various methods to infer how the data can be best explained by assuming diﬀerent processes of change in compounding. Currently, only proto- types are available (see e.g., List 2016), but evolutionary biology oﬀers with the framework of tree reconciliation methods already a rich arsenal of methods which we can adapt to our linguistic needs in the future. Given that compounding is treated carelessly in most etymological dictionaries of SEA languages, the development of new quantitative analyses inspired from biological techniques may turn out as very fruitful for our discipline. 22 / 38

Compounding Analysis of Compounds Ancestral State Reconstruction Fúzhōu Táiběi Xiàmén
Zhāngpíng Mǐn Guǎngzhōu Měixiàn Liánchéng Hakka Wēnzhōu Níngbō Sūzhōu Shànghǎi Shànghǎi_B Wú Nánchāng Ānyì Gàn Chángshā Shuāngfēng Xiāng Yàngshān Wǔhàn Níngxià Chéngdū Běijīng Tàiyuán Yúcì Guānhuà 月月娘月光佛月光月亮月明 ‘MOON’ ‘MOON-MOTHER’ ‘MOON-LIGHT’ ‘MOON-LIGHT-SUFFIX’ ‘MOON-SHINE’ ‘MOON-BRIGHT’ List (2016, Journal of Language Evolution) 23 / 38

Compounding Analysis of Compounds Pattern Analysis Pattern Freq. Note 82
Mostly a-prefix followed by non-cognate items, e.g. Bola [a³¹ + thaʔ⁵⁵] vs. Achang [a31 + lum31[ ‘above’. 58 Mostly one plain noun vs. one with an additional part, e.g. Bola [tʃa◌̱ m⁵⁵] vs. Lashi [tsɔ◌̱ m⁵⁵ + mou⁵⁵] {cloud + sky} ‘cloud’. 53 Mostly one plain noun vs. one with an additional part in front, often simply the a-suffix, e.g., Atsi [siŋ²¹] vs. Achang [a³¹ + ʂəŋ³¹] ‘liver’. 36 Common main noun, but different preceding noun, e.g., Bola [mjaŋ⁵⁵ + kʰui³⁵] {thunder + dog} vs. Lashi [wɔm³¹ + kʰui⁵⁵] {bear + dog} ‘wolf’. 34 The common part is prefixed in one word and suffixed in the other, e.g., Atsi [u²¹ + tsham⁵¹] {head + hair} vs. Rangoon [shɑ◌̃ ²² + pĩ²²] {hair + ?} ‘hair (of head)’. remaining 77 30 unique patterns are remaining, 15 only occuring once Partial Cognate Patterns in Burmish Languages (Hill & List WIP) 24 / 38

Compounding Analysis of Compounds Pattern Analysis To our knowledge, partial
cognate pattern analysis has not been carried out in linguistics so far. It may, however, turn out to provide us with important evidence on the directionality of certain patterns. Regardless of directionality, it may also be interesting to see to which degree language families in which compounding is fre- quent diﬀer in their major patterns. 25 / 38

Derivation Derivation 26 / 38

Derivation Preliminaries Handling Derivation in Sino-Tibetan Languages The following is
common work with Guillaume Jacques (CRLAO, Paris), who provided a ﬁrst explicit annotation of verbal derivation in Kiranti languages (an annotated version of Jacques forthcoming, A reconstruction of Proto-Kiranti verb roots, Folia Linguistica Historica). We developed the ideas for handling and preliminary ideas for analysing derivation in several discus- sions during the last months. 27 / 38

Derivation Preliminaries Derivation in SEA Languages *rib rab.rib srib sa.srib
grib grib.ma sgrib ɴgrib Exemplary derivations of word forms from the Tibetan root *rib 'to be dark', as proposed by Jacques (2016, Une famille de mot en Tibétain, URL: https://panchr.hypotheses.org/1273), visualized as a derivation tree. The form itself is not attested, but it can be supposed based on the semantic relations between the members of the word family. 28 / 38

Derivation Preliminaries Levels of Annotation Hierarchical Aspect The hierarchical aspect
plays a much more crucial role in derivations than in plain transparent compounds. Instead of a simple morpheme segmentation, an annotation of derivations needs to account for this aspect. Paradigmatic Aspect While compounds are syntagmatic structures, derivations may further show paradigmatic variation. If derivation is expressed by a voicing contrast, as, for example, the anticausative in some Kiranti languages (Jacques forthcoming), this is a paradigmatic change which cannot be modeled with help of alignments. A transparent annotation of derivations also needs to account for this. 29 / 38

Derivation Annotation of Derivation Language-Internal Annotation Basic Principle of Annotation
As a general idea for annotation, we can model syntagmatic derivation in a similar way in which we propose to model compounding. As a rule, the machine-readable transcription of words is provided in morpheme-segmented form. Hierarchical Annotation In most cases, the hierarchical aspect can be modeled by annotating that one element attaches to another element. Bantawa sakt ‘to weed’, an applicative derivation of an intransitive verb, can thus be written as "s a k ← t", indi- cating that the applicative-marker "t" attaches as suﬃx to the main morpheme (Jacques forthcoming). 30 / 38

Derivation Annotation of Derivation Language-Internal Annotation Paradigmatic Annotation Paradigmatic annotation
requires two additional layers, to allow for a suﬃciently abstract handling. Each word form is linked to a root in a separate ROOT column, and a stem in a separate STEM column. The syntagmatic derivations can be placed in a DERIVATION column, in which derivation is annotated in a similar form, as we used for the word-form- glossing. The hierarchical relations underlying the derivations need to be handled independently of the word list and can be passed in form of directed networks to the word list. 31 / 38

Derivation Annotation of Derivation Language-Internal Annotation Cognate Annotation Cognate annotation
follows the partial cognate annotation principle, but word forms with paradigm variation in their stems are assigned different identifiers, reflecting the fact that they cannot be aligned. An additional cognate identifier is needed to make sure that different stems can be assigned to the same root. 31 / 38

Derivation Annotation of Derivation Language-External Annotation We can use the
language-internal annotation also for language-external annotation, with the diﬀerence, that our root and stem cells will now serve for the proto-form of the proposed proto-language. Problems we cannot yet suﬃ- ciently handle are those where the actual derivation cannot be traced back to the proto-language, as this will require to introduce intermediate ancestral forms. More examples are needed to handle these cases. 32 / 38

Derivation Annotation of Derivation Language-External Annotation DOCULECT SEGMENTS ROOT STEM
DERIVATION French sol←ej *soh₂wl- *soh₂wl + ? REKTUS DIM Spanish sol *soh₂wl- *soh₂wl REKTUS German zɔnɛ *soh₂wl- *sh₂en OBLIQUUS Swedish suːl *soh₂wl- *soh₂wl REKTUS Annotating words for ‘sun’ in the four Indo-European languages. 32 / 38

Derivation Annotation of Derivation Language-External Annotation 32 / 38

Derivation Annotation of Derivation Language-External Annotation Given the derivation patterns
as provided by G. Jacques, the data structure in the demo application could be generated automatically. The same applies to the stems, which were derived by applying the conversion rules proposed for each derivation process. Already in this preliminary form, the data provides a much larger degree of transparency than the classical etymological dictionaries, and further analyses can be easily carried out. 32 / 38

Derivation Analysis of Derivation Phylogenetic Reconstruction (?) Since the direction
in compounding processes is really hard to estimate (and may well be highly language-speciﬁc or even show no preference at all), we need a reference phy- logeny to get initial insights into the major processes. In derivation, however, we usually have a clear idea regarding the direction of the processes. Since directional processes are very useful for phylogenetic estimation, we suppose that a larger dataset annotated in this form, may be very suitable for phylogenetic reconstruction methods. For this, however, more languages need to be added to G. Jacques’ Kiranti database. 33 / 38

Derivation Analysis of Derivation Pattern Analysis 18 4 8 4
14 2 2 70 3 9 10 9 causative reﬂexive deponent applicative anticausative intransitive transitive Frequency of derivational transitions in G. Jacques’ Kiranti data. 34 / 38

Outlook Ausblick Outlook 35 / 38

Outlook Challenges for the Future Analogy So far, we cannot
handle analogy, but we need to handle it at some point, both in annotation and analysis. Software and Tools We need better heuristics to produce proper data and the tools to correct computational preprocessing within the frameworks for annotation which we deﬁne. Complexity and Comparability We surely do not cover all what is possible in languages at the moment, be it in synchrony or diachrony, so we need to develop more and more examples to enhance our speciﬁca- tion. 36 / 38

Outlook Linguists, and I myself, often complain about the insuﬃ-
ciency of the new biological methods applied to linguistic data. But if we, as linguists, fail to annotate our data in such a way that it can be easily compared across languages and applications, if we do not work harder to make our data transparent enough so that it can be read by humans and machines, we should not complain that computational linguists use lousy data and lousy algorithms. Better and suf- ﬁcient algorithms will come, if only we manage to increase the comparability of our linguistic data. 37 / 38

Outlook Merci pour votre attention! 38 / 38

Handling word formation in historical-comparati...

Handling word formation in historical-comparative linguistics

More Decks by Johann-Mattis List

Other Decks in Science

Featured

Transcript