Slide 1

Slide 1 text

From sounds to histories in 8 steps using (mostly) off-the-shelf tools Gereon Kaiping (with Marian Klamer) NWO VICI Project: Languages of the Lesser Sunda Islands Leiden University, NL

Slide 2

Slide 2 text

8 “Easy” Steps 1. Fieldwork & Transcription 2. Database 3. “Cognate” Autocoding 4. Manual Checking and Editing of classes 5. BEAST Configuration 6. Ancestral State Inference 7. Plotting Trees 8. Interpretation & Iteration Links to our components

Slide 3

Slide 3 text

1. Fieldwork and Transcription • Fieldwork in East Indonesia: Audio and video recordings, notes, transcriptions • Different transcription conventions (y/j; ng/ŋ; aa/a:) • Goal: CLPA

Slide 4

Slide 4 text

2. LexiRumah Database: CLLD • Cross-Linguistic Linked Database: Connected to Glottolog and Concepticon • Codebase: Lexibank (from the Glottobank project) • Cross-Linguistic Data Format (CLDF) for word lists, with automatic aggregation for LingPy http://concepticon.clld.org/ https://github.com/clld/lexibank https://github.com/glottobank/cldf/tree/master/wordlists

Slide 5

Slide 5 text

Word lists from the Lesser Sunda islands in Eastern Indonesia LexiRumah: Content (i) Australia Papua New Guinea Indonesia

Slide 6

Slide 6 text

• 124 lects from 2 families: 72 Austronesian 52 Timor-Alor-Pantar • 200–600 concepts • 40 000 lexical items • More to be added LexiRumah: Content (ii) http://lessersunda.ullet.net/lexirumah/

Slide 7

Slide 7 text

LexiRumah: Challenges • Most data comes from fieldwork (2002–2017), only some from published sources (different data model) • Field data originally collected in Excel spreadsheets: Inconsistencies in transcription conventions and metadata • Some lects are not in Glottolog yet, some concepts not in Concepticon • Minor changes (map zoom etc.) • Todo: Web frontend improvements

Slide 8

Slide 8 text

3. “Cognate” Auto-coding • LingPy LexStat for automatic coding to bootstrap “cognates” • Classes of sufficiently similar forms (loans & cognates, but possibly not distant cognates) lex.get_scorer( preprocessing=False, runs=10000, ratio=(2, 1), vscale=1.0) lex.cluster( cluster_method= 'upgma', method='lexstat', ref='auto_cogid', threshold=0.8) http://lingpy.org/docu/compare/lexstat.html

Slide 9

Slide 9 text

LingPy Assembly • LingPy version 2: Tries to be clever and provide full working steps – small variants can be hard, eg. merge automatic and manual coding • Similar but not identical format to CLDF • Runs out of memory when auto-aligning our data • Hacked together with a Python Script: LingPy interspersed with Pandas for minor modifications • Goal: LingPy 3 and iterative methods (CogDetect) http://github.com/PhyloStar/CogDetect

Slide 10

Slide 10 text

4. Manual Checking & Editing of Cognate Classes Edit LingPy- suggested classes in Edictor http://edictor.digling.org/

Slide 11

Slide 11 text

Working with Edictor (i) • Crashes when working with a selection of ‘Doculects’ • Checking & coding 1 concept in 124 lects at the same time is challenging.

Slide 12

Slide 12 text

5. BEAST configuration • BEAST for inference – but writing BEAST config files for linguistic inference is hard • BEASTling: a translator from CLDF to BEAST – scriptable with reasonable defaults and Glottolog support, … https://github.com/lmaurits/BEASTling

Slide 13

Slide 13 text

BEASTling model description # -*- coding: utf-8 -*- [admin] basename = sunda_lexicon_ratevar [MCMC] chainlength = 10000000 [languages] exclusions = p-aust1307-acd,... [model lexicon] model = covarion rate_variation = True data = cognates.tsv file_format = cldf reconstruct = meanings.txt Data file: Edictor (with CLDF column headers) List of 50 interesting meanings LexiRumah contains some protolanguages

Slide 14

Slide 14 text

6. Ancestral State Inference • MCMC running in BEAST, with various packages: beast- classic, BEASTlabs, morph-models • Output: a 4.9 GiB Nexus file #NEXUS Begin taxa; Dimensions ntax=93; Taxlabels abui1241-fuime abui1241-petle ... wers1238-marit ; End; Begin trees; Translate 1 abui1241-fuime, 2 abui1241-petle, ... 93 wers1238-marit ; tree STATE_0 = (((((((((((1[&recon_lexicon:1pl incl="00aa11a",recon_lexicon:above="00a001000000 tree STATE_1000 = (((((((((((((((((((1[&recon_lexicon:1pl incl="0000100",recon_lexicon:above="0 tree STATE_2000 = ((((((((((((((((((((1[&recon_lexicon:1pl incl="00000b0",recon_lexicon:above=" tree STATE_3000 = ((((((((((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above=" tree STATE_4000 = ((((((((((((((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above tree STATE_5000 = (((((((((((((((((((((1[&recon_lexicon:1pl incl="0000011",recon_lexicon:above= tree STATE_6000 = (((((((((((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above= tree STATE_7000 = (((((((((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="000 tree STATE_8000 = (((((((((((((1[&recon_lexicon:1pl incl="0000010",recon_lexicon:above="0000000 tree STATE_9000 = (((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above="0000000 tree STATE_10000 = ((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="000000000 tree STATE_11000 = (((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="0000000000 http://www.beast2.org/ https://github.com/CompEvol/CBAN

Slide 15

Slide 15 text

7. Plotting trees: Apparent Solutions • TreeAnnotator + FigTree • “ape” R package • DensiTree • DendroPy • phyltr + ete3 • Can’t read whole file into memory and compute there • Beast’s Nexus format with [&pull="01ab00"] annotations inside Newick strings • Big lists: Automatic, scriptable http://www.beast2.org/treeannotator/ http://tree.bio.ed.ac.uk/software/figtree/ http://ape-package.ird.fr/ https://www.cs.auckland.ac.nz/~remco/DensiTree/ https://github.com/lmaurits/phyltr http://etetoolkit.org/ Except none of this works for our case

Slide 16

Slide 16 text

Aggregating trees to plot  Tree stream editor based on phyltr:  BEAST Nexus New Hampshire eXtension → [&&NHX:] annotated Newick tree stream  (Absolute) burnin and subsampling  Pruning languages  Aggregating consensus tree, aware of binary coding  This process is slow and memory or disk space hungry, but works  Better: Rework the rest of phyltr into treesed; try DendroPhy? https://github.com/Anaphory/treesed

Slide 17

Slide 17 text

Plotting trees https://gist.github.com/Anaphory/cecaa9b51280bc52be1a80b145bf7af0 http://etetoolkit.org/ • A small python script using ete3 for plotting the tree

Slide 18

Slide 18 text

8. Interpretation & Iteration Iterative Improvement • Find bad codings • Find data gaps • Improve coding automatically: Guide trees (and explicit reconstructions?) Data visualizations for Hypothesis generation • Hypotheses about proto-forms and borrowings • Hypotheses about shared lexical innovations

Slide 19

Slide 19 text

Proto-forms & Lexical Innovations ‘to stand’ *tas *diri *tide *hamari(k) • Proto TAP *tas; Innovation in Kula & Sawila group? • Proto MP (AN): Proto Flores *diri, Proto Timor *hamari(k) • Innovation in Alor?

Slide 20

Slide 20 text

Borrowings ‘fishing hook’ Austronesian > TAP or vice versa? *kavila

Slide 21

Slide 21 text

Find bad codings kis kisi *kavila

Slide 22

Slide 22 text

Find data gaps ‘island’

Slide 23

Slide 23 text

Conclusions • Most components easily available • Some “Duct Tape” required: Variations in data formats • Modifications and programming needed for actual Big Data • Easy to add dates and phylogeography Still open: • Manual work still required – How can we make doing and checking it easier and faster? • How to work with lots of language contact? • And how explicitly, results useable by historical linguists? (word forms, contact situations, etc.)

Slide 24

Slide 24 text

“Thank You”s ● Marian Klamer ● George Saad, Francesca Moro ● Hannah Fricke, Jiang Wu, Emily Maas ● Mattis List, Luke Maurits, Johannes Dellert, Robert Forkel ● Funding: NWO ● and YOU!

Slide 25

Slide 25 text

No content