Some Assembly Required: From sounds to histories in 8 steps using mostly off-the-shelf tools.

From sounds to histories in 8 steps using (mostly) off-the-shelf
tools Gereon Kaiping (with Marian Klamer) NWO VICI Project: Languages of the Lesser Sunda Islands Leiden University, NL

8 “Easy” Steps 1. Fieldwork & Transcription 2. Database 3.
“Cognate” Autocoding 4. Manual Checking and Editing of classes 5. BEAST Configuration 6. Ancestral State Inference 7. Plotting Trees 8. Interpretation & Iteration Links to our components

1. Fieldwork and Transcription • Fieldwork in East Indonesia: Audio
and video recordings, notes, transcriptions • Different transcription conventions (y/j; ng/ŋ; aa/a:) • Goal: CLPA

2. LexiRumah Database: CLLD • Cross-Linguistic Linked Database: Connected to
Glottolog and Concepticon • Codebase: Lexibank (from the Glottobank project) • Cross-Linguistic Data Format (CLDF) for word lists, with automatic aggregation for LingPy http://concepticon.clld.org/ https://github.com/clld/lexibank https://github.com/glottobank/cldf/tree/master/wordlists

Word lists from the Lesser Sunda islands in Eastern Indonesia
LexiRumah: Content (i) Australia Papua New Guinea Indonesia

• 124 lects from 2 families: 72 Austronesian 52 Timor-Alor-Pantar
• 200–600 concepts • 40 000 lexical items • More to be added LexiRumah: Content (ii) http://lessersunda.ullet.net/lexirumah/

LexiRumah: Challenges • Most data comes from fieldwork (2002–2017), only
some from published sources (different data model) • Field data originally collected in Excel spreadsheets: Inconsistencies in transcription conventions and metadata • Some lects are not in Glottolog yet, some concepts not in Concepticon • Minor changes (map zoom etc.) • Todo: Web frontend improvements

3. “Cognate” Auto-coding • LingPy LexStat for automatic coding to
bootstrap “cognates” • Classes of sufficiently similar forms (loans & cognates, but possibly not distant cognates) lex.get_scorer( preprocessing=False, runs=10000, ratio=(2, 1), vscale=1.0) lex.cluster( cluster_method= 'upgma', method='lexstat', ref='auto_cogid', threshold=0.8) http://lingpy.org/docu/compare/lexstat.html

LingPy Assembly • LingPy version 2: Tries to be clever
and provide full working steps – small variants can be hard, eg. merge automatic and manual coding • Similar but not identical format to CLDF • Runs out of memory when auto-aligning our data • Hacked together with a Python Script: LingPy interspersed with Pandas for minor modifications • Goal: LingPy 3 and iterative methods (CogDetect) http://github.com/PhyloStar/CogDetect

4. Manual Checking & Editing of Cognate Classes Edit LingPy-
suggested classes in Edictor http://edictor.digling.org/

Working with Edictor (i) • Crashes when working with a
selection of ‘Doculects’ • Checking & coding 1 concept in 124 lects at the same time is challenging.

5. BEAST configuration • BEAST for inference – but writing
BEAST config files for linguistic inference is hard • BEASTling: a translator from CLDF to BEAST – scriptable with reasonable defaults and Glottolog support, … https://github.com/lmaurits/BEASTling

BEASTling model description # -*- coding: utf-8 -*- [admin] basename
= sunda_lexicon_ratevar [MCMC] chainlength = 10000000 [languages] exclusions = p-aust1307-acd,... [model lexicon] model = covarion rate_variation = True data = cognates.tsv file_format = cldf reconstruct = meanings.txt Data file: Edictor (with CLDF column headers) List of 50 interesting meanings LexiRumah contains some protolanguages

6. Ancestral State Inference • MCMC running in BEAST, with
various packages: beast- classic, BEASTlabs, morph-models • Output: a 4.9 GiB Nexus file #NEXUS Begin taxa; Dimensions ntax=93; Taxlabels abui1241-fuime abui1241-petle ... wers1238-marit ; End; Begin trees; Translate 1 abui1241-fuime, 2 abui1241-petle, ... 93 wers1238-marit ; tree STATE_0 = (((((((((((1[&recon_lexicon:1pl incl="00aa11a",recon_lexicon:above="00a001000000 tree STATE_1000 = (((((((((((((((((((1[&recon_lexicon:1pl incl="0000100",recon_lexicon:above="0 tree STATE_2000 = ((((((((((((((((((((1[&recon_lexicon:1pl incl="00000b0",recon_lexicon:above=" tree STATE_3000 = ((((((((((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above=" tree STATE_4000 = ((((((((((((((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above tree STATE_5000 = (((((((((((((((((((((1[&recon_lexicon:1pl incl="0000011",recon_lexicon:above= tree STATE_6000 = (((((((((((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above= tree STATE_7000 = (((((((((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="000 tree STATE_8000 = (((((((((((((1[&recon_lexicon:1pl incl="0000010",recon_lexicon:above="0000000 tree STATE_9000 = (((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above="0000000 tree STATE_10000 = ((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="000000000 tree STATE_11000 = (((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="0000000000 http://www.beast2.org/ https://github.com/CompEvol/CBAN

7. Plotting trees: Apparent Solutions • TreeAnnotator + FigTree •
“ape” R package • DensiTree • DendroPy • phyltr + ete3 • Can’t read whole file into memory and compute there • Beast’s Nexus format with [&pull="01ab00"] annotations inside Newick strings • Big lists: Automatic, scriptable http://www.beast2.org/treeannotator/ http://tree.bio.ed.ac.uk/software/figtree/ http://ape-package.ird.fr/ https://www.cs.auckland.ac.nz/~remco/DensiTree/ https://github.com/lmaurits/phyltr http://etetoolkit.org/ Except none of this works for our case

Aggregating trees to plot  Tree stream editor based on
phyltr:  BEAST Nexus New Hampshire eXtension → [&&NHX:] annotated Newick tree stream  (Absolute) burnin and subsampling  Pruning languages  Aggregating consensus tree, aware of binary coding  This process is slow and memory or disk space hungry, but works  Better: Rework the rest of phyltr into treesed; try DendroPhy? https://github.com/Anaphory/treesed

Plotting trees https://gist.github.com/Anaphory/cecaa9b51280bc52be1a80b145bf7af0 http://etetoolkit.org/ • A small python script using
ete3 for plotting the tree

8. Interpretation & Iteration Iterative Improvement • Find bad codings
• Find data gaps • Improve coding automatically: Guide trees (and explicit reconstructions?) Data visualizations for Hypothesis generation • Hypotheses about proto-forms and borrowings • Hypotheses about shared lexical innovations

Proto-forms & Lexical Innovations ‘to stand’ *tas *diri *tide *hamari(k)
• Proto TAP *tas; Innovation in Kula & Sawila group? • Proto MP (AN): Proto Flores *diri, Proto Timor *hamari(k) • Innovation in Alor?

Borrowings ‘fishing hook’ Austronesian > TAP or vice versa? *kavila

Find bad codings kis kisi *kavila

Find data gaps ‘island’

Conclusions • Most components easily available • Some “Duct Tape”
required: Variations in data formats • Modifications and programming needed for actual Big Data • Easy to add dates and phylogeography Still open: • Manual work still required – How can we make doing and checking it easier and faster? • How to work with lots of language contact? • And how explicitly, results useable by historical linguists? (word forms, contact situations, etc.)

“Thank You”s • Marian Klamer • George Saad, Francesca Moro
• Hannah Fricke, Jiang Wu, Emily Maas • Mattis List, Luke Maurits, Johannes Dellert, Robert Forkel • Funding: NWO • and YOU!

Some Assembly Required: From sounds to historie...

Some Assembly Required: From sounds to histories in 8 steps using mostly off-the-shelf tools.

...

More Decks by ...

Other Decks in Science

Featured

Transcript

From sounds to histories in 8 steps using (mostly) off-the-shelf

8 “Easy” Steps 1. Fieldwork & Transcription 2. Database 3.

1. Fieldwork and Transcription • Fieldwork in East Indonesia: Audio

2. LexiRumah Database: CLLD • Cross-Linguistic Linked Database: Connected to

Word lists from the Lesser Sunda islands in Eastern Indonesia

• 124 lects from 2 families: 72 Austronesian 52 Timor-Alor-Pantar

LexiRumah: Challenges • Most data comes from fieldwork (2002–2017), only

3. “Cognate” Auto-coding • LingPy LexStat for automatic coding to

LingPy Assembly • LingPy version 2: Tries to be clever

4. Manual Checking & Editing of Cognate Classes Edit LingPy-

Working with Edictor (i) • Crashes when working with a

5. BEAST configuration • BEAST for inference – but writing

BEASTling model description # -- coding: utf-8 -- [admin] basename

6. Ancestral State Inference • MCMC running in BEAST, with

7. Plotting trees: Apparent Solutions • TreeAnnotator + FigTree •

Aggregating trees to plot  Tree stream editor based on

Plotting trees https://gist.github.com/Anaphory/cecaa9b51280bc52be1a80b145bf7af0 http://etetoolkit.org/ • A small python script using

8. Interpretation & Iteration Iterative Improvement • Find bad codings

Proto-forms & Lexical Innovations ‘to stand’ tas diri tide hamari(k)

Borrowings ‘fishing hook’ Austronesian > TAP or vice versa? *kavila

Find bad codings kis kisi *kavila

Find data gaps ‘island’

Conclusions • Most components easily available • Some “Duct Tape”

“Thank You”s • Marian Klamer • George Saad, Francesca Moro