Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Some Assembly Required: From sounds to historie...

...
April 25, 2017

Some Assembly Required: From sounds to histories in 8 steps using mostly off-the-shelf tools.

Speaker: Gereon Kaiping , University of Leiden

Title: Some Assembly Required: From sounds to histories in 8 steps using mostly off-the-shelf tools.

Abstract: Phylogenetic methods are gaining traction in linguistics, but have so far been quite inaccessible to linguists:
The core tools doing the tree construction – whether they be heuristic or Bayesian – often come from bioinformatics, and their inputs (eg. Nexus files) and outputs (eg. Newick trees without explicit reconstruction) conform to biological, not linguistic standards – or they are ad-hoc written for a specific datasets. However, this situation is changing: In this talk, I will present a collection of tools, most of which are published elsewhere, that together go the full way from linguistic fieldwork via public cross-linguistic linked databases and Bayesian inference tools to plots of phylogenetic trees with ancestral state reconstruction. I will describe both emerging standards in quantitative historical linguistics that make this process easier, and specific challenges that arose in the construction of this tool chain. The talk will conclude with the discussion of some results from the reconstructed word-meaning correspondences in the Lesser Sunda region of Indonesia, and how they feed back into improving our data and understanding of the local language history.

...

April 25, 2017
Tweet

More Decks by ...

Other Decks in Science

Transcript

  1. From sounds to histories in 8 steps using (mostly) off-the-shelf

    tools Gereon Kaiping (with Marian Klamer) NWO VICI Project: Languages of the Lesser Sunda Islands Leiden University, NL
  2. 8 “Easy” Steps 1. Fieldwork & Transcription 2. Database 3.

    “Cognate” Autocoding 4. Manual Checking and Editing of classes 5. BEAST Configuration 6. Ancestral State Inference 7. Plotting Trees 8. Interpretation & Iteration Links to our components
  3. 1. Fieldwork and Transcription • Fieldwork in East Indonesia: Audio

    and video recordings, notes, transcriptions • Different transcription conventions (y/j; ng/ŋ; aa/a:) • Goal: CLPA
  4. 2. LexiRumah Database: CLLD • Cross-Linguistic Linked Database: Connected to

    Glottolog and Concepticon • Codebase: Lexibank (from the Glottobank project) • Cross-Linguistic Data Format (CLDF) for word lists, with automatic aggregation for LingPy http://concepticon.clld.org/ https://github.com/clld/lexibank https://github.com/glottobank/cldf/tree/master/wordlists
  5. Word lists from the Lesser Sunda islands in Eastern Indonesia

    LexiRumah: Content (i) Australia Papua New Guinea Indonesia
  6. • 124 lects from 2 families: 72 Austronesian 52 Timor-Alor-Pantar

    • 200–600 concepts • 40 000 lexical items • More to be added LexiRumah: Content (ii) http://lessersunda.ullet.net/lexirumah/
  7. LexiRumah: Challenges • Most data comes from fieldwork (2002–2017), only

    some from published sources (different data model) • Field data originally collected in Excel spreadsheets: Inconsistencies in transcription conventions and metadata • Some lects are not in Glottolog yet, some concepts not in Concepticon • Minor changes (map zoom etc.) • Todo: Web frontend improvements
  8. 3. “Cognate” Auto-coding • LingPy LexStat for automatic coding to

    bootstrap “cognates” • Classes of sufficiently similar forms (loans & cognates, but possibly not distant cognates) lex.get_scorer( preprocessing=False, runs=10000, ratio=(2, 1), vscale=1.0) lex.cluster( cluster_method= 'upgma', method='lexstat', ref='auto_cogid', threshold=0.8) http://lingpy.org/docu/compare/lexstat.html
  9. LingPy Assembly • LingPy version 2: Tries to be clever

    and provide full working steps – small variants can be hard, eg. merge automatic and manual coding • Similar but not identical format to CLDF • Runs out of memory when auto-aligning our data • Hacked together with a Python Script: LingPy interspersed with Pandas for minor modifications • Goal: LingPy 3 and iterative methods (CogDetect) http://github.com/PhyloStar/CogDetect
  10. 4. Manual Checking & Editing of Cognate Classes Edit LingPy-

    suggested classes in Edictor http://edictor.digling.org/
  11. Working with Edictor (i) • Crashes when working with a

    selection of ‘Doculects’ • Checking & coding 1 concept in 124 lects at the same time is challenging.
  12. 5. BEAST configuration • BEAST for inference – but writing

    BEAST config files for linguistic inference is hard • BEASTling: a translator from CLDF to BEAST – scriptable with reasonable defaults and Glottolog support, … https://github.com/lmaurits/BEASTling
  13. BEASTling model description # -*- coding: utf-8 -*- [admin] basename

    = sunda_lexicon_ratevar [MCMC] chainlength = 10000000 [languages] exclusions = p-aust1307-acd,... [model lexicon] model = covarion rate_variation = True data = cognates.tsv file_format = cldf reconstruct = meanings.txt Data file: Edictor (with CLDF column headers) List of 50 interesting meanings LexiRumah contains some protolanguages
  14. 6. Ancestral State Inference • MCMC running in BEAST, with

    various packages: beast- classic, BEASTlabs, morph-models • Output: a 4.9 GiB Nexus file #NEXUS Begin taxa; Dimensions ntax=93; Taxlabels abui1241-fuime abui1241-petle ... wers1238-marit ; End; Begin trees; Translate 1 abui1241-fuime, 2 abui1241-petle, ... 93 wers1238-marit ; tree STATE_0 = (((((((((((1[&recon_lexicon:1pl incl="00aa11a",recon_lexicon:above="00a001000000 tree STATE_1000 = (((((((((((((((((((1[&recon_lexicon:1pl incl="0000100",recon_lexicon:above="0 tree STATE_2000 = ((((((((((((((((((((1[&recon_lexicon:1pl incl="00000b0",recon_lexicon:above=" tree STATE_3000 = ((((((((((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above=" tree STATE_4000 = ((((((((((((((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above tree STATE_5000 = (((((((((((((((((((((1[&recon_lexicon:1pl incl="0000011",recon_lexicon:above= tree STATE_6000 = (((((((((((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above= tree STATE_7000 = (((((((((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="000 tree STATE_8000 = (((((((((((((1[&recon_lexicon:1pl incl="0000010",recon_lexicon:above="0000000 tree STATE_9000 = (((((((((((((1[&recon_lexicon:1pl incl="0000000",recon_lexicon:above="0000000 tree STATE_10000 = ((((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="000000000 tree STATE_11000 = (((((((((1[&recon_lexicon:1pl incl="0000001",recon_lexicon:above="0000000000 http://www.beast2.org/ https://github.com/CompEvol/CBAN
  15. 7. Plotting trees: Apparent Solutions • TreeAnnotator + FigTree •

    “ape” R package • DensiTree • DendroPy • phyltr + ete3 • Can’t read whole file into memory and compute there • Beast’s Nexus format with [&pull="01ab00"] annotations inside Newick strings • Big lists: Automatic, scriptable http://www.beast2.org/treeannotator/ http://tree.bio.ed.ac.uk/software/figtree/ http://ape-package.ird.fr/ https://www.cs.auckland.ac.nz/~remco/DensiTree/ https://github.com/lmaurits/phyltr http://etetoolkit.org/ Except none of this works for our case
  16. Aggregating trees to plot  Tree stream editor based on

    phyltr:  BEAST Nexus New Hampshire eXtension → [&&NHX:] annotated Newick tree stream  (Absolute) burnin and subsampling  Pruning languages  Aggregating consensus tree, aware of binary coding  This process is slow and memory or disk space hungry, but works  Better: Rework the rest of phyltr into treesed; try DendroPhy? https://github.com/Anaphory/treesed
  17. 8. Interpretation & Iteration Iterative Improvement • Find bad codings

    • Find data gaps • Improve coding automatically: Guide trees (and explicit reconstructions?) Data visualizations for Hypothesis generation • Hypotheses about proto-forms and borrowings • Hypotheses about shared lexical innovations
  18. Proto-forms & Lexical Innovations ‘to stand’ *tas *diri *tide *hamari(k)

    • Proto TAP *tas; Innovation in Kula & Sawila group? • Proto MP (AN): Proto Flores *diri, Proto Timor *hamari(k) • Innovation in Alor?
  19. Conclusions • Most components easily available • Some “Duct Tape”

    required: Variations in data formats • Modifications and programming needed for actual Big Data • Easy to add dates and phylogeography Still open: • Manual work still required – How can we make doing and checking it easier and faster? • How to work with lots of language contact? • And how explicitly, results useable by historical linguists? (word forms, contact situations, etc.)
  20. “Thank You”s • Marian Klamer • George Saad, Francesca Moro

    • Hannah Fricke, Jiang Wu, Emily Maas • Mattis List, Luke Maurits, Johannes Dellert, Robert Forkel • Funding: NWO • and YOU!