Tools to get you started and beyond

Tools to get you started and beyond

Presentation given at Memornet Summer School on Data and Tools for Digital Humanities (Tampere, 2015).

1cd2f30e2d1be7af75c81503bbf631c1?s=128

Matti Lassila

May 25, 2015
Tweet

Transcript

  1. Tools to get you started and beyond Matti Lassila
 Information

    Specialist
 Jyväskylä University Library
  2. Amateur explorations in the Diplomatarium Fennicum

  3. None
  4. None
  5. None
  6. Year on the letter and the place of writing

  7. Year on the letter and the number of words

  8. Mentions of ‘Reval’ and its variations in the segments of

    corpus ordered chronologically
  9. Mentions of ‘Tavast’ and its variations in context

  10. + = —

  11. + = — Your 
 favourite 
 tool

  12. JSON CSV XML Plain text Import data Process further in...

    (Broken) HTML
  13. Sophisticated full text search engine XQuery 3.1 + built-in extensions

    Powerful but easy to get started Variety of visualization tools
  14. for $article in /articles/article where $article/letter[text() contains text 'Reval' using

    fuzzy] return $article/place Simple full text search using XQuery <place>Raseborg</place> <place>Reval</place> <place>Wenden</place> <place>Viborg</place> <place>Novgorod</place> <place>Viborg</place> <place>Viborg</place> <place>Raseborg</place> <place>Viborg</place> <place>Reval</place> <place>Raseborg</place> in out
  15. for $article in /articles/article where $article/letter[text() contains text 'Reval' using

    fuzzy] let $place_of_writing:=$article/place group by $place_of_writing order by count($article) descending return <place name="{$place_of_writing}" count="{count($article)}"/> Slightly more advanced full text search using XQuery <place name="Viborg" count="55"/> <place name="" count="30"/> <place name="Raseborg" count="18"/> <place name="Åbo" count="14"/> <place name="Riga" count="7"/> <place name="Wenden" count="6"/> <place name="Reval" count="4"/> <place name="Wolmar" count="3"/> <place name="Narva" count="3"/> in out
  16. ft:extract( /articles/article/letter[text() contains text {‘?la.*','Fl.*'} all words using wildcards window

    4 words], 'hit', 30 ) <letter>...err <hit>Claes</hit> <hit>Fleming</hit> med dem i Ko...</letter> <letter>...h <hit>Claus</hit> <hit>Fleming</hit>, riddere, lagm...</letter> <letter>... <hit>Clavus</hit> <hit>Fläming</hit>, riddare och ...</letter> <letter>... <hit>Claues</hit> <hit>Flämingh</hit>, riddare och...</letter> <letter>... <hit>Clauos</hit> <hit>Flämingh</hit>, riddare och...</letter> <letter>...er fasten.Her <hit>Clawes</hit> <hit>Flamingh</hit>,...</letter> <letter>... <hit>Clauus</hit> <hit>Fläming</hit>, riddare oc l...</letter> <letter>...chels dach.Her <hit>Clawes</hit> <hit>Flemingh</hit>...</letter> <letter>...stelavende.Her <hit>Clawes</hit> <hit>Flemingh</hit>...</letter> <letter>... <hit>Clauus</hit> <hit>Flemingh</hit> lagmandz thin...</letter> <letter>... <hit>Clawus</hit> <hit>Flæming</hit>, riddare oc l...</letter> in out
  17. None
  18. Interactive visualizations in BaseX Year on the letter and the

    number of words. 
 Highlighted in red if the place of writing is Åbo
  19. Exploring the Diplomatarium Fennicum (1. Created a local copy of

    DF database) 2. Converted HTML to XML in BaseX 3. Converted XML to plain text in BaseX 4. Imported plain text to Voyant Tools
  20. <div id="Layer1"> <table width="542" height="264" border="0"> <tr> <td> <TABLE><tr><td><b>479</b><br></td><td>Reval 21.5.1343<br/><br/></td></tr>

    <tr><td>Dan Niklisson, hövding över östra landsdelarna, Johannes Götesson, fogde i Viborg, Johannes Benediktsson m. fl. tillkännage att de på kung Magnus’ vägnar med kung Valdemars av Danmark rådsherrar och män i Estland samt med staden Reval och alla andra danska undersåter i Estland ingått en fred, såväl till lands som sjöss, att gälla till den 14 mars 1344, vilken fred dock skulle underställas kung Magni stadfästelse.</td></tr> <tr><td><br></td><tr> <tr><td>Urkunden intagen i ett vidisse, av 19 juli s. å., på pergament i Revals stadsarkiv. Sv. Dipl. V n:o 3698. Sv. Trakt. II n:o 257. LEC. Urkdb. II n:o 815.</td></tr> <tr><td>* Ändrat från ”obseruandas”.</td></tr> <tr><td>Reval 21 maj 1343.</td></tr> <tr><td><p><i>Omnibus presentes litteras visuris vel audituris Dan Niclisson, parcium orientalium prefectus, Johannes Gøtæson, aduocatus castri Wiborgh, Johannes Bændiczson, Hartekinus, Nicholaus Magnusson, Marquardus Fleegh, Nicholaus Guttæson salutem in Domino sempiternam. Tenore presencium notum facimus vniuersis, quod nos ex parte domini nostri, jllustris principis domini Magni, regis Swecie, Norwegie et Skanie, et omnium regna sua eadem inhabitancium necnon et omnium amore sui facere uel omittere quicquam volencium et suorum cum honorabilibus viris consiliariis incliti principis domini Waldemari, regis Dacie, militibus et militaribus ac vniuersis suis hominibus in Estonia.</ i><br></td><tr> </font></TABLE></td> </tr> <tr> <td valign="bottom"><A href="javascript:history.go(-1)">Takaisin/ Raw HTML
  21. import module namespace functx = 'http://www.functx.com'; let $articles:= <articles>{ for

    $article in /html/body/div/table/tr/td/table let $number:=$article/tr[1]/td[1]/b/text() let $placeTime:=$article/tr[1]/td[2]/text() let $provenance:=$article/tr[5]/td/text() let $regesta:=$article/tr/td[@class="regesta"]/text() let $letter:=$article/tr/td/p/*/text() let $originalDate:= if (string-length($article/tr[7]/td/text())>5) then $article/tr[7]/td/text() else() return <article> <number>{$number}</number> <placetime>{$placeTime}</placetime> <originaldate>{$originalDate}</originaldate> <provenance>{$provenance}</provenance> <regesta>{$regesta}</regesta> <letter>{$letter}</letter> </article> }</articles> Scripts for converting HTML to XML
  22. let $placesDates:= <articles>{ for $article in $articles/article where not(functx:all-whitespace($article/placetime)) let

    $splittedDate:=tokenize($article/placetime,'\.') let $date:=<date>{functx:get-matches($splittedDate[1],'\d+')}</date> let $place:=<place>{ functx:get-matches( $splittedDate[1],'[A-Za-zÖÄÅäöå]+') }</place> let $finalDate:=if (string-length($date)>=1 and $date<31) then $date else () let $finalYear:= if (string-length($date) >=1 and $date>100) then $date else ($splittedDate[3]) return <article> {$article/originaldate} {$article/number} {$article/provenance} {$article/regesta} {$article/letter} <place>{normalize-space(data($place))}</place> <placetime>{data($article/placetime)}</placetime> <date>{normalize-space(data($finalDate))}</date> <month>{$splittedDate[2]}</month> <year>{$finalYear}</year> </article> }</articles>
  23. <article> <originaldate>Reval 21 maj 1343.</originaldate> <number>479</number> <place>Reval</place> <placetime>Reval 21.5.1343</placetime> <date>21</date>

    <month>5</month> <year>1343</year> <provenance>Urkunden intagen i ett vidisse, av 19 juli s. å., på pergament i Revals stadsarkiv. Sv. Dipl. V n:o 3698. Sv. Trakt. II n:o 257. LEC. Urkdb. II n:o 815.</provenance> <regesta>Dan Niklisson, hövding över östra landsdelarna, Johannes Götesson, fogde i Viborg, Johannes Benediktsson m. fl. tillkännage att de på kung Magnus’ vägnar med kung Valdemars av Danmark rådsherrar och män i Estland samt med staden Reval och alla andra danska undersåter i Estland ingått en fred, såväl till lands som sjöss, att gälla till den 14 mars 1344, vilken fred dock skulle underställas kung Magni stadfästelse.</regesta> <letter>Ego vero Dan Niclisson prenotatus interim vel in persona propria vel per certum et ydoneum nuncium dominum meum, dominum regem Swecie, visitabo, notificaturus sibi de treugis memoratis; quas si ratas habere noluerit, extunc dictis consiliariis regis Dacie in Estonia, ciuitati Reualiensi et terre ad mensem integrum, antequam eis per aliquod regnorum dicti domini mei regis dampna vel violencia aliquatenus inferantur, hoc intimare debeo cum fide mea et compromiss(i)ariorum meorum predictorum. Jn quorum omnium testimonium sigilla nostra presentibus sunt appensa. Datum et actum Reualie anno Domini m°ccc°xl° tercio, vigilia ascensionis Domini.</letter> </article> Cleaned-up XML
  24. declare option output:method "text"; declare option output:item-separator "\n"; " for

    $record in /articles/article let $number:=concat($record/year,'-',$record/number) let $path:=concat('/home/majulass/dh-tools/fennicum-data/',$number,'.txt') where (string-length($number) eq 4) return file:write-text-lines(data($path), ( $record/number, ' ', $record/originaldate, $record/regesta, $record/letter, '---') ) Script for converting XML to plain text
  25. --- 484 Kuustö 24 april 1344. Biskop Hemming i Åbo

    intygar att hans mor på sitt yttersta gett 5 ören land uti Agersta i Jumkils socken till Uppsala domkyrkobyggnad. Vniuersis presentes litteras inspecturis Hemmingus, permissione diuina episcopus Aboensis, salutem in Domino sempiternam. Tenore presencium protestamur, quod dilecta nobis in Christo mater nostra, Katerina, bone memorie, legauit pro anima sua in extremis quinque oras terre cum domibus et ceteris pertinenciis in villa, que dicitur Aghursta, parrrochie Jumakyl situatas, fabrice ecclesie Vpsalensis, ex consensu nostro et Hemmingi Olafsson sororisque sue Elene, magis propinquorum necnon et Johannis Pætærsson, generi nostri, mariti Elene memorate. In cuius legacionis et consensus euidenciam firmiorem sigillum nostrum vna cum sigillis predictorum virorum presentibus sunt appensa (!). Datum et actum apud curiam nostram Kustu anno Domini m°ccc°xl° quarto, in profesto beati Marci evangeliste. --- 485 Pargas 3 maj 1344. Kyrkoherden Vinand i Kimito upplåter sina gods uti Trollshovda i Tenala till Åbo domkyrka. Dat. die inventionis sanctæ crucis. Vniuersis presentes litteras inspecturis Wynandus, curatus ecclesie Kymitto, salutem in Domino sempiternam. Noueritis, quod bona mea in Trvlzhuwt, parrochia Thenalum, cum omnibus adiacenciis dimitto et assigno ecclesie Aboensi per presentes perpetuo possidenda. Jn cuius rei euidenciam sigillum venerandj viri dominj Elawj, prepositj Aboensis, vna cum sigillo meo presentibus est appensum. Datum et actum in Pargasa, anno Dominj m°ccc°xl quarto, die inuencionis sancte crvcis. --- 486 Varberg 29 maj 1344. Kung Magnus förnyar de privilegier, han den 12 augusti 1336 beviljat staden Lübeck. Dat. sabbato Trinitatis. " Corpus in plain text
  26. Upload plain text corpus to Voyant http://www.voyant-tools.org/

  27. None
  28. 20 analysis tools http://docs.voyant-tools.org/tools/

  29. Export and present results

  30. Online embedding

  31. First steps to take 1) Try Voyant Tools for basic

    text analysis 2) Explore XQuery using BaseX
  32. https://github.com/XQueryInstitute Vanderbilt University XQuery Summer Institute http://docs.voyant-tools.org/ Voyant Tools Documentation

    http://basex.org/ BaseX XML Database http://en.wikibooks.org/wiki/XQuery XQuery Wikibook Korkiakangas, T. & Lassila, M. (2013). 
 Abbreviations, fragmentary words, formulaic language: treebanking medieval charter material. http://www.bultreebank.org/ACRH-3/ACRH-3Proceeding.pdf