Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tools to get you started and beyond

Tools to get you started and beyond

Presentation given at Memornet Summer School on Data and Tools for Digital Humanities (Tampere, 2015).

Matti Lassila

May 25, 2015
Tweet

More Decks by Matti Lassila

Other Decks in Research

Transcript

  1. Tools to get you started and beyond Matti Lassila
 Information

    Specialist
 Jyväskylä University Library
  2. Sophisticated full text search engine XQuery 3.1 + built-in extensions

    Powerful but easy to get started Variety of visualization tools
  3. for $article in /articles/article where $article/letter[text() contains text 'Reval' using

    fuzzy] return $article/place Simple full text search using XQuery <place>Raseborg</place> <place>Reval</place> <place>Wenden</place> <place>Viborg</place> <place>Novgorod</place> <place>Viborg</place> <place>Viborg</place> <place>Raseborg</place> <place>Viborg</place> <place>Reval</place> <place>Raseborg</place> in out
  4. for $article in /articles/article where $article/letter[text() contains text 'Reval' using

    fuzzy] let $place_of_writing:=$article/place group by $place_of_writing order by count($article) descending return <place name="{$place_of_writing}" count="{count($article)}"/> Slightly more advanced full text search using XQuery <place name="Viborg" count="55"/> <place name="" count="30"/> <place name="Raseborg" count="18"/> <place name="Åbo" count="14"/> <place name="Riga" count="7"/> <place name="Wenden" count="6"/> <place name="Reval" count="4"/> <place name="Wolmar" count="3"/> <place name="Narva" count="3"/> in out
  5. ft:extract( /articles/article/letter[text() contains text {‘?la.*','Fl.*'} all words using wildcards window

    4 words], 'hit', 30 ) <letter>...err <hit>Claes</hit> <hit>Fleming</hit> med dem i Ko...</letter> <letter>...h <hit>Claus</hit> <hit>Fleming</hit>, riddere, lagm...</letter> <letter>... <hit>Clavus</hit> <hit>Fläming</hit>, riddare och ...</letter> <letter>... <hit>Claues</hit> <hit>Flämingh</hit>, riddare och...</letter> <letter>... <hit>Clauos</hit> <hit>Flämingh</hit>, riddare och...</letter> <letter>...er fasten.Her <hit>Clawes</hit> <hit>Flamingh</hit>,...</letter> <letter>... <hit>Clauus</hit> <hit>Fläming</hit>, riddare oc l...</letter> <letter>...chels dach.Her <hit>Clawes</hit> <hit>Flemingh</hit>...</letter> <letter>...stelavende.Her <hit>Clawes</hit> <hit>Flemingh</hit>...</letter> <letter>... <hit>Clauus</hit> <hit>Flemingh</hit> lagmandz thin...</letter> <letter>... <hit>Clawus</hit> <hit>Flæming</hit>, riddare oc l...</letter> in out
  6. Interactive visualizations in BaseX Year on the letter and the

    number of words. 
 Highlighted in red if the place of writing is Åbo
  7. Exploring the Diplomatarium Fennicum (1. Created a local copy of

    DF database) 2. Converted HTML to XML in BaseX 3. Converted XML to plain text in BaseX 4. Imported plain text to Voyant Tools
  8. <div id="Layer1"> <table width="542" height="264" border="0"> <tr> <td> <TABLE><tr><td><b>479</b><br></td><td>Reval 21.5.1343<br/><br/></td></tr>

    <tr><td>Dan Niklisson, hövding över östra landsdelarna, Johannes Götesson, fogde i Viborg, Johannes Benediktsson m. fl. tillkännage att de på kung Magnus’ vägnar med kung Valdemars av Danmark rådsherrar och män i Estland samt med staden Reval och alla andra danska undersåter i Estland ingått en fred, såväl till lands som sjöss, att gälla till den 14 mars 1344, vilken fred dock skulle underställas kung Magni stadfästelse.</td></tr> <tr><td><br></td><tr> <tr><td>Urkunden intagen i ett vidisse, av 19 juli s. å., på pergament i Revals stadsarkiv. Sv. Dipl. V n:o 3698. Sv. Trakt. II n:o 257. LEC. Urkdb. II n:o 815.</td></tr> <tr><td>* Ändrat från ”obseruandas”.</td></tr> <tr><td>Reval 21 maj 1343.</td></tr> <tr><td><p><i>Omnibus presentes litteras visuris vel audituris Dan Niclisson, parcium orientalium prefectus, Johannes Gøtæson, aduocatus castri Wiborgh, Johannes Bændiczson, Hartekinus, Nicholaus Magnusson, Marquardus Fleegh, Nicholaus Guttæson salutem in Domino sempiternam. Tenore presencium notum facimus vniuersis, quod nos ex parte domini nostri, jllustris principis domini Magni, regis Swecie, Norwegie et Skanie, et omnium regna sua eadem inhabitancium necnon et omnium amore sui facere uel omittere quicquam volencium et suorum cum honorabilibus viris consiliariis incliti principis domini Waldemari, regis Dacie, militibus et militaribus ac vniuersis suis hominibus in Estonia.</ i><br></td><tr> </font></TABLE></td> </tr> <tr> <td valign="bottom"><A href="javascript:history.go(-1)">Takaisin/ Raw HTML
  9. import module namespace functx = 'http://www.functx.com'; let $articles:= <articles>{ for

    $article in /html/body/div/table/tr/td/table let $number:=$article/tr[1]/td[1]/b/text() let $placeTime:=$article/tr[1]/td[2]/text() let $provenance:=$article/tr[5]/td/text() let $regesta:=$article/tr/td[@class="regesta"]/text() let $letter:=$article/tr/td/p/*/text() let $originalDate:= if (string-length($article/tr[7]/td/text())>5) then $article/tr[7]/td/text() else() return <article> <number>{$number}</number> <placetime>{$placeTime}</placetime> <originaldate>{$originalDate}</originaldate> <provenance>{$provenance}</provenance> <regesta>{$regesta}</regesta> <letter>{$letter}</letter> </article> }</articles> Scripts for converting HTML to XML
  10. let $placesDates:= <articles>{ for $article in $articles/article where not(functx:all-whitespace($article/placetime)) let

    $splittedDate:=tokenize($article/placetime,'\.') let $date:=<date>{functx:get-matches($splittedDate[1],'\d+')}</date> let $place:=<place>{ functx:get-matches( $splittedDate[1],'[A-Za-zÖÄÅäöå]+') }</place> let $finalDate:=if (string-length($date)>=1 and $date<31) then $date else () let $finalYear:= if (string-length($date) >=1 and $date>100) then $date else ($splittedDate[3]) return <article> {$article/originaldate} {$article/number} {$article/provenance} {$article/regesta} {$article/letter} <place>{normalize-space(data($place))}</place> <placetime>{data($article/placetime)}</placetime> <date>{normalize-space(data($finalDate))}</date> <month>{$splittedDate[2]}</month> <year>{$finalYear}</year> </article> }</articles>
  11. <article> <originaldate>Reval 21 maj 1343.</originaldate> <number>479</number> <place>Reval</place> <placetime>Reval 21.5.1343</placetime> <date>21</date>

    <month>5</month> <year>1343</year> <provenance>Urkunden intagen i ett vidisse, av 19 juli s. å., på pergament i Revals stadsarkiv. Sv. Dipl. V n:o 3698. Sv. Trakt. II n:o 257. LEC. Urkdb. II n:o 815.</provenance> <regesta>Dan Niklisson, hövding över östra landsdelarna, Johannes Götesson, fogde i Viborg, Johannes Benediktsson m. fl. tillkännage att de på kung Magnus’ vägnar med kung Valdemars av Danmark rådsherrar och män i Estland samt med staden Reval och alla andra danska undersåter i Estland ingått en fred, såväl till lands som sjöss, att gälla till den 14 mars 1344, vilken fred dock skulle underställas kung Magni stadfästelse.</regesta> <letter>Ego vero Dan Niclisson prenotatus interim vel in persona propria vel per certum et ydoneum nuncium dominum meum, dominum regem Swecie, visitabo, notificaturus sibi de treugis memoratis; quas si ratas habere noluerit, extunc dictis consiliariis regis Dacie in Estonia, ciuitati Reualiensi et terre ad mensem integrum, antequam eis per aliquod regnorum dicti domini mei regis dampna vel violencia aliquatenus inferantur, hoc intimare debeo cum fide mea et compromiss(i)ariorum meorum predictorum. Jn quorum omnium testimonium sigilla nostra presentibus sunt appensa. Datum et actum Reualie anno Domini m°ccc°xl° tercio, vigilia ascensionis Domini.</letter> </article> Cleaned-up XML
  12. declare option output:method "text"; declare option output:item-separator "\n"; " for

    $record in /articles/article let $number:=concat($record/year,'-',$record/number) let $path:=concat('/home/majulass/dh-tools/fennicum-data/',$number,'.txt') where (string-length($number) eq 4) return file:write-text-lines(data($path), ( $record/number, ' ', $record/originaldate, $record/regesta, $record/letter, '---') ) Script for converting XML to plain text
  13. --- 484 Kuustö 24 april 1344. Biskop Hemming i Åbo

    intygar att hans mor på sitt yttersta gett 5 ören land uti Agersta i Jumkils socken till Uppsala domkyrkobyggnad. Vniuersis presentes litteras inspecturis Hemmingus, permissione diuina episcopus Aboensis, salutem in Domino sempiternam. Tenore presencium protestamur, quod dilecta nobis in Christo mater nostra, Katerina, bone memorie, legauit pro anima sua in extremis quinque oras terre cum domibus et ceteris pertinenciis in villa, que dicitur Aghursta, parrrochie Jumakyl situatas, fabrice ecclesie Vpsalensis, ex consensu nostro et Hemmingi Olafsson sororisque sue Elene, magis propinquorum necnon et Johannis Pætærsson, generi nostri, mariti Elene memorate. In cuius legacionis et consensus euidenciam firmiorem sigillum nostrum vna cum sigillis predictorum virorum presentibus sunt appensa (!). Datum et actum apud curiam nostram Kustu anno Domini m°ccc°xl° quarto, in profesto beati Marci evangeliste. --- 485 Pargas 3 maj 1344. Kyrkoherden Vinand i Kimito upplåter sina gods uti Trollshovda i Tenala till Åbo domkyrka. Dat. die inventionis sanctæ crucis. Vniuersis presentes litteras inspecturis Wynandus, curatus ecclesie Kymitto, salutem in Domino sempiternam. Noueritis, quod bona mea in Trvlzhuwt, parrochia Thenalum, cum omnibus adiacenciis dimitto et assigno ecclesie Aboensi per presentes perpetuo possidenda. Jn cuius rei euidenciam sigillum venerandj viri dominj Elawj, prepositj Aboensis, vna cum sigillo meo presentibus est appensum. Datum et actum in Pargasa, anno Dominj m°ccc°xl quarto, die inuencionis sancte crvcis. --- 486 Varberg 29 maj 1344. Kung Magnus förnyar de privilegier, han den 12 augusti 1336 beviljat staden Lübeck. Dat. sabbato Trinitatis. " Corpus in plain text
  14. First steps to take 1) Try Voyant Tools for basic

    text analysis 2) Explore XQuery using BaseX
  15. https://github.com/XQueryInstitute Vanderbilt University XQuery Summer Institute http://docs.voyant-tools.org/ Voyant Tools Documentation

    http://basex.org/ BaseX XML Database http://en.wikibooks.org/wiki/XQuery XQuery Wikibook Korkiakangas, T. & Lassila, M. (2013). 
 Abbreviations, fragmentary words, formulaic language: treebanking medieval charter material. http://www.bultreebank.org/ACRH-3/ACRH-3Proceeding.pdf