“Digital Information Lasts Forever – Or Five Years, Whichever Comes First.” (1997) “…still apt…” (2012) David Rosenthal: “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010) That network effects inhibit obsolescence Where is the evidence?
Format version level identification Minor problem concerning file handles Only binary signature part (DROID-B) could be embedded Apache Tika Widely used identification and data extraction tool Identifies many formats at the MIME type level Easy to embed and extend Added ability to extract e.g. software identifiers Minor bug concerning identification buffer size
requires a common model Map PRONOM IDs to extended MIME Types fmt/18 becomes application/pdf; version=1.4 Allows easy comparison at sub-type level Can easily extend to cover other properties: text/plain; charset=UTF-8 application/pdf; software=“Adobe Acrobat 6.0” Also extended Tika to output details from PDFs
over time: image/png image/png image/png; version=1.0 2004 102 ! application/pdf ! application/pdf; version=1.2; software="Acrobat Distiller 4.0 for Windows"; source="Adobe PageMaker 6.0" ! application/pdf; version=1.2 !2004 !1 CC0 – free to download and reuse http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/ Please cite us and/or let us know if you use it Source code of all tools and modifications also available https://github.com/openplanets/nanite
not Tika Notably includes earlier Office formats 129 formats spotted by Tika but not DROID-B But at least 20 are due to not using the full DROID Conflicts Failed MIME type mapping, e.g. PDF 1.7 (since fixed) ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone) DROID strictness – 9M GIF, 4M JPG, 1.3M PDF… Both tools bad at non-HTML/XML text formats CSS, scripting languages like JS, CSV, TSV, etc.
appear to stabilize formats But once popular formats are fading nevertheless More sophisticated approach required Please re-use our data, or ask for more Firmer conclusions need: Richer, more detailed results From a wider range of corpora This approach only gives creator information A different approach will be needed to understand resource consumption (e.g. PPT 4, RealAudio 1)