Slide 1

Slide 1 text

Formats over Time Exploring UK Web History Andrew Jackson UK Web Archive, The British Library iPres 2012 | 04-10-2012 | Toronto

Slide 2

Slide 2 text

DEBATING OBSOLESCENCE Formats over Time

Slide 3

Slide 3 text

Rothenberg & Rosenthal On Format Obsolescence   Jeff Rothenberg:   “Digital Information Lasts Forever – Or Five Years, Whichever Comes First.” (1997)   “…still apt…” (2012)   David Rosenthal:   “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)   That network effects inhibit obsolescence   Where is the evidence?

Slide 4

Slide 4 text

AN EXPERIMENT Formats over Time

Slide 5

Slide 5 text

UK Web Domain Dataset (1994-2010)   UK Web Domain Dataset (1994-2010)   From the Internet Archive   Millions of websites   > 2.5 billion resources   > 400,000 ARC/WARC files   > 35TB   Execution at Scale   Stored on HDFS   Map-Reduce

Slide 6

Slide 6 text

Identification Tools   DROID   Well-known in digital preservation community   Format version level identification   Minor problem concerning file handles   Only binary signature part (DROID-B) could be embedded   Apache Tika   Widely used identification and data extraction tool   Identifies many formats at the MIME type level   Easy to embed and extend   Added ability to extract e.g. software identifiers   Minor bug concerning identification buffer size

Slide 7

Slide 7 text

A Common Language For Format Identifiers   Comparison and combination requires a common model   Map PRONOM IDs to extended MIME Types   fmt/18 becomes application/pdf; version=1.4   Allows easy comparison at sub-type level   Can easily extend to cover other properties:   text/plain; charset=UTF-8   application/pdf; software=“Adobe Acrobat 6.0”   Also extended Tika to output details from PDFs

Slide 8

Slide 8 text

Format Profile Dataset   Server, Tika & DROID-B format profiles, over time: image/png image/png image/png; version=1.0 2004 102 ! application/pdf !
 application/pdf; version=1.2; software="Acrobat Distiller 4.0 for Windows"; 
 source="Adobe PageMaker 6.0" !
 application/pdf; version=1.2 !2004 !1   CC0 – free to download and reuse   http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/   Please cite us and/or let us know if you use it   Source code of all tools and modifications also available   https://github.com/openplanets/nanite

Slide 9

Slide 9 text

COMPARING TOOLS Results

Slide 10

Slide 10 text

Coverage & Depth 0%# 1%# 10%# 100%# 1996# 1997# 1998# 1999# 2000# 2001# 2002# 2003# 2004# 2005# 2006# 2007# 2008# 2009# 2010# Percentage)of)resources) uniden0fied) Year) DROID1B#v.59# Apache#Tika#1.1# No format-version-level information from Apache Tika.

Slide 11

Slide 11 text

Inconsistencies   Gaps   37 formats spotted by DROID-B but not Tika   Notably includes earlier Office formats   129 formats spotted by Tika but not DROID-B   But at least 20 are due to not using the full DROID   Conflicts   Failed MIME type mapping, e.g. PDF 1.7 (since fixed)   ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)   DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…   Both tools bad at non-HTML/XML text formats   CSS, scripting languages like JS, CSV, TSV, etc.

Slide 12

Slide 12 text

FORMATS OVER TIME Results

Slide 13

Slide 13 text

Image Formats Over Time 0.00001%% 0.00010%% 0.00100%% 0.01000%% 0.10000%% 1.00000%% 10.00000%% 100.00000%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% Percentage)of)crawl) Year) JPEG% GIF% PNG% ICON% XBM% TIFF%

Slide 14

Slide 14 text

HTML Versions Over Time HTML%2.0% HTML%3.2% HTML%4.0% HTML%4.01% XHTML%1.0% 0%% 10%% 20%% 30%% 40%% 50%% 60%% 70%% 80%% 90%% 100%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% Percentage)of)HTML)Resources) Year)

Slide 15

Slide 15 text

PDF Versions Over Time 1.0$ 1.1$ 1.2$ 1.3$ 1.4$ 1.5$ 1.6$ 0%$ 10%$ 20%$ 30%$ 40%$ 50%$ 60%$ 70%$ 80%$ 90%$ 100%$ 1996$ 1997$ 1998$ 1999$ 2000$ 2001$ 2002$ 2003$ 2004$ 2005$ 2006$ 2007$ 2008$ 2009$ 2010$ Percentage)of)PDF)Resources) Year)

Slide 16

Slide 16 text

Format Usage Versus Time 1" 10" 100" 1,000" 10,000" 100,000" 1,000,000" 10,000,000" 100,000,000" 1,000,000,000" 10,000,000,000" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" Number'of'Resources'in'Archive' Timespan'[Years]'

Slide 17

Slide 17 text

IMPLEMENTATIONS Results

Slide 18

Slide 18 text

PDF Software Over Time Acrobat(Dis,ller( Acrobat( PDFWriter( Acrobat( 0%( 10%( 20%( 30%( 40%( 50%( 60%( 70%( 80%( 90%( 100%( 1996( 1997( 1998( 1999( 2000( 2001( 2002( 2003( 2004( 2005( 2006( 2007( 2008( 2009( 2010( Percentage)of)PDF)Resources) Year) Over 2100 Distinct PDF Software IDs

Slide 19

Slide 19 text

JPEG Hardware Over Time DS5$ CYBERSHOT$ E990$ MX1700$ NIKON$D40$ 0%$ 10%$ 20%$ 30%$ 40%$ 50%$ 60%$ 70%$ 80%$ 90%$ 100%$ 1994$ 1995$ 1996$ 1997$ 1998$ 1999$ 2000$ 2001$ 2002$ 2003$ 2004$ 2005$ 2006$ 2007$ 2008$ 2009$ 2010$ Percentage)of)Harware)IDs) Year) Over 2100 Distinct JPEG Hardware IDs

Slide 20

Slide 20 text

CONCLUSIONS Formats over Time

Slide 21

Slide 21 text

Summary   Format obsolescence is complex   Network effects do appear to stabilize formats   But once popular formats are fading nevertheless   More sophisticated approach required   Please re-use our data, or ask for more   Firmer conclusions need:   Richer, more detailed results   From a wider range of corpora   This approach only gives creator information   A different approach will be needed to understand resource consumption (e.g. PPT 4, RealAudio 1)

Slide 22

Slide 22 text

webarchive.org.uk Questions?