Formats Over Time: Exploring UK Web History

A6b47d884e877f197e05c06916a956c8?s=47 Andy Jackson
October 04, 2012

Formats Over Time: Exploring UK Web History

A6b47d884e877f197e05c06916a956c8?s=128

Andy Jackson

October 04, 2012
Tweet

Transcript

  1. Formats over Time Exploring UK Web History Andrew Jackson UK

    Web Archive, The British Library iPres 2012 | 04-10-2012 | Toronto
  2. DEBATING OBSOLESCENCE Formats over Time

  3. Rothenberg & Rosenthal On Format Obsolescence   Jeff Rothenberg:  

    “Digital Information Lasts Forever – Or Five Years, Whichever Comes First.” (1997)   “…still apt…” (2012)   David Rosenthal:   “when challenged, proponents of [format migration strategies] have failed to identify even one format in wide use when Rothenberg [made that assertion] that has gone obsolete in the intervening decade and a half.” (2010)   That network effects inhibit obsolescence   Where is the evidence?
  4. AN EXPERIMENT Formats over Time

  5. UK Web Domain Dataset (1994-2010)   UK Web Domain Dataset

    (1994-2010)   From the Internet Archive   Millions of websites   > 2.5 billion resources   > 400,000 ARC/WARC files   > 35TB   Execution at Scale   Stored on HDFS   Map-Reduce
  6. Identification Tools   DROID   Well-known in digital preservation community

      Format version level identification   Minor problem concerning file handles   Only binary signature part (DROID-B) could be embedded   Apache Tika   Widely used identification and data extraction tool   Identifies many formats at the MIME type level   Easy to embed and extend   Added ability to extract e.g. software identifiers   Minor bug concerning identification buffer size
  7. A Common Language For Format Identifiers   Comparison and combination

    requires a common model   Map PRONOM IDs to extended MIME Types   fmt/18 becomes application/pdf; version=1.4   Allows easy comparison at sub-type level   Can easily extend to cover other properties:   text/plain; charset=UTF-8   application/pdf; software=“Adobe Acrobat 6.0”   Also extended Tika to output details from PDFs
  8. Format Profile Dataset   Server, Tika & DROID-B format profiles,

    over time: image/png image/png image/png; version=1.0 2004 102 ! application/pdf !
 application/pdf; version=1.2; software="Acrobat Distiller 4.0 for Windows"; 
 source="Adobe PageMaker 6.0" !
 application/pdf; version=1.2 !2004 !1   CC0 – free to download and reuse   http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/   Please cite us and/or let us know if you use it   Source code of all tools and modifications also available   https://github.com/openplanets/nanite
  9. COMPARING TOOLS Results

  10. Coverage & Depth 0%# 1%# 10%# 100%# 1996# 1997# 1998#

    1999# 2000# 2001# 2002# 2003# 2004# 2005# 2006# 2007# 2008# 2009# 2010# Percentage)of)resources) uniden0fied) Year) DROID1B#v.59# Apache#Tika#1.1# No format-version-level information from Apache Tika.
  11. Inconsistencies   Gaps   37 formats spotted by DROID-B but

    not Tika   Notably includes earlier Office formats   129 formats spotted by Tika but not DROID-B   But at least 20 are due to not using the full DROID   Conflicts   Failed MIME type mapping, e.g. PDF 1.7 (since fixed)   ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)   DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…   Both tools bad at non-HTML/XML text formats   CSS, scripting languages like JS, CSV, TSV, etc.
  12. FORMATS OVER TIME Results

  13. Image Formats Over Time 0.00001%% 0.00010%% 0.00100%% 0.01000%% 0.10000%% 1.00000%%

    10.00000%% 100.00000%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% Percentage)of)crawl) Year) JPEG% GIF% PNG% ICON% XBM% TIFF%
  14. HTML Versions Over Time HTML%2.0% HTML%3.2% HTML%4.0% HTML%4.01% XHTML%1.0% 0%%

    10%% 20%% 30%% 40%% 50%% 60%% 70%% 80%% 90%% 100%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% Percentage)of)HTML)Resources) Year)
  15. PDF Versions Over Time 1.0$ 1.1$ 1.2$ 1.3$ 1.4$ 1.5$

    1.6$ 0%$ 10%$ 20%$ 30%$ 40%$ 50%$ 60%$ 70%$ 80%$ 90%$ 100%$ 1996$ 1997$ 1998$ 1999$ 2000$ 2001$ 2002$ 2003$ 2004$ 2005$ 2006$ 2007$ 2008$ 2009$ 2010$ Percentage)of)PDF)Resources) Year)
  16. Format Usage Versus Time 1" 10" 100" 1,000" 10,000" 100,000"

    1,000,000" 10,000,000" 100,000,000" 1,000,000,000" 10,000,000,000" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" Number'of'Resources'in'Archive' Timespan'[Years]'
  17. IMPLEMENTATIONS Results

  18. PDF Software Over Time Acrobat(Dis,ller( Acrobat( PDFWriter( Acrobat( 0%( 10%(

    20%( 30%( 40%( 50%( 60%( 70%( 80%( 90%( 100%( 1996( 1997( 1998( 1999( 2000( 2001( 2002( 2003( 2004( 2005( 2006( 2007( 2008( 2009( 2010( Percentage)of)PDF)Resources) Year) Over 2100 Distinct PDF Software IDs
  19. JPEG Hardware Over Time DS5$ CYBERSHOT$ E990$ MX1700$ NIKON$D40$ 0%$

    10%$ 20%$ 30%$ 40%$ 50%$ 60%$ 70%$ 80%$ 90%$ 100%$ 1994$ 1995$ 1996$ 1997$ 1998$ 1999$ 2000$ 2001$ 2002$ 2003$ 2004$ 2005$ 2006$ 2007$ 2008$ 2009$ 2010$ Percentage)of)Harware)IDs) Year) Over 2100 Distinct JPEG Hardware IDs
  20. CONCLUSIONS Formats over Time

  21. Summary   Format obsolescence is complex   Network effects do

    appear to stabilize formats   But once popular formats are fading nevertheless   More sophisticated approach required   Please re-use our data, or ask for more   Firmer conclusions need:   Richer, more detailed results   From a wider range of corpora   This approach only gives creator information   A different approach will be needed to understand resource consumption (e.g. PPT 4, RealAudio 1)
  22. webarchive.org.uk Questions?