Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Formats Over Time: Exploring UK Web History

Andy Jackson
October 04, 2012

Formats Over Time: Exploring UK Web History

Andy Jackson

October 04, 2012
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. Formats over Time
    Exploring UK Web History
    Andrew Jackson
    UK Web Archive, The British Library
    iPres 2012 | 04-10-2012 | Toronto

    View Slide

  2. DEBATING OBSOLESCENCE
    Formats over Time

    View Slide

  3. Rothenberg & Rosenthal On Format Obsolescence
      Jeff Rothenberg:
      “Digital Information Lasts Forever –
    Or Five Years, Whichever Comes First.” (1997)
      “…still apt…” (2012)
      David Rosenthal:
      “when challenged, proponents of [format migration strategies]
    have failed to identify even one format in wide use when
    Rothenberg [made that assertion] that has gone obsolete in the
    intervening decade and a half.” (2010)
      That network effects inhibit obsolescence
      Where is the evidence?

    View Slide

  4. AN EXPERIMENT
    Formats over Time

    View Slide

  5. UK Web Domain Dataset (1994-2010)
      UK Web Domain Dataset (1994-2010)
      From the Internet Archive
      Millions of websites
      > 2.5 billion resources
      > 400,000 ARC/WARC files
      > 35TB
      Execution at Scale
      Stored on HDFS
      Map-Reduce

    View Slide

  6. Identification Tools
      DROID
      Well-known in digital preservation community
      Format version level identification
      Minor problem concerning file handles
      Only binary signature part (DROID-B) could be embedded
      Apache Tika
      Widely used identification and data extraction tool
      Identifies many formats at the MIME type level
      Easy to embed and extend
      Added ability to extract e.g. software identifiers
      Minor bug concerning identification buffer size

    View Slide

  7. A Common Language For Format Identifiers
      Comparison and combination requires a common model
      Map PRONOM IDs to extended MIME Types
      fmt/18
    becomes
    application/pdf; version=1.4
      Allows easy comparison at sub-type level
      Can easily extend to cover other properties:
      text/plain; charset=UTF-8
      application/pdf;
    software=“Adobe Acrobat 6.0”
      Also extended Tika to output details from PDFs

    View Slide

  8. Format Profile Dataset
      Server, Tika & DROID-B format profiles, over time:
    image/png image/png image/png; version=1.0 2004 102
    !
    application/pdf !

    application/pdf; version=1.2; software="Acrobat
    Distiller 4.0 for Windows"; 

    source="Adobe PageMaker 6.0" !

    application/pdf; version=1.2 !2004 !1
      CC0 – free to download and reuse
      http://data.webarchive.org.uk/opendata/ukwa.ds.2/fmt/
      Please cite us and/or let us know if you use it
      Source code of all tools and modifications also available
      https://github.com/openplanets/nanite

    View Slide

  9. COMPARING TOOLS
    Results

    View Slide

  10. Coverage & Depth
    0%#
    1%#
    10%#
    100%#
    1996#
    1997#
    1998#
    1999#
    2000#
    2001#
    2002#
    2003#
    2004#
    2005#
    2006#
    2007#
    2008#
    2009#
    2010#
    Percentage)of)resources)
    uniden0fied)
    Year)
    DROID1B#v.59#
    Apache#Tika#1.1#
    No format-version-level information from Apache Tika.

    View Slide

  11. Inconsistencies
      Gaps
      37 formats spotted by DROID-B but not Tika
      Notably includes earlier Office formats
      129 formats spotted by Tika but not DROID-B
      But at least 20 are due to not using the full DROID
      Conflicts
      Failed MIME type mapping, e.g. PDF 1.7 (since fixed)
      ‘Soft’ signatures – e.g. PICT matching 3M JPG (gone)
      DROID strictness – 9M GIF, 4M JPG, 1.3M PDF…
      Both tools bad at non-HTML/XML text formats
      CSS, scripting languages like JS, CSV, TSV, etc.

    View Slide

  12. FORMATS OVER TIME
    Results

    View Slide

  13. Image Formats Over Time
    0.00001%%
    0.00010%%
    0.00100%%
    0.01000%%
    0.10000%%
    1.00000%%
    10.00000%%
    100.00000%%
    1996%
    1997%
    1998%
    1999%
    2000%
    2001%
    2002%
    2003%
    2004%
    2005%
    2006%
    2007%
    2008%
    2009%
    2010%
    Percentage)of)crawl)
    Year)
    JPEG%
    GIF%
    PNG%
    ICON%
    XBM%
    TIFF%

    View Slide

  14. HTML Versions Over Time
    HTML%2.0%
    HTML%3.2%
    HTML%4.0%
    HTML%4.01%
    XHTML%1.0%
    0%%
    10%%
    20%%
    30%%
    40%%
    50%%
    60%%
    70%%
    80%%
    90%%
    100%%
    1996%
    1997%
    1998%
    1999%
    2000%
    2001%
    2002%
    2003%
    2004%
    2005%
    2006%
    2007%
    2008%
    2009%
    2010%
    Percentage)of)HTML)Resources)
    Year)

    View Slide

  15. PDF Versions Over Time
    1.0$
    1.1$
    1.2$
    1.3$
    1.4$
    1.5$
    1.6$
    0%$
    10%$
    20%$
    30%$
    40%$
    50%$
    60%$
    70%$
    80%$
    90%$
    100%$
    1996$
    1997$
    1998$
    1999$
    2000$
    2001$
    2002$
    2003$
    2004$
    2005$
    2006$
    2007$
    2008$
    2009$
    2010$
    Percentage)of)PDF)Resources)
    Year)

    View Slide

  16. Format Usage Versus Time
    1"
    10"
    100"
    1,000"
    10,000"
    100,000"
    1,000,000"
    10,000,000"
    100,000,000"
    1,000,000,000"
    10,000,000,000"
    0" 2" 4" 6" 8" 10" 12" 14" 16" 18"
    Number'of'Resources'in'Archive'
    Timespan'[Years]'

    View Slide

  17. IMPLEMENTATIONS
    Results

    View Slide

  18. PDF Software Over Time
    Acrobat(Dis,ller(
    Acrobat(
    PDFWriter(
    Acrobat(
    0%(
    10%(
    20%(
    30%(
    40%(
    50%(
    60%(
    70%(
    80%(
    90%(
    100%(
    1996(
    1997(
    1998(
    1999(
    2000(
    2001(
    2002(
    2003(
    2004(
    2005(
    2006(
    2007(
    2008(
    2009(
    2010(
    Percentage)of)PDF)Resources)
    Year)
    Over 2100 Distinct PDF Software IDs

    View Slide

  19. JPEG Hardware Over Time
    DS5$ CYBERSHOT$ E990$
    MX1700$
    NIKON$D40$
    0%$
    10%$
    20%$
    30%$
    40%$
    50%$
    60%$
    70%$
    80%$
    90%$
    100%$
    1994$
    1995$
    1996$
    1997$
    1998$
    1999$
    2000$
    2001$
    2002$
    2003$
    2004$
    2005$
    2006$
    2007$
    2008$
    2009$
    2010$
    Percentage)of)Harware)IDs)
    Year)
    Over 2100 Distinct JPEG Hardware IDs

    View Slide

  20. CONCLUSIONS
    Formats over Time

    View Slide

  21. Summary
      Format obsolescence is complex
      Network effects do appear to stabilize formats
      But once popular formats are fading nevertheless
      More sophisticated approach required
      Please re-use our data, or ask for more
      Firmer conclusions need:
      Richer, more detailed results
      From a wider range of corpora
      This approach only gives creator information
      A different approach will be needed to understand
    resource consumption (e.g. PPT 4, RealAudio 1)

    View Slide

  22. webarchive.org.uk
    Questions?

    View Slide