Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching for obsolescence in the UK Web Archive

Searching for obsolescence in the UK Web Archive

Presented at "Re:Format - What is file format obsolescence and does it really exist?"

Andy Jackson

June 23, 2016
Tweet

More Decks by Andy Jackson

Other Decks in Technology

Transcript

  1. Searching for obsolescence in
    the UK Web Archive
    Andy Jackson
    Web Archive Technical Lead

    View Slide

  2. www.bl.uk 2
    Format ?
    Y/N

    View Slide

  3. www.bl.uk 3
    What is a Format?
    “A file format is a standard way that information is encoded
    for storage in a computer file.”
    DPC Handbook Glossary
    OAIS

    View Slide

  4. www.bl.uk 4
    OAIS Representation Information

    View Slide

  5. www.bl.uk 5

    View Slide

  6. www.bl.uk 6
    National Archives of Australia’s Approach
    • Better approximates of how digital media behaves
    – Acknowledges the critical role software plays
    • All models:
    – assume digital objects have a singular interpretation
    – over-simplify the relationship between source & performance
    – ignore how digital objects are created

    View Slide

  7. www.bl.uk 7
    Polyglots

    View Slide

  8. www.bl.uk 8
    LOADING…
    [ o o ]

    View Slide

  9. www.bl.uk 9
    Save As…

    View Slide

  10. www.bl.uk 10
    Theory of Communication

    View Slide

  11. www.bl.uk 11
    Digital Preservation as Communication

    View Slide

  12. www.bl.uk 12
    Formats are Communication Protocols
    • Designed to enable the state of a software process to:
    – persist over time
    – distributed to other people via software systems
    • success manifests as usage and standardisation
    • ‘semantics’ means a common interpretation process
    – Formats are structure and interpretation
    • Singular, unambiguous interpretation is not ‘natural’:
    – standardised formats reflect social consensus:
    • captures what aspects of a performance are important to
    a community of users in a given context
    – validation only meaningful when standardisation succeeds

    View Slide

  13. www.bl.uk 13
    Summary
    • Computer software & hardware make DP DP
    – Software and hardware are what go obsolete
    – Formats define structure and interpretation
    – Format identifiers link bitstreams to software
    • Encoders, decoders & consensus
    – Robustness principle (a.k.a. Postel’s Law)
    – Validation, normalisation not always meaningful
    • Trade-offs when modelling format
    – Our models don’t necessarily have to be this complex
    • But we’ll need to accept exceptions

    View Slide

  14. www.bl.uk 14
    Reboot ?
    Y/N

    View Slide

  15. www.bl.uk 15
    Obsolescence in the UK Web Archive
    •  Three collections:
    –  Selective Archive (since 2004)
    –  Legal Deposit Archive (since 2013)
    –  JISC Historical Archive (1996-2013)
    •  With 20 years of format history, we can:
    –  Study how formats change over time
    –  Test theories about formats
    –  Search for obsolescence:
    •  Find examples of difficult or
    inaccessible resources
    •  Document what we find

    View Slide

  16. www.bl.uk 16
    Analysing the JISC Historical Archive
    • All .uk content from the Internet Archive collection:
    – From 1996 to April 2013
    – 58 TB of archived, compressed
    – 3,454,906,082 web resources
    • Indexed for full-text search & analytics:
    – See https://www.webarchive.org.uk/shine
    • Scanned for formats during indexing:
    – Combining DROID and Apache Tika
    – Outputs Extended MIME Types:
    • e.g. text/html; version=“5.0”, charset=“UTF-8”

    View Slide

  17. www.bl.uk 17
    Rothenberg’s Hypothesis
    • From Jeff Rothenberg:
    –  “Digital Information Lasts Forever –
    Or Five Years, Whichever Comes First.” (1997)
    –  “…still apt…” (2012)
    • Does the evidence support this?
    – How long do formats persist in the web archive?

    View Slide

  18. www.bl.uk 18
    False: Network Effects Stabilise Formats
    1"
    10"
    100"
    1,000"
    10,000"
    100,000"
    1,000,000"
    10,000,000"
    100,000,000"
    1,000,000,000"
    10,000,000,000"
    0" 2" 4" 6" 8" 10" 12" 14" 16" 18"
    Number'of'Resources'in'Archive'
    Timespan'[Years]'

    View Slide

  19. www.bl.uk 19
    623 Recognised Formats

    View Slide

  20. www.bl.uk 20
    598,836 Unrecognised Resources

    View Slide

  21. www.bl.uk 21
    The Top Formats
    • 82.09% HTML
    • 99.82% in just 20 formats
    • The ‘unknown format’ is 19th
    – Those 5,015 extensions
    • Flash stands out
    MIME Type # Resources
    text/html 1,995,507,185
    application/xhtml+xml 840,627,471
    image/jpeg 242,764,847
    text/plain 150,670,343
    image/gif 133,610,487
    application/pdf 23,965,322
    image/png 19,865,229
    application/rss+xml 11,528,135
    application/x-empty 8,551,502
    application/xml 6,387,147
    application/x-shockwave-flash 3,835,244
    application/msword 3,720,032
    application/atom+xml 1,381,932
    image/vnd.microsoft.icon 1,305,558
    application/vnd.ms-excel 1,201,725
    application/gzip 988,491
    application/zip 948,112
    audio/mpeg 751,160
    application/octet-stream 598,836
    application/rdf+xml 511,301

    View Slide

  22. www.bl.uk 22
    Flash (3,835,244)
    • Deprecated by Apple & HTML5:
    – Requires Adobe Flash browser plugin, considered insecure
    • Currently still supported, but for how long?
    • See also Director (68,879) & Shockwave plugin
    • Potential Preservation Actions:
    – Amenable to emulation
    – Convert to HTML5 via Swiffy (Google), Shumway (Mozilla)
    – But Flash files often download further resources:
    • The crawler didn’t understand Flash animations, so these
    resources were rarely captured. e.g. YouTube

    View Slide

  23. www.bl.uk 23
    Formats Found – Audio
    • MIDI (244,718)
    – Playable in open source tools, e.g. UADE.
    – Missing/wrong voices? Hard to know.
    • RealMedia (105,638) & RealAudio (22,611)
    – Offical RealPlayer/browser plugin no longer supported
    – Playable in VLC, convertable via ffmpeg
    • RAM (Real Audio Metadata 402,146)
    – Just ‘pointers’ to actual audio streams
    • Not understood, so not captured by crawlers

    View Slide

  24. www.bl.uk 24
    Formats Found – 3D
    • VRML (54,526)
    – VRML1 support is rare, but conversion to VRML97 possible
    – VRML97 quite well supported, can convert to X3D
    • e.g. penguin1.wrl
    • 3D Virtual Tour formats
    – IPIX (31,288)
    • Proprietary browser plugin or old Java plugin
    – QTVR (18!?)
    • Hard to ID, work in QuickTime 7 but not later versions
    – SVH (6,801)
    • Proprietary browser plugin, inaccessible (e.g.)

    View Slide

  25. www.bl.uk 25
    Formats Found – Old Platforms
    • Acorn Draw (4,125)
    – Requires RISC OS
    • Available for Raspberry Pi and via two emulators
    – No conversion tools found
    • Spectrum (3,826)
    – Various formats (.z80 1,287 .tzx 1,219 .tap 825 .sna 495)
    – All widely supported by a number of emulators
    • e.g. jetpac.sna
    – Format conversion almost meaningless

    View Slide

  26. www.bl.uk 26
    HTML Versions
    HTML%2.0%
    HTML%3.2%
    HTML%4.0%
    HTML%4.01%
    XHTML%1.0%
    0%%
    10%%
    20%%
    30%%
    40%%
    50%%
    60%%
    70%%
    80%%
    90%%
    100%%
    1996%
    1997%
    1998%
    1999%
    2000%
    2001%
    2002%
    2003%
    2004%
    2005%
    2006%
    2007%
    2008%
    2009%
    2010%
    Percentage)of)HTML)Resources)
    Year)

    View Slide

  27. www.bl.uk 27
    HTML Features



    <br/>

    View Slide

  28. www.bl.uk 28
    Conclusions
    • Vast majority of web archive content in stable formats:
    – But it’s a complex picture, driven by network effects
    • Most uncommon formats still accessible, but:
    – Takes technical skill and expert community support
    – Hard to know if you’ve got it right
    • Failure to collect resources is a bigger risk:
    – Embedded references:
    • JavaScript, Flash, VRML, RAM
    – Contextual dependencies:
    • Fonts, plugins, software

    View Slide

  29. www.bl.uk 29
    Thank you
    • Questions?

    View Slide