Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Searching for obsolescence in the UK Web Archive

Searching for obsolescence in the UK Web Archive

Presented at "Re:Format - What is file format obsolescence and does it really exist?"

Avatar for Andy Jackson

Andy Jackson

June 23, 2016
Tweet

More Decks by Andy Jackson

Other Decks in Technology

Transcript

  1. www.bl.uk 3 What is a Format? “A file format is

    a standard way that information is encoded for storage in a computer file.” DPC Handbook Glossary OAIS
  2. www.bl.uk 6 National Archives of Australia’s Approach • Better approximates of

    how digital media behaves – Acknowledges the critical role software plays • All models: – assume digital objects have a singular interpretation – over-simplify the relationship between source & performance – ignore how digital objects are created
  3. www.bl.uk 12 Formats are Communication Protocols • Designed to enable the

    state of a software process to: – persist over time – distributed to other people via software systems • success manifests as usage and standardisation • ‘semantics’ means a common interpretation process – Formats are structure and interpretation • Singular, unambiguous interpretation is not ‘natural’: – standardised formats reflect social consensus: • captures what aspects of a performance are important to a community of users in a given context – validation only meaningful when standardisation succeeds
  4. www.bl.uk 13 Summary • Computer software & hardware make DP DP

    – Software and hardware are what go obsolete – Formats define structure and interpretation – Format identifiers link bitstreams to software • Encoders, decoders & consensus – Robustness principle (a.k.a. Postel’s Law) – Validation, normalisation not always meaningful • Trade-offs when modelling format – Our models don’t necessarily have to be this complex • But we’ll need to accept exceptions
  5. www.bl.uk 15 Obsolescence in the UK Web Archive •  Three

    collections: –  Selective Archive (since 2004) –  Legal Deposit Archive (since 2013) –  JISC Historical Archive (1996-2013) •  With 20 years of format history, we can: –  Study how formats change over time –  Test theories about formats –  Search for obsolescence: •  Find examples of difficult or inaccessible resources •  Document what we find
  6. www.bl.uk 16 Analysing the JISC Historical Archive • All .uk content

    from the Internet Archive collection: – From 1996 to April 2013 – 58 TB of archived, compressed – 3,454,906,082 web resources • Indexed for full-text search & analytics: – See https://www.webarchive.org.uk/shine • Scanned for formats during indexing: – Combining DROID and Apache Tika – Outputs Extended MIME Types: • e.g. text/html; version=“5.0”, charset=“UTF-8”
  7. www.bl.uk 17 Rothenberg’s Hypothesis • From Jeff Rothenberg: –  “Digital Information

    Lasts Forever – Or Five Years, Whichever Comes First.” (1997) –  “…still apt…” (2012) • Does the evidence support this? – How long do formats persist in the web archive?
  8. www.bl.uk 18 False: Network Effects Stabilise Formats 1" 10" 100"

    1,000" 10,000" 100,000" 1,000,000" 10,000,000" 100,000,000" 1,000,000,000" 10,000,000,000" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" Number'of'Resources'in'Archive' Timespan'[Years]'
  9. www.bl.uk 21 The Top Formats • 82.09% HTML • 99.82% in just

    20 formats • The ‘unknown format’ is 19th – Those 5,015 extensions • Flash stands out MIME Type # Resources text/html 1,995,507,185 application/xhtml+xml 840,627,471 image/jpeg 242,764,847 text/plain 150,670,343 image/gif 133,610,487 application/pdf 23,965,322 image/png 19,865,229 application/rss+xml 11,528,135 application/x-empty 8,551,502 application/xml 6,387,147 application/x-shockwave-flash 3,835,244 application/msword 3,720,032 application/atom+xml 1,381,932 image/vnd.microsoft.icon 1,305,558 application/vnd.ms-excel 1,201,725 application/gzip 988,491 application/zip 948,112 audio/mpeg 751,160 application/octet-stream 598,836 application/rdf+xml 511,301
  10. www.bl.uk 22 Flash (3,835,244) • Deprecated by Apple & HTML5: – Requires

    Adobe Flash browser plugin, considered insecure • Currently still supported, but for how long? • See also Director (68,879) & Shockwave plugin • Potential Preservation Actions: – Amenable to emulation – Convert to HTML5 via Swiffy (Google), Shumway (Mozilla) – But Flash files often download further resources: • The crawler didn’t understand Flash animations, so these resources were rarely captured. e.g. YouTube
  11. www.bl.uk 23 Formats Found – Audio • MIDI (244,718) – Playable in

    open source tools, e.g. UADE. – Missing/wrong voices? Hard to know. • RealMedia (105,638) & RealAudio (22,611) – Offical RealPlayer/browser plugin no longer supported – Playable in VLC, convertable via ffmpeg • RAM (Real Audio Metadata 402,146) – Just ‘pointers’ to actual audio streams • Not understood, so not captured by crawlers
  12. www.bl.uk 24 Formats Found – 3D • VRML (54,526) – VRML1 support

    is rare, but conversion to VRML97 possible – VRML97 quite well supported, can convert to X3D • e.g. penguin1.wrl • 3D Virtual Tour formats – IPIX (31,288) • Proprietary browser plugin or old Java plugin – QTVR (18!?) • Hard to ID, work in QuickTime 7 but not later versions – SVH (6,801) • Proprietary browser plugin, inaccessible (e.g.)
  13. www.bl.uk 25 Formats Found – Old Platforms • Acorn Draw (4,125)

    – Requires RISC OS • Available for Raspberry Pi and via two emulators – No conversion tools found • Spectrum (3,826) – Various formats (.z80 1,287 .tzx 1,219 .tap 825 .sna 495) – All widely supported by a number of emulators • e.g. jetpac.sna – Format conversion almost meaningless
  14. www.bl.uk 26 HTML Versions HTML%2.0% HTML%3.2% HTML%4.0% HTML%4.01% XHTML%1.0% 0%%

    10%% 20%% 30%% 40%% 50%% 60%% 70%% 80%% 90%% 100%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% Percentage)of)HTML)Resources) Year)
  15. www.bl.uk 28 Conclusions • Vast majority of web archive content in

    stable formats: – But it’s a complex picture, driven by network effects • Most uncommon formats still accessible, but: – Takes technical skill and expert community support – Hard to know if you’ve got it right • Failure to collect resources is a bigger risk: – Embedded references: • JavaScript, Flash, VRML, RAM – Contextual dependencies: • Fonts, plugins, software