Searching for obsolescence in the UK Web Archive

Searching for obsolescence in the UK Web Archive

Presented at "Re:Format - What is file format obsolescence and does it really exist?"

A6b47d884e877f197e05c06916a956c8?s=128

Andy Jackson

June 23, 2016
Tweet

Transcript

  1. Searching for obsolescence in the UK Web Archive Andy Jackson

    Web Archive Technical Lead
  2. www.bl.uk 2 Format ? Y/N

  3. www.bl.uk 3 What is a Format? “A file format is

    a standard way that information is encoded for storage in a computer file.” DPC Handbook Glossary OAIS
  4. www.bl.uk 4 OAIS Representation Information

  5. www.bl.uk 5

  6. www.bl.uk 6 National Archives of Australia’s Approach • Better approximates of

    how digital media behaves – Acknowledges the critical role software plays • All models: – assume digital objects have a singular interpretation – over-simplify the relationship between source & performance – ignore how digital objects are created
  7. www.bl.uk 7 Polyglots

  8. www.bl.uk 8 LOADING… [ o o ]

  9. www.bl.uk 9 Save As…

  10. www.bl.uk 10 Theory of Communication

  11. www.bl.uk 11 Digital Preservation as Communication

  12. www.bl.uk 12 Formats are Communication Protocols • Designed to enable the

    state of a software process to: – persist over time – distributed to other people via software systems • success manifests as usage and standardisation • ‘semantics’ means a common interpretation process – Formats are structure and interpretation • Singular, unambiguous interpretation is not ‘natural’: – standardised formats reflect social consensus: • captures what aspects of a performance are important to a community of users in a given context – validation only meaningful when standardisation succeeds
  13. www.bl.uk 13 Summary • Computer software & hardware make DP DP

    – Software and hardware are what go obsolete – Formats define structure and interpretation – Format identifiers link bitstreams to software • Encoders, decoders & consensus – Robustness principle (a.k.a. Postel’s Law) – Validation, normalisation not always meaningful • Trade-offs when modelling format – Our models don’t necessarily have to be this complex • But we’ll need to accept exceptions
  14. www.bl.uk 14 Reboot ? Y/N

  15. www.bl.uk 15 Obsolescence in the UK Web Archive •  Three

    collections: –  Selective Archive (since 2004) –  Legal Deposit Archive (since 2013) –  JISC Historical Archive (1996-2013) •  With 20 years of format history, we can: –  Study how formats change over time –  Test theories about formats –  Search for obsolescence: •  Find examples of difficult or inaccessible resources •  Document what we find
  16. www.bl.uk 16 Analysing the JISC Historical Archive • All .uk content

    from the Internet Archive collection: – From 1996 to April 2013 – 58 TB of archived, compressed – 3,454,906,082 web resources • Indexed for full-text search & analytics: – See https://www.webarchive.org.uk/shine • Scanned for formats during indexing: – Combining DROID and Apache Tika – Outputs Extended MIME Types: • e.g. text/html; version=“5.0”, charset=“UTF-8”
  17. www.bl.uk 17 Rothenberg’s Hypothesis • From Jeff Rothenberg: –  “Digital Information

    Lasts Forever – Or Five Years, Whichever Comes First.” (1997) –  “…still apt…” (2012) • Does the evidence support this? – How long do formats persist in the web archive?
  18. www.bl.uk 18 False: Network Effects Stabilise Formats 1" 10" 100"

    1,000" 10,000" 100,000" 1,000,000" 10,000,000" 100,000,000" 1,000,000,000" 10,000,000,000" 0" 2" 4" 6" 8" 10" 12" 14" 16" 18" Number'of'Resources'in'Archive' Timespan'[Years]'
  19. www.bl.uk 19 623 Recognised Formats

  20. www.bl.uk 20 598,836 Unrecognised Resources

  21. www.bl.uk 21 The Top Formats • 82.09% HTML • 99.82% in just

    20 formats • The ‘unknown format’ is 19th – Those 5,015 extensions • Flash stands out MIME Type # Resources text/html 1,995,507,185 application/xhtml+xml 840,627,471 image/jpeg 242,764,847 text/plain 150,670,343 image/gif 133,610,487 application/pdf 23,965,322 image/png 19,865,229 application/rss+xml 11,528,135 application/x-empty 8,551,502 application/xml 6,387,147 application/x-shockwave-flash 3,835,244 application/msword 3,720,032 application/atom+xml 1,381,932 image/vnd.microsoft.icon 1,305,558 application/vnd.ms-excel 1,201,725 application/gzip 988,491 application/zip 948,112 audio/mpeg 751,160 application/octet-stream 598,836 application/rdf+xml 511,301
  22. www.bl.uk 22 Flash (3,835,244) • Deprecated by Apple & HTML5: – Requires

    Adobe Flash browser plugin, considered insecure • Currently still supported, but for how long? • See also Director (68,879) & Shockwave plugin • Potential Preservation Actions: – Amenable to emulation – Convert to HTML5 via Swiffy (Google), Shumway (Mozilla) – But Flash files often download further resources: • The crawler didn’t understand Flash animations, so these resources were rarely captured. e.g. YouTube
  23. www.bl.uk 23 Formats Found – Audio • MIDI (244,718) – Playable in

    open source tools, e.g. UADE. – Missing/wrong voices? Hard to know. • RealMedia (105,638) & RealAudio (22,611) – Offical RealPlayer/browser plugin no longer supported – Playable in VLC, convertable via ffmpeg • RAM (Real Audio Metadata 402,146) – Just ‘pointers’ to actual audio streams • Not understood, so not captured by crawlers
  24. www.bl.uk 24 Formats Found – 3D • VRML (54,526) – VRML1 support

    is rare, but conversion to VRML97 possible – VRML97 quite well supported, can convert to X3D • e.g. penguin1.wrl • 3D Virtual Tour formats – IPIX (31,288) • Proprietary browser plugin or old Java plugin – QTVR (18!?) • Hard to ID, work in QuickTime 7 but not later versions – SVH (6,801) • Proprietary browser plugin, inaccessible (e.g.)
  25. www.bl.uk 25 Formats Found – Old Platforms • Acorn Draw (4,125)

    – Requires RISC OS • Available for Raspberry Pi and via two emulators – No conversion tools found • Spectrum (3,826) – Various formats (.z80 1,287 .tzx 1,219 .tap 825 .sna 495) – All widely supported by a number of emulators • e.g. jetpac.sna – Format conversion almost meaningless
  26. www.bl.uk 26 HTML Versions HTML%2.0% HTML%3.2% HTML%4.0% HTML%4.01% XHTML%1.0% 0%%

    10%% 20%% 30%% 40%% 50%% 60%% 70%% 80%% 90%% 100%% 1996% 1997% 1998% 1999% 2000% 2001% 2002% 2003% 2004% 2005% 2006% 2007% 2008% 2009% 2010% Percentage)of)HTML)Resources) Year)
  27. www.bl.uk 27 HTML Features <applet> <blink> <font> <script>

  28. www.bl.uk 28 Conclusions • Vast majority of web archive content in

    stable formats: – But it’s a complex picture, driven by network effects • Most uncommon formats still accessible, but: – Takes technical skill and expert community support – Hard to know if you’ve got it right • Failure to collect resources is a bigger risk: – Embedded references: • JavaScript, Flash, VRML, RAM – Contextual dependencies: • Fonts, plugins, software
  29. www.bl.uk 29 Thank you • Questions?