Unified characterisation, please

Unified characterisation, please

Slides give at the SPRUCEdp Unified Characterisation event.


Andy Jackson

March 11, 2013


  1. • Andrew Jackson • Web Archiving Technical Lead • British Library Unified Characterisation,

  2. The Practitioners' Have Spoken… n  Quality Assurance (of broken or

    potentially broken data): §  Quality assurance, Bit rot, and Integrity n  Appraisal and Assessment: §  Appraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats. §  Identify/Locate Preservation Worthy Data n  Identify Preservation Risks: §  Obsolescence, preservation risk and business constraint n  Long tail of many other issues: §  Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost. n  Plus: Sustainable Tools 2
  3. Appraisal and Assessment Conformance, Unknown characteristics, and Unknown file formats.

    Identify/Locate Preservation Worthy Data n  Identification §  Always used to ‘route’ data to software that can understand it. §  Use minimum information to identify: §  e.g. header only if possible. “Truncated PDF”, not “UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing .dbf should be reported as such. n  Validation §  Two modes needed: “Fast fail”, “Log and continue” /Quirks §  Stop baseless distinction between “Well formed” and “Valid” §  Validation is irrelevant to digital preservation assessment: §  e.g. Effective “PDF/A”, without the 1.4 and XMP chunk. §  We’re on the wrong side of Postel’s Law. §  Unknown completeness and failure to future-proof: §  e.g. JHOVE tries to validate versions of PDF it cannot know. §  e.g. Tools sometimes interpret/migrate data opaquely. 3
  4. Identify Preservation Risks Obsolescence, preservation risk and business constraint n 

    Significant Properties are irrelevant here. §  It’s not really about the content, but about the context. n  Dependency Analysis: §  What software does this need? §  Does this file use format features that are not well supported across implementations? §  What other resources are transcluded? §  Fonts? c.f. OfficeDDT. §  Remote embeds? §  Embedded scripts that might mask dependencies? §  Do some operations require a password? §  e.g. JHOVE cannot spot ‘harmless’ PDF encryption. 4
  5. Sustainable Tools Our Tools n  Pure-Java Characterisation: §  JHOVE (‘clean

    room’ implementation) §  New Zealand Metadata Extractor (NZME) §  Apache Tika §  Java-based aggregation of various CLI tools: §  JHOVE2 §  FITS §  Other Characterisation: §  XCL – C++/XML ‘clean room’ extended with ImageMagick §  Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer... §  Identification: §  DROID, FIDO, Apache Tika, File §  Visualisation: §  C3PO, and many non-specialised tools. 5
  6. Sustainable Tools Up to date? Working together? n  Software Dependency

    Management: §  FITS/JHOVE2 embed old DROID versions, hard to upgrade. §  Dead dependencies: FITS and FFIdent, NZME and Jflac. §  Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS? §  Embed shared modules instead? §  Software Project Management and Communication: §  JHOVE, JHOVE2? FITS? §  JHOVE2 only compiles on Sheila’s branch? §  Roadmaps, issue management, testing, C.I., etc. §  Cross-project coordination and bug-fixing? §  Complexity: JHOVE2, XCL, extremely complex §  JHOVE2 Berkley DB causes checksum failures in tests §  Tika solves same problem using SAX 6
  7. Sustainable Tools Shared tests? n  Separate projects arise from separate

    workflows §  Start by understand commonality and find gaps? n  Share test cases and compare results? §  The OPF Format Corpus contains various valid and invalid files. §  Built by practitioners' to test real use cases. §  e.g. JP2 features, PDF Cabinet of Horrors. §  Do the tools give consistent and complementary results? §  Let’s find out! §  c.f. Dave Tarrant’s REF for Identification: §  http://data.openplanetsfoundation.org/ref/ §  http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/ 7
  8. Bit-mashing as Tool QA n  Bitwise exploration of data sensitivity.

    n  One way to compare tools. n  Helps understand formats. n  c.f. Jay Gattuso’s recent OPF blog. 8
  9. Quality Assurance (of broken or potentially broken data) Quality assurance,

    Bit rot, and Integrity n  JHOVE let failed TIFF-JP2 through… §  Jpylyzer does better. §  Both fall far short of actual rendering. 9
  10. Where's the unification? Where should we work together? n  Shared

    test corpora and test framework: §  Start with the OPF Format Corpus? §  Pull other corpora in by reference: §  http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A §  Sustainable version of Dave Tarrant’s REF? §  Extend with bit-mashing to compare tools? n  Aim to coordinate more: §  Make it clear where to go? (More about OfficeDDT). §  Consider merging projects? §  Consider sharing underlying libraries? §  Consider building Tika modules? §  Please consider Apache Preflight as base for PDF validation. 10