Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Unified characterisation, please

Unified characterisation, please

Slides give at the SPRUCEdp Unified Characterisation event.

Andy Jackson

March 11, 2013
Tweet

More Decks by Andy Jackson

Other Decks in Research

Transcript

  1. The Practitioners' Have Spoken… n  Quality Assurance (of broken or

    potentially broken data): §  Quality assurance, Bit rot, and Integrity n  Appraisal and Assessment: §  Appraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats. §  Identify/Locate Preservation Worthy Data n  Identify Preservation Risks: §  Obsolescence, preservation risk and business constraint n  Long tail of many other issues: §  Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost. n  Plus: Sustainable Tools 2
  2. Appraisal and Assessment Conformance, Unknown characteristics, and Unknown file formats.

    Identify/Locate Preservation Worthy Data n  Identification §  Always used to ‘route’ data to software that can understand it. §  Use minimum information to identify: §  e.g. header only if possible. “Truncated PDF”, not “UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing .dbf should be reported as such. n  Validation §  Two modes needed: “Fast fail”, “Log and continue” /Quirks §  Stop baseless distinction between “Well formed” and “Valid” §  Validation is irrelevant to digital preservation assessment: §  e.g. Effective “PDF/A”, without the 1.4 and XMP chunk. §  We’re on the wrong side of Postel’s Law. §  Unknown completeness and failure to future-proof: §  e.g. JHOVE tries to validate versions of PDF it cannot know. §  e.g. Tools sometimes interpret/migrate data opaquely. 3
  3. Identify Preservation Risks Obsolescence, preservation risk and business constraint n 

    Significant Properties are irrelevant here. §  It’s not really about the content, but about the context. n  Dependency Analysis: §  What software does this need? §  Does this file use format features that are not well supported across implementations? §  What other resources are transcluded? §  Fonts? c.f. OfficeDDT. §  Remote embeds? §  Embedded scripts that might mask dependencies? §  Do some operations require a password? §  e.g. JHOVE cannot spot ‘harmless’ PDF encryption. 4
  4. Sustainable Tools Our Tools n  Pure-Java Characterisation: §  JHOVE (‘clean

    room’ implementation) §  New Zealand Metadata Extractor (NZME) §  Apache Tika §  Java-based aggregation of various CLI tools: §  JHOVE2 §  FITS §  Other Characterisation: §  XCL – C++/XML ‘clean room’ extended with ImageMagick §  Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer... §  Identification: §  DROID, FIDO, Apache Tika, File §  Visualisation: §  C3PO, and many non-specialised tools. 5
  5. Sustainable Tools Up to date? Working together? n  Software Dependency

    Management: §  FITS/JHOVE2 embed old DROID versions, hard to upgrade. §  Dead dependencies: FITS and FFIdent, NZME and Jflac. §  Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS? §  Embed shared modules instead? §  Software Project Management and Communication: §  JHOVE, JHOVE2? FITS? §  JHOVE2 only compiles on Sheila’s branch? §  Roadmaps, issue management, testing, C.I., etc. §  Cross-project coordination and bug-fixing? §  Complexity: JHOVE2, XCL, extremely complex §  JHOVE2 Berkley DB causes checksum failures in tests §  Tika solves same problem using SAX 6
  6. Sustainable Tools Shared tests? n  Separate projects arise from separate

    workflows §  Start by understand commonality and find gaps? n  Share test cases and compare results? §  The OPF Format Corpus contains various valid and invalid files. §  Built by practitioners' to test real use cases. §  e.g. JP2 features, PDF Cabinet of Horrors. §  Do the tools give consistent and complementary results? §  Let’s find out! §  c.f. Dave Tarrant’s REF for Identification: §  http://data.openplanetsfoundation.org/ref/ §  http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/ 7
  7. Bit-mashing as Tool QA n  Bitwise exploration of data sensitivity.

    n  One way to compare tools. n  Helps understand formats. n  c.f. Jay Gattuso’s recent OPF blog. 8
  8. Quality Assurance (of broken or potentially broken data) Quality assurance,

    Bit rot, and Integrity n  JHOVE let failed TIFF-JP2 through… §  Jpylyzer does better. §  Both fall far short of actual rendering. 9
  9. Where's the unification? Where should we work together? n  Shared

    test corpora and test framework: §  Start with the OPF Format Corpus? §  Pull other corpora in by reference: §  http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A §  Sustainable version of Dave Tarrant’s REF? §  Extend with bit-mashing to compare tools? n  Aim to coordinate more: §  Make it clear where to go? (More about OfficeDDT). §  Consider merging projects? §  Consider sharing underlying libraries? §  Consider building Tika modules? §  Please consider Apache Preflight as base for PDF validation. 10