Slides give at the SPRUCEdp Unified Characterisation event.
• Andrew Jackson
• Web Archiving Technical Lead
• British Library
Unified Characterisation, Please
The Practitioners' Have Spoken…
n Quality Assurance (of broken or potentially broken data):
§ Quality assurance, Bit rot, and Integrity
n Appraisal and Assessment:
§ Appraisal and assessment, Conformance, Unknown
characteristics, and Unknown file formats.
§ Identify/Locate Preservation Worthy Data
n Identify Preservation Risks:
§ Obsolescence, preservation risk and business constraint
n Long tail of many other issues:
§ Contextual and Data capture issues through to Embedded
objects, and broader issues around Value and cost.
n Plus: Sustainable Tools
Appraisal and Assessment
Conformance, Unknown characteristics, and Unknown file
formats. Identify/Locate Preservation Worthy Data
§ Always used to ‘route’ data to software that can understand it.
§ Use minimum information to identify:
§ e.g. header only if possible. “Truncated PDF”, not
“UNKNOWN”. GIS shapefiles: .shp, .shx, but with a
missing .dbf should be reported as such.
§ Two modes needed: “Fast fail”, “Log and continue” /Quirks
§ Stop baseless distinction between “Well formed” and “Valid”
§ Validation is irrelevant to digital preservation assessment:
§ e.g. Effective “PDF/A”, without the 1.4 and XMP chunk.
§ We’re on the wrong side of Postel’s Law.
§ Unknown completeness and failure to future-proof:
§ e.g. JHOVE tries to validate versions of PDF it cannot know.
§ e.g. Tools sometimes interpret/migrate data opaquely. 3
Identify Preservation Risks
Obsolescence, preservation risk and business constraint
n Significant Properties are irrelevant here.
§ It’s not really about the content, but about the context.
n Dependency Analysis:
§ What software does this need?
§ Does this file use format features that are not well supported
§ What other resources are transcluded?
§ Fonts? c.f. OfficeDDT.
§ Remote embeds?
§ Embedded scripts that might mask dependencies?
§ Do some operations require a password?
§ e.g. JHOVE cannot spot ‘harmless’ PDF encryption.
n Pure-Java Characterisation:
§ JHOVE (‘clean room’ implementation)
§ New Zealand Metadata Extractor (NZME)
§ Apache Tika
§ Java-based aggregation of various CLI tools:
§ Other Characterisation:
§ XCL – C++/XML ‘clean room’ extended with ImageMagick
§ Many more, inc. forensics, BitCurator, OfficeDDT, jpylyzer...
§ DROID, FIDO, Apache Tika, File
§ C3PO, and many non-specialised tools.
Up to date? Working together?
n Software Dependency Management:
§ FITS/JHOVE2 embed old DROID versions, hard to upgrade.
§ Dead dependencies: FITS and FFIdent, NZME and Jflac.
§ Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS?
§ Embed shared modules instead?
§ Software Project Management and Communication:
§ JHOVE, JHOVE2? FITS?
§ JHOVE2 only compiles on Sheila’s branch?
§ Roadmaps, issue management, testing, C.I., etc.
§ Cross-project coordination and bug-fixing?
§ Complexity: JHOVE2, XCL, extremely complex
§ JHOVE2 Berkley DB causes checksum failures in tests
§ Tika solves same problem using SAX 6
n Separate projects arise from separate workflows
§ Start by understand commonality and find gaps?
n Share test cases and compare results?
§ The OPF Format Corpus contains various valid and invalid files.
§ Built by practitioners' to test real use cases.
§ e.g. JP2 features, PDF Cabinet of Horrors.
§ Do the tools give consistent and complementary results?
§ Let’s find out!
§ c.f. Dave Tarrant’s REF for Identification:
Bit-mashing as Tool QA
n Bitwise exploration of data sensitivity.
n One way to compare tools.
n Helps understand formats.
n c.f. Jay Gattuso’s recent OPF blog.
Quality Assurance (of broken or potentially broken data)
Quality assurance, Bit rot, and Integrity
n JHOVE let failed TIFF-JP2 through…
§ Jpylyzer does better.
§ Both fall far short of actual rendering.
Where's the unification?
Where should we work together?
n Shared test corpora and test framework:
§ Start with the OPF Format Corpus?
§ Pull other corpora in by reference:
§ http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A
§ Sustainable version of Dave Tarrant’s REF?
§ Extend with bit-mashing to compare tools?
n Aim to coordinate more:
§ Make it clear where to go? (More about OfficeDDT).
§ Consider merging projects?
§ Consider sharing underlying libraries?
§ Consider building Tika modules?
§ Please consider Apache Preflight as base for PDF validation.