potentially broken data): § Quality assurance, Bit rot, and Integrity n Appraisal and Assessment: § Appraisal and assessment, Conformance, Unknown characteristics, and Unknown file formats. § Identify/Locate Preservation Worthy Data n Identify Preservation Risks: § Obsolescence, preservation risk and business constraint n Long tail of many other issues: § Contextual and Data capture issues through to Embedded objects, and broader issues around Value and cost. n Plus: Sustainable Tools 2
Identify/Locate Preservation Worthy Data n Identification § Always used to ‘route’ data to software that can understand it. § Use minimum information to identify: § e.g. header only if possible. “Truncated PDF”, not “UNKNOWN”. GIS shapefiles: .shp, .shx, but with a missing .dbf should be reported as such. n Validation § Two modes needed: “Fast fail”, “Log and continue” /Quirks § Stop baseless distinction between “Well formed” and “Valid” § Validation is irrelevant to digital preservation assessment: § e.g. Effective “PDF/A”, without the 1.4 and XMP chunk. § We’re on the wrong side of Postel’s Law. § Unknown completeness and failure to future-proof: § e.g. JHOVE tries to validate versions of PDF it cannot know. § e.g. Tools sometimes interpret/migrate data opaquely. 3
Significant Properties are irrelevant here. § It’s not really about the content, but about the context. n Dependency Analysis: § What software does this need? § Does this file use format features that are not well supported across implementations? § What other resources are transcluded? § Fonts? c.f. OfficeDDT. § Remote embeds? § Embedded scripts that might mask dependencies? § Do some operations require a password? § e.g. JHOVE cannot spot ‘harmless’ PDF encryption. 4
Management: § FITS/JHOVE2 embed old DROID versions, hard to upgrade. § Dead dependencies: FITS and FFIdent, NZME and Jflac. § Is FITS embedding JHOVE2, or is JHOVE2 embedding FITS? § Embed shared modules instead? § Software Project Management and Communication: § JHOVE, JHOVE2? FITS? § JHOVE2 only compiles on Sheila’s branch? § Roadmaps, issue management, testing, C.I., etc. § Cross-project coordination and bug-fixing? § Complexity: JHOVE2, XCL, extremely complex § JHOVE2 Berkley DB causes checksum failures in tests § Tika solves same problem using SAX 6
workflows § Start by understand commonality and find gaps? n Share test cases and compare results? § The OPF Format Corpus contains various valid and invalid files. § Built by practitioners' to test real use cases. § e.g. JP2 features, PDF Cabinet of Horrors. § Do the tools give consistent and complementary results? § Let’s find out! § c.f. Dave Tarrant’s REF for Identification: § http://data.openplanetsfoundation.org/ref/ § http://data.openplanetsfoundation.org/ref/pdf/pdf_1.7/ 7
test corpora and test framework: § Start with the OPF Format Corpus? § Pull other corpora in by reference: § http://www.pdfa.org/2011/08/isartor-test-suite/ for PDF/A § Sustainable version of Dave Tarrant’s REF? § Extend with bit-mashing to compare tools? n Aim to coordinate more: § Make it clear where to go? (More about OfficeDDT). § Consider merging projects? § Consider sharing underlying libraries? § Consider building Tika modules? § Please consider Apache Preflight as base for PDF validation. 10