A PDF Test-Set for Well-Formedness Validation in JHOVE
Slides to talk given at the 14th International Conference on Digital Preservation, iPRES 2017 in Kyoto on September 27th 2017.
The full paper is available at https://ipres2017.jp/wp-content/uploads/35.pdf
format abc for archival … we‘re safe!“ „Tool xyz has been around the digital preservation community for n years. All bugs must have been caught.“ Do we always know if something fails? How can we detect if something has failed? https://www.flickr.com/prwheatley1 TIB example: Same document rendered with Adobe Acrobat (left) and Ghostview (right)
Helps determine file format of digital object „educated format guessing“ by extension / mime-type – very unreliable by pattern – better (but not perfect) Pattern development requires significant corpus of objects (+ specification, if available) File Format Validation Checks if digital object is of format it claims to be Based on expected structure / behavior as per standard Requires file format specification / standard to define „valid“ as basetruth
1.16. • 14 modules scope of this paper: PDF-module • is integrated in all major off-the-shelf digital preservation solutions • Several active contributors from all around the world • 2 “JHOVE hack day events” • release 1.18 currently anticipated for early 2018
• JHOVE as the go-to validator of the digital preservation community • Remains the only validator for „regular“ PDF: PDF/A, PDF/X, etc. are profiles built as subset restrictions validators such as veraPDF, PDFTron, etc. only check profile requirements (for the most part) • No public „ground truth“ testset available to check against ISO 32000-1:2008 • How can we validate the validation? 73% of respondents (n=132) use JHOVE in production PDF/A PDF
to … (1) to establish a ground truth for what is not well-formed (2) to test the JHOVE software against that ground truth (3) to improve automated regression testing Two possible approaches: 1. Benchmark approach Requires several validation tools (related work on TIFF, JPEG) Currently no alternatives for PDF available 2. Test corpus approach Labor-intensive task (ISO 32000-1:2008 = 756 pages, JHOVE PDF module 10.000 lines of code, 152 possible validation errors) Scope for our work: basic structural errors (e.g., excluding font validation)
a file is well-formed if - it has a header : %PDF-m.n, - a body consisting of well-formed objects; - a cross-reference table; - and a trailer defining the - cross-reference table size, - and an indirect reference to the document catalog dictionary, - and ending with: %%EOF”
Majority of testcases (72 files / 80%) were validated correctly Bad news: • 18 files were not validated correctly, 17 of those (=19%) were considered well-formed and valid (1 well formed, but not valid)
7.5.5 - The last line of the file shall contain only the end-of-file marker, %%EOF. Deviation in test files: Extra data before %%EOF on last line; junk data after %%EOF. Impact: None with tested rendering software (Adobe Acrobat, Evidence). However, this is of relevance if incompletly transfered files are not detected due to wrong behavior.
ISO clause: 7.5.5 / 7.7.2 The trailer has to include in indirect reference to the catalog dictionary for the PDF. Deviation: Object ID of document dictionary was changed. Impact: Object cannot be opened by tested rendering software, but was flagged as „Well-formed and valid“ by JHOVE!
ISO clause: 7.3.4.2 – Literal strings must be enclosed in parentheses. Deviations: Deleting opening / closing / both parenteses; substituting parantheses with brackets. Impact: Missing content on page
github issues to date • Test set has been integrated in JHOVE regression testing as ground truth data • Test data is used as easy to understand examples for JHOVE validation errors as documented by the OPF Document Interest Group
a community need to take responsibility for the (community owned) processes and tools we use • The PDF test-set ist extendable and there‘s plenty of clauses left to check! (…. without even thinking about PDF 2.0) • Ways to get involved: JHOVE Use Case Survey http://jhove.openpreservation.org/ Test tools / processes & talk about it Contribute via github, OPF Document Interest Group, JHOVE hack days Donate for JHOVE, become a software supporter or an OPF member Write more testfiles …
validation? A closer look behind the curtain of JHOVE“ IDCC 2017 paper Yvonne Tunnat: „TIFF format validation: easy-peasy ?“ OPF blog http://openpreservation.org/blog/2017/01/17/tiff- format-validation-easy-peasy/ Yvonne Tunnat: „Error detection of JPEG files with JHOVE and Bad Peggy – so who‘s the real Sherlock Holmes here?“ OPF blog http://openpreservation.org/blog/2016/11/29/jpegvalidation/