Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A PDF Test-Set for Well-Formedness Validation in JHOVE

Michelle Lindlar
September 28, 2017

A PDF Test-Set for Well-Formedness Validation in JHOVE

Slides to talk given at the 14th International Conference on Digital Preservation, iPRES 2017 in Kyoto on September 27th 2017.
The full paper is available at https://ipres2017.jp/wp-content/uploads/35.pdf

Michelle Lindlar

September 28, 2017
Tweet

Other Decks in Research

Transcript

  1. Michelle Lindlar @mickylindlar, Yvonne Tunnat, Carl Wilson Kyoto, September 27th

    2017 iPRES 2017 A PDF Test-Set for Well- Formedness Validation in JHOVE – The Good, the Bad and the Ugly
  2. Agenda Motivation Background Information on JHOVE on validation options for

    validators Synthetic Test File Approach building the test set test set results examples for undetected errors Conclusion and Outlook
  3. Seite 4 Let‘s talk about failure „Institutions xyz recommend file

    format abc for archival … we‘re safe!“ „Tool xyz has been around the digital preservation community for n years. All bugs must have been caught.“ Do we always know if something fails? How can we detect if something has failed? https://www.flickr.com/prwheatley1 TIB example: Same document rendered with Adobe Acrobat (left) and Ghostview (right)
  4. Seite 5 Motivation – Validation vs. Identification File Format Identification

    Helps determine file format of digital object „educated format guessing“  by extension / mime-type – very unreliable  by pattern – better (but not perfect) Pattern development requires significant corpus of objects (+ specification, if available) File Format Validation Checks if digital object is of format it claims to be Based on expected structure / behavior as per standard Requires file format specification / standard to define „valid“ as basetruth
  5. Seite 7 Motivation – Examples for Validation Tools Schema validation

    • XML: basic XML structure + xsd Schema files • XML: tools such as Schematron, Xerces, Altova, … Interchange format validation (text-based formats) • JSON – JSONLint • STEP (ISO 10303-21) EXPRESS files: tools such as Express Engine Validators for binary formats / format families • jyplyzer – feature extractor and validator for JPEG2000 • veraPDF, DPF Manager, Mediaconch Validation frameworks for multiple formats (in modules) • JHOVE • JHOVE2 (no longer maintained) • KOST-val
  6. Seite 8 Brief History of JHOVE: 2003 - today JHOVE

    1.16. • 14 modules  scope of this paper: PDF-module • is integrated in all major off-the-shelf digital preservation solutions • Several active contributors from all around the world • 2 “JHOVE hack day events” • release 1.18 currently anticipated for early 2018
  7. Seite 9 JHOVE: a digital preservation tool success story ?

    • JHOVE as the go-to validator of the digital preservation community • Remains the only validator for „regular“ PDF:  PDF/A, PDF/X, etc. are profiles built as subset restrictions  validators such as veraPDF, PDFTron, etc. only check profile requirements (for the most part) • No public „ground truth“ testset available to check against ISO 32000-1:2008 • How can we validate the validation? 73% of respondents (n=132) use JHOVE in production PDF/A PDF
  8. Seite 10 How to validate validators ? Our goal was

    to … (1) to establish a ground truth for what is not well-formed (2) to test the JHOVE software against that ground truth (3) to improve automated regression testing Two possible approaches: 1. Benchmark approach  Requires several validation tools (related work on TIFF, JPEG)  Currently no alternatives for PDF available 2. Test corpus approach  Labor-intensive task (ISO 32000-1:2008 = 756 pages, JHOVE PDF module 10.000 lines of code, 152 possible validation errors)  Scope for our work: basic structural errors (e.g., excluding font validation)
  9. Seite 11 Validating validation via synthetic test files “In general,

    a file is well-formed if - it has a header : %PDF-m.n, - a body consisting of well-formed objects; - a cross-reference table; - and a trailer defining the - cross-reference table size, - and an indirect reference to the document catalog dictionary, - and ending with: %%EOF”
  10. Seite 13 From JHOVE condition to test file “a body

    consisting of well-formed objects” T02-01_005_document-catalog-type-key-missing.pdf T02-01_006_document-catalog-wrong-type-key.pdf 5 0 obj << /Pages 1 0 R /Type /Catalog >> 5 0 obj << /Pages 1 0 R /Catalog >> 5 0 obj << /Pages 1 0 R /Type /Font >>
  11. Seite 15 Test set results by validation Good news: •

    Majority of testcases (72 files / 80%) were validated correctly Bad news: • 18 files were not validated correctly, 17 of those (=19%) were considered well-formed and valid (1 well formed, but not valid)
  12. Seite 17 Undetected errors – Header Testing against ISO clause:

    7.5.2 – conforming reader shall accept files with header %PDF-1.x where 0>=x<=7 Deviation in test file: %PDF-1.9 Impact: None with tested rendering software (Adobe Acrobat, Evidence, Ghostview, Foxit).
  13. Seite 18 Undetected errors – Trailer Testing against ISO clause:

    7.5.5 - The last line of the file shall contain only the end-of-file marker, %%EOF. Deviation in test files: Extra data before %%EOF on last line; junk data after %%EOF. Impact: None with tested rendering software (Adobe Acrobat, Evidence). However, this is of relevance if incompletly transfered files are not detected due to wrong behavior.
  14. Seite 19 Undetected errors – Body, Document Catalog Testing against

    ISO clause: 7.5.5 / 7.7.2 The trailer has to include in indirect reference to the catalog dictionary for the PDF. Deviation: Object ID of document dictionary was changed. Impact: Object cannot be opened by tested rendering software, but was flagged as „Well-formed and valid“ by JHOVE!
  15. Seite 20 Undetected errors – Body, Stream Object Testing against

    ISO clause: 7.3.4.2 – Literal strings must be enclosed in parentheses. Deviations: Deleting opening / closing / both parenteses; substituting parantheses with brackets. Impact: Missing content on page
  16. Seite 21 Conclusion & Outlook • Tests resulted in 9

    github issues to date • Test set has been integrated in JHOVE regression testing as ground truth data • Test data is used as easy to understand examples for JHOVE validation errors as documented by the OPF Document Interest Group
  17. Seite 23 Question tool output! Get involved! • We, as

    a community need to take responsibility for the (community owned) processes and tools we use • The PDF test-set ist extendable and there‘s plenty of clauses left to check! (…. without even thinking about PDF 2.0) • Ways to get involved:  JHOVE Use Case Survey http://jhove.openpreservation.org/  Test tools / processes & talk about it  Contribute via github, OPF Document Interest Group, JHOVE hack days  Donate for JHOVE, become a software supporter or an OPF member  Write more testfiles  …
  18. Seite 24 Further information Lindlar, Tunnat: „How valid is your

    validation? A closer look behind the curtain of JHOVE“ IDCC 2017 paper Yvonne Tunnat: „TIFF format validation: easy-peasy ?“ OPF blog http://openpreservation.org/blog/2017/01/17/tiff- format-validation-easy-peasy/ Yvonne Tunnat: „Error detection of JPEG files with JHOVE and Bad Peggy – so who‘s the real Sherlock Holmes here?“ OPF blog http://openpreservation.org/blog/2016/11/29/jpegvalidation/