Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PDF/A considered harmful for digital preservation

Marco Klindt
September 27, 2017

PDF/A considered harmful for digital preservation

This talk has been presented at iPRES2017, Kyoto and wants to raise awareness about possible risks associated with the PDF/A file format within digital preservation. Please check the paper at https://ipres2017.jp/wp-content/uploads/15Marco-Klindt.pdf .

Marco Klindt

September 27, 2017
Tweet

Other Decks in Research

Transcript

  1. Disclaimer This talk is not intended as a critique of

    the PDF/A file format but of existing PDF/A workflows and general attitude towards PDF/A‘s role in digital preservation.
  2. This talk is about Information retention within the cultural heritage

    and scientific publication domain1 1 Information = structured text or data
  3. The way to enlightenment1 •  Requirements for digital preservation • 

    PDF/A‘s purpose in life •  Suitability of PDF/A •  Considerations for successful use •  Alternative approaches •  Conclusion 1 may be changed without further no4ce.
  4. What shall be preserved? Textual document (with/or without tabular or

    structured data) Appearance rentention and/or Information retention? „Access vs. Re-Use“
  5. Information rentention Information content can be rendered for intellectual consumption

    and is also (fully) accessible for machines. Structure through explicit markup or data structures.
  6. Access vs. Re-use Repositories provide access to data. Digital preservation

    and re-use of data require ability to process data.
  7. What is PDF/A? Based on Portable Document Format Page description/

    Page layout Faithful rendering of creator‘s design
  8. What is PDF/A? PDF/A: Constrained version of PDF for better

    interchange: Self-contained All resources needed for rendering. (including fonts) must be embedded.
  9. What‘s the idea behind PDF? •  Collect a sequence of

    instructions to arrange and paint glyphs and other graphical elements at fixed positions on rectangular 2D page. •  Glyphs are graphical representations of characters. •  Fonts are drawing instruction sets for glyphs for particular typefaces.
  10. Let‘s create a PDF document %PDF-1.7 ← Header 100 0

    obj <</Type /Pages /Kids [101 1 R 102 0 R ] /Count 2 >> endobj ... ← Body: Objects xref 0 15 000000000 65535 f ... trailer << Root 100 0 R / Size 15 >> startxref 34246 %%EOF ← Crossreference table ← Trailer
  11. ... with two pages. 1 0 obj <</Type /Catalog /Pages

    100 0 R >> endobj ← Document catalog (n R is a reference to object n) 100 0 obj <</Type /Pages /Kids [101 1 R 102 0 R ] /Count 2 >> endobj ← Page tree with two page objects
  12. 101 1 obj <</Type /Page /Parent 100 0 R /Resources

    << /Font <</F1 6 0 R /F12 7 0 R >> >> /MediaBox [0 0 612 792] /Contents 201 0 R >> endobj ← Page one 100 0 obj <</Type /Pages /Kids [101 1 R 102 0 R ] /Count 2 >> endobj
  13. 101 1 obj <</Type /Page /Parent 100 0 R /Resources

    << /Font <</F1 6 0 R /F12 7 0 R >> >> /MediaBox [0 0 612 792] /Contents 201 0 R >> endobj ↑ draws ouput ↓ An empty page!
  14. 101 1 obj <</Type /Page [...] /Contents 201 0 R

    >> endobj Contents ↓Page one 201 1 obj <</Length ...>> stream 1 1 1 rg 0 0 612 792 re f BT 0 0 0 rg /F1 1 Tf 30 0 0 30 18 732 Tm (Heading: ) Tj 1.1333 TL T* (Hello PDF!) Tj /F12 1 Tf (Spanning two pages. To boldly ) Tj ET endstream endobj ⇠ begin text, select first font: F1 ⇠ go to position & paint string ⇠ new line position, paint string ⇠ select font F12 ⇠ paint beginning of paragraph ⇠ end text
  15. ↑ draws ouput ↓ Heading: Hello PDF! 201 1 obj

    <</Length ...>> stream 1 1 1 rg 0 0 612 792 re f BT /F1 1 Tf 30 0 0 30 18 732 Tm (Heading: ) Tj 1.1333 TL T* (Hello PDF!) Tj /F12 1 Tf (Spanning two pages. To boldly ) Tj ET endstream endobj Spanning two pages. To boldly
  16. ↑ actually means ↓ Heading: Hello PDF! (Hello PDF!) Tj

    1.  Select H 2.  Interpret H as the integer index n into instruction set in font (thru Cmap) 3.  Lookup nth glyph in Font /F1 4.  Execute drawing instructions for glyph n at current position 5.  Advance current position according to width of glyph n in font and current character-spacing horizontally 6.  Select next character e 7.  Interpret e as the integer integer m 8.  Lookup mth glyph in Font /F1 9.  and so on... Repeat until done... (computers are good at this!)
  17. 102 0 obj <</Type /Page /Parent 100 0 R /Resources

    << /Font <</F1 6 0 R /F12 7 0 R >> >> /MediaBox [0 0 612 792] /Contents 202 0 R >> endobj ← Page two 100 0 obj <</Type /Pages /Kids [101 1 R 102 0 R ] /Count 2 >> endobj
  18. 102 0 obj <</Type /Page /Parent 100 0 R /Resources

    << /Font <</F1 6 0 R /F12 7 0 R >> >> /MediaBox [0 0 612 792] /Contents 202 0 R >> endobj ↑ draws ouput ↓ Heading: Hello PDF! Spanning two pages. To boldly
  19. 102 1 obj <</Type /Page [...] /Contents 202 0 R

    >> endobj Contents ↓Page two 202 1 obj <</Length ...>> stream 0 0 0 rg 0 0 612 792 re f BT /F1 1 Tf 14 0 0 14 18 732 Tm (where no one has gone before. ) Tj 1.1429 TL T* (And back... ) Tj ET endstream endobj ⇠ begin text, select first font: F1 ⇠ paint string ⇠ new line position ⇠ paint second paragraph ⇠ end text
  20. ↑ draws ouput ↓ Heading: Hello PDF! 202 1 obj

    <</Length ...>> stream 0 0 0 rg 0 0 612 792 re f BT /F1 1 Tf 14 0 0 14 18 732 Tm (go where no one has gone before. ) Tj 1.1429 TL T* (And back... ) Tj ET endstream endobj Spanning two pages. To boldly go where no one has gone before. And back... Pfew! Finished
  21. PDF‘s advantages •  Possibility to render document layouts as intended

    by creators. •  Ubiquity: Renderers are available for almost all platforms. •  Adheres to traditional linear (and flat) page-based information exchange.
  22. PDF for textual information •  But... has no (inherent) concept

    of text, words, text lines, sequential order or other structural information. •  PDF readers are renderers that recreate glyph and graphical element placements; drawing order irrelevant.
  23. Enter: PDF/A Why used in digital preservation? Familiarity might have

    led to the perception that the constrained version of PDF is a solution for many problems concerning long-term preservation.
  24. PDF/A „flavors“ (aka versions and conformance levels) PDF/A-1b PDF/A-1a b

    (basic) a (accessible) Based on PDF 1.4. All used fonts/glyphs must be embedded mandatory unicode mapping, language specified, document structure hierarchical, tagged text spans, descriptive text for images PDF/A-2b PDF/A-2u PDF/A-2a b u (unicode) a 1b, but based on PDF 1.7 (ISO 32000-1). Transparency 2b with unicode mapping but w/o accessibility features 2u with accessibility features (1a) PDF/A-3 b u a See 2b/u/a respectively. Allows for embbeded files with stated relationship of being either Source, Data, Alternative, Supplement, and Unspecified in respect to parts of or the whole PDF content. PDF/UA Universal Accessibility not PDF/A ISO 14289. Specifies tag requirements for document structure and content accessibility by Assistive Technology (i.e. software e.g. screen readers)
  25. PDF/A conformance level a Constrained version of PDF/A for better

    interchange and information extraction: Self-described (unicode mapping) Embedded logical document structure
  26. Let‘s make our document more PDF/A-1a 1 0 obj <</Type

    /Catalog /Pages 100 0 R /StructTreeRoot 300 0 R >> endobj ← Add structure tree root 300 0 obj <</Type /StructTreeRoot /K [301 0 R 304 0 R ] /RoleMap <</Chap /Sect /Head1 /H /Para /P >> /ClassMap <</Normal 305 0 R>> /ParentTree 400 0 R /ParentTreeNextKey 2 /IDTree 403 0 R >> endobj ← Oops, a lot more objects...
  27. 300 0 obj <</Type /StructTreeRoot /K [301 0 R 304

    0 R ] ... >> endobj ← kids are alright: a chapter and a paragraph (but not on the same page) 301 0 obj <</Type /StructElem /S /Chap /ID (Chap1) /T (First Chapter) /P 300 0 R /K [302 0 R 303 0 R ] >> endobj ⇠ a chapter ⇠ machine readable identifier ⇠ human readable title ⇠ reference back to parent ↑ ⇠ two children: a section head and a paragraph ← the chapter consists of the heading and a paragraph
  28. ← 1st: the heading 301 0 obj <</Type /StructElem ...

    /T (First Chapter) ... /K [302 0 R 303 0 R ] >> endobj 302 0 obj <</Type /StructElem /S /Head1 /ID (Sec1.1) /T (Section 1.1) /P 301 0 R /Pg 101 1 R /A <</O /Layout /SpaceAfter 25 /SpaceBefore 0 /TextIndent 12.5 >> /K 0 >> endobj ⇠ a heading ⇠ machine readable identifier ⇠ human readable title ⇠ reference back to parent ↑ ⇠ refer to page 101 ⇠ marked-content seq # 0 ← yup, actual display page (object 101)
  29. 101 1 obj <</Type /Page [...] /Contents 201 0 R

    >> endobj 201 1 obj <</Length ...>> stream 1 1 1 rg 0 0 612 792 re f BT /F1 1 Tf 30 0 0 30 18 732 Tm (Heading: ) Tj 1.1333 TL T* (Hello PDF!) Tj /F12 1 Tf (Spanning two pages. To boldly ) Tj ET endstream endobj Recall page 1: ← Now we have to tag the heading in object 201!
  30. 201 1 obj <</Length ...>> stream 1 1 1 rg

    0 0 612 792 re f BT /Span <</MCID 0>> BDC 0 0 0 rg /F1 1 Tf 30 0 0 30 18 732 Tm (Heading: ) Tj 1.1333 TL T* (Hello PDF!) Tj EMC /Span <</MCID 1>> BDC /F12 1 Tf (Spanning two pages. To boldly ) Tj EMC ET endstream endobj ⇠ start marked-content seq # 0 ⇠ end seq # 0 ⇠ start marked-content seq # 1 ⇠ end seq # 1
  31. ← 2nd: the paragraph 301 0 obj <</Type /StructElem ...

    /K [302 0 R 303 0 R ] >> endobj 303 0 obj <</Type /StructElem /S /Para /ID (para1) /P 301 0 R /Pg 101 1 R /C /Normal /K [1 <</Type /MCR /Page 102 0 R /MCID 0 >> ] >> endobj ⇠ a paragraph ⇠ still refers to page 101 ⇠ marked-content seq # 1 ⇠ but contin‘d on page 2 as ⇠ marked-content seq # 0 ← paragraph spans contents on two pages
  32. Heading: Hello PDF! Spanning two pages. To boldly go where

    no one has gone before. And back... Heading First paragraph Second paragraph Adding the same for the second paragraph... But not done yet!
  33. 300 0 obj <</Type /StructTreeRoot /K [301 0 R 304

    0 R ] /RoleMap <</Chap /Sect /Head1 /H /Para /P >> /ClassMap <</Normal 305 0 R>> /ParentTree 400 0 R /ParentTreeNextKey 2 /IDTree 403 0 R >> endobj Structure at last... 400 0 obj <</Nums [0 401 0 R 1 402 R ] endobj 402 0 obj [303 0 R 304 0 R ] endobj 401 0 obj [302 0 R 303 0 R ] endobj 404 0 obj <</Limits [(Chap1) (Sec1.2)] /Names [(Chap1) 301 0 R (Sec1.1) 302 0 R (Sec1.2) 304 0 R ] endobj 403 0 obj << /Kids [404 0 R] endobj Actual structure tree Identifier lookup table 302 0 obj 303 0 obj 304 0 obj 101 actual page 1 102 actual page 2
  34. Wow! •  Created a second (structure) tree parallel to the

    tree for drawing the pages. •  Add 11 new objects (and altered 3) to our PDF document just to add the logical structure.... •  Accessibility an afterthought in PDF. •  Obviously not easily done after the creation process... ❓
  35. Creation vs. Conversion •  PDF/A A-level conformance depend on (structural)

    information available in creation context. •  If not present, this information cannot1 be generated/recovered from existing PDFs through conversion. 1 At least not easily, maybe laboriously and manually or through future machine learning...
  36. PDF 2.0 (not PDF/A-2) •  Remove ambiguities in spec • 

    Tagging support aligned with PDF/UA for better accessibility •  Support for MathML (mathematical formulas) •  Support tags from other namespaces (i.e. XML schemas) •  Will be PDF/A-4...
  37. NextGeneration PDF •  Idea is to combine intented page layout

    but allow for easier extraction and reflow •  Strategy is to embed structured text as HTML/CSS or ePub with media queries and link to PDF objects. •  Isn‘t this an even more complicated afterthought?
  38. •  Is PDF/A a solution? •  As always in digital

    preservation the answer is: „It depends...“ Now what?
  39. Format criteria for viable digital preservation Ubiquity Stability Complexity Support

    Ease of identification and validation Interoperability Disclosure Intellectual property rights Viability Documentation quality Metadata Support Re-Usability
  40. Format criteria evaluation for PDF(/A) Ubiquity ✓ Stability ✓ Complexity

    ✘ Support ✓ Ease of identification and validation ✘ Interoperability ✓/✘ Disclosure ✓ Intellectual property rights ✓ Viability ✘ Documentation quality ✓1 Metadata Support ✓ Re-Usability ✘ 1 Although not perfect it is at least better than most other specs and getting better...
  41. Recommended Formats Statement (2017) •  Textual Works – Digital Preferred

    formats, in order of preference 1.  XML-based markup formats, with included or accessible DTD/schema, XSD/XSL presentation stylesheet(s), and explicitly stated character encoding •  BITS-compliant (NLM Book DTD) •  EPUB-compliant •  Other widely-used book DTD/schemas (e.g., TEI, DocBook, etc) •  NISO JATS: Journal Article Tag Suite (ANSI/NISO Z39.96-2015 ) for electronic serials 2.  Page-layout formats •  PDF/UA (ISO 14289-1-compliant) •  PDF/A (ISO 19005-compliant) •  PDF (highest quality available, with features such as searchable text, embedded fonts, lossless compression, high resolution images, device-independent specifications of colorspace, content tagging; includes document formats such as PDF/X)
  42. Choose PDF/A and be done with it? Bandwagon effect? Everyone

    is using it – They cannot all be wrong. There does not seem to be an alternative...
  43. Advantages for preservation •  Self-contained file includes all necessary information

    to achieve faithful rendering of original intent. •  Active development and improvement of tools (also for preservation). •  Support for digital signatures helps prove authenticity. •  De-facto page-based document exchange standard.
  44. Disadvantages for preservation •  Technically complex format (violates KISS principle)

    •  Spec describes behaviour of conformant readers not file format per se1. •  Reader renderings may vary wildly. •  Migration to other formats difficult or impossible. 1 PDF 2.0 addresses this .
  45. Disadvantages for preservation •  (Technical) Validation both complex and hard.

    •  Assessing validation results is dependent on local policies and requires technical expertise. •  Digital signatures depend on cryptography. •  Checking accessibility of information next to impossible without human interaction. •  Recovery of information or extraction/migration, too.
  46. Disadvantages for preservation •  Embedded fonts predominantly are faulty and

    fail to validate. •  Checking completenes of structural information and/or author‘s intent (semantic tagging) without access to information available in the creation process automatically next to impossible.
  47. Strategies for mitigation: PDF/A •  Workflow – Raise awareness for problems.

    – Define and establish a reasonable and practical PDF/A A-Level creation workflows for producers. •  Technical – Devise better tools or workflows to treat information accessiblity as a priority. – Research OCR-like extraction of structure
  48. Strategies for mitigation: Alternative formats •  Choose PDF/A only as

    a dissemination format. •  Demand (normalized) formats that include the necessary structural information to access/process the information to be preserved: – XML/JATS etc., TEI, Markup flavors, Tex sources, word processor files, etc. (see paper). ❗️
  49. Sidenote Scientific papers are approaching 2 million publications annually. Open

    Access is not only a legal problem but also the problem of human and machine accessibility1. Be FAIR! 1 Think Google for research/cultural heritage information...
  50. Dilemma Should I stay or should I go now? Should

    I stay or should I go now? If I go, there will be trouble And if I stay it will be double So come on and let me know... The Clash, 1982
  51. An inconvenient truth (Conclusion) •  PDF/A‘s role within institutional policy

    has to be risk-assessed based on preservation goals. •  PDF/A‘s creation workflows have to be meticulous QA‘ed for information accessibility. •  (My) Weather forecast: Currently overcast with slight chance of sunshine on the weekend.