How to debug Writer file format issues?

Bb2fd3b5456ad0012799b2045f4cd212?s=47 Miklos V
February 03, 2013

How to debug Writer file format issues?


Miklos V

February 03, 2013


  1. 1.

    How to debug Writer file format issues? We won't give

    you fish, but we'll teach you how to fish. Miklos Vajna 3 February 2013
  2. 2.

    2 A good and a bad news for you •

    Good news: file filters is one area which can be easily unit-tested in most cases ‒ Compared to layout or UI ‒ So it's not an endless fixing wrt a particular problem • Bad news: in most cases it's about modifying the source code ‒ In rare cases you can work around by modifying the input file
  3. 3.

    3 What is a file format issue? • It is

    the situation when your problem is caused by an import or export filter • Good examples: ‒ Stroke weight of the line inside a group shape is too large, when importing from DOCX ‒ This document is supposed to be of a single page, not two • Bad examples: ‒ The imported document causes a layout loop ‒ Writer doesn't support a particular feature, which is supported by the given format
  4. 4.

    4 More terminology • Technically, we only do import and

    export • “Open” on the UI: Import to an empty document, then reset the undo stack • “Copy and paste” on the UI: partial export, followed by an import to an existing document • “Save” on the UI: exporting to an already existing path • Explains: ‒ Why a single character modification totally rewrites the file ‒ Why it's not possible to extract the “conversion machine” from LibreOffice (but: we have a headless mode)
  5. 5.

    5 How to check the document model? • Basic building

    block: paragraphs • XML dump: ‒ SW_DEBUG=1 ./soffice.bin –writer ‒ Shift-F12 creates nodes.xml • GDB: ‒ print pDoc->GetNodes() • UNO: ‒ Iterating over ThisCompoment → paragraphs ‒ Iterating over a paragraph → runs
  6. 7.

    7 ODF filter • The only “own” filter, both import

    and export supposed to be loss-less • ODF semantics are very close to Writer document model: ‒ Example for paragraphs: UNO properties ↔ XML attributes • Most of the implementation is an UNO filter ‒ Can serve as a good example for other filters • Code under xmloff/ and sw/source/filter/xml/ • ODF validator: ‒
  7. 8.

    8 RTF filter • About the format ‒ Motivation: easily-readable

    like HTML, but supports all word processing features (page size, columns, etc.) ‒ Can be hard to read • Export ‒ New in LibreOffice 3.3 ‒ Internal filter, core shared with DOC/DOCX • Import ‒ New in LibreOffice 3.5 ‒ UNO filter, domain mapper shared with DOCX • Mostly my fault
  8. 9.

    9 DOC filter • Probably the oldest filter in Writer

    ‒ Not counting binfilter • Both import and export are internal filters • Specification is available as [MS-DOC] • Tokenizer and domain mapper is not separated • For tokenizer problems, mso-dumper can help: ‒ • Import/export somewhat shared
  9. 10.

    10 DOCX filter • Two variants: ECMA and Microsoft ‒

    Example: left/right or start/end for paragraph margins • Import is older ‒ Over-engineered in writerfilter/ ‒ XSLT generates the tokenizer code, challenging to debug ‒ Inherited from, UNO-based ‒ Domain mapper shared with RTF • Export is LibreOffice-only ‒ Internal ‒ Shared with DOC/RTF
  10. 12.

    12 Unit tests • We only care if the filter

    can handle the file ‒ CVE tests ‒ If the filter provides the expected return value, we're good • Internal tests ‒ Provide access to private Writer symbols ‒ Handy to test methods used by the UI ‒ In most cases not needed by filter tests
  11. 13.

    13 Document model tests • The most commonly used one

    • Import ‒ Load the file, then assert the accessed UNO document model • Export ‒ Import → export → import ‒ This way the same API can be used for tests, and export is tested as well ‒ Alternative: building the document from code, then somehow check the result (XPath for XML-based formats, but what about the rest?) ‒ Drawback: import should be fine ‒ Not a bad thing anyway
  12. 14.

    14 Layout tests • Testing the layout ‒ Not: contents

    of the header in page style “Default” ‒ But: text in the header on page 3 • Sometimes handy, but be careful ‒ Writer layout is partly counted in the idle, tests won't wait for that ‒ Layout may be OK to differ ‒ E.g. missing fonts
  13. 15.

    Thanks for listening! 15 Slides: All text and image

    content in this document, unless otherwise specified, is licensed under the Creative Commons Attribution-Share Alike 3.0 License . This does not include the LibreOffice name, logo, or icon. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States.