Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to debug Writer file format issues?

Miklos V
February 03, 2013
250

How to debug Writer file format issues?

Miklos V

February 03, 2013
Tweet

Transcript

  1. How to debug Writer file
    format issues?
    We won't give you fish, but we'll teach you how to fish.
    Miklos Vajna
    3 February 2013

    View Slide

  2. 2
    A good and a bad news for you
    • Good news: file filters is one area which can be
    easily unit-tested in most cases
    ‒ Compared to layout or UI
    ‒ So it's not an endless fixing wrt a particular problem
    • Bad news: in most cases it's about modifying the
    source code
    ‒ In rare cases you can work around by modifying the input
    file

    View Slide

  3. 3
    What is a file format issue?
    • It is the situation when your problem is caused by an
    import or export filter
    • Good examples:
    ‒ Stroke weight of the line inside a group shape is too large,
    when importing from DOCX
    ‒ This document is supposed to be of a single page, not two
    • Bad examples:
    ‒ The imported document causes a layout loop
    ‒ Writer doesn't support a particular feature, which is
    supported by the given format

    View Slide

  4. 4
    More terminology
    • Technically, we only do import and export
    • “Open” on the UI: Import to an empty document,
    then reset the undo stack
    • “Copy and paste” on the UI: partial export, followed
    by an import to an existing document
    • “Save” on the UI: exporting to an already existing
    path
    • Explains:
    ‒ Why a single character modification totally rewrites the file
    ‒ Why it's not possible to extract the “conversion machine”
    from LibreOffice (but: we have a headless mode)

    View Slide

  5. 5
    How to check the document model?
    • Basic building block: paragraphs
    • XML dump:
    ‒ SW_DEBUG=1 ./soffice.bin –writer
    ‒ Shift-F12 creates nodes.xml
    • GDB:
    ‒ print pDoc->GetNodes()
    • UNO:
    ‒ Iterating over ThisCompoment → paragraphs
    ‒ Iterating over a paragraph → runs

    View Slide

  6. Writer file formats:
    ODF, Word formats and the rest

    View Slide

  7. 7
    ODF filter
    • The only “own” filter, both import and export
    supposed to be loss-less
    • ODF semantics are very close to Writer document
    model:
    ‒ Example for paragraphs: UNO properties ↔ XML attributes
    • Most of the implementation is an UNO filter
    ‒ Can serve as a good example for other filters
    • Code under xmloff/ and sw/source/filter/xml/
    • ODF validator:
    ‒ http://odf-validator2.rhcloud.com/odf-validator2/

    View Slide

  8. 8
    RTF filter
    • About the format
    ‒ Motivation: easily-readable like HTML, but supports all word
    processing features (page size, columns, etc.)
    ‒ Can be hard to read
    • Export
    ‒ New in LibreOffice 3.3
    ‒ Internal filter, core shared with DOC/DOCX
    • Import
    ‒ New in LibreOffice 3.5
    ‒ UNO filter, domain mapper shared with DOCX
    • Mostly my fault

    View Slide

  9. 9
    DOC filter
    • Probably the oldest filter in Writer
    ‒ Not counting binfilter
    • Both import and export are internal filters
    • Specification is available as [MS-DOC]
    • Tokenizer and domain mapper is not separated
    • For tokenizer problems, mso-dumper can help:
    ‒ http://cgit.freedesktop.org/libreoffice/contrib/mso-dumper/
    • Import/export somewhat shared

    View Slide

  10. 10
    DOCX filter
    • Two variants: ECMA and Microsoft
    ‒ Example: left/right or start/end for paragraph margins
    • Import is older
    ‒ Over-engineered in writerfilter/
    ‒ XSLT generates the tokenizer code, challenging to debug
    ‒ Inherited from OpenOffice.org, UNO-based
    ‒ Domain mapper shared with RTF
    • Export is LibreOffice-only
    ‒ Internal
    ‒ Shared with DOC/RTF

    View Slide

  11. Testcases:
    core, document model and layout tests

    View Slide

  12. 12
    Unit tests
    • We only care if the filter can handle the file
    ‒ CVE tests
    ‒ If the filter provides the expected return value, we're good
    • Internal tests
    ‒ Provide access to private Writer symbols
    ‒ Handy to test methods used by the UI
    ‒ In most cases not needed by filter tests

    View Slide

  13. 13
    Document model tests
    • The most commonly used one
    • Import
    ‒ Load the file, then assert the accessed UNO document
    model
    • Export
    ‒ Import → export → import
    ‒ This way the same API can be used for tests, and export is
    tested as well
    ‒ Alternative: building the document from code, then somehow check
    the result (XPath for XML-based formats, but what about the rest?)
    ‒ Drawback: import should be fine
    ‒ Not a bad thing anyway

    View Slide

  14. 14
    Layout tests
    • Testing the layout
    ‒ Not: contents of the header in page style “Default”
    ‒ But: text in the header on page 3
    • Sometimes handy, but be careful
    ‒ Writer layout is partly counted in the idle, tests won't wait
    for that
    ‒ Layout may be OK to differ
    ‒ E.g. missing fonts

    View Slide

  15. Thanks for listening!
    15
    Slides:
    http://vmiklos.hu/odp/
    All text and image content in this document, unless otherwise specified, is licensed under
    the Creative Commons Attribution-Share Alike 3.0 License . This does not include the
    LibreOffice name, logo, or icon. All SUSE marks referenced in this presentation are trademarks
    or registered trademarks of Novell, Inc. in the United States.

    View Slide