Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Document Liberation Project

Document Liberation Project

News from the reverse and straight engineering world

Fridrich Strba

April 02, 2014
Tweet

More Decks by Fridrich Strba

Other Decks in Technology

Transcript

  1. Document Liberation Project News from the reverse and straight engineering

    world David Tardon, Fridrich Štrba, Валёк Филиппов
  2. Agenda  Admitted agenda  About the project  Document

    Liberation  News  Boring technical details  Hidden agenda  You'll have to wait until the end to know :)
  3. History  Launched officially today (2014-04-02)  Group working on

    file-formats within LibreOffice since the beginning  GSoC 2011 - libvisio  Clear feeling that this is bigger then LibreOffice itself  A service of the LibreOffice community to the wider FOSS world  Reuse by other projects  Scalability issues  One person can produce maximum 24 man-hours per day  Need to attract more people  Not everybody wants to become LibreOffice developer
  4. We believe...  Ownership of documents  ... that documents

    and their content belong to their creators, not software vendors  Unhindered access of an owner  ... that access to content you own should not be hindered by the fact that the application that created it is not maintained any more or that the application does not work on the particular operating system that you use  Importance of truly open standards as a long term solution  ... that use of truly open and free standards for encoding digital content is the only long- term guarantee that a user's digital content will never be beholden to a single vendor  Importance of FOSS implementation  ... that implementation of Free and Open Source Software that can read proprietary file- formats is the best solution to escape vendor lock during the transition period to truly open and free standards
  5. Our mission is...  File-format understanding  ... to try

    to understand the structure and details of proprietary, undocumented file-formats  Parser library implementations  ... to use the understanding of the file-formats to implement FOSS libraries that are able to parse such documents and extract as much information as possible from them  Being good citizens of ODF ecosystem  ... to use our existing framework to encode this data in a truly free and open standard file-format: the Open Document Format
  6. Goodness in OLEToy  New file-formats understood  Adobe PageMaker

     Versions 3 to 7  New contributor to OLEToy!  David Tardon  Software602 602Text  Zoner Callisto (Draw)  Zoner Zebra (predecessor of Callisto)  Apple Keynote 6 / Pages 5 / Numbers 3
  7. New libraries  Libetonyek  Support first for Keynote documents

     Extending support to Numbers and Pages  Libe-book  Supports a host of e-book file-formats  Libfreehand  Started the implementing of Freehand import filter  Libabw  Now we can load documents of our “cousin”  … and more still to come
  8. New document types  Previously only text documents and graphics

     Text documents based on libwpd API  Libwpd, libwps, libmwaw  Graphics based on libwpg API  Libwpg, libvisio, libcdr, libmspub, libfreehand  New presentation support  Presentations based on libetonyek API  Libwpg's API was too limited for presentations  Need to extend to spreadsheets too  Libmwaw  Libwps
  9. libodfgen  ODF Generation was duplicated in several places 

    LibreOffice writerperfect module  Standalone writerperfect  Calligra sources  It makes sense to collect all bugs in the same place  OdtGenerator class  Implementations of WPXDocumentInterface  OdgGenerator class  Implementation of WPGPaintInterface  OdpGenerator class added later  Implementation of KEYPresentationInterface  OdfDocumentHandler interface  SAX-like interface to output XML in a generic way
  10. librevenge  Interface of each document type in different library

     Libwpd, libwpg, libetonyek  The common types in libwpd  Libwpd is a text-related library  All others had to link to it  Consolidating the types and interfaces  Interfaces  RVNGTextInterface, RVNGDrawingInterface,  RVNGPresentationInterface, RVNGSpreadsheetInterface  Types  RVNGProperty, RVNGPropertyList, RVNGPropertyListVector  Extended the capacities  RVNGBinaryData, RVNGString, RVNGStringVector
  11. librevenge-stream  RVNGInputStream interface  Extended to handle structured documents

    a bit more efficiently  Several implementations:  RVNGFileStream  Implementation using file name  RVNGStringStream  Implementation using a buffer of data  RVNGDirectoryStream  Accesses a directory structure as if it was a structured document  OLE2 and ZIP documents handled transparently  No need to know what is the container type  Gives the responsibility to the implementers!
  12. librevenge-generators  Useful implementations of the different interfaces  Raw

    Generators  Implementations of the different RVNG interfaces  printing callbacks called and properties passed  Used for regression testing  CSV generator for spreadsheets, HTML, Text generators  SVG generators  Exception: SVG generator for drawings  included in librevenge core library  Historical reasons  ODF generators in libodfgen  More complicated  Historical reasons
  13. Advantage of the design  Parser libraries independent and self-contained

     Much easier life of filter writers  Enough to focus on the structure of document to parse  Call the interface callbacks that one needs  Avoid sucking in unrelated libraries  Librevenge itself and libodfgen have boost as build-time dependency  No need to link text-related libraries in drawing application  Considerable reduction of code duplication  Less risk to have bugs fixed in one place and hanging around in another  Faster to start a library skeleton
  14. Ways to contribute  Code development  Contribute to one

    of our existing libraries, or  Start a new one  Understanding and documenting file-formats  OLEToy  Preferred way to visualize documents  Need a bit of knowledge of Python  Preparation of sample documents  Need to access a generating application  Important for regression testing
  15. Future file-formats to import?  Google Summer of Code 

    The possibility for a student to work with outstanding mentors  David Tardon  Fridrich Štrba  Валёк Филиппов  Several formats ready for straight engineering  Apple Numbers, Pages  Adobe PageMaker  Zoner Draw