Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OCRFeeder: OCR Made Easy on GNOME

OCRFeeder: OCR Made Easy on GNOME

A presentation of what OCRFeeder is and what is does.

A0a1e3a9ca85502ca53f11819d236764?s=128

Joaquim Rocha

July 27, 2012
Tweet

More Decks by Joaquim Rocha

Other Decks in Programming

Transcript

  1. static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code

    attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha jrocha@igalia.com OCRFeeder OCR Made Easy on GNOME July 27 2012
  2. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 What is

    it? Document Analysis and Optical Character Recognition for GNOME
  3. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Why? Paper

    has a number of problems No applications for GNU/Linux to do a fair job
  4. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Security CC Photo by: http://www.flickr.com/photos/badwsky/
  5. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Preservation CC Photo by: http://www.flickr.com/photos/98469445@N00/
  6. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Data processing CC Photo by: http://www.flickr.com/photos/hugovk/
  7. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Ecology CC Photo by: http://www.flickr.com/photos/pranavsingh/
  8. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Accessibility CC Photo by: http://www.flickr.com/photos/illustrator/
  9. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 No fair

    conversion apps for GNU/Linux apart from OCR engines, but...
  10. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 OCR !=

    Document Conversion (it only deals with chars) (does not consider the layout) (does not distinguish contents)
  11. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 What's needed

    is Document Analysis and Recognition (conversion of documents to an electronic format) (first projects in the 80s)
  12. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

  13. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

  14. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 How it

    works
  15. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 So many

    layouts... CC Photo by: http://www.flickr.com/photos/uber-tuber/
  16. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Layouts vary

    with the type of document What works on detecting one, won't work on others
  17. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 OCRFeeder focuses

    on contents, not on layouts!
  18. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Key concept:

    If a document image can be divided in windows of 1 (content) or 0 (not content), then it is possible to group all the 1s and outline the contents
  19. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

  20. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Recognition: System-wide

    OCR engines are used Engines are configured from the GUI or XML files
  21. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

  22. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Most known

    free OCR engines are detected and configured automatically: * Tesseract * GOCR * OCRAD * Cuneiform
  23. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Exportation formats:

    ODT HTML Plain text PDF
  24. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 User interaction:

    Users can edit everything and review the algorithm's results So, UI can work in attended and unattended ways CLI only works in an unattended mode
  25. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012

  26. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Demo time!

  27. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Other features:

    * PDF importation * Unpaper preprocessor * Font style edition * Image deskewing * OCR results cleaning * Project saving/loading
  28. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Future: *

    More exportation formats: HOCR, etc. * Make OCR engines' management easier
  29. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Webpage: http://live.gnome.org/OCRFeeder

    git: http://git.gnome.org/ocrfeeder Bugzilla: http://bugzilla.gnome.org product: OCRFeeder
  30. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Thank you!