Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OCRFeeder: OCR Made Easy on GNOME

OCRFeeder: OCR Made Easy on GNOME

A presentation of what OCRFeeder is and what is does.

Avatar for Joaquim Rocha

Joaquim Rocha

July 27, 2012
Tweet

More Decks by Joaquim Rocha

Other Decks in Programming

Transcript

  1. static void _f_do_barnacle_install_properties(GObjectClass *gobject_class) { GParamSpec *pspec; /* Party code

    attribute */ pspec = g_param_spec_uint64 (F_DO_BARNACLE_CODE, "Barnacle code.", "Barnacle code", 0, G_MAXUINT64, G_MAXUINT64 /* default value */, G_PARAM_READABLE | G_PARAM_WRITABLE | G_PARAM_PRIVATE); g_object_class_install_property (gobject_class, F_DO_BARNACLE_PROP_CODE, Joaquim Rocha [email protected] OCRFeeder OCR Made Easy on GNOME July 27 2012
  2. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 What is

    it? Document Analysis and Optical Character Recognition for GNOME
  3. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Why? Paper

    has a number of problems No applications for GNU/Linux to do a fair job
  4. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Security CC Photo by: http://www.flickr.com/photos/badwsky/
  5. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Preservation CC Photo by: http://www.flickr.com/photos/98469445@N00/
  6. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Data processing CC Photo by: http://www.flickr.com/photos/hugovk/
  7. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Ecology CC Photo by: http://www.flickr.com/photos/pranavsingh/
  8. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Paper problems:

    Accessibility CC Photo by: http://www.flickr.com/photos/illustrator/
  9. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 No fair

    conversion apps for GNU/Linux apart from OCR engines, but...
  10. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 OCR !=

    Document Conversion (it only deals with chars) (does not consider the layout) (does not distinguish contents)
  11. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 What's needed

    is Document Analysis and Recognition (conversion of documents to an electronic format) (first projects in the 80s)
  12. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 So many

    layouts... CC Photo by: http://www.flickr.com/photos/uber-tuber/
  13. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Layouts vary

    with the type of document What works on detecting one, won't work on others
  14. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Key concept:

    If a document image can be divided in windows of 1 (content) or 0 (not content), then it is possible to group all the 1s and outline the contents
  15. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Recognition: System-wide

    OCR engines are used Engines are configured from the GUI or XML files
  16. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Most known

    free OCR engines are detected and configured automatically: * Tesseract * GOCR * OCRAD * Cuneiform
  17. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 User interaction:

    Users can edit everything and review the algorithm's results So, UI can work in attended and unattended ways CLI only works in an unattended mode
  18. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Other features:

    * PDF importation * Unpaper preprocessor * Font style edition * Image deskewing * OCR results cleaning * Project saving/loading
  19. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Future: *

    More exportation formats: HOCR, etc. * Make OCR engines' management easier
  20. Joaquim Rocha (Igalia) · OCRFeeder · GUADEC 2012 Webpage: http://live.gnome.org/OCRFeeder

    git: http://git.gnome.org/ocrfeeder Bugzilla: http://bugzilla.gnome.org product: OCRFeeder