Finding images in book page images

0c6e18570f8a5192b8cdfd809196c540?s=47 Eric Larson
February 07, 2012

Finding images in book page images

Code4Lib 2012 Lighting Talk Presentation

0c6e18570f8a5192b8cdfd809196c540?s=128

Eric Larson

February 07, 2012
Tweet

Transcript

  1. Finding images in book page images Eric Larson University of

    Wisconsin-Madison Libraries
  2. Warning Hobbyist code here. I’m certain there are better ways

    to do this.
  3. None
  4. None
  5. None
  6. None
  7. None
  8. curl

  9. None
  10. imagemagick

  11. Processing steps 1. Desaturate the image 2. Boost contrast 3.

    Convert image to 1pixel wide x image height 4. Sharpen the image 5. Super-duper grayscale conversion 6. Produce the text color list 7. Look for continuous “black” blocks
  12. None
  13. convert

  14. convert -colorspace Gray

  15. None
  16. convert \ -contrast -contrast \ -contrast -contrast \ -contrast -contrast

    \ -contrast -contrast \
  17. None
  18. Convert image to 1px x height

  19. None
  20. Sharpen the image

  21. Heavy-handed grayscale conversion => make most grays black => whites

    are white
  22. convert to txt

  23. None
  24. Look for long, continuous blocks of “black”

  25. None
  26. None
  27. None
  28. github.com ewlarson/picturepages

  29. Don Quixote # (168/169) 99% Accurate http://openlibrary.org/books/OL24150024M/The_history_of_Don_Quixote

  30. None
  31. None
  32. Paradise Lost # (54/54) 100% Accurate http://openlibrary.org/books/OL14022842M/Paradise_Lost

  33. None
  34. None
  35. Around the World in Eighty Days # (60/62) 97% Accurate

    http://openlibrary.org/books/OL7050533M/Around_the_world_in_eighty_days
  36. None
  37. None
  38. Wanna help do this better? Contact me. elarson@library.wisc.edu