Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NYPL Labs Building Inspector: Extracting Data from Historic Maps

NYPL Labs Building Inspector: Extracting Data from Historic Maps

Slides for the talk given by Mauricio Giraldo Arteaga at OpenVisConf 2014 http://openvisconf.com/

Video here: https://www.youtube.com/watch?v=Oph1o3IZEFU

Errata (Apr 30, 2014): There's some (kind of cool) dispute on the bronze map. More info: http://exhibitions.nypl.org/treasures/items/show/163 and http://news.nationalgeographic.com/news/2013/08/130821-ostrich-globe-map-discovery-science-nation/

7aff8f547184534da3ca2e14e63a68a8?s=128

Mauricio Giraldo

April 25, 2014
Tweet

Transcript

  1. mauricio giraldo arteaga @mgiraldo NYPL Labs

  2. None
  3. hola

  4. my name is mauricio

  5. None
  6. research and circulating library system spanning 3 boroughs in NYC

  7. None
  8. NYPL Labs

  9. None
  10. i’m going to talk about maps

  11. The Great Map Data Extraction

  12. an adventure in three acts

  13. prologue

  14. The Lionel Pincus and Princess Firyal Map Division

  15. photo by christopher cannon

  16. None
  17. None
  18. None
  19. None
  20. None
  21. 500,000+ maps 20,000+ books & atlases

  22. None
  23. None
  24. None
  25. None
  26. None
  27. year

  28. street names year

  29. use type street names year

  30. use type street names name year

  31. material use type street names name year

  32. material use type street names name class year

  33. material use type street names address name class year

  34. material use type street names address floors name class year

  35. material use type street names address floors name class year

    skylights
  36. material use type street names address floors name class year

    skylights backyards
  37. material use type street names address floors name class geo

    location year skylights backyards
  38. footprint material use type street names address floors name class

    geo location year skylights backyards
  39. footprint material use type street names address floors name class

    geo location year skylights backyards
  40. we got these for several decades since the 1800s

  41. data trapped in a legacy format

  42. gimme all dems datas!

  43. f**k yeah historical data!

  44. citysdk.waag.org/buildings

  45. NYU Stern / Imaginaria3D

  46. NYU Stern / Imaginaria3D

  47. maps.google.com

  48. maps.google.com

  49. None
  50. maps.nypl.org/warper

  51. None
  52. None
  53. hand-crafted, artisanal, locally-sourced data

  54. 500,000+ maps 20,000+ books & atlases

  55. 500,000+ maps 20,000+ books & atlases

  56. 500,000+ maps 20,000+ books & atlases* *imagine how many pages

    an atlas has
  57. None
  58. this will take us a few millenia* *actual number taken

    out from a hat
  59. None
  60. None
  61. None
  62. geo-rectification or: “make it match Open Street Map”

  63. None
  64. None
  65. *this is a simulation. actual process is intensive. consult your

    mathematician before trying
  66. vectorization or: “draw the building shapes”

  67. None
  68. results from maps.nypl.org/warper

  69. None
  70. ~120k footprints produced in three years by staff and volunteers

  71. there has to be a better way

  72. act i: will there be polygons?

  73. requests to geo companies went unanswered

  74. None
  75. can we automate this?

  76. None
  77. WTF!? @mgiraldo

  78. what is a building?

  79. None
  80. completely enclosed by black lines

  81. completely enclosed by black lines dashed lines are not walls

  82. completely enclosed by black lines dashed lines are not walls

    > 20m2 (~180ft2)
  83. completely enclosed by black lines dashed lines are not walls

    > 20m2 (~180ft2) < 3,000m2 (~27,000ft2)
  84. completely enclosed by black lines dashed lines are not walls

    > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored
  85. process

  86. github.com/NYPL/map-vectorizer

  87. None
  88. None
  89. None
  90. completely enclosed by black lines dashed lines are not walls

    > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored
  91. completely enclosed by black lines dashed lines are not walls

    > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored
  92. provide the best (possible) input image

  93. None
  94. None
  95. None
  96. None
  97. differences in resampling cubic nearest neighbor

  98. differences in resampling cubic nearest neighbor

  99. make the image a binary bitmap or: “black and white”

  100. None
  101. None
  102. polygonize or: “convert contiguous pixels to a single line shape”

  103. None
  104. ! gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

  105. ! gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

  106. ! gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

  107. simplify* *for those polygons that we care about

  108. completely enclosed by black lines dashed lines are not walls

    > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored ✔ ✔
  109. None
  110. None
  111. None
  112. alpha shape convex hull with a sample point set cran.r-project.org/web/packages/alphahull/

    *code basically stolen wholesale from rpubs.com/geospacedman/alphasimple
  113. ﹡ ﹡ ﹡ ﹡ ﹡﹡ ﹡ ﹡ ﹡ ﹡ ﹡

  114. ﹡ ﹡ ﹡ ﹡ ﹡﹡ ﹡ ﹡ ﹡ ﹡ ﹡

  115. ﹡ ﹡ ﹡ ﹡ ﹡﹡ ﹡ ﹡ ﹡ ﹡ ﹡

  116. we need a set of points

  117. None
  118. pts = spsample(polygon, n=1000, type="hexagonal")

  119. pts = spsample(polygon, n=1000, type="regular")

  120. pts = spsample(polygon, n=1000, type="random")

  121. now we alpha shaping

  122. x.as = ashape(pts@coords, alpha=2.0)

  123. there are other point reduction algorithms like Ramer-Douglas-Peucker or Whyatt

    Curve Simplification
  124. separate the buildings from the chaff

  125. completely enclosed by black lines dashed lines are not walls

    > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored ✔ ✔ ✔ ✔
  126. None
  127. None
  128. [218, 211, 209]

  129. [218, 211, 209] paper [199, 179, 173], [179, 155, 157],

    [206, 193, 189], [199, 195, 163], [207, 204, 179], [195, 189, 154], [209, 203, 181], [255, 225, 40], [194, 198, 192], [161, 175, 190], [137, 174, 163], [166, 176, 172], [149, 156, 141] [205, 200, 186] not paper
  130. None
  131. None
  132. None
  133. this is good enough for our use case

  134. None
  135. None
  136. None
  137. ✔ ✔ ✔ ✔ ✔ completely enclosed by black lines

    dashed lines are not walls > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored
  138. computer-vision for attribute recognition *bonus quest

  139. None
  140. None
  141. None
  142. 66,056 footprints produced in one day for an 1859 atlas

    of manhattan
  143. caveats: ! adjacency not enforced false positives/negatives buildings may also

    overlap
  144. act ii: the vectorizer needs to prove itself

  145. None
  146. None
  147. None
  148. None
  149. multiple inspections for each item and let consensus surface on

    its own
  150. footprint validation or: “tell us what the computer got right

    or wrong“
  151. are people willing to spend time checking building footprints? insurance

    atlases are not exactly the coolest type of maps
  152. None
  153. buildinginspector.nypl.org

  154. github.com/NYPL/building-inspector

  155. None
  156. None
  157. None
  158. None
  159. about a month later…

  160. None
  161. None
  162. None
  163. None
  164. 420k+ flags* 70k+ unique polygons ! consensus: ~84% YES, 7%

    FIX, 9% NO *a “flag” is a YES/NO/FIX by one person for a given polygon
  165. seems people are willing after all… we — our contributors

  166. seems people are willing after all… we — our contributors

  167. act iii: the return of the inspector

  168. footprint material use type street names address floors name class

    geo location year skylights backyards
  169. divide and conquer

  170. footprint material use type street names address floors name class

    geo location year skylights backyards
  171. three new tasks for now… we really want it all!

  172. footprint material use type street names address floors name class

    geo location year skylights backyards
  173. check

  174. check YES

  175. check YES address color

  176. check YES FIX address color

  177. check YES FIX address color fix

  178. check YES FIX address color fix

  179. check YES FIX address color fix *footprints marked as “NO”

    go to building heaven
  180. check YES FIX address color fix *footprints marked as “NO”

    go to building heaven
  181. fix

  182. fix

  183. address

  184. address

  185. classify color

  186. classify color

  187. resulting data available via an API in 100% recyclable GeoJSON

  188. resulting data available via an API in 100% recyclable GeoJSON

  189. None
  190. not the end

  191. gracias bocoup! mauricio giraldo arteaga @mgiraldo NYPL Labs images from

    giphy, wikimedia commons & nypl digital collections