Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NYPL Labs Building Inspector: Extracting Data from Historic Maps

NYPL Labs Building Inspector: Extracting Data from Historic Maps

Slides for the talk given by Mauricio Giraldo Arteaga at OpenVisConf 2014 http://openvisconf.com/

Video here: https://www.youtube.com/watch?v=Oph1o3IZEFU

Errata (Apr 30, 2014): There's some (kind of cool) dispute on the bronze map. More info: http://exhibitions.nypl.org/treasures/items/show/163 and http://news.nationalgeographic.com/news/2013/08/130821-ostrich-globe-map-discovery-science-nation/

Mauricio Giraldo

April 25, 2014
Tweet

More Decks by Mauricio Giraldo

Other Decks in Technology

Transcript

  1. mauricio giraldo arteaga
    @mgiraldo
    NYPL Labs

    View Slide

  2. View Slide

  3. hola

    View Slide

  4. my name is mauricio

    View Slide

  5. View Slide

  6. research and circulating library system
    spanning 3 boroughs in NYC

    View Slide

  7. View Slide

  8. NYPL Labs

    View Slide

  9. View Slide

  10. i’m going to talk about maps

    View Slide

  11. The Great Map Data Extraction

    View Slide

  12. an adventure in three acts

    View Slide

  13. prologue

    View Slide

  14. The Lionel Pincus and
    Princess Firyal Map Division

    View Slide

  15. photo by christopher cannon

    View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. 500,000+ maps
    20,000+ books & atlases

    View Slide

  22. View Slide

  23. View Slide

  24. View Slide

  25. View Slide

  26. View Slide

  27. year

    View Slide

  28. street names
    year

    View Slide

  29. use type
    street names
    year

    View Slide

  30. use type
    street names
    name
    year

    View Slide

  31. material
    use type
    street names
    name
    year

    View Slide

  32. material
    use type
    street names
    name
    class
    year

    View Slide

  33. material
    use type
    street names
    address
    name
    class
    year

    View Slide

  34. material
    use type
    street names
    address
    floors
    name
    class
    year

    View Slide

  35. material
    use type
    street names
    address
    floors
    name
    class
    year
    skylights

    View Slide

  36. material
    use type
    street names
    address
    floors
    name
    class
    year
    skylights
    backyards

    View Slide

  37. material
    use type
    street names
    address
    floors
    name
    class
    geo location
    year
    skylights
    backyards

    View Slide

  38. footprint
    material
    use type
    street names
    address
    floors
    name
    class
    geo location
    year
    skylights
    backyards

    View Slide

  39. footprint
    material
    use type
    street names
    address
    floors
    name
    class
    geo location
    year
    skylights
    backyards

    View Slide

  40. we got these for several decades since the 1800s

    View Slide

  41. data trapped in a legacy format

    View Slide

  42. gimme all dems datas!

    View Slide

  43. f**k yeah historical data!

    View Slide

  44. citysdk.waag.org/buildings

    View Slide

  45. NYU Stern / Imaginaria3D

    View Slide

  46. NYU Stern / Imaginaria3D

    View Slide

  47. maps.google.com

    View Slide

  48. maps.google.com

    View Slide

  49. View Slide

  50. maps.nypl.org/warper

    View Slide

  51. View Slide

  52. View Slide

  53. hand-crafted, artisanal,
    locally-sourced data

    View Slide

  54. 500,000+ maps
    20,000+ books & atlases

    View Slide

  55. 500,000+ maps
    20,000+ books & atlases

    View Slide

  56. 500,000+ maps
    20,000+ books & atlases*
    *imagine how many pages an atlas has

    View Slide

  57. View Slide

  58. this will take us a few millenia*
    *actual number taken out from a hat

    View Slide

  59. View Slide

  60. View Slide

  61. View Slide

  62. geo-rectification
    or: “make it match Open Street Map”

    View Slide

  63. View Slide

  64. View Slide

  65. *this is a simulation. actual process is intensive. consult your mathematician before trying

    View Slide

  66. vectorization
    or: “draw the building shapes”

    View Slide

  67. View Slide

  68. results from maps.nypl.org/warper

    View Slide

  69. View Slide

  70. ~120k footprints produced in three years
    by staff and volunteers

    View Slide

  71. there has to be a better way

    View Slide

  72. act i: will there be polygons?

    View Slide

  73. requests to geo companies went unanswered

    View Slide

  74. View Slide

  75. can we automate this?

    View Slide

  76. View Slide

  77. WTF!?
    @mgiraldo

    View Slide

  78. what is a building?

    View Slide

  79. View Slide

  80. completely enclosed by black lines

    View Slide

  81. completely enclosed by black lines
    dashed lines are not walls

    View Slide

  82. completely enclosed by black lines
    dashed lines are not walls
    > 20m2 (~180ft2)

    View Slide

  83. completely enclosed by black lines
    dashed lines are not walls
    > 20m2 (~180ft2)
    < 3,000m2 (~27,000ft2)

    View Slide

  84. completely enclosed by black lines
    dashed lines are not walls
    > 20m2 (~180ft2)
    < 3,000m2 (~27,000ft2)
    not paper-colored

    View Slide

  85. process

    View Slide

  86. github.com/NYPL/map-vectorizer

    View Slide

  87. View Slide

  88. View Slide

  89. View Slide

  90. completely enclosed by black lines
    dashed lines are not walls
    > 20m2 (~180ft2)
    < 3,000m2 (~27,000ft2)
    not paper-colored

    View Slide

  91. completely enclosed by black lines
    dashed lines are not walls
    > 20m2 (~180ft2)
    < 3,000m2 (~27,000ft2)
    not paper-colored

    View Slide

  92. provide the best (possible) input image

    View Slide

  93. View Slide

  94. View Slide

  95. View Slide

  96. View Slide

  97. differences in resampling
    cubic nearest neighbor

    View Slide

  98. differences in resampling
    cubic nearest neighbor

    View Slide

  99. make the image a binary bitmap
    or: “black and white”

    View Slide

  100. View Slide

  101. View Slide

  102. polygonize
    or: “convert contiguous pixels to a single line shape”

    View Slide

  103. View Slide

  104. !
    gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

    View Slide

  105. !
    gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

    View Slide

  106. !
    gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

    View Slide

  107. simplify*
    *for those polygons that we care about

    View Slide

  108. completely enclosed by black lines
    dashed lines are not walls
    > 20m2 (~180ft2)
    < 3,000m2 (~27,000ft2)
    not paper-colored


    View Slide

  109. View Slide

  110. View Slide

  111. View Slide

  112. alpha shape convex hull with a sample point set
    cran.r-project.org/web/packages/alphahull/
    *code basically stolen wholesale from rpubs.com/geospacedman/alphasimple

    View Slide

  113. ﹡ ﹡ ﹡

    ﹡﹡






    View Slide

  114. ﹡ ﹡ ﹡

    ﹡﹡






    View Slide

  115. ﹡ ﹡ ﹡

    ﹡﹡






    View Slide

  116. we need a set of points

    View Slide

  117. View Slide

  118. pts = spsample(polygon, n=1000, type="hexagonal")

    View Slide

  119. pts = spsample(polygon, n=1000, type="regular")

    View Slide

  120. pts = spsample(polygon, n=1000, type="random")

    View Slide

  121. now we alpha shaping

    View Slide

  122. x.as = ashape([email protected], alpha=2.0)

    View Slide

  123. there are other point reduction algorithms
    like Ramer-Douglas-Peucker or Whyatt Curve Simplification

    View Slide

  124. separate the buildings from the chaff

    View Slide

  125. completely enclosed by black lines
    dashed lines are not walls
    > 20m2 (~180ft2)
    < 3,000m2 (~27,000ft2)
    not paper-colored




    View Slide

  126. View Slide

  127. View Slide

  128. [218, 211, 209]

    View Slide

  129. [218, 211, 209]
    paper
    [199, 179, 173], [179, 155, 157],
    [206, 193, 189], [199, 195, 163],
    [207, 204, 179], [195, 189, 154],
    [209, 203, 181], [255, 225, 40],
    [194, 198, 192], [161, 175, 190],
    [137, 174, 163], [166, 176, 172],
    [149, 156, 141]
    [205, 200, 186]
    not paper

    View Slide

  130. View Slide

  131. View Slide

  132. View Slide

  133. this is good enough
    for our use case

    View Slide

  134. View Slide

  135. View Slide

  136. View Slide






  137. completely enclosed by black lines
    dashed lines are not walls
    > 20m2 (~180ft2)
    < 3,000m2 (~27,000ft2)
    not paper-colored

    View Slide

  138. computer-vision for attribute recognition
    *bonus quest

    View Slide

  139. View Slide

  140. View Slide

  141. View Slide

  142. 66,056 footprints produced in one day
    for an 1859 atlas of manhattan

    View Slide

  143. caveats:
    !
    adjacency not enforced
    false positives/negatives
    buildings may also overlap

    View Slide

  144. act ii: the vectorizer needs to prove itself

    View Slide

  145. View Slide

  146. View Slide

  147. View Slide

  148. View Slide

  149. multiple inspections for each item
    and let consensus surface on its own

    View Slide

  150. footprint validation
    or: “tell us what the computer got right or wrong“

    View Slide

  151. are people willing to spend time
    checking building footprints?
    insurance atlases are not exactly the coolest type of maps

    View Slide

  152. View Slide

  153. buildinginspector.nypl.org

    View Slide

  154. github.com/NYPL/building-inspector

    View Slide

  155. View Slide

  156. View Slide

  157. View Slide

  158. View Slide

  159. about a month later…

    View Slide

  160. View Slide

  161. View Slide

  162. View Slide

  163. View Slide

  164. 420k+ flags*
    70k+ unique polygons
    !
    consensus: ~84% YES, 7% FIX, 9% NO
    *a “flag” is a YES/NO/FIX by one person for a given polygon

    View Slide

  165. seems people are willing after all…
    we — our contributors

    View Slide

  166. seems people are willing after all…
    we — our contributors

    View Slide

  167. act iii: the return of the inspector

    View Slide

  168. footprint
    material
    use type
    street names
    address
    floors
    name
    class
    geo location
    year
    skylights
    backyards

    View Slide

  169. divide and conquer

    View Slide

  170. footprint
    material
    use type
    street names
    address
    floors
    name
    class
    geo location
    year
    skylights
    backyards

    View Slide

  171. three new tasks
    for now… we really want it all!

    View Slide

  172. footprint
    material
    use type
    street names
    address
    floors
    name
    class
    geo location
    year
    skylights
    backyards

    View Slide

  173. check

    View Slide

  174. check
    YES

    View Slide

  175. check
    YES
    address color

    View Slide

  176. check
    YES FIX
    address color

    View Slide

  177. check
    YES FIX
    address color fix

    View Slide

  178. check
    YES FIX
    address color fix

    View Slide

  179. check
    YES FIX
    address color fix
    *footprints marked as “NO” go to building heaven

    View Slide

  180. check
    YES FIX
    address color fix
    *footprints marked as “NO” go to building heaven

    View Slide

  181. fix

    View Slide

  182. fix

    View Slide

  183. address

    View Slide

  184. address

    View Slide

  185. classify color

    View Slide

  186. classify color

    View Slide

  187. resulting data available via an API
    in 100% recyclable GeoJSON

    View Slide

  188. resulting data available via an API
    in 100% recyclable GeoJSON

    View Slide

  189. View Slide

  190. not the end

    View Slide

  191. gracias bocoup!
    mauricio giraldo arteaga
    @mgiraldo
    NYPL Labs
    images from giphy, wikimedia commons & nypl digital collections

    View Slide