Slide 1

Slide 1 text

mauricio giraldo arteaga @mgiraldo NYPL Labs

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

bon jour

Slide 4

Slide 4 text

my name is mauricio

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

research and circulating library system spanning the Bronx, Staten Island and Manhattan boroughs in NYC

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

NYPL Labs

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

i’m going to talk about maps

Slide 11

Slide 11 text

The Great Map Data Extraction

Slide 12

Slide 12 text

an adventure in three acts and a prologue and an epilogue

Slide 13

Slide 13 text

prologue

Slide 14

Slide 14 text

The Lionel Pincus and Princess Firyal Map Division

Slide 15

Slide 15 text

No content

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

No content

Slide 20

Slide 20 text

No content

Slide 21

Slide 21 text

500,000+ maps 20,000+ books & atlases

Slide 22

Slide 22 text

No content

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

No content

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

year

Slide 28

Slide 28 text

street names year

Slide 29

Slide 29 text

use type street names year

Slide 30

Slide 30 text

use type street names name year

Slide 31

Slide 31 text

material use type street names name year

Slide 32

Slide 32 text

material use type street names name class year

Slide 33

Slide 33 text

material use type street names address name class year

Slide 34

Slide 34 text

material use type street names address floors name class year

Slide 35

Slide 35 text

material use type street names address floors name class year skylights

Slide 36

Slide 36 text

material use type street names address floors name class year skylights backyards

Slide 37

Slide 37 text

material use type street names address floors name class geo location year skylights backyards

Slide 38

Slide 38 text

footprint material use type street names address floors name class geo location year skylights backyards

Slide 39

Slide 39 text

footprint material use type street names address floors name class geo location year skylights backyards

Slide 40

Slide 40 text

we got these for several decades since the 1800s and by 1950 every town in the US with a population of 2,000 had been mapped

Slide 41

Slide 41 text

data trapped in a legacy format

Slide 42

Slide 42 text

we want all the data!

Slide 43

Slide 43 text

f**k yeah historical data!

Slide 44

Slide 44 text

citysdk.waag.org/buildings

Slide 45

Slide 45 text

citysdk.waag.org/buildings

Slide 46

Slide 46 text

NYU Stern / Imaginaria3D

Slide 47

Slide 47 text

NYU Stern / Imaginaria3D

Slide 48

Slide 48 text

maps.google.com

Slide 49

Slide 49 text

maps.google.com

Slide 50

Slide 50 text

No content

Slide 51

Slide 51 text

data

Slide 52

Slide 52 text

it all starts with a photograph

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

but it is “just a photo” but it is only a few clicks away

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

maps.nypl.org/warper

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

geo-rectification or: “make it match Open Street Map”

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

No content

Slide 62

Slide 62 text

*this is a simulation. actual process is intensive. consult your mathematician before trying

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

vectorization or: “draw the building shapes”

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

results from maps.nypl.org/warper

Slide 68

Slide 68 text

hand-crafted, artisanal, locally-sourced data

Slide 69

Slide 69 text

500,000+ maps 20,000+ books & atlases

Slide 70

Slide 70 text

500,000+ maps 20,000+ books & atlases* *imagine how many pages an atlas has

Slide 71

Slide 71 text

in the order of dozens of millions building footprints if counting NYC only

Slide 72

Slide 72 text

No content

Slide 73

Slide 73 text

~120k footprints produced in three years by staff and volunteers

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

this will take us a few millenia* *actual number taken out from a hat

Slide 76

Slide 76 text

there has to be a better way

Slide 77

Slide 77 text

act i: will there be polygons?

Slide 78

Slide 78 text

requests to geo companies went unanswered

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

can we automate this?

Slide 81

Slide 81 text

No content

Slide 82

Slide 82 text

¿¡quoi!? @mgiraldo

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

No content

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

No content

Slide 87

Slide 87 text

what is a building?

Slide 88

Slide 88 text

No content

Slide 89

Slide 89 text

completely enclosed by black lines

Slide 90

Slide 90 text

completely enclosed by black lines dashed lines are not walls

Slide 91

Slide 91 text

completely enclosed by black lines dashed lines are not walls > 20m2 (~180ft2)

Slide 92

Slide 92 text

completely enclosed by black lines dashed lines are not walls > 20m2 (~180ft2) < 3,000m2 (~27,000ft2)

Slide 93

Slide 93 text

completely enclosed by black lines dashed lines are not walls > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored

Slide 94

Slide 94 text

process

Slide 95

Slide 95 text

github.com/NYPL/map-vectorizer

Slide 96

Slide 96 text

No content

Slide 97

Slide 97 text

No content

Slide 98

Slide 98 text

No content

Slide 99

Slide 99 text

No content

Slide 100

Slide 100 text

completely enclosed by black lines dashed lines are not walls > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored

Slide 101

Slide 101 text

completely enclosed by black lines dashed lines are not walls > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored

Slide 102

Slide 102 text

provide the best (possible) input image

Slide 103

Slide 103 text

No content

Slide 104

Slide 104 text

No content

Slide 105

Slide 105 text

No content

Slide 106

Slide 106 text

No content

Slide 107

Slide 107 text

differences in resampling cubic nearest neighbor

Slide 108

Slide 108 text

differences in resampling cubic nearest neighbor

Slide 109

Slide 109 text

make the image a binary bitmap or: “black and white”

Slide 110

Slide 110 text

No content

Slide 111

Slide 111 text

No content

Slide 112

Slide 112 text

polygonize or: “convert contiguous pixels to a single line shape”

Slide 113

Slide 113 text

No content

Slide 114

Slide 114 text

! gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

Slide 115

Slide 115 text

! gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

Slide 116

Slide 116 text

! gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

Slide 117

Slide 117 text

! gdal_polygonize.py test.tif -f "ESRI Shapefile" test.shp test

Slide 118

Slide 118 text

No content

Slide 119

Slide 119 text

no no no no no

Slide 120

Slide 120 text

no no no no no yes yes

Slide 121

Slide 121 text

simplify* *for those polygons that we care about

Slide 122

Slide 122 text

completely enclosed by black lines dashed lines are not walls > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored ✔ ✔

Slide 123

Slide 123 text

No content

Slide 124

Slide 124 text

No content

Slide 125

Slide 125 text

alpha shape *code basically stolen wholesale from rpubs.com/geospacedman/alphasimple

Slide 126

Slide 126 text

﹡ ﹡ ﹡ ﹡ ﹡﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 127

Slide 127 text

﹡ ﹡ ﹡ ﹡ ﹡﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 128

Slide 128 text

﹡ ﹡ ﹡ ﹡ ﹡﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 129

Slide 129 text

we need a set of points

Slide 130

Slide 130 text

No content

Slide 131

Slide 131 text

pts = spsample(polygon, n=1000, type="hexagonal")

Slide 132

Slide 132 text

pts = spsample(polygon, n=1000, type="regular")

Slide 133

Slide 133 text

pts = spsample(polygon, n=1000, type="random")

Slide 134

Slide 134 text

now we alpha shaping

Slide 135

Slide 135 text

x.as = ashape(pts@coords, alpha=2.0)

Slide 136

Slide 136 text

x.as = ashape(pts@coords, alpha=2.0)

Slide 137

Slide 137 text

x.as = ashape(pts@coords, alpha=2.0)

Slide 138

Slide 138 text

there are other point reduction algorithms like Ramer-Douglas-Peucker or Whyatt Curve Simplification

Slide 139

Slide 139 text

separate the buildings from the chaff

Slide 140

Slide 140 text

completely enclosed by black lines dashed lines are not walls > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored ✔ ✔ ✔ ✔

Slide 141

Slide 141 text

No content

Slide 142

Slide 142 text

No content

Slide 143

Slide 143 text

[218, 211, 209]

Slide 144

Slide 144 text

[218, 211, 209] paper [199, 179, 173], [179, 155, 157], [206, 193, 189], [199, 195, 163], [207, 204, 179], [195, 189, 154], [209, 203, 181], [255, 225, 40], [194, 198, 192], [161, 175, 190], [137, 174, 163], [166, 176, 172], [149, 156, 141] [205, 200, 186] not paper

Slide 145

Slide 145 text

No content

Slide 146

Slide 146 text

No content

Slide 147

Slide 147 text

No content

Slide 148

Slide 148 text

this is good enough for our use case

Slide 149

Slide 149 text

No content

Slide 150

Slide 150 text

No content

Slide 151

Slide 151 text

No content

Slide 152

Slide 152 text

✔ ✔ ✔ ✔ ✔ completely enclosed by black lines dashed lines are not walls > 20m2 (~180ft2) < 3,000m2 (~27,000ft2) not paper-colored

Slide 153

Slide 153 text

computer-vision for attribute recognition *bonus quest

Slide 154

Slide 154 text

No content

Slide 155

Slide 155 text

No content

Slide 156

Slide 156 text

No content

Slide 157

Slide 157 text

66,056 footprints produced in one day for an 1859 atlas of Manhattan

Slide 158

Slide 158 text

caveats: ! adjacency not enforced false positives/negatives buildings may also overlap

Slide 159

Slide 159 text

act ii: the vectorizer needs to prove itself

Slide 160

Slide 160 text

No content

Slide 161

Slide 161 text

No content

Slide 162

Slide 162 text

No content

Slide 163

Slide 163 text

No content

Slide 164

Slide 164 text

multiple inspections for each item and let consensus surface on its own

Slide 165

Slide 165 text

footprint validation or: “tell us what the computer got right or wrong“

Slide 166

Slide 166 text

are people willing to spend time checking building footprints? insurance atlases are not exactly the coolest type of maps

Slide 167

Slide 167 text

No content

Slide 168

Slide 168 text

buildinginspector.nypl.org

Slide 169

Slide 169 text

github.com/NYPL/building-inspector

Slide 170

Slide 170 text

No content

Slide 171

Slide 171 text

No content

Slide 172

Slide 172 text

No content

Slide 173

Slide 173 text

No content

Slide 174

Slide 174 text

about a month later…

Slide 175

Slide 175 text

No content

Slide 176

Slide 176 text

No content

Slide 177

Slide 177 text

No content

Slide 178

Slide 178 text

No content

Slide 179

Slide 179 text

420k+ flags* 70k+ unique polygons ! consensus: ~84% YES, 7% FIX, 9% NO *a “flag” is a YES/NO/FIX by one person for a given polygon

Slide 180

Slide 180 text

seems people are willing after all… we — our contributors

Slide 181

Slide 181 text

seems people are willing after all… we — our contributors

Slide 182

Slide 182 text

act iii: the return of the inspector

Slide 183

Slide 183 text

footprint material use type street names address floors name class geo location year skylights backyards

Slide 184

Slide 184 text

divide and conquer

Slide 185

Slide 185 text

footprint material use type street names address floors name class geo location year skylights backyards

Slide 186

Slide 186 text

three new tasks for now… we really want it all!

Slide 187

Slide 187 text

No content

Slide 188

Slide 188 text

footprint material use type street names address floors name class geo location year skylights backyards

Slide 189

Slide 189 text

check

Slide 190

Slide 190 text

check YES

Slide 191

Slide 191 text

check YES address color

Slide 192

Slide 192 text

check YES FIX address color

Slide 193

Slide 193 text

check YES FIX address color fix

Slide 194

Slide 194 text

check YES FIX address color fix

Slide 195

Slide 195 text

check YES FIX address color fix *footprints marked as “NO” go to building heaven

Slide 196

Slide 196 text

check YES FIX address color fix *footprints marked as “NO” go to building heaven

Slide 197

Slide 197 text

fix

Slide 198

Slide 198 text

fix

Slide 199

Slide 199 text

address

Slide 200

Slide 200 text

address

Slide 201

Slide 201 text

classify color

Slide 202

Slide 202 text

classify color

Slide 203

Slide 203 text

865k+ flags

Slide 204

Slide 204 text

check YES FIX address color fix

Slide 205

Slide 205 text

check YES FIX address color fix for 80k+ unique polygons 77k+ 5k+ 42k+ 26k+

Slide 206

Slide 206 text

epilogue

Slide 207

Slide 207 text

address and shape consensus or: how to determine what the right building footprint and address looks like?

Slide 208

Slide 208 text

No content

Slide 209

Slide 209 text

No content

Slide 210

Slide 210 text

all points are useful inclusiveness above all

Slide 211

Slide 211 text

No content

Slide 212

Slide 212 text

No content

Slide 213

Slide 213 text

No content

Slide 214

Slide 214 text

No content

Slide 215

Slide 215 text

No content

Slide 216

Slide 216 text

No content

Slide 217

Slide 217 text

No content

Slide 218

Slide 218 text

No content

Slide 219

Slide 219 text

DBSCAN for the win citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.1980

Slide 220

Slide 220 text

bit.ly/nypl-consensus

Slide 221

Slide 221 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 222

Slide 222 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 223

Slide 223 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 224

Slide 224 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 225

Slide 225 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 226

Slide 226 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 227

Slide 227 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 228

Slide 228 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 229

Slide 229 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ + +

Slide 230

Slide 230 text

246 246 246 414 246 414 414 246 414 414 414 ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ + +

Slide 231

Slide 231 text

246 246 246 414 246 414 414 246 414 414 414 ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ + +

Slide 232

Slide 232 text

246 414 + +

Slide 233

Slide 233 text

No content

Slide 234

Slide 234 text

No content

Slide 235

Slide 235 text

DBSCAN for shapes also!

Slide 236

Slide 236 text

No content

Slide 237

Slide 237 text

No content

Slide 238

Slide 238 text

No content

Slide 239

Slide 239 text

No content

Slide 240

Slide 240 text

No content

Slide 241

Slide 241 text

No content

Slide 242

Slide 242 text

all points are still useful

Slide 243

Slide 243 text

No content

Slide 244

Slide 244 text

Slide 245

Slide 245 text

﹡ ﹡

Slide 246

Slide 246 text

﹡ ﹡ ﹡

Slide 247

Slide 247 text

﹡ ﹡ ﹡ ﹡

Slide 248

Slide 248 text

﹡ ﹡ ﹡

Slide 249

Slide 249 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 250

Slide 250 text

﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡ ﹡

Slide 251

Slide 251 text

+ + + + + + +

Slide 252

Slide 252 text

+ + + + + + +

Slide 253

Slide 253 text

+ + + + + + +

Slide 254

Slide 254 text

+ + + + + + +

Slide 255

Slide 255 text

+ + + + + + +

Slide 256

Slide 256 text

+ + + + + + +

Slide 257

Slide 257 text

+ + + + + + +

Slide 258

Slide 258 text

+ + + + + + +

Slide 259

Slide 259 text

No content

Slide 260

Slide 260 text

No content

Slide 261

Slide 261 text

No content

Slide 262

Slide 262 text

No content

Slide 263

Slide 263 text

No content

Slide 264

Slide 264 text

No content

Slide 265

Slide 265 text

No content

Slide 266

Slide 266 text

resulting data available via an API

Slide 267

Slide 267 text

resulting data available via an API in 100% recyclable GeoJSON

Slide 268

Slide 268 text

No content

Slide 269

Slide 269 text

photographing

Slide 270

Slide 270 text

photographing ↓

Slide 271

Slide 271 text

photographing ↓ geo-rectification

Slide 272

Slide 272 text

photographing ↓ geo-rectification ↓

Slide 273

Slide 273 text

photographing ↓ geo-rectification ↓ vectorization

Slide 274

Slide 274 text

photographing ↓ geo-rectification ↓ vectorization ↓

Slide 275

Slide 275 text

photographing ↓ geo-rectification ↓ vectorization ↓ inspection

Slide 276

Slide 276 text

photographing ↓ geo-rectification ↓ vectorization ↓ inspection ↓

Slide 277

Slide 277 text

photographing ↓ geo-rectification ↓ vectorization ↓ inspection ↓ check / fix / color / address

Slide 278

Slide 278 text

photographing ↓ geo-rectification ↓ vectorization ↓ inspection ↓ check / fix / color / address ↓

Slide 279

Slide 279 text

photographing ↓ geo-rectification ↓ vectorization ↓ inspection ↓ check / fix / color / address ↓ consensus

Slide 280

Slide 280 text

photographing ↓ geo-rectification ↓ vectorization ↓ inspection ↓ check / fix / color / address ↓ consensus ↓

Slide 281

Slide 281 text

photographing ↓ geo-rectification ↓ vectorization ↓ inspection ↓ check / fix / color / address ↓ consensus ↓ data release

Slide 282

Slide 282 text

not the end

Slide 283

Slide 283 text

No content

Slide 284

Slide 284 text

No content

Slide 285

Slide 285 text

No content

Slide 286

Slide 286 text

¡merci beaucoup! mauricio giraldo arteaga @mgiraldo NYPL Labs slides at: bit.ly/nypl-ehess images from: NYPL digital collections - Wikimedia Commons
 Christopher Cannon - Flickr user wallyg - Giphy