Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postal address cleaning in CEU's Networks Project

1239095d95152f847a87afc34545bbeb?s=47 Gabor Nyeki
November 28, 2012

Postal address cleaning in CEU's Networks Project

1239095d95152f847a87afc34545bbeb?s=128

Gabor Nyeki

November 28, 2012
Tweet

Transcript

  1. Postal address cleaning in CEU's Networks Project Gábor Nyéki November

    28, 2012
  2. Context • The Networks Project studies social and economic networks.

    • Extensive data sets; one of them is Hungary's company registry from 1989 to 2011. • Headquarters, establishments, and branches.
  3. Context • The Networks Project studies social and economic networks.

    • Extensive data sets; one of them is Hungary's company registry from 1989 to 2011. • Headquarters, establishments, and branches. • One of many questions we can ask: Do you learn how to do business from your neighbours?
  4. Anatomy of an address 10751 Budapest,2 Károly körút3 9.4 1

    postal code 2 settlement 3 street name & type 4 house number
  5. Wrong postal code 1052 Budapest, Kossuth Lajos tér 3.

  6. Incomplete street name 1055 Budapest, Kossuth tér 3.

  7. Ambiguity 2700 Cegléd, Kossuth Lajos utca 5.

  8. Ambiguity 2700 Cegléd, Kossuth Ferenc utca 5.

  9. Verbose street name 1151 Budapest, Kossuth Lajos utca 7.

  10. Spelling mistakes 1151 Budapest, Kosut utca 7.

  11. Obsolete street name 1106 Budapest, Vörös fény utca 3.

  12. Our take on this: directed acyclical graphs • Break the

    job down into small, manageable tasks. • Implement them as small, self-evident functions in Python. • Wire them together in a graph.
  13. Glue them with unit tests • Small functions are easy

    to understand and easy to test. • Complex stu is complex; but we can still test them.
  14. Glue them with unit tests • Small functions are easy

    to understand and easy to test. • Complex stu is complex; but we can still test them. • Unit tests guarantee output quality no matter how messy the job.
  15. Measure of goodness • We've dened an attribute called street_found.

    • We set street_found = True roughly if the (postal code, settlement, street) triplet is found in our whitelists. • By this measure, about six percent of the input data does still misbehave.
  16. Concluding remarks • Location data available as postal addresses can

    be used to learn about how people behave. • Processing these addresses is messy. • But unit testing helps a great deal with quality and consistency.
  17. Concluding remarks • Location data available as postal addresses can

    be used to learn about how people behave. • Processing these addresses is messy. • But unit testing helps a great deal with quality and consistency. • Stay tuned: we're going to open the code on GitHub.
  18. Appendix: Data sources • Company registry: Complex Kft. • Street

    name whitelists: • Hungarian Post • ocial list of election districts