Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postal address cleaning in CEU's Networks Project

Gabor Nyeki
November 28, 2012

Postal address cleaning in CEU's Networks Project

Gabor Nyeki

November 28, 2012
Tweet

Other Decks in Research

Transcript

  1. Context • The Networks Project studies social and economic networks.

    • Extensive data sets; one of them is Hungary's company registry from 1989 to 2011. • Headquarters, establishments, and branches.
  2. Context • The Networks Project studies social and economic networks.

    • Extensive data sets; one of them is Hungary's company registry from 1989 to 2011. • Headquarters, establishments, and branches. • One of many questions we can ask: Do you learn how to do business from your neighbours?
  3. Anatomy of an address 10751 Budapest,2 Károly körút3 9.4 1

    postal code 2 settlement 3 street name & type 4 house number
  4. Our take on this: directed acyclical graphs • Break the

    job down into small, manageable tasks. • Implement them as small, self-evident functions in Python. • Wire them together in a graph.
  5. Glue them with unit tests • Small functions are easy

    to understand and easy to test. • Complex stu is complex; but we can still test them.
  6. Glue them with unit tests • Small functions are easy

    to understand and easy to test. • Complex stu is complex; but we can still test them. • Unit tests guarantee output quality no matter how messy the job.
  7. Measure of goodness • We've dened an attribute called street_found.

    • We set street_found = True roughly if the (postal code, settlement, street) triplet is found in our whitelists. • By this measure, about six percent of the input data does still misbehave.
  8. Concluding remarks • Location data available as postal addresses can

    be used to learn about how people behave. • Processing these addresses is messy. • But unit testing helps a great deal with quality and consistency.
  9. Concluding remarks • Location data available as postal addresses can

    be used to learn about how people behave. • Processing these addresses is messy. • But unit testing helps a great deal with quality and consistency. • Stay tuned: we're going to open the code on GitHub.
  10. Appendix: Data sources • Company registry: Complex Kft. • Street

    name whitelists: • Hungarian Post • ocial list of election districts