Postal address cleaning in CEU's Networks Project

1239095d95152f847a87afc34545bbeb?s=47 Gabor Nyeki
November 28, 2012

Postal address cleaning in CEU's Networks Project

1239095d95152f847a87afc34545bbeb?s=128

Gabor Nyeki

November 28, 2012
Tweet

Transcript

  1. 2.

    Context • The Networks Project studies social and economic networks.

    • Extensive data sets; one of them is Hungary's company registry from 1989 to 2011. • Headquarters, establishments, and branches.
  2. 3.

    Context • The Networks Project studies social and economic networks.

    • Extensive data sets; one of them is Hungary's company registry from 1989 to 2011. • Headquarters, establishments, and branches. • One of many questions we can ask: Do you learn how to do business from your neighbours?
  3. 4.

    Anatomy of an address 10751 Budapest,2 Károly körút3 9.4 1

    postal code 2 settlement 3 street name & type 4 house number
  4. 12.

    Our take on this: directed acyclical graphs • Break the

    job down into small, manageable tasks. • Implement them as small, self-evident functions in Python. • Wire them together in a graph.
  5. 13.

    Glue them with unit tests • Small functions are easy

    to understand and easy to test. • Complex stu is complex; but we can still test them.
  6. 14.

    Glue them with unit tests • Small functions are easy

    to understand and easy to test. • Complex stu is complex; but we can still test them. • Unit tests guarantee output quality no matter how messy the job.
  7. 15.

    Measure of goodness • We've dened an attribute called street_found.

    • We set street_found = True roughly if the (postal code, settlement, street) triplet is found in our whitelists. • By this measure, about six percent of the input data does still misbehave.
  8. 16.

    Concluding remarks • Location data available as postal addresses can

    be used to learn about how people behave. • Processing these addresses is messy. • But unit testing helps a great deal with quality and consistency.
  9. 17.

    Concluding remarks • Location data available as postal addresses can

    be used to learn about how people behave. • Processing these addresses is messy. • But unit testing helps a great deal with quality and consistency. • Stay tuned: we're going to open the code on GitHub.
  10. 18.

    Appendix: Data sources • Company registry: Complex Kft. • Street

    name whitelists: • Hungarian Post • ocial list of election districts