Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postal address cleaning in CEU's Networks Project

Gabor Nyeki
November 28, 2012

Postal address cleaning in CEU's Networks Project

Gabor Nyeki

November 28, 2012
Tweet

Other Decks in Research

Transcript

  1. Postal address cleaning
    in CEU's Networks Project
    Gábor Nyéki
    November 28, 2012

    View Slide

  2. Context
    • The Networks Project studies social and economic
    networks.
    • Extensive data sets; one of them is Hungary's company
    registry from 1989 to 2011.

    Headquarters, establishments, and branches.

    View Slide

  3. Context
    • The Networks Project studies social and economic
    networks.
    • Extensive data sets; one of them is Hungary's company
    registry from 1989 to 2011.

    Headquarters, establishments, and branches.
    • One of many questions we can ask:
    Do you learn how to do business from your neighbours?

    View Slide

  4. Anatomy of an address
    10751 Budapest,2 Károly körút3 9.4
    1
    postal code
    2
    settlement
    3
    street name & type
    4
    house number

    View Slide

  5. Wrong postal code
    1052 Budapest, Kossuth Lajos tér 3.

    View Slide

  6. Incomplete street name
    1055 Budapest, Kossuth tér 3.

    View Slide

  7. Ambiguity
    2700 Cegléd, Kossuth Lajos utca 5.

    View Slide

  8. Ambiguity
    2700 Cegléd, Kossuth Ferenc utca 5.

    View Slide

  9. Verbose street name
    1151 Budapest, Kossuth Lajos utca 7.

    View Slide

  10. Spelling mistakes
    1151 Budapest, Kosut utca 7.

    View Slide

  11. Obsolete street name
    1106 Budapest, Vörös fény utca 3.

    View Slide

  12. Our take on this: directed acyclical graphs
    • Break the job down into small, manageable tasks.
    • Implement them as small, self-evident functions in Python.
    • Wire them together in a graph.

    View Slide

  13. Glue them with unit tests
    • Small functions are easy to understand and easy to test.
    • Complex stu is complex; but we can still test them.

    View Slide

  14. Glue them with unit tests
    • Small functions are easy to understand and easy to test.
    • Complex stu is complex; but we can still test them.
    • Unit tests guarantee output quality no matter how messy
    the job.

    View Slide

  15. Measure of goodness
    • We've dened an attribute called street_found.
    • We set street_found = True roughly if the
    (postal code, settlement, street) triplet is found in our
    whitelists.
    • By this measure, about six percent of the input data does
    still misbehave.

    View Slide

  16. Concluding remarks
    • Location data available as postal addresses can be used to
    learn about how people behave.
    • Processing these addresses is messy.
    • But unit testing helps a great deal with quality and
    consistency.

    View Slide

  17. Concluding remarks
    • Location data available as postal addresses can be used to
    learn about how people behave.
    • Processing these addresses is messy.
    • But unit testing helps a great deal with quality and
    consistency.
    • Stay tuned: we're going to open the code on GitHub.

    View Slide

  18. Appendix: Data sources
    • Company registry: Complex Kft.
    • Street name whitelists:

    Hungarian Post

    ocial list of election districts

    View Slide