• Extensive data sets; one of them is Hungary's company registry from 1989 to 2011. • Headquarters, establishments, and branches. • One of many questions we can ask: Do you learn how to do business from your neighbours?
to understand and easy to test. • Complex stu is complex; but we can still test them. • Unit tests guarantee output quality no matter how messy the job.
• We set street_found = True roughly if the (postal code, settlement, street) triplet is found in our whitelists. • By this measure, about six percent of the input data does still misbehave.
be used to learn about how people behave. • Processing these addresses is messy. • But unit testing helps a great deal with quality and consistency. • Stay tuned: we're going to open the code on GitHub.