Indian Geographies - In a nutshell, they’re a mess. - Delhi in one dataset, and Dilli in the other. - Need to make sure that the system identifies Delhi and Dilli as the same, i.e. Delhi as an entity is standardized.
- Human-in-the-loop system. - Why do this? - Entities are often misspelt. Sometimes they may have multiple names. But letting a machine do standardization on its own is tricky. Human context is required.
India. - More than 3 million (2,000,000) entities in the system. - For each entity, you could have multiple aliases or naming errors. - To deal with this, we need robust infrastructure and efficient algorithms. We will focus on the former in this session.
queue/job queue based on distributed message passing. - Basically, it lets you run any python code as a process separate from your current application/script. - Basic architecture - Producer (Python application/script) - Queue (RabbitMQ) - Consumer (Celery workers)
K8s cluster which had been set up during the previous session. - Demo derived from: https://github.com/fabric8io/gitcontroller/tree/master/vendor/k8s.io/kubernetes/examples/celer y-rabbitmq