Slide 1

Slide 1 text

Scaling human-aided data enrichment using celery and kubernetes Google Cloud: Building for the Next Billion

Slide 2

Slide 2 text

Objective - Understand challenges which come up when processing data at scale. - How a robust infrastructure can help you deal with such challenges. - Deploy a basic celery Python application on Kubernetes.

Slide 3

Slide 3 text

Of dashboards and geographies - Data-driven dashboards - Nuances of Indian Geographies - In a nutshell, they’re a mess. - Delhi in one dataset, and Dilli in the other. - Need to make sure that the system identifies Delhi and Dilli as the same, i.e. Delhi as an entity is standardized.

Slide 4

Slide 4 text

Geography - example of an entity type An entity could be: - Geography or Location - Person - Organization - Money Or basically, anything at all.

Slide 5

Slide 5 text

Entity standardization and enrichment - In-house entity enrichment and standardization. - Human-in-the-loop system. - Why do this? - Entities are often misspelt. Sometimes they may have multiple names. But letting a machine do standardization on its own is tricky. Human context is required.

Slide 6

Slide 6 text

Problems of scale - More than 0.6 million villages in India. - More than 3 million (2,000,000) entities in the system. - For each entity, you could have multiple aliases or naming errors. - To deal with this, we need robust infrastructure and efficient algorithms. We will focus on the former in this session.

Slide 7

Slide 7 text

Celery to the rescue - Celery is an asynchronous task queue/job queue based on distributed message passing. - Basically, it lets you run any python code as a process separate from your current application/script. - Basic architecture - Producer (Python application/script) - Queue (RabbitMQ) - Consumer (Celery workers)

Slide 8

Slide 8 text

But where do we run this setup?! No prizes for guessing this one.

Slide 9

Slide 9 text

Enter Kubernetes - We will deploy our application on the K8s cluster which had been set up during the previous session. - Demo derived from: https://github.com/fabric8io/gitcontroller/tree/master/vendor/k8s.io/kubernetes/examples/celer y-rabbitmq

Slide 10

Slide 10 text

Setup - Set up Rabbitmq

Slide 11

Slide 11 text

Setup (contd.) - Set up Rabbitmq - Set up Celery

Slide 12

Slide 12 text

Setup (contd.) - Set up Rabbitmq - Set up Celery - Set up Flower

Slide 13

Slide 13 text

Alternate architectures

Slide 14

Slide 14 text

The future: services on top of entities - Entity recognition: https://cloud.google.com/natural-language/ - Entity graph - Demo video

Slide 15

Slide 15 text

That’s all folks!