Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling human-aided data enrichment - GDG meetup at SocialCops (March 2018)

Scaling human-aided data enrichment - GDG meetup at SocialCops (March 2018)

Akash Tandon

March 25, 2018
Tweet

More Decks by Akash Tandon

Other Decks in Programming

Transcript

  1. Objective - Understand challenges which come up when processing data

    at scale. - How a robust infrastructure can help you deal with such challenges. - Deploy a basic celery Python application on Kubernetes.
  2. Of dashboards and geographies - Data-driven dashboards - Nuances of

    Indian Geographies - In a nutshell, they’re a mess. - Delhi in one dataset, and Dilli in the other. - Need to make sure that the system identifies Delhi and Dilli as the same, i.e. Delhi as an entity is standardized.
  3. Geography - example of an entity type An entity could

    be: - Geography or Location - Person - Organization - Money Or basically, anything at all.
  4. Entity standardization and enrichment - In-house entity enrichment and standardization.

    - Human-in-the-loop system. - Why do this? - Entities are often misspelt. Sometimes they may have multiple names. But letting a machine do standardization on its own is tricky. Human context is required.
  5. Problems of scale - More than 0.6 million villages in

    India. - More than 3 million (2,000,000) entities in the system. - For each entity, you could have multiple aliases or naming errors. - To deal with this, we need robust infrastructure and efficient algorithms. We will focus on the former in this session.
  6. Celery to the rescue - Celery is an asynchronous task

    queue/job queue based on distributed message passing. - Basically, it lets you run any python code as a process separate from your current application/script. - Basic architecture - Producer (Python application/script) - Queue (RabbitMQ) - Consumer (Celery workers)
  7. Enter Kubernetes - We will deploy our application on the

    K8s cluster which had been set up during the previous session. - Demo derived from: https://github.com/fabric8io/gitcontroller/tree/master/vendor/k8s.io/kubernetes/examples/celer y-rabbitmq
  8. The future: services on top of entities - Entity recognition:

    https://cloud.google.com/natural-language/ - Entity graph - Demo video