Making sense of messy data to track disease outbreaks in India - Fifth Elephant 2018

1 www.socialcops.com Akash Tandon Data Engineering @ SocialCops

2 www.socialcops.com Safest route home? National health budget? Better governance?
Best local school? Drought-prone villages? Disaster relief?

3 www.socialcops.com Let’s have a look at a specific use
case. Analog to digital conversion Data is often still collected on paper forms, before moving to a digital format, resulting in errors. Unstructured data Data is often stored in dense PDFs or text files, rather than structured formats like CSVs. Unreliable data Data across different systems and data sets is often contradictory, inaccurate and outdated. Local language data Data is often recorded in different languages, making it harder to match and analyze data. Scattered data Rather than a central repository, data is scattered into disconnected, siloed systems. Dirty data Standard geographic conventions or metadata standards are often not followed.

4 www.socialcops.com - The Ministry of Health and Family Welfare
(MoHFW) in India has the IDSP scheme in place to identify disease outbreaks at sub-district & village level across India. Under this scheme, it releases weekly outbreak data as a PDF document. - In our bid to derive useful insights from the data, we set up a data pipeline which, among other things, - (e)xtracted the PDFs from the IDSP website - (t)ransformed them into CSVs - (l)oaded them into a data store.

5 www.socialcops.com PDF reports released weekly, tracking disease outbreaks across
India

6 www.socialcops.com A data pipeline is an automated set of
actions that extracts data from various sources and makes it available to be used.

7 www.socialcops.com - DAG (Directed Acyclic Graph) creation on Airflow
- Data ingestion - getting PDFs from the IDSP website - Extracting data from PDFs - Data wrangling using Python and R - Geography identification - Insights and alerts

8 www.socialcops.com - Problem: We needed a workflow management tool
that would execute tasks in our pipeline on a regular basis and allow for: - Easy monitoring and logging - Effective task scheduling and dependency management - Intuitive UI - Programmatic control - Solution: Apache Airflow satisfied most of our needs out of the box - Learnings: - When to move from cron to a WMS? - Why Airflow over its counterparts?

9 www.socialcops.com -

10 www.socialcops.com - Problem: A new PDF document is released
almost every week (with some lag) on the IDSP website. We needed to scrape new documents every week while keeping track of the documents that were already fetched into our system. - Solution: Store state (last scraped PDF) at a central location, which can be referred by our scraper before its next run. For the IDSP use case, a run-of-the-mill web scraping code using lxml/scrapy would do. In production, we use a custom data ingestion solution which acts as the central gateway for data into our system. - Learnings: - Advantages of using a central data ingestion system. - Error handling during data ingestion.

11 www.socialcops.com - Problem: IDSP provides tabular data as PDFs,
and it’s very difficult to get 100% parsing accuracy on all types of PDFs with a single tool. - Solution: In-house library, which uses image recognition along with a few heuristics (to be described) to solve the parsing problem. - Learnings: - PDFs are the worst format in which you can provide tabular data for consumption. - As a startup with limited resources, what made us decide to create a custom solution for PDF parsing?

12 www.socialcops.com - Learnings: - Automate as much as you
can. Investing in code templates or UI solutions for simple repetitive tasks is a good idea. - However, make sure that your pipeline is flexible enough to use custom code whenever necessary. This is another pro of working on top of a system such as Airflow.

13 www.socialcops.com - Problem: Geographical entities (villages, sub-districts, districts, etc.)
in Indian context are often misspelt or have multiple names (aliases). The IDSP data is no exception to this. These geographies need to be identified and standardized to derive useful insights or match the data with other data sources (such as Census). - Solution: An entity identification and standardization system designed to tackle the intricacies and quirks of Indian geography. - Learnings: - Lack of an out-of-the-box NLP solution for working with Indian geographies. - Often, a semi-automated system (human-in-the-loop), which offers a higher degree of reliability, is much better than an automated one.

14 www.socialcops.com

15 www.socialcops.com - Reproducible notebook - Code snippets

16 www.socialcops.com

Making sense of messy data to track disease out...

Making sense of messy data to track disease outbreaks in India - Fifth Elephant 2018

Akash Tandon

More Decks by Akash Tandon

Other Decks in Technology

Featured

Transcript

1 www.socialcops.com Akash Tandon Data Engineering @ SocialCops

2 www.socialcops.com Safest route home? National health budget? Better governance?

3 www.socialcops.com Let’s have a look at a specific use

4 www.socialcops.com - The Ministry of Health and Family Welfare

5 www.socialcops.com PDF reports released weekly, tracking disease outbreaks across

6 www.socialcops.com A data pipeline is an automated set of

7 www.socialcops.com - DAG (Directed Acyclic Graph) creation on Airflow

8 www.socialcops.com - Problem: We needed a workflow management tool

9 www.socialcops.com -

10 www.socialcops.com - Problem: A new PDF document is released

11 www.socialcops.com - Problem: IDSP provides tabular data as PDFs,

12 www.socialcops.com - Learnings: - Automate as much as you

13 www.socialcops.com - Problem: Geographical entities (villages, sub-districts, districts, etc.)

14 www.socialcops.com

15 www.socialcops.com - Reproducible notebook - Code snippets

16 www.socialcops.com