Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pakistan Census Data

Pakistan Census Data

Case study of collecting Pakistan census data for robust distribution and better availability. This deck discusses the problems faced while accessing public data in general, using this particular case.

Jabran Rafique

August 04, 2014
Tweet

More Decks by Jabran Rafique

Other Decks in Education

Transcript

  1. Objectives • Data availability • Open data • Transparency •

    Robust access • Widely accessible formats
  2. Best Source • Population Census Organization (census.gov.pk) Detailed data exists

    but not available in reusable and widely accessible formats. In fact, the website itself is not available most of the time. • World Bank (data.worldbank.org) • ReliefWeb (reliefweb.int) • USAID (usaid.gov) Data available in different accessible formats but data is brief, limited and directed.
  3. Problems • No downloadable data format available • Website inaccessible

    most times of 24 hours • No semantic management for available data • No easy way to access the data programmatically
  4. Collection Methodology • Start with 1998 census data1. • Data

    available for each district. • Each district data accessible2 as HTML page. • Patience! 1. Who am I kidding?! That is the only census data available. 2. Only when website is available & accessible
  5. Scrap, Covert, Save. Easy Peasy! PHP Library – Simple HTML

    DOM Project website: http://sourceforge.net/projects/simplehtmldom/
  6. Easier said than done! 1. Server non-responsive to script calls.

    2. Server unavailable after script comes across an error. 3. Ridiculous latency. (Patience methodology applies here.) 4. Non-semantic data e.g. some districts have extra information columns; in result, returning error and going back to #2. 5. HTML files were literally saved from Microsoft Office!!
  7. Problems: • No looping through data files (Server timeout). •

    Even with a delay, if an error occurred, its long server timeout. Solution: • Manually run script for each file one by one. Process goes as following: Scrapping, finally PHP file_get_contents Simple HTML DOM JSON SAVE FILE
  8. Further Steps; Making Data Useful 1. Go for original objectives

    of this whole process. 2. Restructure all data into a standard format. 3. Acquire missing data. 4. Make it all available for public use. Get it, share it or contribute to it at git.io/pk-census