Transforming India's Budgets into Open Linked Data

Transforming India's Budgets into Open Linked Data

The Fifth Elephant is a conference on data science, engineering, and Machine Learning. Indian Budget documents across various tiers of government, consisting of detailed information on allocations made and resources raised in a financial year. Unfortunately, these documents are published in unstructured PDFs which makes it difficult for researchers, economists and the general public to analyze and use this crucial data. This session will delve into our journey of developing OpenBudgetsIndia - a collective initiative to make India’s budgets open, usable and easy to comprehend.

33325ff5fafdf8849195687d12abf30b?s=128

Gaurav Godhwani

July 27, 2017
Tweet

Transcript

  1. TRANSFORMING INDIA’s BUDGETS INTO OPEN LINKED DATA Gaurav Godhwani (@gggodhwani)

    Open Budgets India - an open data platform on government budgets in India Centre for Budget and Governance Accountability (CBGA) DataKind Bangalore
  2. Government budgets are globally considered as “Moral Documents”*, reflecting priorities

    and values of the state and its people. *Source: https://www.vox.com/policy-and-politics/2017/3/16/14943748/trump-budget-outline-moral
  3. But Budget Documents are Hard To Access & Difficult To

    Comprehend
  4. Major issues with India’s Budgets • Unstructured PDF documents •

    Limited availability of Budgets online • Inconsistent Formats • No Metadata • Inconsistent and incomplete Budget Codes aka Unique IDs
  5. None
  6. Public Accounts Open Budget Data Fiscal Transparency Trust in Governments

  7. • Publicly Accessible (online) • Reusable Format (data points not

    just analysis) • Without any restriction (free/legally open) • Machine-readable (editable) ‘OPEN’ BUDGET DATA
  8. Tim Berners-Lee’s 5-Stars of Open Data *Source: http://ec.europa.eu/newsroom/itemdetail.cfm?item_id=27191&newsletter=126

  9. An Open Source Community Driven Initiative Image Source: https://opensource.org/

  10. Open Data Pipeline Scrape Parse Transform Publish Analyse

  11. Open Data Pipeline Scrape Parse Transform Publish Analyse

  12. Scrape 150+ Budget Source Websites

  13. Scrape Scrape Utils Union Budget Plugin State1 Budget Plugin StateX

    Budget Plugin Municipal1 Budget Plugin MunicipalX Budget Plugin ... ...
  14. Scrape Image Source: Marcin Floryan, https://commons.wikimedia.org/wiki/File:XPath_example.svg

  15. Open Data Pipeline Scrape Parse Transform Publish Analyse

  16. Parse 150+ Budget Formats

  17. PDF to CSV Union Budget Plugin State1 Budget Plugin StateX

    Budget Plugin ... Parse
  18. Parse Step #1: Loop over each page in the PDF

    & convert them into images …..
  19. Parse Step #2: Get all major vertical & horizontal lines

    in the image using Hough Transform Image Source: Amos Storkey http://homepages.inf.ed.ac.uk/amos/hough. html
  20. Parse Step #3: Get max table bounds + extend vertical

    lines to touch the table bounds
  21. Parse Step #4: Compute coordinates for tables & columns for

    each page aka Table Attributes table_bounds = { "top": …, "left": …., "bottom": ..., "right": … } column_coordinates = [c1, c2, c3, ... , cN]
  22. Parse Step #5: Table attributes are passed as input to

    Tabula https://github.com/tabulapdf/tabula { Table Attributes }
  23. Parse Step #6: Custom plugins to modify table data wherever

    necessary • Fixing Header Values • Merging & Splitting of Rows & Columns • Filtering out non-unicode(UTF) characters • Other Data Sanity Checks
  24. Parse Step #7: Merge tables extracted from each page into

    one table CSV
  25. Open Data Pipeline Scrape Parse Transform Publish Analyse

  26. Transform State Budgets follow 6-tier classification to record their Receipts

    and Expenditure Detailed Head: Like 'Salaries', 'Office Expenses', Object Object Head: Covers Objects and Sub-schemes
  27. Transform But unfortunately these budget heads are arranged hierarchically in

    the PDF and same gets transformed into CSV. Making CSVs slightly difficult to analyze.
  28. Transform Write custom transform scripts to flatten the CSVs &

    create consumable Budget Heads
  29. Open Data Pipeline Scrape Parse Transform Publish Analyse

  30. Publish https://openbudgetsindia.org/dataset?q=&tags=union+budget+2017

  31. Publish Architecture Source: http://docs.ckan.org/en/latest/contributing/architecture.html

  32. Publish THEMES VISUALIZATIONS SITEMAPS OTHERS Python Scripts for Pylons Jinja

    Templates JS, CSS, Image Files Add-on Libraries CKAN Plugin Architecture
  33. Publish

  34. https://openbudgetsindia.org/api/action/ datastore_search?resource_id=38e553a0 -4dd9-46f5-8d62-4938e1f7df3d Publish APIs

  35. Publish Karnatak a Budget 2017-18 Karnata ka Budget 2016-17 Karnatak

    a Budget 2015-16 Karnata ka Budget 2014-15 Karnataka Budget 2017-18 Karnataka Budget 2016-17 Karnataka Budget 2015-16 Karnataka Budget 2014-15 Sikkim Budget 2015-16 Sikkim Budget 2017-18 Sikkim Budget 2016-17 SIkkim Budget 2014-15 OPEN LINKED DATA 2210-01-110 : Urban Health Services Allopathy Hospitals & Dispensaries
  36. Open Data Pipeline Scrape Parse Transform Publish Analyse

  37. Analyze https://openbudgetsindia.org/dataset/union-budget-at-a-glance-timeseries

  38. Analyze https://openbudgetsindia.org/dataset/ahmedabad-municipal-corporation-budget-summary-statement

  39. Analyze http://unionbudget2017.cbgaindia.org/

  40. Analyze https://cbgaindia.github.io/story-generator/

  41. Educate https://openbudgetsindia.org/budget-basics/

  42. Future Work

  43. Contribute Help us to: • Generate more Open Budget Data

    • Improve our Algorithms and evolve our Codebase • Cover Budget Data in your Geography • Refine our Designs & much more! We are open to new ideas, suggestions and feedback
  44. THANK YOU Image Source: https://alanorth.github.io/github-pages-2015/#/ Slides URL: http://tiny.cc/obi-fifthel Code: https://github.com/cbgaindia

    Email: gaurav.godhwani@gmail.com @gggodhwani @OpenBudgetsIn @CBGAIndia @DataKindBLR