Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transforming India's Budgets into Open Linked Data

Transforming India's Budgets into Open Linked Data

The Fifth Elephant is a conference on data science, engineering, and Machine Learning. Indian Budget documents across various tiers of government, consisting of detailed information on allocations made and resources raised in a financial year. Unfortunately, these documents are published in unstructured PDFs which makes it difficult for researchers, economists and the general public to analyze and use this crucial data. This session will delve into our journey of developing OpenBudgetsIndia - a collective initiative to make India’s budgets open, usable and easy to comprehend.

Gaurav Godhwani

July 27, 2017
Tweet

More Decks by Gaurav Godhwani

Other Decks in Technology

Transcript

  1. TRANSFORMING INDIA’s BUDGETS INTO OPEN LINKED DATA Gaurav Godhwani (@gggodhwani)

    Open Budgets India - an open data platform on government budgets in India Centre for Budget and Governance Accountability (CBGA) DataKind Bangalore
  2. Government budgets are globally considered as “Moral Documents”*, reflecting priorities

    and values of the state and its people. *Source: https://www.vox.com/policy-and-politics/2017/3/16/14943748/trump-budget-outline-moral
  3. Major issues with India’s Budgets • Unstructured PDF documents •

    Limited availability of Budgets online • Inconsistent Formats • No Metadata • Inconsistent and incomplete Budget Codes aka Unique IDs
  4. • Publicly Accessible (online) • Reusable Format (data points not

    just analysis) • Without any restriction (free/legally open) • Machine-readable (editable) ‘OPEN’ BUDGET DATA
  5. Scrape Scrape Utils Union Budget Plugin State1 Budget Plugin StateX

    Budget Plugin Municipal1 Budget Plugin MunicipalX Budget Plugin ... ...
  6. Parse Step #1: Loop over each page in the PDF

    & convert them into images …..
  7. Parse Step #2: Get all major vertical & horizontal lines

    in the image using Hough Transform Image Source: Amos Storkey http://homepages.inf.ed.ac.uk/amos/hough. html
  8. Parse Step #4: Compute coordinates for tables & columns for

    each page aka Table Attributes table_bounds = { "top": …, "left": …., "bottom": ..., "right": … } column_coordinates = [c1, c2, c3, ... , cN]
  9. Parse Step #5: Table attributes are passed as input to

    Tabula https://github.com/tabulapdf/tabula { Table Attributes }
  10. Parse Step #6: Custom plugins to modify table data wherever

    necessary • Fixing Header Values • Merging & Splitting of Rows & Columns • Filtering out non-unicode(UTF) characters • Other Data Sanity Checks
  11. Transform State Budgets follow 6-tier classification to record their Receipts

    and Expenditure Detailed Head: Like 'Salaries', 'Office Expenses', Object Object Head: Covers Objects and Sub-schemes
  12. Transform But unfortunately these budget heads are arranged hierarchically in

    the PDF and same gets transformed into CSV. Making CSVs slightly difficult to analyze.
  13. Publish THEMES VISUALIZATIONS SITEMAPS OTHERS Python Scripts for Pylons Jinja

    Templates JS, CSS, Image Files Add-on Libraries CKAN Plugin Architecture
  14. Publish Karnatak a Budget 2017-18 Karnata ka Budget 2016-17 Karnatak

    a Budget 2015-16 Karnata ka Budget 2014-15 Karnataka Budget 2017-18 Karnataka Budget 2016-17 Karnataka Budget 2015-16 Karnataka Budget 2014-15 Sikkim Budget 2015-16 Sikkim Budget 2017-18 Sikkim Budget 2016-17 SIkkim Budget 2014-15 OPEN LINKED DATA 2210-01-110 : Urban Health Services Allopathy Hospitals & Dispensaries
  15. Contribute Help us to: • Generate more Open Budget Data

    • Improve our Algorithms and evolve our Codebase • Cover Budget Data in your Geography • Refine our Designs & much more! We are open to new ideas, suggestions and feedback