Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Insight Fellowship Demo

muchovej
June 26, 2015

Insight Fellowship Demo

Demo for the Insight Fellowship Program.

muchovej

June 26, 2015
Tweet

Transcript

  1. data • Bill summary/info: sunlight foundation, legiscan • Bill text:

    scraped off CA legislature site. • Location/district: googlemaps, sunlight, SQL • Representative information: sunlight API • Financial information: transparency data API
  2. algorithm • Topic analysis to identify issues in the bills

    • Amass and clean the data (scrape) • Tokenize (stem vs. lemmatize) • N-grams • Identify stop words (nltk -> “df” in tfidf) • Latent Dirichlet Allocation (LDA) for distinct topics
  3. validation • 4 distinct topics (education, health-related, tax/ economy, regulatory

    [changing rules, etc]) • trained on 80% — validated on sample of single- topic bills from test set. • correctly identified 50 out of 50 single-topic bills a (few) examples: 96% health 94%health 98% health 97% education 88% education (valley fever) (stroke) (alzheimers) 
 (apprenticeship) (postsecondary)
  4. numbers!! • 2015-2016 session: ~2600 bills, 35% of which are

    single-topic (>85%) • distribution of single topic bills: education (450), health (183), tax (65), regulations/procedures (304) • sample regulation ones: Alcoholic beverage control, gambling, etc. • trained on 80%, of the 20% left for testing, ~130 are single-topic. Checked a sample of each of those to confirm (~40 total) and they all agreed. • corpus: 1.6million words, dictionary of ~26K tokens when tokenized and throwing out single occurrences. • bigrams: 33%. trigrams: 0! • sample multi-topical: SB65 dealing with establishing olive oil commission for grading olive oil from within the state department of public health; drought regulation and education.
  5. future directions • expand to period from 2009 to 2015

    • on expanded corpus, use word2vec to identify extra topics of interest (i.e., environment, social services) • voting records • track “flip-flopping” • trend of topics • NLP of speeches • correlation between actions and donor interest. • expand to all states nationwide. • include organizations to fund for/against representative. • get picked up by fivethirtyeight