Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Insight Fellowship Demo

Avatar for muchovej muchovej
June 26, 2015

Insight Fellowship Demo

Demo for the Insight Fellowship Program.

Avatar for muchovej

muchovej

June 26, 2015
Tweet

Other Decks in Business

Transcript

  1. data • Bill summary/info: sunlight foundation, legiscan • Bill text:

    scraped off CA legislature site. • Location/district: googlemaps, sunlight, SQL • Representative information: sunlight API • Financial information: transparency data API
  2. algorithm • Topic analysis to identify issues in the bills

    • Amass and clean the data (scrape) • Tokenize (stem vs. lemmatize) • N-grams • Identify stop words (nltk -> “df” in tfidf) • Latent Dirichlet Allocation (LDA) for distinct topics
  3. validation • 4 distinct topics (education, health-related, tax/ economy, regulatory

    [changing rules, etc]) • trained on 80% — validated on sample of single- topic bills from test set. • correctly identified 50 out of 50 single-topic bills a (few) examples: 96% health 94%health 98% health 97% education 88% education (valley fever) (stroke) (alzheimers) 
 (apprenticeship) (postsecondary)
  4. numbers!! • 2015-2016 session: ~2600 bills, 35% of which are

    single-topic (>85%) • distribution of single topic bills: education (450), health (183), tax (65), regulations/procedures (304) • sample regulation ones: Alcoholic beverage control, gambling, etc. • trained on 80%, of the 20% left for testing, ~130 are single-topic. Checked a sample of each of those to confirm (~40 total) and they all agreed. • corpus: 1.6million words, dictionary of ~26K tokens when tokenized and throwing out single occurrences. • bigrams: 33%. trigrams: 0! • sample multi-topical: SB65 dealing with establishing olive oil commission for grading olive oil from within the state department of public health; drought regulation and education.
  5. future directions • expand to period from 2009 to 2015

    • on expanded corpus, use word2vec to identify extra topics of interest (i.e., environment, social services) • voting records • track “flip-flopping” • trend of topics • NLP of speeches • correlation between actions and donor interest. • expand to all states nationwide. • include organizations to fund for/against representative. • get picked up by fivethirtyeight