Upgrade to Pro — share decks privately, control downloads, hide ads and more …

CLASSIEfier: Using Machine Learning to Paint a ...

CLASSIEfier: Using Machine Learning to Paint a Picture of Social Sector Trends

Tracking the flow of funding and other support to social sector organisations in Australia has historically been difficult because of inconsistencies in categorisation, or the absence of categorisation entirely. Our Community (Melbourne based social enterprise) developed CLASSIE to serve as a universal classification system for Australian social sector initiatives and entities. We are now developing a Machine learning algorithm to reduce or remove the need for manual (human) classification. Once released, CLASSIEfier will allow us to classify historical records on behalf of grantmakers and other social sector supporters and reduce the need for human intervention in the classification of current and future records. In the long term will allow us to answer fundamental questions such as: Where is the money going? Are we helping the areas in most need?

I will present the project scope and development of CLASSIEfier, highlighting my experiences using Machine Learning in the social sector. I will also list the difficulties of working with text and sensitive data, and the methodologies to identify and mitigate algorithmic biases.

Paola Oliva-Altamirano

May 06, 2019
Tweet

More Decks by Paola Oliva-Altamirano

Other Decks in Research

Transcript

  1. CLASSIEFIER: USING MACHINE LEARNING TO PAINT A PICTURE OF SOCIAL

    TRENDS Dr. Paola Oliva-Altamirano, Innovation Lab, Our Community, May 2019
  2. Who am I? Our Community - Innovation Lab 2 A

    foreigner From Honduras to the US to Australia From Galaxies to Taxonomies • Dr. Paola Oliva-Altamirano, Innovation Lab, Our Community, May 2019
  3. Outline: • Introducing Our community’s data initiatives • Background: CLASSIE

    a social dictionary • How did we scope CLASSIEfier? • How did CLASSIEfier evolve as a project? • Data science for social good concept • Results and conclusions Our Community - Innovation Lab 3
  4. Is a social enterprise and B Corp that provides advice,

    connections, training and easy-to-use tech tools for community-builders. Donation Platform Grants database Training and networking Software for grants applications
  5. Main objective – Classification of grants Our Community - Innovation

    Lab 7 Australia lacked a unified taxonomy to classify subjects, beneficiaries and organization types In 2016, OC introduced CLASSIE The classification system for Australian social sector initiatives and entities CLASSIE opens the door to standard classification
  6. CLASSIE • Subjects • Populations • Organisation type Our Community

    - Innovation Lab 8 A social sector dictionary Where is the money going? and How is the Australian social sector working?
  7. Hierarchical Classification – e.g. Subjects Social Sciences Anthropology Archeology Biological

    anthropology Interdisciplinary studies Ethnic studies Indigenous studies Asian studies Sport and recreation Community recreation Parks Camps Sport Outdoor sport Mountain and rock climbing Hiking and walking Paralympics Level 1 17 categories Level 4 243 categories Level 3 492 categories Level 2 132 categories Our Community - Innovation Lab 9
  8. Questions • How do we ensure that users are choosing

    the correct category? • How do we classify historical data? 800,000 grant applications since 2010 Now we have the dictionary – How do we apply it? Our Community - Innovation Lab 10
  9. CLASSIEfier – Two different models Our Community - Innovation Lab

    14 1. To give automatic suggestions to grant applicants 2. To classify historical data Seems like you are applying for: q Sports and recreation q Art and culture q Community and development
  10. CLASSIEfier – The Algorithm How do we generate more labels?

    At least 2000 applications per category What do we have? Our Community - Innovation Lab 17 800,000 grant applications 4,000 grant applications labeled by users since CLASSIE went live
  11. First phase: a simple keyword matching to extract more labels

    Keyword matching = the process of searching for ‘Literal’ matches (e.g. “hospital”) in a given piece of text (e.g. a grant description) to identify groups or subjects (e.g. health sector). Stages: • Identify keywords for CLASSIE • Extract applications that exhibit a strong match • Score the classification done by Users We found that: • Keyword matching accuracy differs from one category to another. • On average is around 80% Example: This project will raise awareness and empower deaf people by providing key mental health information in their primary language (Australian Sign Language). People with hearing impediment. Our Community - Innovation Lab 18 CLASSIEfier – The Algorithm For example: “orphans” is a confusing category. “wildlife welfare” is a straight forward category
  12. Our Community - Innovation Lab 19 DIFFICULTY #1: Multilabel Second

    phase: Training the Machine Learning model CLASSIEfier – The Algorithm Training dataset: 128,000 grant applications Classified by keyword matching DIFFICULTY #2: Hierarchy DIFFICULTY #3: Number of labels per category
  13. Our Community - Innovation Lab 20 Example: A grant application

    that is aimed at helping teenagers with autism. Beneficiaries: • “Children and youth” at level 1 • “Adolescents” at level 2 And also, • “People with disabilities” at level 1 • “People with intellectual disabilities” at level 2 Multilabels and Hierarchy
  14. • Categories such as Confucius, North American people, Nomadic people

    among others will have less than 100 grant applications. Our Community - Innovation Lab 21 20X less Than the 2000 minimum required DIFFICULTY #3: Number of labels per category Niche classification or “black holes”
  15. Reads the application Classification Level 1 – Machine learning Sports

    and recreation Classification Level 2: We have enough labels we use another ML model Classification Level 3: Keyword matching Information and communications Classification Level 2: we do not have enough labels we use keyword matching Classification Level 3: Keyword matching How do we solve it? – Separate training Our Community - Innovation Lab 22
  16. Our Community - Innovation Lab 23 Stages: • Choose the

    best model – k-nearest neighbours (k-nn) • Choose the best parameters • Choose the best scoring Third phase: Model interpretation: scoring and checking for biases CLASSIEfier – The Algorithm
  17. Scoring Our Community - Innovation Lab 24 Recall: !" !"#$%

    &'(&)*+&,' ,- .&//&'0 1*213/ Precision: !" !"#$" &'(&)*+&,' ,- 2*( 453(&)+&,'/
  18. Scoring Our Community - Innovation Lab 25 Based on the

    fact that each application has several categories Recall:How many categories got picked per application 0 None 1 <45% 2 >45% 3 Perfect match Precision:How many categories are wrong per application 0 All 1 >55% 2 <55% 3 None – Perfect match 0 6 Useless Model Perfect Model!! CLASSIEfier ~4-5
  19. Misclassifications and black holeswill cause to underfund minorities that are

    already overlooked Our Community - Innovation Lab 26
  20. The Data Science for Social Good Movement “The best minds

    of my generation are thinking about how to make people click ads,” he says. “That sucks.” -- Jeff Hammerbacher (Cloudera and Facebook data leader)
  21. Algorithmic bias • This will happen if you feed in

    the algorithm with data that is already biased or with insufficient data - The algorithm will predict biased classifications. • Algorithms are mirrors Our Community - Innovation Lab 28 Sport people
  22. Our Community - Innovation Lab 30 SHAP (SHapley Additive exPlanations)

    WEAT tests proposed in Caliskan et al. 2017 AI Fairness 360
  23. Our Community - Innovation Lab 31 Document everything! – this

    is how we tackle biases Choose transparency
  24. Results and conclusions It is not feasible to classify human

    natural languages with 100% accuracy Our Community - Innovation Lab 32 Church Religion Christian Model = Religion Reality – A fete in a Catholic school
  25. Results and conclusions • CLASSIEfier works similar to humans, not

    better not worse. ~70-80% accuracy Our Community - Innovation Lab 33 Church Religion Christian Out 200 applications classified by Users we found that: 63% right 18% wrong 19% Half right
  26. Results and conclusions • The model is also discriminating between

    good and bad applications Our Community - Innovation Lab 34 Church Religion Christian Approved Grant applications 85% accuracy Declined Grant applications 75% accuracy
  27. Results and conclusions CLASSIEfier is now feeding back into CLASSIE

    Our Community - Innovation Lab 35 Church Religion Christian Seems like you are applying for: q Sports and recreation q Art and culture q Community and development
  28. CLASSIEfier – More than just an algorithm Data preprocessing Writing

    and testing the algorithm Production – back and front end product Maintenance Our Community - Innovation Lab 36
  29. DO YOU WANT TO LEARN MORE? Linkedin: paola-oliva-altamirano Email: [email protected]

    Innovation lab: https://www.ourcommunity.com.au/innovationlab