Slide 1

Slide 1 text

CLASSIEFIER: USING MACHINE LEARNING TO PAINT A PICTURE OF SOCIAL TRENDS Dr. Paola Oliva-Altamirano, Innovation Lab, Our Community, May 2019

Slide 2

Slide 2 text

Who am I? Our Community - Innovation Lab 2 A foreigner From Honduras to the US to Australia From Galaxies to Taxonomies • Dr. Paola Oliva-Altamirano, Innovation Lab, Our Community, May 2019

Slide 3

Slide 3 text

Outline: • Introducing Our community’s data initiatives • Background: CLASSIE a social dictionary • How did we scope CLASSIEfier? • How did CLASSIEfier evolve as a project? • Data science for social good concept • Results and conclusions Our Community - Innovation Lab 3

Slide 4

Slide 4 text

Is a social enterprise and B Corp that provides advice, connections, training and easy-to-use tech tools for community-builders. Donation Platform Grants database Training and networking Software for grants applications

Slide 5

Slide 5 text

Our Community - Innovation Lab 5

Slide 6

Slide 6 text

From CLASSIE to CLASSIEfier

Slide 7

Slide 7 text

Main objective – Classification of grants Our Community - Innovation Lab 7 Australia lacked a unified taxonomy to classify subjects, beneficiaries and organization types In 2016, OC introduced CLASSIE The classification system for Australian social sector initiatives and entities CLASSIE opens the door to standard classification

Slide 8

Slide 8 text

CLASSIE • Subjects • Populations • Organisation type Our Community - Innovation Lab 8 A social sector dictionary Where is the money going? and How is the Australian social sector working?

Slide 9

Slide 9 text

Hierarchical Classification – e.g. Subjects Social Sciences Anthropology Archeology Biological anthropology Interdisciplinary studies Ethnic studies Indigenous studies Asian studies Sport and recreation Community recreation Parks Camps Sport Outdoor sport Mountain and rock climbing Hiking and walking Paralympics Level 1 17 categories Level 4 243 categories Level 3 492 categories Level 2 132 categories Our Community - Innovation Lab 9

Slide 10

Slide 10 text

Questions • How do we ensure that users are choosing the correct category? • How do we classify historical data? 800,000 grant applications since 2010 Now we have the dictionary – How do we apply it? Our Community - Innovation Lab 10

Slide 11

Slide 11 text

CLASSIEfier is a tool that will automatically classify grants Our Community - Innovation Lab 11

Slide 12

Slide 12 text

How did we scope CLASSIEfier?

Slide 13

Slide 13 text

Source: “One model to rule them all” by Christoph Molnar

Slide 14

Slide 14 text

CLASSIEfier – Two different models Our Community - Innovation Lab 14 1. To give automatic suggestions to grant applicants 2. To classify historical data Seems like you are applying for: q Sports and recreation q Art and culture q Community and development

Slide 15

Slide 15 text

CLASSIEfier: How does it work? Our Community - Innovation Lab 15

Slide 16

Slide 16 text

How did CLASSIEfier evolve?

Slide 17

Slide 17 text

CLASSIEfier – The Algorithm How do we generate more labels? At least 2000 applications per category What do we have? Our Community - Innovation Lab 17 800,000 grant applications 4,000 grant applications labeled by users since CLASSIE went live

Slide 18

Slide 18 text

First phase: a simple keyword matching to extract more labels Keyword matching = the process of searching for ‘Literal’ matches (e.g. “hospital”) in a given piece of text (e.g. a grant description) to identify groups or subjects (e.g. health sector). Stages: • Identify keywords for CLASSIE • Extract applications that exhibit a strong match • Score the classification done by Users We found that: • Keyword matching accuracy differs from one category to another. • On average is around 80% Example: This project will raise awareness and empower deaf people by providing key mental health information in their primary language (Australian Sign Language). People with hearing impediment. Our Community - Innovation Lab 18 CLASSIEfier – The Algorithm For example: “orphans” is a confusing category. “wildlife welfare” is a straight forward category

Slide 19

Slide 19 text

Our Community - Innovation Lab 19 DIFFICULTY #1: Multilabel Second phase: Training the Machine Learning model CLASSIEfier – The Algorithm Training dataset: 128,000 grant applications Classified by keyword matching DIFFICULTY #2: Hierarchy DIFFICULTY #3: Number of labels per category

Slide 20

Slide 20 text

Our Community - Innovation Lab 20 Example: A grant application that is aimed at helping teenagers with autism. Beneficiaries: • “Children and youth” at level 1 • “Adolescents” at level 2 And also, • “People with disabilities” at level 1 • “People with intellectual disabilities” at level 2 Multilabels and Hierarchy

Slide 21

Slide 21 text

• Categories such as Confucius, North American people, Nomadic people among others will have less than 100 grant applications. Our Community - Innovation Lab 21 20X less Than the 2000 minimum required DIFFICULTY #3: Number of labels per category Niche classification or “black holes”

Slide 22

Slide 22 text

Reads the application Classification Level 1 – Machine learning Sports and recreation Classification Level 2: We have enough labels we use another ML model Classification Level 3: Keyword matching Information and communications Classification Level 2: we do not have enough labels we use keyword matching Classification Level 3: Keyword matching How do we solve it? – Separate training Our Community - Innovation Lab 22

Slide 23

Slide 23 text

Our Community - Innovation Lab 23 Stages: • Choose the best model – k-nearest neighbours (k-nn) • Choose the best parameters • Choose the best scoring Third phase: Model interpretation: scoring and checking for biases CLASSIEfier – The Algorithm

Slide 24

Slide 24 text

Scoring Our Community - Innovation Lab 24 Recall: !" !"#$% &'(&)*+&,' ,- .&//&'0 1*213/ Precision: !" !"#$" &'(&)*+&,' ,- 2*( 453(&)+&,'/

Slide 25

Slide 25 text

Scoring Our Community - Innovation Lab 25 Based on the fact that each application has several categories Recall:How many categories got picked per application 0 None 1 <45% 2 >45% 3 Perfect match Precision:How many categories are wrong per application 0 All 1 >55% 2 <55% 3 None – Perfect match 0 6 Useless Model Perfect Model!! CLASSIEfier ~4-5

Slide 26

Slide 26 text

Misclassifications and black holeswill cause to underfund minorities that are already overlooked Our Community - Innovation Lab 26

Slide 27

Slide 27 text

The Data Science for Social Good Movement “The best minds of my generation are thinking about how to make people click ads,” he says. “That sucks.” -- Jeff Hammerbacher (Cloudera and Facebook data leader)

Slide 28

Slide 28 text

Algorithmic bias • This will happen if you feed in the algorithm with data that is already biased or with insufficient data - The algorithm will predict biased classifications. • Algorithms are mirrors Our Community - Innovation Lab 28 Sport people

Slide 29

Slide 29 text

Know your Model! Our Community - Innovation Lab 29 xkdc.com/1838/

Slide 30

Slide 30 text

Our Community - Innovation Lab 30 SHAP (SHapley Additive exPlanations) WEAT tests proposed in Caliskan et al. 2017 AI Fairness 360

Slide 31

Slide 31 text

Our Community - Innovation Lab 31 Document everything! – this is how we tackle biases Choose transparency

Slide 32

Slide 32 text

Results and conclusions It is not feasible to classify human natural languages with 100% accuracy Our Community - Innovation Lab 32 Church Religion Christian Model = Religion Reality – A fete in a Catholic school

Slide 33

Slide 33 text

Results and conclusions • CLASSIEfier works similar to humans, not better not worse. ~70-80% accuracy Our Community - Innovation Lab 33 Church Religion Christian Out 200 applications classified by Users we found that: 63% right 18% wrong 19% Half right

Slide 34

Slide 34 text

Results and conclusions • The model is also discriminating between good and bad applications Our Community - Innovation Lab 34 Church Religion Christian Approved Grant applications 85% accuracy Declined Grant applications 75% accuracy

Slide 35

Slide 35 text

Results and conclusions CLASSIEfier is now feeding back into CLASSIE Our Community - Innovation Lab 35 Church Religion Christian Seems like you are applying for: q Sports and recreation q Art and culture q Community and development

Slide 36

Slide 36 text

CLASSIEfier – More than just an algorithm Data preprocessing Writing and testing the algorithm Production – back and front end product Maintenance Our Community - Innovation Lab 36

Slide 37

Slide 37 text

DO YOU WANT TO LEARN MORE? Linkedin: paola-oliva-altamirano Email: [email protected] Innovation lab: https://www.ourcommunity.com.au/innovationlab