CLASSIEFIER: USING MACHINE
LEARNING TO PAINT A
PICTURE OF SOCIAL TRENDS
Dr. Paola Oliva-Altamirano, Innovation Lab, Our
Community, May 2019
Slide 2
Slide 2 text
Who am I?
Our Community - Innovation Lab 2
A foreigner
From Honduras to the US to Australia
From Galaxies to Taxonomies
• Dr. Paola Oliva-Altamirano, Innovation Lab,
Our Community, May 2019
Slide 3
Slide 3 text
Outline:
• Introducing Our community’s data initiatives
• Background: CLASSIE a social dictionary
• How did we scope CLASSIEfier?
• How did CLASSIEfier evolve as a project?
• Data science for social good concept
• Results and conclusions
Our Community - Innovation Lab 3
Slide 4
Slide 4 text
Is a social enterprise and B Corp that provides advice, connections, training and easy-to-use tech tools
for community-builders.
Donation Platform
Grants database Training and networking
Software for grants applications
Slide 5
Slide 5 text
Our Community - Innovation Lab 5
Slide 6
Slide 6 text
From CLASSIE to CLASSIEfier
Slide 7
Slide 7 text
Main objective – Classification of grants
Our Community - Innovation Lab 7
Australia lacked a
unified taxonomy to
classify subjects,
beneficiaries and
organization types
In 2016, OC introduced
CLASSIE
The classification
system for Australian
social sector initiatives
and entities
CLASSIE opens the door
to standard
classification
Slide 8
Slide 8 text
CLASSIE • Subjects
• Populations
• Organisation type
Our Community - Innovation Lab 8
A social sector dictionary
Where is the money going? and How is the Australian social sector working?
Slide 9
Slide 9 text
Hierarchical Classification – e.g. Subjects
Social Sciences
Anthropology
Archeology
Biological
anthropology
Interdisciplinary
studies
Ethnic studies
Indigenous
studies
Asian studies
Sport and
recreation
Community
recreation
Parks
Camps
Sport
Outdoor sport
Mountain and
rock climbing
Hiking and
walking
Paralympics
Level 1
17 categories
Level 4
243 categories
Level 3
492 categories
Level 2
132 categories
Our Community - Innovation Lab 9
Slide 10
Slide 10 text
Questions • How do we ensure that users are
choosing the correct category?
• How do we classify historical data?
800,000 grant applications since 2010
Now we have the dictionary – How do we apply it?
Our Community - Innovation Lab 10
Slide 11
Slide 11 text
CLASSIEfier is a tool that will automatically classify
grants
Our Community - Innovation Lab 11
Slide 12
Slide 12 text
How did we scope CLASSIEfier?
Slide 13
Slide 13 text
Source: “One model to rule them all” by Christoph Molnar
Slide 14
Slide 14 text
CLASSIEfier – Two different models
Our Community - Innovation Lab 14
1. To give automatic suggestions to grant applicants
2. To classify historical data
Seems like you are applying
for:
q Sports and recreation
q Art and culture
q Community and development
Slide 15
Slide 15 text
CLASSIEfier: How does it work?
Our Community - Innovation Lab 15
Slide 16
Slide 16 text
How did CLASSIEfier evolve?
Slide 17
Slide 17 text
CLASSIEfier – The Algorithm
How do we
generate more
labels?
At least 2000 applications per category
What do we have?
Our Community - Innovation Lab 17
800,000
grant applications
4,000
grant applications
labeled by users
since CLASSIE
went live
Slide 18
Slide 18 text
First phase:
a simple
keyword matching to
extract more labels
Keyword matching = the process of searching for ‘Literal’
matches (e.g. “hospital”) in a given piece of text (e.g. a grant
description) to identify groups or subjects (e.g. health sector).
Stages:
• Identify keywords for CLASSIE
• Extract applications that exhibit a strong match
• Score the classification done by Users
We found that:
• Keyword matching accuracy differs from one category to another.
• On average is around 80%
Example:
This project will raise awareness and empower deaf
people by providing key mental health information in
their primary language (Australian Sign Language).
People with hearing impediment.
Our Community - Innovation Lab 18
CLASSIEfier – The Algorithm
For example:
“orphans” is a confusing category.
“wildlife welfare” is a straight forward
category
Slide 19
Slide 19 text
Our Community - Innovation Lab 19
DIFFICULTY #1: Multilabel
Second phase:
Training the Machine Learning model
CLASSIEfier – The Algorithm
Training dataset:
128,000
grant applications
Classified by
keyword
matching
DIFFICULTY #2: Hierarchy
DIFFICULTY #3: Number of labels per category
Slide 20
Slide 20 text
Our Community - Innovation Lab 20
Example:
A grant application that
is aimed at helping
teenagers with autism.
Beneficiaries:
• “Children and youth” at level 1
• “Adolescents” at level 2
And also,
• “People with disabilities” at level 1
• “People with intellectual disabilities” at level 2
Multilabels and Hierarchy
Slide 21
Slide 21 text
• Categories such as Confucius, North American people, Nomadic
people among others will have less than 100 grant applications.
Our Community - Innovation Lab 21
20X less
Than the 2000
minimum
required
DIFFICULTY #3: Number of labels per category
Niche classification or “black holes”
Slide 22
Slide 22 text
Reads the
application
Classification Level 1 –
Machine learning
Sports and recreation
Classification Level 2:
We have enough
labels we use another
ML model
Classification Level 3:
Keyword matching
Information and
communications
Classification Level 2:
we do not have enough
labels we use
keyword matching
Classification Level 3:
Keyword matching
How do we solve it? – Separate training
Our Community - Innovation Lab 22
Slide 23
Slide 23 text
Our Community - Innovation Lab 23
Stages:
• Choose the best model – k-nearest neighbours (k-nn)
• Choose the best parameters
• Choose the best scoring
Third phase:
Model interpretation: scoring and
checking for biases
CLASSIEfier – The Algorithm
Scoring
Our Community - Innovation Lab 25
Based on the fact that each application has several categories
Recall:How many
categories got picked per application
0 None
1 <45%
2 >45%
3 Perfect match
Precision:How many
categories are wrong per application
0 All
1 >55%
2 <55%
3 None – Perfect match
0 6
Useless Model Perfect Model!!
CLASSIEfier ~4-5
Slide 26
Slide 26 text
Misclassifications and black holeswill cause to underfund
minorities that are already overlooked
Our Community - Innovation Lab 26
Slide 27
Slide 27 text
The Data Science for Social
Good Movement
“The best minds of my generation are thinking
about how to make people click ads,” he says.
“That sucks.”
-- Jeff Hammerbacher
(Cloudera and Facebook data leader)
Slide 28
Slide 28 text
Algorithmic bias
• This will happen if you feed in the algorithm with data
that is already biased or with insufficient data - The
algorithm will predict biased classifications.
• Algorithms are mirrors
Our Community - Innovation Lab 28
Sport people
Slide 29
Slide 29 text
Know your Model!
Our Community - Innovation Lab 29
xkdc.com/1838/
Slide 30
Slide 30 text
Our Community - Innovation Lab 30
SHAP (SHapley Additive exPlanations)
WEAT tests proposed in Caliskan et al. 2017
AI Fairness 360
Slide 31
Slide 31 text
Our Community - Innovation Lab 31
Document everything! – this is how we tackle biases
Choose transparency
Slide 32
Slide 32 text
Results and conclusions
It is not feasible to classify human natural languages with 100% accuracy
Our Community - Innovation Lab 32
Church
Religion
Christian
Model = Religion
Reality –
A fete in a Catholic school
Slide 33
Slide 33 text
Results and conclusions
• CLASSIEfier works similar to humans, not better not worse. ~70-80% accuracy
Our Community - Innovation Lab 33
Church
Religion
Christian
Out 200 applications classified by Users we found that:
63%
right
18%
wrong
19%
Half right
Slide 34
Slide 34 text
Results and conclusions
• The model is also discriminating between good and bad applications
Our Community - Innovation Lab 34
Church
Religion
Christian
Approved
Grant applications
85% accuracy
Declined
Grant applications
75% accuracy
Slide 35
Slide 35 text
Results and conclusions
CLASSIEfier is now feeding back into CLASSIE
Our Community - Innovation Lab 35
Church
Religion
Christian
Seems like you are applying
for:
q Sports and recreation
q Art and culture
q Community and development
Slide 36
Slide 36 text
CLASSIEfier – More than just an algorithm
Data preprocessing Writing and testing the
algorithm
Production – back and front
end product
Maintenance
Our Community - Innovation Lab 36
Slide 37
Slide 37 text
DO YOU WANT
TO LEARN
MORE?
Linkedin: paola-oliva-altamirano
Email: [email protected]
Innovation lab:
https://www.ourcommunity.com.au/innovationlab