Upgrade to Pro — share decks privately, control downloads, hide ads and more …

NLP for Social Good - Girlz Code Festival

NLP for Social Good - Girlz Code Festival

March 2021.

Julia Wabant

March 31, 2021
Tweet

More Decks by Julia Wabant

Other Decks in Education

Transcript

  1. 20/03/2021 GCF- Omdena Case Studies The Problem:Lack of sexual education

    • Studies have shown that a lack of education causes violent crimes against women, higher suicide rates among a certain group of teenagers, and the spread of diseases. sexedPL goals: create responsible, educated, and sexual freedom among the young leaders of tomorrow.
  2. 20/03/2021 GCF- Omdena Case Studies Topics of dicussion • In

    collaboration with sexedPL, use AI to analyze the importance of sex education • Our solution: A. Collection of Datasets from several sources B . Detailed report on analysis using NLP Source: https://en.wikipedia.org/wiki/Sustainable_Development_Goals#/media/File:Sustainable_Development_Goals.svg
  3. 20/03/2021 GCF- Omdena Case Studies Project Goals Using a combination

    of AI and Natural Language Processing (NLP) techniques, together with text visualization, analyze different data sources (social media, newspaper articles, survey answers) to get insights on which topics should be covered more broadly in books related to sexual education, media, etc.
  4. 20/03/2021 GCF- Omdena Case Studies Task: Teenagers Forums • Description:

    analyze frequently discussed sexually-related topics in Polish online forums • Key Data Sources: Polish online forums zapytaj.onet.pl and dojrzewamy.pl • Methods/Tools: Python, NLP application (topic modeling algorithms, text classifiers) • Key Results: ◦ Identifying topics from unstructured texts ◦ Automatically classify sexually-related questions
  5. 20/03/2021 GCF- Omdena Case Studies Our Process/Approach Gather data from

    forums Identify topics and/or apply trained classifier dojrzewamy.pl zapytaj.onet.pl Creating visualizations
  6. 20/03/2021 GCF- Omdena Case Studies Topics of dicussion • In

    collaboration with foundation Botnar, use data to understand what youth preferences, aspirations and sentiment are in order to understand the future. • Our solution: A. Collection of Datasets from several sources B . Detailed report on analysis using NLP and specific insights about mental health
  7. 20/03/2021 GCF- Omdena Case Studies The Problem: Uncovering the ideas,

    aspirations and sentiment hidden in this web • Every day, new events, controversies and reforms are spun into the net, and millions of young minds react to them. • This digital ecosystem offers an excellent opportunity for analysis to understand what young people’s fears, dreams, and thoughts are on different topics Botnar Foundation goals: better understand how to support young people more effectively and catalyze appropriate initiatives for youth empowerment
  8. 20/03/2021 GCF- Omdena Case Studies Project Goals Using a combination

    of AI and Natural Language Processing (NLP) techniques, together with text visualization, analyze different data sources (social media, newspaper articles) to get insights on the views of the younger generation.
  9. 20/03/2021 GCF- Omdena Case Studies Task: Twitter Analysis • Description:

    determine the sentiment and topics most frequently used by the youth, and how this sentiment is changing over time. • Key Data Sources : Twitter • Methods/Tools: Python, JavaScript, Tableau, NLP application (topic modeling algorithms, text classifiers) • Key Results: ◦ Identifying topics from unstructured texts ◦ Automatically classifiy tweets (sentiment : positive, neutral, or negative)
  10. 20/03/2021 GCF- Omdena Case Studies Our Process/Approach Scrape and process

    Twitter data + preprocess it for sentiment analysis Perform topic modeling on the corpus of tweets and conduct sentiment analysis Visualize the results
  11. 20/03/2021 GCF- Omdena Case Studies Results & Visualizations Heat map

    of Total Sentiment over region Heat Map of Average Sentiment over Region
  12. 20/03/2021 GCF- Omdena Case Studies Results • Trends : Most

    negative : 2013 in South Africa, 2019 in Germany and Australia, 2020 in UK Mostly neutral and negative tweets found in 2020 Neutrality is the most common sentiments • Region-wise patterns : Maximum deviation in polarity (most change in sentiment over time) : France Maximum increase in polarity over time : France and Germany Greatest polarity : South Africa, Canada, Germany Maximum number of positive tweets : South Africa Maximum number of negative tweets : USA and Brazil
  13. 20/03/2021 GCF- Omdena Case Studies Task: Instagram Analysis • Description:

    Analyze captions from data collected from young users and organizations dedicated to youth and children • Key Data Sources : Instagram • Methods/Tools: Python, NLP application (wordcloud, text classifiers) • Key Results: ◦ Extract keywords from caption ◦ Automatically classifiy sentiment of caption
  14. 20/03/2021 GCF- Omdena Case Studies Results & Visualizations Heatmap for

    positive, neutral, and negative sentiments worldwide
  15. 20/03/2021 GCF- Omdena Case Studies Task: Reddit Analysis • Description:

    collecting and analyzing data from youth subredits(Teenagers, ForKids, Under18, GradSchool, College, YouthRights, HighSchool, in English and Spanish) • Key Data Sources : Reddit • Methods/Tools: Python, NLP application (wordclouds) • Key Result: ◦ Identifying keyword from unstructured text
  16. 20/03/2021 GCF- Omdena Case Studies Task: YouTube Analysis • Description:

    Analyze video titles and description from data collected from young users public channels and channels dedicated to you people (English, Spanish and French), with a focus on climate, education and relationships. • Key Data Sources : YouTube • Methods/Tools: Python, NLP application (topic modeling algorithms, text classifiers)
  17. 20/03/2021 GCF- Omdena Case Studies Topics of discussion • In

    collaboration with Save the Children, use data to protect children in digital spaces • UN SDGl 16.2: “end abuse, exploitation, trafficking and all forms of violence and torture against children”. • Our solution: AI to address online abuse of children A. Collection of Datasets from several sources B . Detailed report on analysis using NLP C. Search Engine for research articles D. Educational Chatbot E. Predator Analysis (Chatbot with risk) Source: https://en.wikipedia.org/wiki/Sustainable_Development_Goals#/media/File:Sustainable_Development_Goals.svg
  18. 20/03/2021 GCF- Omdena Case Studies The Problem: Online Violence Against

    Children (OVAC) A 21st Century problem... • Internet communities of predators who trade or sell child sexual abuse materials (CSAM) and teach methods to approach kids online • In the past 20 years Online Child Sexual Exploitation is growing exponentially • Children living in poverty - or subject to humanitarian emergencies - are more likely to become victims Save the Children Goals: if we are not protecting children in digital spaces, then we are not protecting children.
  19. 20/03/2021 GCF- Omdena Case Studies Project Goals Using a combination

    of AI and Natural Language Processing (NLP) techniques, together with text visualization, analyze different data sources (social media, newspaper articles, among others) to get insights on the severity, causes, and actors of online violence against children (OVAC), as well as identify behaviors of predators online.
  20. 20/03/2021 GCF- Omdena Case Studies Summary: Project Highlights 67 →

    43 global contributors - 6 datasets built - 5 final reports - 5 product prototypes - 1-3 scientific publication(s) strong focus on sexually related forms of Online Violence against Children (left branches) & on cybergrooming (black)
  21. 20/03/2021 GCF- Omdena Case Studies Solution building Process • Based

    on insights from several data sources, the team performed a broad analysis on OVAC with AI • Omdena’s bottom-up/self-managing approach • Collaborators organized themselves into 9 tasks, building in 2 months prototype applications with great potential to help protecting kids online • Consultation with external domain experts: Safe to Net and Human Rights First teams
  22. 20/03/2021 GCF- Omdena Case Studies Task: Newspapers Task Managers: Madhu

    Charan/Shreya Walia • Description: to understand how rampant was the presence of OVAC by analyzing the coverage done by news outlets in different countries • Key Data Sources : Collected from different media outlets from 22 countries. • Methods/Tools: NLP(tf-idf), Scikit learn for classifiers, google sheets • Key Results: Dataset, Insights
  23. 20/03/2021 GCF- Omdena Case Studies Our Process/Approach Omdena-Save the Children

    Project Summer 2020 Newspaper Article Collection Tagging the articles: 2 approaches Using scrapers Manually Using classifier Manually Creating dataset and visualizations OVAC-PVAC-OVAA classifier OVAC-PVAC-irrelevant classifier
  24. 20/03/2021 GCF- Omdena Case Studies Task: Victim Stories Task Manager:

    Ahmed Jyad/Shreya Walia • Description: To find insights about the behaviour online groomers and children by analyzing victim stories • Key Data Sources : Reddit, Newspapers, Survivor Forums • Methods/Tools: Keyword extraction using Python APIs and manual classification & labelling • Key Results: Finding answers to questions about how and why kids fall into hands of online groomers
  25. 20/03/2021 GCF- Omdena Case Studies Our Process/Approach Collecting Victim Stories

    Generating list of questions and manually reading stories for answers Reddit(Pyt hon APIs) Survivor forums Creating dataset and visualizations Newspap ers
  26. 20/03/2021 GCF- Omdena Case Studies Questions answered for insights: Questions

    Asked: 1. What initial and continued tactics are used by groomers to gain trust? 2. Which Platform was used to establish the connection? 3. What did the groomer promise the victim? 4. What threats groomers make to children in order to convince them to send suggestive or illegal media? 5. What reasons are given by victims for sending images? 6. What reasons do children have to feel like talking to strangers online in the first place? (For instance, do they face issues in school and at home, but don’t feel they can open up to their teachers or families?) 7. What age did the groomer state?
  27. 20/03/2021 GCF- Omdena Case Studies Results & Visualizations Pie Chart

    for Victim Reasons Pie Chart for Victim Struggles
  28. 20/03/2021 GCF- Omdena Case Studies Task: Social Media Task Manager:

    Anjali/Erum • Description: :To get insight on the trends of conversation(chat) and Extraction of Keyword , polarizing sentiments and finding out Chat patterns • Key Data Sources : Social media(Instagram and twitter) and chat forums(Psych Forum,Netmums Forum, CyberSmile Forum and Reachout) • Methods/Tools: Pandas, NumPy, Matplotlib, NLTK, Spacy, Scikit-Learn, Text blob • Key Results: ◦ Chat Pattern ◦ Sentiment analysis ◦ Keywords extraction
  29. 20/03/2021 GCF- Omdena Case Studies Our Process/Approach→ Instagram #hashtags Emojis

    yes no Methods top2vec lda bert lda+bert wordcloud wordcloud cluster Methods top2vec lda bert lda+bert wordcloud wordcloud cluster
  30. 20/03/2021 GCF- Omdena Case Studies Our Process/Approach→ Instagram (lda) 1.

    #hashtags: childbeauty, childmodel, girlfashion, lovelygirl. Voguegirl, younggirls 2. Emojis: yes 3. Dataset size: 35,374 rows 4. Result: wordcloud
  31. 20/03/2021 GCF- Omdena Case Studies Observations + Future Improvements 1.

    Public datasets do not give deep insight unless we can get access to private chat and chat rooms. → inappropriate data is removed upon request 2. Age and gender was not revealed . These platforms ensuring 18+ age to register. 3. Chat pattern is not based on any data dictionary (difficult to map slang words hey, heyyy, heyy). Suggestion and future improvement 1. Access to private chat logs of dataset (textual, graphics, audio/video) 2. Effective data cleaning Methods 3. Embedding warning systems to alter about inappropriate behaviours 4. Grooming through other AI applications
  32. 20/03/2021 GCF- Omdena Case Studies Task: Game Forums Task Manager:

    Sabrina/Erum • Description: Building a NLP model to infer the sentimental of a reviews and to detect the risk of a game for a child. • Key Data Sources : Commonsense website • Methods/Tools: Python, NLP application (Text Vectorized,ML Algorithms) • Key Results: ◦ Quantifying popular games ◦ Sentiment Analysis ◦ Risk Prediction ◦ correlating features
  33. 20/03/2021 GCF- Omdena Case Studies Process: Generic Workflow Data Preparation

    Prepare in according with the NLP Model 1- Treat Slang words to enrich the text quality 2- Delete punctuation if it is necessary for the model. Data Analyses BoW, ngrams, unique characteristics of the dataset. Make questions for yourself that could be answered by analysing the data Modeling Choose the appropriate model for your purpose. Save the weights of your model, configurations and features Deploy the model Show your results (graphs, api, etc) Data Extraction Extract information from websites, documents
  34. Extract sentiment from Reviews using polarity and quantify Risks The

    polarity results are used for developing a RandomForestClassifier The results shows the model has a good precision as we can see by the ROC curve. We can analyse this results using a confusion matrix, too. Accuracy is 0.9622980251346499 F1 score 0.980786825251601 Game Classifier
  35. 20/03/2021 GCF- Omdena Case Studies Observations + Future Improvements 1.

    Model is built on Reviews (Biased/Unbiased) 2. Other features rating, player’s sentiments , violence and game plot have not been taken to in consideration) 3. In game chat log is not reviewed Suggestion and future improvement 1. Triangulating other features with reviews 2. Getting in game chat for review 3. Plot Review 4. Alter system while playing
  36. 20/03/2021 GCF- Omdena Case Studies Task: Educational Chatobot Task Manager:

    Juber/Erum • Description: A Chabot for grooming/education of children on different topics including cyberbullying, gaming, online abuse, unwanted contact, sending nude. • Key Data Sources : esafety website and commonsense media • Methods/Tools: Rasa NLU, Google dialogue flow, Python • Key Results: ◦Prototype is functional ◦1000s of intents have been trained
  37. 20/03/2021 GCF- Omdena Case Studies Task: Predator Analysis Task Manager:

    Guneet/Jeremy • Description: Analysis of Chat Messages to Find Sentiment and Chatbot Development • Key Data Sources : Perverted Justice chat logs, PAN 2012 logs • Methods/Tools: Python, Microsoft Access / Excel • Key Results: Functional Chatbot Conversations and Effective Sentiment Analysis
  38. 20/03/2021 GCF- Omdena Case Studies Dataset Acquisition + Modelling Approaches

    Modelling Approaches tried 1. Deep Learning NLP (Seq2Seq) 2. Various word-to-vector solutions 3. Transformers (BERT, XLNet, fast.ai) Dataset Acquisition 1. Raw chat logs 2. Created Dataset No structured dataset available. Dataset creation critical to project’s success More than 840 man hours dedicated. >803,000 messages combined
  39. 20/03/2021 GCF- Omdena Case Studies Summary: Recommendations for Future Work

    What can be done further regarding AI and online abuse of children? • improve the performance of models built / improve the quality of the data models were built on • collaborate with providers of social media / gaming platforms to access more valuable data