Upgrade to Pro — share decks privately, control downloads, hide ads and more …

"Detecting Propaganda in Fake News using Natura...

Pycon ZA
October 11, 2019

"Detecting Propaganda in Fake News using Natural Language Processing" by Aroma Rodrigues

Propaganda Detection in Fake News using Natural Language Processing

What is the real world basis for this problem?

Modern times have modern problems. And in India for the past few years we have seen a spike in mob incited violence and killings as a result of fake news spread through social media, messengers and even at times through legit news channels. Apart from educating the masses to identify fake and real news and not get carried away by propaganda, this problem also has a technology based solution. Fake news has already existed in the world, but social media, the fact that the world is so connected that it takes mere seconds for these to spread has exacerbated this issue. All the checks on a piece of news that can be done by a human, can be automated too. This paper aims at doing exactly that by checking with mainstream news agencies to verify the claims by extracting keywords and detecting propaganda using natural language processing libraries.

.

Why attend this talk?

This is a technical talk, but the underlying idea is one that almost anyone could identify with. The implementations and the technical know-how’s would be suitable for those even in the Beginner stages in their understanding of machine learning and natural language processing. This talk is to be taken as an example of how some societal/civic problems can be considered technology problems and solved. It also serves as an example in translating almost any such problem into a tech one, dividing it into steps and solving every mini problem to solve the whole.

The big idea!!!

What exactly is fake news? For the uninitiated, fake news is news that looks real, at times, and deludes people into believing it but is actually fake or modified to suit vested interests. Fake news has always existed in the world. Till the end of the cold war, fake news has existed, as either pro-Soviet or pro-America propaganda. Some fake news has political implications, affects trade deals and does result in affecting the life of the people living in the countries involved. This is an indirect effect though. In the 21st century with the widespread use of social media, instant messengers the effects are more dangerously direct. Fake news in the last few years have been used to slander communities, inciting violence, riots and even on occasion lynching and killing people. Fake news has also been used on social media to topple governments, swivel elections and build up mass perspective for and against individuals and organizations. This also means that for the modern world and democratic ideals to survive in today’s world the menace of fake news must be addressed.

There are a few characteristics that help a human differentiate fake news from the rest. A lot of the “fake news” containing messages shared on social media handles, have bad spellings and wrong grammar. A properly researched news article that has been taken from a credible news channel, paper or any other media is less likely to have any of these. Secondly, there would be no legitimate sources mentioned. And the keywords from the article, if searched for would result in either the news not existing, or skillfully modified to the interest of the maker. Images and videos are also not spared from these modifications with the usage of photo editing and video editing software in today’s world. Propaganda based fake news also generally either praises or criticizes individuals, communities or organizations. These characteristics can translate into a technical module, to predict whether a particular article is fake news or not.

Based on these very steps this proposal now describes how two major steps are implemented and integrated to accomplish this.

The first is to find out reliable sources for the piece of news. Here we can use the Rapid Automatic Keyword Extraction algorithm. It is based on the frequency of a particular word and the co-occurrence of these words, basically a n- gram based approach. The nltk rake algorithm takes care of all the stop words in the English language which consists of prepositions, articles and such. The RAKE algorithm also gives us ranked phrases which makes it easier to use the first “n” ranked phrases to search in a neutral, well established and reputed news aggregator API to find if there are any articles corresponding to these phrases. The second step is to find the similarity of the articles retrieved if any, and the fake news. This is done by comparing the fake news passage to the extract of sentences from the article containing the said keywords and using the SpaCy similarity feature to determine this.

The second part is identifying propaganda and this is where I use the Path Model of Blame to determine if the news contains any propaganda both blame and praise. Apart from these propaganda can also be identified from the contents of the article.

Figure : Path Model of Blame

In order to quantify propaganda we can also quantify the data using parameters like

location (a town, a country),
labeling
argumentation
emotions (fear, outrage, sympathy, hatred, other, missing),
fabrication
politician (the name of a mentioned politician)

For the scope of this proposal we consider only two stages of the whole model. That is event detection and the agent implied by the text. This is done by extracting events from the article and then parsing them based on a pattern that can be fed in the parser as a mixture of regular expression and the Parts of Speech tagging by nltk.

Based on the thresholds defined for both the source finding with similarity and propaganda detection the article would qualify as a fake news article or not.

Progress so far...

Having run these models on a few of the popular fake news article, the systems have worked very well at times for some articles with a 100% accuracy, but this is not the case for all articles. This means that for cases that have failed miserable, a fine tuning of this module is necessary. Though the fake news detector is not an accurate system and should not be considered as such, fake news detection is definitely a technology problem and can be solved thus.

A lot of the problems in the modern world are technology problems and can be solved by using the tools we have built and have at hand.

Pycon ZA

October 11, 2019
Tweet

More Decks by Pycon ZA

Other Decks in Programming

Transcript

  1. A BuzzFeed News analysis found that 50 of the biggest

    fake stories of 2018 generated roughly 22 million total shares, reactions, and comments on Facebook
  2. A BuzzFeed News analysis found that 50 of the biggest

    fake stories of 2018 generated roughly 22 million total shares, reactions, and comments on Facebook
  3. The Knight Foundation analyzed more 10 million tweets from 700,000

    Twitter accounts which had linked to more than 600 fake and conspiracy news outlets. They found that in the lead-up to the 2016 US Presidential election, more than 6.6 million tweets linked to fake news and conspiracy news publishers, a problem which continued after, with 4 million tweets to fake and conspiracy news publishers found from mid-March to mid-April 2017.
  4. And now: "More than 80% of accounts that repeatedly spread

    misinformation during the 2016 election campaign are still active, and they continue to publish more than a million tweets on a typical day."
  5. A recent Reuters Institute survey of English- language Indian internet

    users found that 52% of respondents got news via WhatsApp. The same proportion said they got their news from Facebook. But content shared via WhatsApp has led to murder. At least 31 people were killed in 2017 and 2018 as a result of mob attacks fuelled by rumours on WhatsApp and social media, a BBC analysis found.
  6. Identifying Fake News • Bad grammar, spelling mistakes • No

    source : find source • A lot of praise for propaganda highly positive • A lot of criticism for propaganda highly negative • Keywords, google search • Credible mainstream agencies
  7. Photoshop/ Image editing : Reverse google image search / differences

    Articles: Mainstream: not generally fake news Spoof websites: BBCNewspoint Blogs: make sure they are credible, personal or professional Govt Agencies tweets: could be fake Fact checking websites: alt news, social media hoax slayer: for mainstream
  8. Extracting keywords from a text >>> from rake_nltk import Rake

    >>> from nltk.corpus import stopwords >>> r = Rake() >>> b=r.get_ranked_phrases() >>> b ['pm narendra', 'best pm', 'world', 'us', 'unesco', 'modi', 'declared', 'congratulation']
  9. Finding sources for keywords import requests url = ('https://newsapi.org/v2/everything?' 'q=pm

    narendra&best pm&world' 'from=2019-05-06&' 'sortBy=popularity&' 'apiKey=f3ff05f37c2b4b0c9707a6c1de8076bb') response = requests.get(url) # extract all review sentences that contains the term - “best pm” keyword_extracted= [sent for sent in response.content.sents if ‘best pm’ in sent.string.lower()]
  10. Results {u'status': u'ok', u'articles': [{u'description': u'Rahul Gandhi has energised a

    struggling party and has been increasingly setting the agenda.', u'title': u'Can India\u2019s political prince unseat the PM?', u'url': u'https://www.bbc.co.uk/news/world-asia-india-47978944', u'author': u'https://www.facebook.com/bbcnews', u'publishedAt': u'2019-04- 24T23:17:51Z', u'content': u"Image copyrightGetty ImagesImage caption\r\ n Rahul Gandhi (centre) received a tumultuous welcome during his road show in Amethi\r\nIndia's main opposition leader Rahul Gandhi was all but written off after his crushing defeat in the last elections. But he has ener\ u2026 [+7947 chars]", u'source': {u'id': u'bbc-news', u'name': u'BBC News'}, u'urlToImage': u'https://ichef.bbci.co.uk/news/1024/branded_news/15056/production/ _106520168_siblings.jpg'},
  11. Similarity Check import spacy nlp = spacy.load('en') doc1 = nlp(u'Hello

    hi there!') doc2 = nlp(u'Hello hi there!') doc3 = nlp(u'Hey whatsup?') print doc1.similarity(doc2) # 0.999999954642 print doc2.similarity(doc3) # 0.699032527716 print doc1.similarity(doc3) # 0.699032527716
  12. Important Parameters • location (a town, a country), • labeling

    • argumentation • emotions (fear, outrage, sympathy, hatred, other, missing), • fabrication • politician (the name of a mentioned politician)
  13. Pattern with POS tagger import nltk from nltk.tokenize import word_tokenize

    from nltk.tag import pos_tag ex = '''Theresa May ordered use of military force against Syria. This is what she ordered. Two residential areas have been struck by the Uk/French/US missiles. Reports of 4 dead in one of the strikes. ''' def preprocess(sent): sent = nltk.word_tokenize(sent) sent = nltk.pos_tag(sent) return sent sent = preprocess(ex) print(sent) pattern = 'NP: {<NNP>?<VBD>?<NN>*<NNP>}' cp = nltk.RegexpParser(pattern) cs = cp.parse(sent)
  14. POS Tags • CC coordinating conjunction • CD cardinal digit

    • DT determiner • EX existential there (like: "there is" ... think of it like "there exists") • FW foreign word • IN preposition/subordinating conjunction • JJ adjective 'big' • JJR adjective, comparative 'bigger' • JJS adjective, superlative 'biggest' • LS list marker 1) • MD modal could, will • NN noun, singular 'desk' • NNS noun plural 'desks'
  15. POS Tags • NNP proper noun, singular 'Harrison' • NNPS

    proper noun, plural 'Americans' • PDT predeterminer 'all the kids' • POS possessive ending parent's • PRP personal pronoun I, he, she • PRP$ possessive pronoun my, his, hers • RB adverb very, silently, • RBR adverb, comparative better • RBS adverb, superlative best • RP particle give up • TO to go 'to' the store. • UH interjection errrrrrrrm • VB verb, base form take
  16. POS Tags • VBD verb, past tense took • VBG

    verb, gerund/present participle taking • VBN verb, past participle taken • VBP verb, sing. present, non-3d take • VBZ verb, 3rd person sing. present takes • WDT wh-determiner which • WP wh-pronoun who, what • WP$ possessive wh-pronoun whose • WRB wh-abverb where, when
  17. Patterns to consider • Active Voice ◦ <Individual/Community/Organization> <Causative Verb>

    <Event entity> • Passive Voice ◦ <Event entity> <Causative Verb> <Individual/Community/Organization> Causative verbs are verbs that show the reason that something happened. Easier/Basic patterns in scope for now. • Thresholding ◦ What percentage of sentences fitting this pattern in an article text would be considered propaganda?
  18. References • Manipulative Propaganda Techniques - Vít Baisa, Ondˇrej Herman,

    and Aleš Horák • Detecting Expressions of Blame or Praise in Text - Udochukwu Orizu and Yulan He • The BECauSE Corpus 2.0: Annotating Causality and Overlapping Relations - Jesse Dunietz, Lori Levin and Jaime Carbonell • Unsupervised Learning of Narrative Event Chains - Nathanael Chambers and Dan Jurafsky • Samples: https://medium.com/@VasquezNnenna/different-examples- of-propaganda-in-social-media-758fc98d021d • https://towardsdatascience.com/named-entity-recognition-with-nltk- and-spacy-8c4a7d88e7da • https://www.kdnuggets.com/2018/08/emotion-sentiment-analysis- practitioners-guide-nlp-5.html • https://towardsdatascience.com/natural-language processing-event- extraction-f20d634661d3