Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Transportation in Social Media: an automatic classifier for travel-related tweets

João Pereira
September 15, 2017

Transportation in Social Media: an automatic classifier for travel-related tweets

18th EPIA Conference on Artificial Intelligence (EPIA 2017)

João Pereira

September 15, 2017
Tweet

More Decks by João Pereira

Other Decks in Research

Transcript

  1. Transportation in Social Media: an automatic classifier for travel-related tweets

    João Pereira, Arian Pasquali, Pedro Saleiro and Rosaldo Rossetti 18th EPIA Conference on Artificial Intelligence Faculdade de Engenharia da Universidade do Porto Porto, Portugal 5th-8th September, 2017 1/19
  2. Introduction Instant connectivity 24/7 Sharing of events, opinions and activities

    Many research areas have tried to exploit social media data Smart Cities and Intelligent Transportation Systems are also obvious candidates Information derived from such exploration may bring benefits to the cities’ governance, traffic-flow management, etc. 2/19
  3. Introduction Mining social media data is a laborious and time

    consuming process: a) Social media platforms have their own specificities b) The volume of data retrieved is overwhelming c) Social media texts have several restrictions 3/19
  4. Related Work • Maghrebi et al. [1] tried to extract

    tweets related to several modes of transport using a keyword-based search method in Melbourne, Australia • Carvalho et al. [2] created a travel-related classifier whose training set had the particularity of being unbalanced, since the percentage of tweets known to be travel-related was very low • Kuflik et al. [3] proposed a framework to automatically extract and analyse transport-related tweets using conventional features (Bag-of-words) to train their classification models 5/19
  5. Data Rio de Janeiro and São Paulo • Data was

    collected using bounding-boxes matching • Period of collection: March 12 and April 12, 2017 • Total of Portuguese geo-located tweets: 7.7 Million ◦ Rio de Janeiro, RJ (5.3 Million) ◦ São Paulo, SP (2.4 Million) Two of the top-10 most active cities regarding geo-located tweets RJ SP 6/19
  6. Classifying Travel-related Tweets • Conventional approaches require the specification of

    travel-related keywords • Bag-of-Words tend to produce sparse representations • Word embeddings - Mikolov et al. [4] ◦ Text representation technique ◦ Captures the syntactic and semantic relations from words ◦ More cohesive representation where similar words are represented by similar vectors • For instance: taxi/uber bus/busão/ônibus go to work/go to school yield similar vectors Approach 7/19
  7. Classifying Travel-related Tweets Text Pre-processing • Lowercasing Every Twitter message

    was converted into lowercase • Transforming repeated characters Sequences of characters repeated more than three times were transformed For instance: “loooool” -> ”lol” • Entities Cleaning Removing URL’s and user mentions (@user) 8/19
  8. Classifying Travel-related Tweets Features • Bag-of-words (BoW) ◦ 3,000 most

    frequent terms ◦ Excluding the ones found in more than 60% of the documents • Bag-of-embeddings (BoE) ◦ Paragraph2vec - Le and Mikolov [5] ◦ Train using 10 iterations over the whole Portuguese dataset ◦ Context window of value 2 ◦ Feature vectors of 100 dimensions • BoW + BoE 9/19
  9. Classifying Travel-related Tweets • Training Dataset ◦ Search-based method using

    terms for each mode of transport ◦ Manual annotation ◦ Balanced ▪ 2,000 positive samples + 2,000 negative samples • Test Dataset ◦ 71 positive samples + 929 negative samples ◦ Tweets with terms such as “Uber” and “Busão” were included in the positive sample Training and Test Datasets Bike bicicleta, moto Bus onibus, ônibus Car carro Taxi taxi, táxi Train metro, metrô, trem Walk caminhar Terms 10/19
  10. Classifiers and Evaluation Metrics Classifying Travel-related Tweets • Training Classifiers

    ◦ Support Vector Machines (SVM), with function kernels: ▪ rbf ▪ sigmoid ▪ linear ◦ Logistic Regression (LR) ▪ Scikit-learn standard parameters ◦ Random Forests (RF) ▪ Gini criteria ▪ 100 trees in the forest • Evaluation Metrics ◦ Precision, Recall, F1-score, ROC and AUC 11/19
  11. Results Classifier Features Precision Recall F1-score Linear SVM BoW BoE

    BoW + BoE 1.0 0.4338 1.0 0.6761 0.8309 0.7465 0.8067 0.5700 0.8548 Logistic Regression BoW BoE BoW + BoE 1.0 0.4444 1.0 0.6338 0.8451 0.6761 0.7759 0.5825 0.8067 Random Forest BoW BoE BoW + BoE 1.0 0.2298 1.0 0.6338 0.8028 0.6338 0.7759 0.3574 0.7759 12/19
  12. Application Verifying the location of the travel-related tweets predicted with

    our classifier. Rio de Janeiro • Rio-Niterói bridge • Central do Brasil 13/19
  13. Final Remarks • Construction of a fine-grained Twitter training set

    for the travel domain • Combination of different word representations yielded better results which may might indicate that BoW and BoE complement each other • Correlate our results with official sources of transportation agencies: ▪ traffic congestions ▪ other events on the transportation network 14/19
  14. References [1] Maghrebi, Abbasi, and Waller - Transportation application of

    social media: Travel mode extraction In Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference, 2016 [2] Carvalho, Sarmento, and Rossetti - Real-time sensing of traffic information in twitter messages In 4th Workshop on Artificial Transportation Systems and Simulation (ATSS), ITSC 2010, 2010 [3] Kuflik, Minkov, Nocera, Grant-Muller, Gal-Tzur, and Shoor - Automating a framework to extract and analyse transport related social media content: The potential and the challenges In Transportation Research Part C: Emerging Technologies, 2017 [4] Mikolov, Sutskever, Chen., Corrado, and Dean - Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems, 2013 [5] Le, and Mikolov - Distributed representations of sentences and documents In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014 15/19
  15. Thank You! Transportation in Social Media: an automatic classifier for

    travel-related tweets João Pereira, Arian Pasquali, Pedro Saleiro and Rosaldo Rossetti 18th EPIA Conference on Artificial Intelligence Faculdade de Engenharia da Universidade do Porto Porto, Portugal 5th-8th September, 2017 16/19