João Pereira, Arian Pasquali, Pedro Saleiro and Rosaldo Rossetti 18th EPIA Conference on Artificial Intelligence Faculdade de Engenharia da Universidade do Porto Porto, Portugal 5th-8th September, 2017 1/19
Many research areas have tried to exploit social media data Smart Cities and Intelligent Transportation Systems are also obvious candidates Information derived from such exploration may bring benefits to the cities’ governance, traffic-flow management, etc. 2/19
consuming process: a) Social media platforms have their own specificities b) The volume of data retrieved is overwhelming c) Social media texts have several restrictions 3/19
tweets related to several modes of transport using a keyword-based search method in Melbourne, Australia • Carvalho et al. [2] created a travel-related classifier whose training set had the particularity of being unbalanced, since the percentage of tweets known to be travel-related was very low • Kuflik et al. [3] proposed a framework to automatically extract and analyse transport-related tweets using conventional features (Bag-of-words) to train their classification models 5/19
collected using bounding-boxes matching • Period of collection: March 12 and April 12, 2017 • Total of Portuguese geo-located tweets: 7.7 Million ◦ Rio de Janeiro, RJ (5.3 Million) ◦ São Paulo, SP (2.4 Million) Two of the top-10 most active cities regarding geo-located tweets RJ SP 6/19
travel-related keywords • Bag-of-Words tend to produce sparse representations • Word embeddings - Mikolov et al. [4] ◦ Text representation technique ◦ Captures the syntactic and semantic relations from words ◦ More cohesive representation where similar words are represented by similar vectors • For instance: taxi/uber bus/busão/ônibus go to work/go to school yield similar vectors Approach 7/19
was converted into lowercase • Transforming repeated characters Sequences of characters repeated more than three times were transformed For instance: “loooool” -> ”lol” • Entities Cleaning Removing URL’s and user mentions (@user) 8/19
frequent terms ◦ Excluding the ones found in more than 60% of the documents • Bag-of-embeddings (BoE) ◦ Paragraph2vec - Le and Mikolov [5] ◦ Train using 10 iterations over the whole Portuguese dataset ◦ Context window of value 2 ◦ Feature vectors of 100 dimensions • BoW + BoE 9/19
terms for each mode of transport ◦ Manual annotation ◦ Balanced ▪ 2,000 positive samples + 2,000 negative samples • Test Dataset ◦ 71 positive samples + 929 negative samples ◦ Tweets with terms such as “Uber” and “Busão” were included in the positive sample Training and Test Datasets Bike bicicleta, moto Bus onibus, ônibus Car carro Taxi taxi, táxi Train metro, metrô, trem Walk caminhar Terms 10/19
◦ Support Vector Machines (SVM), with function kernels: ▪ rbf ▪ sigmoid ▪ linear ◦ Logistic Regression (LR) ▪ Scikit-learn standard parameters ◦ Random Forests (RF) ▪ Gini criteria ▪ 100 trees in the forest • Evaluation Metrics ◦ Precision, Recall, F1-score, ROC and AUC 11/19
for the travel domain • Combination of different word representations yielded better results which may might indicate that BoW and BoE complement each other • Correlate our results with official sources of transportation agencies: ▪ traffic congestions ▪ other events on the transportation network 14/19
social media: Travel mode extraction In Intelligent Transportation Systems (ITSC), 2016 IEEE 19th International Conference, 2016 [2] Carvalho, Sarmento, and Rossetti - Real-time sensing of traffic information in twitter messages In 4th Workshop on Artificial Transportation Systems and Simulation (ATSS), ITSC 2010, 2010 [3] Kuflik, Minkov, Nocera, Grant-Muller, Gal-Tzur, and Shoor - Automating a framework to extract and analyse transport related social media content: The potential and the challenges In Transportation Research Part C: Emerging Technologies, 2017 [4] Mikolov, Sutskever, Chen., Corrado, and Dean - Distributed representations of words and phrases and their compositionality In Advances in neural information processing systems, 2013 [5] Le, and Mikolov - Distributed representations of sentences and documents In Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014 15/19
travel-related tweets João Pereira, Arian Pasquali, Pedro Saleiro and Rosaldo Rossetti 18th EPIA Conference on Artificial Intelligence Faculdade de Engenharia da Universidade do Porto Porto, Portugal 5th-8th September, 2017 16/19