LOCATION EXTRACTION AND GEOREFERENCING IN SOCIAL MEDIA

rebrand.ly/HazardSEES knoesis.org/resources/geotutorial LOCATION EXTRACTION AND GEOREFERENCING IN SOCIAL MEDIA: CHALLENGES,
TECHNIQUES, AND APPLICATIONS Hussein S. Al-Olimat, Amir Yazdavar, Krishnaprasad Thirunarayan, and Amit Sheth

Tutorial Schedule

Part 1

Part 1: Outline • Context-aware Computing and some Introductory Material
• Twitter Location Features • Semantic Ambiguity • Geographic Information Retrieval (GIR) • Place Semantics • Location Extraction from Unstructured Texts • Location Name Extractor (LNEx) • Location Disambiguation-- brief introduction

Context-aware Computing http://www.snw-arts.dk/content/inside-it-context-aware-computing/ • Location-aware apps and services • Tailoring
computing to specific situations for specific locations • Influence decision-making • e.g., Landslide models taking into consideration the land cover and the rainfall amount.

THE SEMANTIC TRIANGLE Term “Buffalo” Concept Referent Stands for Refers
to Symbolizes 4th century BC TRIANGLE OF REFERENCE

THE SEMANTIC TRIANGLE Term “Buffalo” Concept Referent Stands for Refers
to Symbolizes 4th century BC TRIANGLE OF REFERENCE The role of context

• Semantics: “The linguistic and philosophical study of meaning.” ~
Wikipedia • Syntactics: “a branch of semiotics that deals with the formal relations between signs or expressions in abstraction from their signification and their interpreters.” ~ merriam-webster.com “Standing beside the Statue of Liberty” vs. “Standing beside the ” • Tying Semantics and Syntactics is the outcome of social agreements on the use of certain terms and symbols Knowledge Representation - Semantics and Syntactics

The Geographic coordinate system • “A geographic coordinate system is
a coordinate system used in geography that enables every location on Earth to be specified by a set of numbers, letters or symbols.” ~ Wikipedia • World Geodetic System (WGS-84) is the standard coordinate system. Hyatt Regency Rochester 125 E Main St, Rochester, NY 14604 43.156449, -77.608364 wiki/File:Latitude_and_Longitude_of_the_Earth.svg

Some Definitions ▪ Geocoding vs. Geoparsing: □ Georeferencing (aka. Geocoding):
• “Finding the geographical coordinates of a place name or street address (Geocoding)” ~ wiki/Georeferencing • Geocoding usually works with unambiguous structured location references (e.g., addresses such as “125 E Main St, Rochester, NY 14604”). □ Geoparsing: handles ambiguous references in unstructured texts (e.g., “We are at Hyatt Hotel”).

Some Definitions ▪ Toponyms: “Toponym is the general name for
any place or geographical entity.” ~Wikipedia ▪ Toponym Resolution: associating location names with geographical footprints from a coordination system like WGS-84. ▪ Location Name Linking: part of the toponym resolution process where toponym resolution goes one step further by disambiguating the linked location mentions due to the semantic ambiguity of locations mentions.

SEMANTIC AMBIGUITY wiki/Buffalo Sense Disambiguation GEO/GEO GEO/NON-GEO ◦ Buffalo (footwear)
◦ Buffalo (card game) ◦ Buffalo (band) ◦ Buffalo, Victoria, Australia ◦ Buffalo, Alberta, Canada ◦ Buffalo, USA (24 cities).

The Role of Context Context for GEO/GEO Context for GEO/NON-GEO
Textual Context Bounding Boxes Background Knowledge I am from Jordan vs. I love watching Jordan playing vs. I love watching Jordan playing with the Bulls • Rochester, NY • 5 miles from 43.156449, -77.608364

▪ Disaster Management (Response and Recovery) □ Road Closures or
Evacuations □ Disaster Relief (shelters, food, and donations) ▪ Provide a system for Disaster Assistance Centers Road Closures Help Needed Injuries Reported ... Location Information in Social Media

User profile Tweet Metadata Geo Coordinates Tweet Text User Connectivity
Twitter Location Features - TLF Profile Description Location Field Timezone Place

TLF - User Profile Location User profile Profile Description Location
Field Timezone free-text field in the user profile string describing the Time Zone this user declares themselves within

TLF - Tweet Text

TLF - Tweet Metadata place: Indicates that the tweet is
associated (but not necessarily originating from) a Place. coordinates: Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as geoJSON (longitude first, then latitude). geo: Deprecated. Contains the coordinates field content. Tweet Metadata Geo coordinates place

TLF - User Connectivity #!/usr/bin/python #-*-coding:utf-8-*- users=tweepy.Cursor(api.followers_ids,screen_name='iscram2018').items() users[0].screen_name Edges Created
through Retweet Follower/Followee

The Search API & Tweets by Place Twitter Search API
Web Interface Search https://twitter.com/search?q=place:00c60988621e2c71 https://api.twitter.com/1.1/geo/reverse_geocode.json?lat=43.156603&long=-77.608248 Place ID >>> places = api.geo_search(lat='43.156603', long='-77.608248', max_results=10) >>> places[0].id u'00c60988621e2c71' >>> tweets = api.search(q='place:23e921b82040ccd6', count=100)

Geo-based Stream Filtering Targeted Streams Geo-tagged Content api = tweepy.streaming.Stream(auth,
CustomStreamListener()) api.filter(locations=[-77.70,43.10,-77.53,43.26]) #ChennaiFloods #HoustonFlood #event Wikipedia: File:2016_Louisiana_floods_map_of_parishe s_declared_federal_disaster_areas.png

Data Integration - Example ▪ Sensor Data (Weather) and Twitter
Data (Georeferenced Tweets) □ Researchers in North Dakota [#] determined that the weather station radius coverage of 30 mile should not be exceeded taking into consideration the terrain and land-use. □ Readings not covered by any weather station can be interpolated by taking averages of the few nearest readings. □ Integration can be done using Geohashes where we convert the area into grids [#] Surface Transportation Weather Research Center. (2009). Analysis of Environmental Sensor Station Deployment Alternatives. Bismarck, ND: North Dakota Department of Transportation.

Geographic Information Retrieval From Semi-/Structured Data Sources From Unstructured Data
Sources [#] Jones, Christopher B., and Ross S. Purves. "Geographical information retrieval." International Journal of Geographical Information Science 22, no. 3 (2008): 219-228. Retrieving from any type of data sources the relevant geographic information based on queries [#] Formal Data representation and annotation GeoJSON or W3C Geospatial Vocabulary IE, ML and NLP techniques Sequence labeling, NER, Gazetteer matching

Geographic Information Retrieval Social Network Analysis For Location Inference wiki/Geotagged_photograph
Geotagged Pictures User Localization Tweet Localization Tweet Metadata profile field, place field, geocoordinates Tweet Text Location Name Extraction using NLP and ML

Place Semantics / Geosemantics Processing User-location interactions User reviews Thematic
Spatial Spatial Relationships Temporal In a Knowledge Base/Graph --- Geo-computations Hu, Yingjie. “Geospatial Semantics.” Arxiv.org. 2017 Ye et al. "On the semantic annotation of places in location-based social networks." SIGKDD 2011.

Location Extraction from Unstructured Texts

Lorem Ipsum (1) Case Folding Text Normalization and Filtering (2)
Spell Checking (4) User mention Removal (5) URL removal Text Preprocessing (3) Hashtag Breaking Tokenization @iscram2018_in_Hyatt_hotl_at_#EastMainStr_http://.. Hyatt → hyatt hotl → hotel #EastMainStr → east main str @iscram2018 http://.. (6) Stopword Removal in at Optional

Hashtag Breaking #HoustonFlood Houston Flood p(Houston & Flood) > p(Houst
& onFlood) Camel-case Subsequence Extraction Statistical Hash Character Removal HoustonFlood Houston Flood using a word list

Location Name Extraction from Texts Rule vs. Feature-base Gazetteer and
N-gram Matching/Search Machine Learning Supervised Techniques Semi-supervised Techniques [Middleton et al., 2014; Malmasi and Dras, 2015; Sultanik and Fink, 2012; Li et al., 2014; Gelernter and Zhang, 2013] [Lingad et al., 2013; Yin et al., 2014; Weissenbacher et al., 2015; Han et al., 2014; Khan et al., 2013; Bontcheva et al., 2013; Gelernter and Zhang, 2013] Distant Supervision Co-training

Gazetteer & N-gram Matching/Search Hyatt hotel east main st Chunking
& Noun Phrase Extraction Locative Expression Extraction (Syntactic Heuristics) 1. east 2. main 3. st 4. east main 5. main st 6. east main st east main st Using lexical cues: Distance & Direction Markers & Prepositions … 50 km south of LOCATION … … across from LOCATION … … away from LOCATION … ... live in LOCATION … n-gram enumeration Location & POI category matching (Semantic Heuristics) Using Regex Shallow Parser: Part-of-speech (POS) tagging + Rules e.g., [Liu et al., 2014] [Malmasi and Dras, 2015] [Gelernter and Balaji, 2013] [Malmasi and Dras, 2015]

Supervised Techniques Sequence Labeling Conditional Random Fields (Stanford NER) Recurrent
Neural Networks Maximum Entropy (OpenNLP)

Training Data @iscram2018 O in O Hyatt B-Loc hotel I-Loc
at O East B-Loc Main I-Loc Str I-Loc http://.. O @iscram2018 in <START:LOC> Hyatt hotel <END> at <START:LOC> East Main Str <END> http://.. Standoff format <file.txt> @iscram2018 in Hyatt hotel at East Main Str http://.. <file.ann> T1 Location 15 26 Hayatt hotel T2 Location 30 43 East Main Str OpenNLP format CONLL format

Feature Engineering Shape Contextual Word POS Category Gazetteer Preposition Feature
Set Hyatt Hotel in, hayatt, hotel hayatt, hotel, at @iscram2018 in at ... IN, NNP, NNP low, caps, caps caps, caps, low false true true true true true NNP, NNP, IN in the windows of 2 before? in gazetteer? is location category?

Location Name Extractor (LNEx) wiki/File:Open_Definition_logo.png/ File:Open Source Initiative keyhole.svg https://github.com/halolimat/LNEx

Challenging examples and types of Location Mentions: 1. Normal LN
2. Shape Problem 3. Ambiguous Location Name 4. LN in Hashtag 5. LN with Abbreviation 6. Abbreviated LN 7. LN Nickname 8. LN Contraction 9. Numeric LN 10. Misspelled LN 11. Spacing problem 12. Highly Ambiguous LN 13. Missing Punctuation

Problems and Challenges Contractions Abbreviations Ambiguity Nicknames Referring to “Balalok
Matric Higher Secondary School” as “Balalok School” Referring to “Wright State University” as “WSU” My backyard, My house, Buffalo HTown Mentions in Hashtags #LouisianaFlood Misspellings sou th kr koil street Word Shape Problems west mambalam Ungrammatical Writing Oxford school.west mambalam

Nameheads: theory of “alternate name forms” by John M Carroll
Shortening Processes Appellation Formation Explicit Metonymy Category Ellipsis Location Ellipsis Referring to “Greater Rochester International Airport” as “The Airport” Referring to “University of Michigan” as “Michigan” Referring to “Balalok Matric Higher Secondary School” as “Balalok School” Dropping specific location references in the location name: RIT - Rochester to RIT

Gazetteers ▪ Significance: □ Help machines to understand the geographic
meaning of places. □ Model the relations between locations. ▪ Main Components: □ Place Names (N) □ Place Types (T) □ Spatial Footprints (F) ▪ Gazetteers Operations: □ N ➞ F □ N ➞ T □ F (x T) ➞ N Keßler et al. [4]

• Cities • Countries • Street names • Neighborhoods •
Points of interest • Building names • Organizations • Districts • States Bounding Box https://www.npmjs.com/package/geonames-importer | https://github.com/komoot/photon | https://dbpedia.org/sparql boundingbox.klokantech.com Building Region-specific Gazetteers

OpenStreetMap - Our Choice ▪ Regarded as the Wikipedia of
maps. ▪ Contains more fine-grained locations than any other resource. ▪ More accurate geo-coordinates in comparison with Geonames. ▪ and, it has a strong volunteer foundation (such as hotosm.org) which maps thousands of locations during a disaster. http://osmlab.github.io/show-me-the-way/

Lexical Variations ▪ We tried to overcome the lexical variations
through: □ Using the USPS street suffixes and the English OSM abbreviations dictionaries: □ Apartment → Apt □ Street → Str. □ Avenue → AVE □ Airport → Aprt □ Using OpenStreetMap gazetteer which contains the following names: □ English Names □ Default Names □ Alternative Names □ Old Names □ Acronyms (This need more processing, not directly retrievable majority of the time). https://pe.usps.com/text/pub28/28apc_002.htm https://wiki.openstreetmap.org/wiki/Name_finder :Abbreviations#English

Stratford School appears in gazetteer as Stratford High School Problem
Lexical Stats. using language models built from gazetteers

▪ Capturing different forms of a toponym (to improve recall)
□ “Balalok Matric Higher Secondary School” → “Balalok School”. ▪ Recording alternative names (to improve recall) □ “Anna Salai (Mount Road)” → “Anna Salai” and “Mount Road” ▪ Filtering toponym names (to improve recall) □ Break records: “Tamilnadu Housing Board Road , Ayapakkam” ▪ Filtering out very noisy toponyms (to improve precision) https://foursquare.com/ Ball Town in Louisiana Clinton Town in Louisiana Jackson City in Mississippi Gazetteers Preprocessing

Raw Text 1. Filtering 2. Augmentation (Private Road) - -
Burnt School (historical) Burnt School - Lindale Park (Southbound) Lindale Park - Cars India - Adyar Cars India Adyar Cars India Adyar | Clinton | - - Stratford High School - Stratford School Modern Senior Secondary School (Primary) Modern Senior Secondary School Modern School Modern Senior School Modern Secondary School Solutions: 1. Word lists and rules 2. Skip-grams Gazetteers Preprocessing Auxiliary Content | Ambiguous Location Names | Collocation Contractions

Preprocessing Tweets Tweets/Text Preprocessing

▪ Break compound words into discrete ones □ In our
dataset, around 29% of hashtags include location names ▪ Used Peter Norvig’s Word Segmentation method [Norvig, 2009] □ Unigrams with their counts (333,333 types) from Google’s 1 billion token corpus ▪ Selection mechanism ▪ Candidate model ▪ Language model ▪ Error model #ChennaiRains Split-1 Split-2 Split-3 c1- C hennaiR ains c2- Chen na iRains c3- Ch ennaiRai ns c4- Chennai Rains Hashtag Segmentation

Spelling Correction ▪ Symmetric Delete Spelling Correction (SymSpell) algorithm wolfgarbe/symspell
OOV: Out-of-vocabulary

Extracting Location Names using LNEx

my house ????? boundingbox.klokantech.com Location Names Categorization ▪ Previous works
categorize location names based on their types (e.g., building, street) [Matsuda et al., 2015; Gelernter and Balaji, 2013] ▪ We categorize the location names based on their geo-coordinates and meaning into: □ In-area Location Names (inLoc) □ Location names that are inside the area of interest. □ Out-area Location Names (outLoc) □ Location names that are outside the area of interest. □ Ambiguous Location Names (ambLoc) □ Ambiguous in nature, need more context or background data for disambiguation

T1 ambLoc 33 37 homes T2 inLoc 42 53 Ville
Platte T3 inLoc 57 58 la >1 Annotating Tweets Using Brat ( brat.nlplab.org ) ▪ 1500 from 2015 Chennai flood □ 75% inLoc , 4% outLoc , and 21% ambLoc ▪ 1500 from 2016 Louisiana flood □ 66% inLoc , 13% outLoc , and 22% ambLoc ▪ 1500 from 2016 Houston flood □ 66% inLoc , 7% outLoc , and 27% ambLoc ▪ 1,000 tweets were used for development and the rest for testing >2 ‘16 Workshop on Noisy User-generated Text (W-NUT) noisy-text.github.io/2016/ > 7245 annotated tweets ▪ Kept company, facility, geo-loc and removed the rest ▪ Used to train two supervised tools for comparison Annotations For Benchmarking

• For partial and overlapping matches: ◦ “The louisiana” instead
of “Louisiana” ◦ “Avadi Road” instead of “New Avadi Road” • Count all inLocs hits and missed • Ignore all ambLocs and outLocs • For LNEx extraction: ◦ ambLocs and outLocs are all counted as false positives (FPs). Evaluation Strategy

• Lead to only 1% increase in recall • F-Scores
decreased by 2% on average due to the influence of increased false positives on precision. Spell Checking

• Performance of breaking on hashtags with location names only.
• Some performance was lost due to the statistical nature of the method: ◦ #LAWX was broken into LAW and X ◦ Since their multiplication is higher than other combinations Hashtag Breaking

Comparing LNEx with other Tools

Improvement after using more training data Difference between testing on
same training data and data never seen before Dataset 1: Chennai Dataset 2: Louisiana Dataset 3: Houston Dataset 4: W-NUT ‘16 Scores of Supervised/Trained Tools

Location Disambiguation -Future Work • LN alone ◦ I am
at Hayatt Hotel ▪ which Hayatt? ▪ disambiguation resolution should be defined (i.e., city, state, country) • LN with others ◦ no relations ▪ Chase Bank and American Bank are closed today. ▪ facts should be propagated to both. >>> relation extraction ◦ with relations ▪ Chase Bank in Main Street is closed. • only one geo-point ◦ answers questions like the focus of the location mention.. the focus of the tweet ▪ Chase Bank and American Bank in Rochester are closed. • how many are there in Rochester? ▪ Chase Bank in First Street and American Bank in Second Street are closed.

Part 2

User profile Tweet Metadata Geo Coordinates Tweet Text User Connectivity
Quick Recap: Location Extraction in Twitter Profile Description Location Field Timezone Place

User Location Prediction Problem • Long term residual address prediction
• Applications: ❖ Local content recommendation ❖ Location-based advertising ❖ Public health monitoring ❖ Public opinion polling ❖ Predicting group behaviors and modeling populations (observing linguistic differences) ❖ Social media supporting local communities 1 percent of Twitter posts are associated with a geolocation http://geoawesomeness.com/location-based-ads-for-mobile-and-desktop-are-starting-to-be-more-contextually-aware/

Public Opinion Polling Wiki page: Knoesis Research: Us 2016 Election
prediction Paper: Challenges of Sentiment Analysis for Dynamic Events Blog: Link to Blog

Public Health Monitoring NIH R01 Funded Project on: Modeling Social
Behavior for Health Care Utilization in Depression NIH R01 Grant #: MH105384-01A1 Project Link: rebrand.ly/depressionProject

Ground truth home locations • Self-declared profiles ◦ Noisy ◦
Absent ▪ Example: I am living in hell Most frequent city have geo-tags The first valid geotag or median (centroid) of geo-tags

Evaluation Metrics • Common metrics can be categorized into: ◦
Distance based: presented by geographical coordinates ◦ Token-based: discrete symbol, city, POI • Distance-based: ◦ Error Distance: Euclidean distance between ground truth and predicted coordinates: ◦ For the whole users corpus: ▪ Mean ED ▪ Median ED(less sensitive ) ▪ Mean squared error(MSE): square of error distance ▪ ACC@d: • Define a threshold d, 100 miles, “tolebraly correct”

Evaluation Metrics Token based Metrics: • Discrete symbols • County
• City • POI Accuracy • Accuracy @k: Predicted a ranking list of predicted locations instead of one, a ranking list is considered “correct” ◦ Precision ◦ Recall ◦ F1

Home location Prediction • Tweet content ◦ Word Centric: Estimate
the probability of location P(l|w) • Location Centric: ◦ Probability of generating a tweet d at a given location Estimate the probability of location P(d|l) Twitter network

Location Centric Methods • Distribution of words over locations ◦
Maximum Likelihood estimation[1] ◦ Language modeling[2] ◦ Topic modeling ▪ Latent topics with geographical regions[3] • Users statistics for local words as a feature and candidate locations as labels Intuition: • Language in social media is geographically biased • NLP techniques: ◦ Tf-idf vectors ◦ Regularization penalty ◦ Dictionary learning techniques ◦ Tweeting behavior (volume of tweets per time unit) ◦ Detecting travelers https://thumbs.dreamstime.com/b/business-finance-word-cloud-tags-world-ma p-shape-industry-economics-words-vector-currency-market-money-devaluatio n-exchange-85258127.jpg

Twitris 3.0: • Twitris: a Semantic Web application for understanding
social perceptions by Semantics-based processing of massive amounts of event-centric data on social media. • Twitris addresses challenges in large scale processing of social data, focusing on multi-dimensional analysis of spatio-temporal-thematic, people-content-network and sentiment-emotion-intent facets. More on: http://wiki.knoesis.org/index.php/Twitris

How our Twitris 3.0 Handle Location inference in a real-time
manner • For each tweet, get user profile location (map of text to location) • Use google geocoding service to try and match. ◦ if there is a match: ▪ Get timezone of the tweet. ▪ Get timezone of the location returned from google. ◦ If they match, add it to user profile location. ▪ Geo-queries on the coordinates from google to achieve the best granular results it can get ◦ Not perfect, but google geocoding service certainly enhance noises such as “middle of nowhere” or “your mom's place”

Social media Geolocation prediction from different social media platforms: •
Flickr[1] • Facebook[2] • Twitter[3] • WikiPedia[4] Twitter advantages: • Timezone • Self-declared location • Capturing user network https://nickdoucette.wordpress.com/

Network-based Important parameters: • What constitutes a relationship in Twitter
• Ground truth location Finding relationships in social media are strong indicator of spatial proximity. https://www.outsourcing-pharma.com/Article/2015/02/19/Parexel-expands-its-clinical-trial-site-network-to-speed-drug-development https://s3.amazonaws.com/tjn-blog-images/wp-content/uploads/2016/07/20003911/manage-work-relationship-810x540.jpg

Network-based In 2010, a seminal research by facebook: The likelihood
of friendship decrease with distance. Geography Friendship https://st2.depositphotos.com/1017228/8219/i/95 0/depositphotos_82194388-stock-photo-friends-h ands-together.jpg photos.gograph.com/thumbs/ CSP/CSP993/k14300461.jpg https://upload.wikimedia.org/wikipedia/commons/t humb/2/26/World_Map_FIFA.svg/2000px-World_ Map_FIFA.svg.png They conclude their approach is more accurate than traditional user’s IP mapping. They derive users location through friends geography

Network-based Cont, Ground truth: 100 million self-reported address in Facebook.
• 60% parsable to longitude and latitude • 3.5M users with known home address • 2.9 M users with at least one friend with valid location • Their network has 30.6 M edge between users with valid address Data validation: Facebook penetration based on self-reported address Facebook penetration based on IP geolocation

Density Distribution: Population Vs Geography • Dividing US into cells
of 1/100 degree square. • Count number of FB users in each region. • In low density area, distribution decrease with exponent of -1.37, • In high density area, the exponent decreases with exponent of -3.07 • Transition begins at 50 people per square mile Dist. of counts, Low-density region (straight line), High density region falls much sharper. (Power law dist.)

Density Distribution: Population Vs Geography 2.bp.blogspot.com/_VMAt17gvKp8/TEUAD3ebxQI/AAAAAAAAA1A/riVGzdSTavM/s 1600/confuse.gif Distance and friendship
are connected to population density. • Example: Imagine you live in Manhattan and you have 10000 people living within a single block and you do not know most of them. • Let say you know five out of ten thousands people within 1 mile. Your probability knowing any individual would be 0.0005. • However, in small town, there are only 1000 people within 1 mile from you. • If you still know 5 of them, your probability would be 0.005

Density Distribution: Population Vs Geography • Dividing population into 3
groups, based on population density they are living: low, medium, high • Average number of people living x miles away • For instance, for high density the average is higher in general. • For small distances, the curves increases linearly. • After that, population density falls off, number of people is also falls • Why? as we moved away from the center about 50 miles we are in less dense areas (small towns).

Friendship vs Distance • The probability of friendship goes down
with distance. • Computing distance between all pairs of individuals with known address. • Bucketing by intervals of 0.1 miles to compute total number of pairs. • Number of pairs which an edge is present. • a(b+x)^-1: inverse relationship • Moreover, if you are in high density area you are less likely to know nearby individual. • At 50 miles, probability of knowing someone is no longer dependent on density

Prediction framework How to predict location of people who have
not provided their information? • Naive approach: ◦ Take the mean location’s of one’s friend ▪ Not working for people who live in the coast. bocawatch.org/wp-content/uploads/2016/10/careerc onfusion1-e1417093044460.png • Approach: ◦ For each friend v of u, where location l v is known. what is the probability the edge forms. ◦ Multiplying all these probabilities together, we get likelihood for all edge given the distance of v and u.

Network-based Cont, Inference: Given their geographic distance between users what
is the likelihood of observing a relationship between 2 users. • P(likelihood of friendship | distance) More on: Backstrom, L., Sun, E., & Marlow, C. (2010, April). Find me if you can: improving geographical prediction with social and spatial proximity. In Proceedings of the 19th international conference on World wide web (pp. 61-70). ACM.

Network-based Many attempts to improve this idea: • Adopt the
previous approach to glean location from Twitter. ◦ Challenges: ▪ Users information in Twitter is very noisy compared to Facebook[cite] ▪ Social relationships in twitter can be grouped into multiple roles[cite] Motivation for researcher starts to classify users Twitter relationships.

Network-based • Users with X number of GPS posts selected.
• Median latitude longitude are selected as their location • Separating relationships into different partitions, location information in partition with most-predictive relationships will dominate the likelihood. In summary, select more informative relationship.

Network-based • Different weighting strategies: ◦ Which of a user’s
friend are likely to be most predictive of their location. For example user a friend is friend with user b if at least mentioned by b at least twice in the posts. • Given a user, friends are weighted based on their social tightness.

Network-based Intuition: Some users are more informative for predicting the
locations of the neighbors. • First assign users a random location • Iteratively update the location of the users employing already known locations Liu, Y., Kliman-Silver, C., & Mislove, A. (2014). The Tweets They Are a-Changin: Evolution of Twitter Users and Behavior. ICWSM, 30, 5-314.

Label Propagation • Label propagation - Given a set of
labels, extend the labels to all nodes in a graph. • Applications: ◦ Network classification ◦ Community detection ◦ Structured inference • Underlying assumption is edges carries the notion of similarity • Social network often show assortative mixing

Label Propagation • What is Assortativity? ◦ Measure volume of
interactions among nodes with like class. • Social networks show assortative mixing: • Nodes with similar characteristics are connected to each other(homophily) • Gender • Political preferences: democrat, republicans • Age

Network Classification • Given graph nodes V = V l
∪ V u : ◦ Nodes V l have labels Y l ◦ Nodes V u do not have any labels Need to find Y u Can be modeled for both regression and classification. • Binary • Multi-class • Real values Features: • Computed for every node (local node features), eg. gender, age • Link features available (labels from neighbors, neighbours features, node degrees, connectivity patterns)

Network Classification • Graph-based semi-supervised learning • Given partially labeled
dataset Data: X = X l ∪ X u • small set of labeled data (X l , Y l ) • large set of unlabeled data X u Transductive learning: • Fit a model that predicts labels Y u for the unlabeled input X u . Transductive learning VS Inductive learning Take advantage of pattern(structure) of the data. Label propagation idea is very similar to idea of Transductive learning.

Random Walk It can be explained with continuous diffusions along
the network. ◦ Information moves through network ◦ Rumors ◦ Energy ◦ Epidemiological process Let a network has nodes hold a i (0) units of energy at time 0. The goal is to find after t time, how much energy is present at node i for all i’s in the network. β(∅ i -∅ j ) dt as energy going from i ⇒ j per dt(small time) It turns out, we have to solve the below differential equation: ◦ Solution: ▪ find L = D − A ▪ compute {λ i } of L ▪ solve

Random Walk • A walker carrying energy moving from one
node to another. ◦ How we can analyze which way it will go? Markov Chain: It is a directed network where edges are annotate with probabilities such that if P ij is the probability on an edge from i to j. • Probability Transition Matrix Markov process : if reading of time t is dependent only on t−1 Chapman-Kolmogoror Equation: We are given the transition matrix, we can calculate the probability we are in specific stage.

Random Walk Let's consider a random walk with absorbing state(trap)
? Intuitive solution: How many time we trapped in a green node and how many times in blue node, then decide P ij : Prob. you start from node i and end up in node j. y i [c]= probability distribution over labels

Random Walk • Random Walk Matrix: P = D−1A 2
1 3 4 5 If we move from node 3, most likely we trap on blue. If we move from node 5 we probably end up on red. The question is what is gonna happen when we power p (n times)?

Random Walk The question is what is gonna happen when
we power p (n times)? It turn out p uu = 0 , is subset of transition matrix when we multiply the transition matrix several times, eventually it becomes zero.

Random Walk The whole idea of blocking is for to
obtain something look like this: If the Matrix is small, we can get the inverse and get the results. We often need to inverse a large sparse matrix • Iterative solution

Label Propagation

Hybrid Models Ensemble learning: Hybridize Text-based Network based • Build
a network by @mentions from tweets, a typical approach for detecting conversations between friends. • Using label propagation to infer user locations. In this manner, simultaneously they capture the notion of location homophily as well as powerful textual features such as tf-idf vectorization

Hybrid Models: Ensemble learning Observation: Bi-directional mentions are too rare
to be useful • We can create uni-directional mentions as undirected edges. • Create weighted undirected graph, number of @mentions in tweets by either user. Then run label propagation algorithm to update the location of each non-training node. Rahimi, A., Vu, D., Cohn, T., & Baldwin, T. (2015). Exploiting text and network context for geolocation of social media users. arXiv preprint arXiv:1506.04803.

Hybrid approach: Deep learning based State of the art: •
Propose a neural network model that learns unified text, metadata, and user network representations with an attention mechanism. They show their model outperform the previous ensemble approach on two public datasets.

Hybrid approach: Deep learning based Recurrent Neural Networks (RNNs) show
great promise in many NLP tasks • Language modeling • Generating text • Predict the next word in a sentence given its previous word ◦ E.g. Let's say we have sequence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word. • Training: ◦ Backpropagation Through Time (BPTT) Sequential Information

Hybrid approach: Deep learning based Drawback of Recurrent Neural Networks:
◦ Vanishing gradient problem prevents RNN from learning long-term dependencies. ▪ Solution: LSTM • Gating Mechanism Big Picture

Hybrid Model (RNN with attention) Cite: Miura, Yasuhide, et al.
"Unifying text, metadata, and user network representations with a neural network for geolocation prediction." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2017.

We acknowledge the support from the National Science Foundation (NSF)
award: EAR 1520870: Hazards SEES: Social and Physical Sensing Enabled Decision Support for Disaster Management and Response. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF. For further questions and to get slides: [email protected] @halolimat Project Website: https://rebrand.ly/HazardSEES Hussein Al-Olimat TK. Prasad Amir Yazdavar Amit Sheth

LOCATION EXTRACTION AND GEOREFERENCING IN SOCIAL MEDIA

LOCATION EXTRACTION AND GEOREFERENCING IN SOCIAL MEDIA

Other Decks in Research

Featured

Transcript