Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LOCATION EXTRACTION AND GEOREFERENCING IN SOCIAL MEDIA

LOCATION EXTRACTION AND GEOREFERENCING IN SOCIAL MEDIA

Ability to extract or estimate location in social media content, and perform location-centric analyses offer unique and wide-ranging applications. Examples include disaster management, demographic and socio-cultural studies, and spatiotemporal tracking. For instance, location information is critical to reach and rescue disaster-stricken people and dispatch humanitarian assistance. Consequently, there is a pressing need for better understanding of how people express location information explicitly and implicitly on social media, and in general, develop efficient techniques for geospatial computing that spans all information channels. Additionally, location information enables a variety of individual-level and community-level analyses.

Location extraction and georeferencing methods leverage the user-generated content (textual as well as multimedia data, e.g., images, videos), and users’ connectivity (social network analysis). The applications of these methods range from detecting communities and localizing individual texts or detecting users’ physical locations. However, due to the challenges posed by social media data some techniques did not work reliably on its informal and ill-formed texts or scaled poorly.

In this tutorial, we present the general problem of georeferencing and location extraction, summarizes the state-of-the-art research, discusses challenges, and provides an overview of our recent research accomplishments in the context of disaster management.

Hussein S. Al-Olimat

May 20, 2018
Tweet

Other Decks in Research

Transcript

  1. rebrand.ly/HazardSEES knoesis.org/resources/geotutorial LOCATION EXTRACTION AND GEOREFERENCING IN SOCIAL MEDIA: CHALLENGES,

    TECHNIQUES, AND APPLICATIONS Hussein S. Al-Olimat, Amir Yazdavar, Krishnaprasad Thirunarayan, and Amit Sheth
  2. Part 1: Outline • Context-aware Computing and some Introductory Material

    • Twitter Location Features • Semantic Ambiguity • Geographic Information Retrieval (GIR) • Place Semantics • Location Extraction from Unstructured Texts • Location Name Extractor (LNEx) • Location Disambiguation-- brief introduction
  3. Context-aware Computing http://www.snw-arts.dk/content/inside-it-context-aware-computing/ • Location-aware apps and services • Tailoring

    computing to specific situations for specific locations • Influence decision-making • e.g., Landslide models taking into consideration the land cover and the rainfall amount.
  4. THE SEMANTIC TRIANGLE Term “Buffalo” Concept Referent Stands for Refers

    to Symbolizes 4th century BC TRIANGLE OF REFERENCE
  5. THE SEMANTIC TRIANGLE Term “Buffalo” Concept Referent Stands for Refers

    to Symbolizes 4th century BC TRIANGLE OF REFERENCE The role of context
  6. • Semantics: “The linguistic and philosophical study of meaning.” ~

    Wikipedia • Syntactics: “a branch of semiotics that deals with the formal relations between signs or expressions in abstraction from their signification and their interpreters.” ~ merriam-webster.com “Standing beside the Statue of Liberty” vs. “Standing beside the ” • Tying Semantics and Syntactics is the outcome of social agreements on the use of certain terms and symbols Knowledge Representation - Semantics and Syntactics
  7. The Geographic coordinate system • “A geographic coordinate system is

    a coordinate system used in geography that enables every location on Earth to be specified by a set of numbers, letters or symbols.” ~ Wikipedia • World Geodetic System (WGS-84) is the standard coordinate system. Hyatt Regency Rochester 125 E Main St, Rochester, NY 14604 43.156449, -77.608364 wiki/File:Latitude_and_Longitude_of_the_Earth.svg
  8. Some Definitions ▪ Geocoding vs. Geoparsing: □ Georeferencing (aka. Geocoding):

    • “Finding the geographical coordinates of a place name or street address (Geocoding)” ~ wiki/Georeferencing • Geocoding usually works with unambiguous structured location references (e.g., addresses such as “125 E Main St, Rochester, NY 14604”). □ Geoparsing: handles ambiguous references in unstructured texts (e.g., “We are at Hyatt Hotel”).
  9. Some Definitions ▪ Toponyms: “Toponym is the general name for

    any place or geographical entity.” ~Wikipedia ▪ Toponym Resolution: associating location names with geographical footprints from a coordination system like WGS-84. ▪ Location Name Linking: part of the toponym resolution process where toponym resolution goes one step further by disambiguating the linked location mentions due to the semantic ambiguity of locations mentions.
  10. SEMANTIC AMBIGUITY wiki/Buffalo Sense Disambiguation GEO/GEO GEO/NON-GEO ◦ Buffalo (footwear)

    ◦ Buffalo (card game) ◦ Buffalo (band) ◦ Buffalo, Victoria, Australia ◦ Buffalo, Alberta, Canada ◦ Buffalo, USA (24 cities).
  11. The Role of Context Context for GEO/GEO Context for GEO/NON-GEO

    Textual Context Bounding Boxes Background Knowledge I am from Jordan vs. I love watching Jordan playing vs. I love watching Jordan playing with the Bulls • Rochester, NY • 5 miles from 43.156449, -77.608364
  12. ▪ Disaster Management (Response and Recovery) □ Road Closures or

    Evacuations □ Disaster Relief (shelters, food, and donations) ▪ Provide a system for Disaster Assistance Centers Road Closures Help Needed Injuries Reported ... Location Information in Social Media
  13. User profile Tweet Metadata Geo Coordinates Tweet Text User Connectivity

    Twitter Location Features - TLF Profile Description Location Field Timezone Place
  14. TLF - User Profile Location User profile Profile Description Location

    Field Timezone free-text field in the user profile string describing the Time Zone this user declares themselves within
  15. TLF - Tweet Metadata place: Indicates that the tweet is

    associated (but not necessarily originating from) a Place. coordinates: Represents the geographic location of this Tweet as reported by the user or client application. The inner coordinates array is formatted as geoJSON (longitude first, then latitude). geo: Deprecated. Contains the coordinates field content. Tweet Metadata Geo coordinates place
  16. The Search API & Tweets by Place Twitter Search API

    Web Interface Search https://twitter.com/search?q=place:00c60988621e2c71 https://api.twitter.com/1.1/geo/reverse_geocode.json?lat=43.156603&long=-77.608248 Place ID >>> places = api.geo_search(lat='43.156603', long='-77.608248', max_results=10) >>> places[0].id u'00c60988621e2c71' >>> tweets = api.search(q='place:23e921b82040ccd6', count=100)
  17. Geo-based Stream Filtering Targeted Streams Geo-tagged Content api = tweepy.streaming.Stream(auth,

    CustomStreamListener()) api.filter(locations=[-77.70,43.10,-77.53,43.26]) #ChennaiFloods #HoustonFlood #event Wikipedia: File:2016_Louisiana_floods_map_of_parishe s_declared_federal_disaster_areas.png
  18. Data Integration - Example ▪ Sensor Data (Weather) and Twitter

    Data (Georeferenced Tweets) □ Researchers in North Dakota [#] determined that the weather station radius coverage of 30 mile should not be exceeded taking into consideration the terrain and land-use. □ Readings not covered by any weather station can be interpolated by taking averages of the few nearest readings. □ Integration can be done using Geohashes where we convert the area into grids [#] Surface Transportation Weather Research Center. (2009). Analysis of Environmental Sensor Station Deployment Alternatives. Bismarck, ND: North Dakota Department of Transportation.
  19. Geographic Information Retrieval From Semi-/Structured Data Sources From Unstructured Data

    Sources [#] Jones, Christopher B., and Ross S. Purves. "Geographical information retrieval." International Journal of Geographical Information Science 22, no. 3 (2008): 219-228. Retrieving from any type of data sources the relevant geographic information based on queries [#] Formal Data representation and annotation GeoJSON or W3C Geospatial Vocabulary IE, ML and NLP techniques Sequence labeling, NER, Gazetteer matching
  20. Geographic Information Retrieval Social Network Analysis For Location Inference wiki/Geotagged_photograph

    Geotagged Pictures User Localization Tweet Localization Tweet Metadata profile field, place field, geocoordinates Tweet Text Location Name Extraction using NLP and ML
  21. Place Semantics / Geosemantics Processing User-location interactions User reviews Thematic

    Spatial Spatial Relationships Temporal In a Knowledge Base/Graph --- Geo-computations Hu, Yingjie. “Geospatial Semantics.” Arxiv.org. 2017 Ye et al. "On the semantic annotation of places in location-based social networks." SIGKDD 2011.
  22. Lorem Ipsum (1) Case Folding Text Normalization and Filtering (2)

    Spell Checking (4) User mention Removal (5) URL removal Text Preprocessing (3) Hashtag Breaking Tokenization @iscram2018_in_Hyatt_hotl_at_#EastMainStr_http://.. Hyatt → hyatt hotl → hotel #EastMainStr → east main str @iscram2018 http://.. (6) Stopword Removal in at Optional
  23. Hashtag Breaking #HoustonFlood Houston Flood p(Houston & Flood) > p(Houst

    & onFlood) Camel-case Subsequence Extraction Statistical Hash Character Removal HoustonFlood Houston Flood using a word list
  24. Location Name Extraction from Texts Rule vs. Feature-base Gazetteer and

    N-gram Matching/Search Machine Learning Supervised Techniques Semi-supervised Techniques [Middleton et al., 2014; Malmasi and Dras, 2015; Sultanik and Fink, 2012; Li et al., 2014; Gelernter and Zhang, 2013] [Lingad et al., 2013; Yin et al., 2014; Weissenbacher et al., 2015; Han et al., 2014; Khan et al., 2013; Bontcheva et al., 2013; Gelernter and Zhang, 2013] Distant Supervision Co-training
  25. Gazetteer & N-gram Matching/Search Hyatt hotel east main st Chunking

    & Noun Phrase Extraction Locative Expression Extraction (Syntactic Heuristics) 1. east 2. main 3. st 4. east main 5. main st 6. east main st east main st Using lexical cues: Distance & Direction Markers & Prepositions … 50 km south of LOCATION … … across from LOCATION … … away from LOCATION … ... live in LOCATION … n-gram enumeration Location & POI category matching (Semantic Heuristics) Using Regex Shallow Parser: Part-of-speech (POS) tagging + Rules e.g., [Liu et al., 2014] [Malmasi and Dras, 2015] [Gelernter and Balaji, 2013] [Malmasi and Dras, 2015]
  26. Training Data @iscram2018 O in O Hyatt B-Loc hotel I-Loc

    at O East B-Loc Main I-Loc Str I-Loc http://.. O @iscram2018 in <START:LOC> Hyatt hotel <END> at <START:LOC> East Main Str <END> http://.. Standoff format <file.txt> @iscram2018 in Hyatt hotel at East Main Str http://.. <file.ann> T1 Location 15 26 Hayatt hotel T2 Location 30 43 East Main Str OpenNLP format CONLL format
  27. Feature Engineering Shape Contextual Word POS Category Gazetteer Preposition Feature

    Set Hyatt Hotel in, hayatt, hotel hayatt, hotel, at @iscram2018 in at ... IN, NNP, NNP low, caps, caps caps, caps, low false true true true true true NNP, NNP, IN in the windows of 2 before? in gazetteer? is location category?
  28. Challenging examples and types of Location Mentions: 1. Normal LN

    2. Shape Problem 3. Ambiguous Location Name 4. LN in Hashtag 5. LN with Abbreviation 6. Abbreviated LN 7. LN Nickname 8. LN Contraction 9. Numeric LN 10. Misspelled LN 11. Spacing problem 12. Highly Ambiguous LN 13. Missing Punctuation
  29. Problems and Challenges Contractions Abbreviations Ambiguity Nicknames Referring to “Balalok

    Matric Higher Secondary School” as “Balalok School” Referring to “Wright State University” as “WSU” My backyard, My house, Buffalo HTown Mentions in Hashtags #LouisianaFlood Misspellings sou th kr koil street Word Shape Problems west mambalam Ungrammatical Writing Oxford school.west mambalam
  30. Nameheads: theory of “alternate name forms” by John M Carroll

    Shortening Processes Appellation Formation Explicit Metonymy Category Ellipsis Location Ellipsis Referring to “Greater Rochester International Airport” as “The Airport” Referring to “University of Michigan” as “Michigan” Referring to “Balalok Matric Higher Secondary School” as “Balalok School” Dropping specific location references in the location name: RIT - Rochester to RIT
  31. Gazetteers ▪ Significance: □ Help machines to understand the geographic

    meaning of places. □ Model the relations between locations. ▪ Main Components: □ Place Names (N) □ Place Types (T) □ Spatial Footprints (F) ▪ Gazetteers Operations: □ N ➞ F □ N ➞ T □ F (x T) ➞ N Keßler et al. [4]
  32. • Cities • Countries • Street names • Neighborhoods •

    Points of interest • Building names • Organizations • Districts • States Bounding Box https://www.npmjs.com/package/geonames-importer | https://github.com/komoot/photon | https://dbpedia.org/sparql boundingbox.klokantech.com Building Region-specific Gazetteers
  33. OpenStreetMap - Our Choice ▪ Regarded as the Wikipedia of

    maps. ▪ Contains more fine-grained locations than any other resource. ▪ More accurate geo-coordinates in comparison with Geonames. ▪ and, it has a strong volunteer foundation (such as hotosm.org) which maps thousands of locations during a disaster. http://osmlab.github.io/show-me-the-way/
  34. Lexical Variations ▪ We tried to overcome the lexical variations

    through: □ Using the USPS street suffixes and the English OSM abbreviations dictionaries: □ Apartment → Apt □ Street → Str. □ Avenue → AVE □ Airport → Aprt □ Using OpenStreetMap gazetteer which contains the following names: □ English Names □ Default Names □ Alternative Names □ Old Names □ Acronyms (This need more processing, not directly retrievable majority of the time). https://pe.usps.com/text/pub28/28apc_002.htm https://wiki.openstreetmap.org/wiki/Name_finder :Abbreviations#English
  35. Stratford School appears in gazetteer as Stratford High School Problem

    Lexical Stats. using language models built from gazetteers
  36. ▪ Capturing different forms of a toponym (to improve recall)

    □ “Balalok Matric Higher Secondary School” → “Balalok School”. ▪ Recording alternative names (to improve recall) □ “Anna Salai (Mount Road)” → “Anna Salai” and “Mount Road” ▪ Filtering toponym names (to improve recall) □ Break records: “Tamilnadu Housing Board Road , Ayapakkam” ▪ Filtering out very noisy toponyms (to improve precision) https://foursquare.com/ Ball Town in Louisiana Clinton Town in Louisiana Jackson City in Mississippi Gazetteers Preprocessing
  37. Raw Text 1. Filtering 2. Augmentation (Private Road) - -

    Burnt School (historical) Burnt School - Lindale Park (Southbound) Lindale Park - Cars India - Adyar Cars India Adyar Cars India Adyar | Clinton | - - Stratford High School - Stratford School Modern Senior Secondary School (Primary) Modern Senior Secondary School Modern School Modern Senior School Modern Secondary School Solutions: 1. Word lists and rules 2. Skip-grams Gazetteers Preprocessing Auxiliary Content | Ambiguous Location Names | Collocation Contractions
  38. ▪ Break compound words into discrete ones □ In our

    dataset, around 29% of hashtags include location names ▪ Used Peter Norvig’s Word Segmentation method [Norvig, 2009] □ Unigrams with their counts (333,333 types) from Google’s 1 billion token corpus ▪ Selection mechanism ▪ Candidate model ▪ Language model ▪ Error model #ChennaiRains Split-1 Split-2 Split-3 c1- C hennaiR ains c2- Chen na iRains c3- Ch ennaiRai ns c4- Chennai Rains Hashtag Segmentation
  39. my house ????? boundingbox.klokantech.com Location Names Categorization ▪ Previous works

    categorize location names based on their types (e.g., building, street) [Matsuda et al., 2015; Gelernter and Balaji, 2013] ▪ We categorize the location names based on their geo-coordinates and meaning into: □ In-area Location Names (inLoc) □ Location names that are inside the area of interest. □ Out-area Location Names (outLoc) □ Location names that are outside the area of interest. □ Ambiguous Location Names (ambLoc) □ Ambiguous in nature, need more context or background data for disambiguation
  40. T1 ambLoc 33 37 homes T2 inLoc 42 53 Ville

    Platte T3 inLoc 57 58 la >1 Annotating Tweets Using Brat ( brat.nlplab.org ) ▪ 1500 from 2015 Chennai flood □ 75% inLoc , 4% outLoc , and 21% ambLoc ▪ 1500 from 2016 Louisiana flood □ 66% inLoc , 13% outLoc , and 22% ambLoc ▪ 1500 from 2016 Houston flood □ 66% inLoc , 7% outLoc , and 27% ambLoc ▪ 1,000 tweets were used for development and the rest for testing >2 ‘16 Workshop on Noisy User-generated Text (W-NUT) noisy-text.github.io/2016/ > 7245 annotated tweets ▪ Kept company, facility, geo-loc and removed the rest ▪ Used to train two supervised tools for comparison Annotations For Benchmarking
  41. • For partial and overlapping matches: ◦ “The louisiana” instead

    of “Louisiana” ◦ “Avadi Road” instead of “New Avadi Road” • Count all inLocs hits and missed • Ignore all ambLocs and outLocs • For LNEx extraction: ◦ ambLocs and outLocs are all counted as false positives (FPs). Evaluation Strategy
  42. • Lead to only 1% increase in recall • F-Scores

    decreased by 2% on average due to the influence of increased false positives on precision. Spell Checking
  43. • Performance of breaking on hashtags with location names only.

    • Some performance was lost due to the statistical nature of the method: ◦ #LAWX was broken into LAW and X ◦ Since their multiplication is higher than other combinations Hashtag Breaking
  44. Improvement after using more training data Difference between testing on

    same training data and data never seen before Dataset 1: Chennai Dataset 2: Louisiana Dataset 3: Houston Dataset 4: W-NUT ‘16 Scores of Supervised/Trained Tools
  45. Location Disambiguation -Future Work • LN alone ◦ I am

    at Hayatt Hotel ▪ which Hayatt? ▪ disambiguation resolution should be defined (i.e., city, state, country) • LN with others ◦ no relations ▪ Chase Bank and American Bank are closed today. ▪ facts should be propagated to both. >>> relation extraction ◦ with relations ▪ Chase Bank in Main Street is closed. • only one geo-point ◦ answers questions like the focus of the location mention.. the focus of the tweet ▪ Chase Bank and American Bank in Rochester are closed. • how many are there in Rochester? ▪ Chase Bank in First Street and American Bank in Second Street are closed.
  46. User profile Tweet Metadata Geo Coordinates Tweet Text User Connectivity

    Quick Recap: Location Extraction in Twitter Profile Description Location Field Timezone Place
  47. User Location Prediction Problem • Long term residual address prediction

    • Applications: ❖ Local content recommendation ❖ Location-based advertising ❖ Public health monitoring ❖ Public opinion polling ❖ Predicting group behaviors and modeling populations (observing linguistic differences) ❖ Social media supporting local communities 1 percent of Twitter posts are associated with a geolocation http://geoawesomeness.com/location-based-ads-for-mobile-and-desktop-are-starting-to-be-more-contextually-aware/
  48. Public Opinion Polling Wiki page: Knoesis Research: Us 2016 Election

    prediction Paper: Challenges of Sentiment Analysis for Dynamic Events Blog: Link to Blog
  49. Public Health Monitoring NIH R01 Funded Project on: Modeling Social

    Behavior for Health Care Utilization in Depression NIH R01 Grant #: MH105384-01A1 Project Link: rebrand.ly/depressionProject
  50. Ground truth home locations • Self-declared profiles ◦ Noisy ◦

    Absent ▪ Example: I am living in hell Most frequent city have geo-tags The first valid geotag or median (centroid) of geo-tags
  51. Evaluation Metrics • Common metrics can be categorized into: ◦

    Distance based: presented by geographical coordinates ◦ Token-based: discrete symbol, city, POI • Distance-based: ◦ Error Distance: Euclidean distance between ground truth and predicted coordinates: ◦ For the whole users corpus: ▪ Mean ED ▪ Median ED(less sensitive ) ▪ Mean squared error(MSE): square of error distance ▪ ACC@d: • Define a threshold d, 100 miles, “tolebraly correct”
  52. Evaluation Metrics Token based Metrics: • Discrete symbols • County

    • City • POI Accuracy • Accuracy @k: Predicted a ranking list of predicted locations instead of one, a ranking list is considered “correct” ◦ Precision ◦ Recall ◦ F1
  53. Home location Prediction • Tweet content ◦ Word Centric: Estimate

    the probability of location P(l|w) • Location Centric: ◦ Probability of generating a tweet d at a given location Estimate the probability of location P(d|l) Twitter network
  54. Location Centric Methods • Distribution of words over locations ◦

    Maximum Likelihood estimation[1] ◦ Language modeling[2] ◦ Topic modeling ▪ Latent topics with geographical regions[3] • Users statistics for local words as a feature and candidate locations as labels Intuition: • Language in social media is geographically biased • NLP techniques: ◦ Tf-idf vectors ◦ Regularization penalty ◦ Dictionary learning techniques ◦ Tweeting behavior (volume of tweets per time unit) ◦ Detecting travelers https://thumbs.dreamstime.com/b/business-finance-word-cloud-tags-world-ma p-shape-industry-economics-words-vector-currency-market-money-devaluatio n-exchange-85258127.jpg
  55. Twitris 3.0: • Twitris: a Semantic Web application for understanding

    social perceptions by Semantics-based processing of massive amounts of event-centric data on social media. • Twitris addresses challenges in large scale processing of social data, focusing on multi-dimensional analysis of spatio-temporal-thematic, people-content-network and sentiment-emotion-intent facets. More on: http://wiki.knoesis.org/index.php/Twitris
  56. How our Twitris 3.0 Handle Location inference in a real-time

    manner • For each tweet, get user profile location (map of text to location) • Use google geocoding service to try and match. ◦ if there is a match: ▪ Get timezone of the tweet. ▪ Get timezone of the location returned from google. ◦ If they match, add it to user profile location. ▪ Geo-queries on the coordinates from google to achieve the best granular results it can get ◦ Not perfect, but google geocoding service certainly enhance noises such as “middle of nowhere” or “your mom's place”
  57. Social media Geolocation prediction from different social media platforms: •

    Flickr[1] • Facebook[2] • Twitter[3] • WikiPedia[4] Twitter advantages: • Timezone • Self-declared location • Capturing user network https://nickdoucette.wordpress.com/
  58. Network-based Important parameters: • What constitutes a relationship in Twitter

    • Ground truth location Finding relationships in social media are strong indicator of spatial proximity. https://www.outsourcing-pharma.com/Article/2015/02/19/Parexel-expands-its-clinical-trial-site-network-to-speed-drug-development https://s3.amazonaws.com/tjn-blog-images/wp-content/uploads/2016/07/20003911/manage-work-relationship-810x540.jpg
  59. Network-based In 2010, a seminal research by facebook: The likelihood

    of friendship decrease with distance. Geography Friendship https://st2.depositphotos.com/1017228/8219/i/95 0/depositphotos_82194388-stock-photo-friends-h ands-together.jpg photos.gograph.com/thumbs/ CSP/CSP993/k14300461.jpg https://upload.wikimedia.org/wikipedia/commons/t humb/2/26/World_Map_FIFA.svg/2000px-World_ Map_FIFA.svg.png They conclude their approach is more accurate than traditional user’s IP mapping. They derive users location through friends geography
  60. Network-based Cont, Ground truth: 100 million self-reported address in Facebook.

    • 60% parsable to longitude and latitude • 3.5M users with known home address • 2.9 M users with at least one friend with valid location • Their network has 30.6 M edge between users with valid address Data validation: Facebook penetration based on self-reported address Facebook penetration based on IP geolocation
  61. Density Distribution: Population Vs Geography • Dividing US into cells

    of 1/100 degree square. • Count number of FB users in each region. • In low density area, distribution decrease with exponent of -1.37, • In high density area, the exponent decreases with exponent of -3.07 • Transition begins at 50 people per square mile Dist. of counts, Low-density region (straight line), High density region falls much sharper. (Power law dist.)
  62. Density Distribution: Population Vs Geography 2.bp.blogspot.com/_VMAt17gvKp8/TEUAD3ebxQI/AAAAAAAAA1A/riVGzdSTavM/s 1600/confuse.gif Distance and friendship

    are connected to population density. • Example: Imagine you live in Manhattan and you have 10000 people living within a single block and you do not know most of them. • Let say you know five out of ten thousands people within 1 mile. Your probability knowing any individual would be 0.0005. • However, in small town, there are only 1000 people within 1 mile from you. • If you still know 5 of them, your probability would be 0.005
  63. Density Distribution: Population Vs Geography • Dividing population into 3

    groups, based on population density they are living: low, medium, high • Average number of people living x miles away • For instance, for high density the average is higher in general. • For small distances, the curves increases linearly. • After that, population density falls off, number of people is also falls • Why? as we moved away from the center about 50 miles we are in less dense areas (small towns).
  64. Friendship vs Distance • The probability of friendship goes down

    with distance. • Computing distance between all pairs of individuals with known address. • Bucketing by intervals of 0.1 miles to compute total number of pairs. • Number of pairs which an edge is present. • a(b+x)^-1: inverse relationship • Moreover, if you are in high density area you are less likely to know nearby individual. • At 50 miles, probability of knowing someone is no longer dependent on density
  65. Prediction framework How to predict location of people who have

    not provided their information? • Naive approach: ◦ Take the mean location’s of one’s friend ▪ Not working for people who live in the coast. bocawatch.org/wp-content/uploads/2016/10/careerc onfusion1-e1417093044460.png • Approach: ◦ For each friend v of u, where location l v is known. what is the probability the edge forms. ◦ Multiplying all these probabilities together, we get likelihood for all edge given the distance of v and u.
  66. Network-based Cont, Inference: Given their geographic distance between users what

    is the likelihood of observing a relationship between 2 users. • P(likelihood of friendship | distance) More on: Backstrom, L., Sun, E., & Marlow, C. (2010, April). Find me if you can: improving geographical prediction with social and spatial proximity. In Proceedings of the 19th international conference on World wide web (pp. 61-70). ACM.
  67. Network-based Many attempts to improve this idea: • Adopt the

    previous approach to glean location from Twitter. ◦ Challenges: ▪ Users information in Twitter is very noisy compared to Facebook[cite] ▪ Social relationships in twitter can be grouped into multiple roles[cite] Motivation for researcher starts to classify users Twitter relationships.
  68. Network-based • Users with X number of GPS posts selected.

    • Median latitude longitude are selected as their location • Separating relationships into different partitions, location information in partition with most-predictive relationships will dominate the likelihood. In summary, select more informative relationship.
  69. Network-based • Different weighting strategies: ◦ Which of a user’s

    friend are likely to be most predictive of their location. For example user a friend is friend with user b if at least mentioned by b at least twice in the posts. • Given a user, friends are weighted based on their social tightness.
  70. Network-based Intuition: Some users are more informative for predicting the

    locations of the neighbors. • First assign users a random location • Iteratively update the location of the users employing already known locations Liu, Y., Kliman-Silver, C., & Mislove, A. (2014). The Tweets They Are a-Changin: Evolution of Twitter Users and Behavior. ICWSM, 30, 5-314.
  71. Label Propagation • Label propagation - Given a set of

    labels, extend the labels to all nodes in a graph. • Applications: ◦ Network classification ◦ Community detection ◦ Structured inference • Underlying assumption is edges carries the notion of similarity • Social network often show assortative mixing
  72. Label Propagation • What is Assortativity? ◦ Measure volume of

    interactions among nodes with like class. • Social networks show assortative mixing: • Nodes with similar characteristics are connected to each other(homophily) • Gender • Political preferences: democrat, republicans • Age
  73. Network Classification • Given graph nodes V = V l

    ∪ V u : ◦ Nodes V l have labels Y l ◦ Nodes V u do not have any labels Need to find Y u Can be modeled for both regression and classification. • Binary • Multi-class • Real values Features: • Computed for every node (local node features), eg. gender, age • Link features available (labels from neighbors, neighbours features, node degrees, connectivity patterns)
  74. Network Classification • Graph-based semi-supervised learning • Given partially labeled

    dataset Data: X = X l ∪ X u • small set of labeled data (X l , Y l ) • large set of unlabeled data X u Transductive learning: • Fit a model that predicts labels Y u for the unlabeled input X u . Transductive learning VS Inductive learning Take advantage of pattern(structure) of the data. Label propagation idea is very similar to idea of Transductive learning.
  75. Random Walk It can be explained with continuous diffusions along

    the network. ◦ Information moves through network ◦ Rumors ◦ Energy ◦ Epidemiological process Let a network has nodes hold a i (0) units of energy at time 0. The goal is to find after t time, how much energy is present at node i for all i’s in the network. β(∅ i -∅ j ) dt as energy going from i ⇒ j per dt(small time) It turns out, we have to solve the below differential equation: ◦ Solution: ▪ find L = D − A ▪ compute {λ i } of L ▪ solve
  76. Random Walk • A walker carrying energy moving from one

    node to another. ◦ How we can analyze which way it will go? Markov Chain: It is a directed network where edges are annotate with probabilities such that if P ij is the probability on an edge from i to j. • Probability Transition Matrix Markov process : if reading of time t is dependent only on t−1 Chapman-Kolmogoror Equation: We are given the transition matrix, we can calculate the probability we are in specific stage.
  77. Random Walk Let's consider a random walk with absorbing state(trap)

    ? Intuitive solution: How many time we trapped in a green node and how many times in blue node, then decide P ij : Prob. you start from node i and end up in node j. y i [c]= probability distribution over labels
  78. Random Walk • Random Walk Matrix: P = D−1A 2

    1 3 4 5 If we move from node 3, most likely we trap on blue. If we move from node 5 we probably end up on red. The question is what is gonna happen when we power p (n times)?
  79. Random Walk The question is what is gonna happen when

    we power p (n times)? It turn out p uu = 0 , is subset of transition matrix when we multiply the transition matrix several times, eventually it becomes zero.
  80. Random Walk The whole idea of blocking is for to

    obtain something look like this: If the Matrix is small, we can get the inverse and get the results. We often need to inverse a large sparse matrix • Iterative solution
  81. Hybrid Models Ensemble learning: Hybridize Text-based Network based • Build

    a network by @mentions from tweets, a typical approach for detecting conversations between friends. • Using label propagation to infer user locations. In this manner, simultaneously they capture the notion of location homophily as well as powerful textual features such as tf-idf vectorization
  82. Hybrid Models: Ensemble learning Observation: Bi-directional mentions are too rare

    to be useful • We can create uni-directional mentions as undirected edges. • Create weighted undirected graph, number of @mentions in tweets by either user. Then run label propagation algorithm to update the location of each non-training node. Rahimi, A., Vu, D., Cohn, T., & Baldwin, T. (2015). Exploiting text and network context for geolocation of social media users. arXiv preprint arXiv:1506.04803.
  83. Hybrid approach: Deep learning based State of the art: •

    Propose a neural network model that learns unified text, metadata, and user network representations with an attention mechanism. They show their model outperform the previous ensemble approach on two public datasets.
  84. Hybrid approach: Deep learning based Recurrent Neural Networks (RNNs) show

    great promise in many NLP tasks • Language modeling • Generating text • Predict the next word in a sentence given its previous word ◦ E.g. Let's say we have sequence of 5 words, the network would be unrolled into a 5-layer neural network, one layer for each word. • Training: ◦ Backpropagation Through Time (BPTT) Sequential Information
  85. Hybrid approach: Deep learning based Drawback of Recurrent Neural Networks:

    ◦ Vanishing gradient problem prevents RNN from learning long-term dependencies. ▪ Solution: LSTM • Gating Mechanism Big Picture
  86. Hybrid Model (RNN with attention) Cite: Miura, Yasuhide, et al.

    "Unifying text, metadata, and user network representations with a neural network for geolocation prediction." Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2017.
  87. We acknowledge the support from the National Science Foundation (NSF)

    award: EAR 1520870: Hazards SEES: Social and Physical Sensing Enabled Decision Support for Disaster Management and Response. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF. For further questions and to get slides: [email protected] @halolimat Project Website: https://rebrand.ly/HazardSEES Hussein Al-Olimat TK. Prasad Amir Yazdavar Amit Sheth