Sensing the city

Advances in datasets and methods for 'sensing the city' Robin
Lovelace, University of Leeds 9th July, Research Methods Festival University of Oxford. [email protected] - Slides: robinlovelace.net

What I’m going to talk about 1.Advances in datasets 2.Advances
in method 3.Examples 4.Opportunities for synthesis 5.Conclusion

Part I: Advances in the data Ongoing and accelerating digital
revolution More data than ever before Most growth in 'big data' Or rather 'V data': • High Volume • High Velocity • Highly Variable • Often Un-Verified Continuing rush to access this data

Overview of access and rights Availability -> Right to access/store/use
-> 2001 o-d flow data (imperfect) Twitter API Historic tweets (pending Library of Congress action) (£) Mobile phone triangulation data (e.g. Telefonica) Strava data on Running/cycling (£) Migration flow data Anonymous, non-geo. Ind. Survey data Google Location Services data Primary survey data (e.g. Ian Kellar; LAs, Bogota) Size/recent/(potential) utility Direction of movement

Recent developments: Twitter “a single search of the 21 billion
tweets in the fixed 2006-10 archive was taking 24 hours just last year. Twitter acquired Gnip in April, prompting hopes that the archive may be operational in 2014-15, but even so, the archive will only be accessible on-site at the Library in Washington, D.C.” source: Poynter.org, 25th June http://www.poynter.org/latest-news/media-lab/social-media/256811/how-to-do-twitter-research-on-a-shoestring/ “Modeled after the U.S. Consumer Privacy Bill of Rights, Rivers and Lewis outline six guidelines for the ethical use of Twitter data:” http://phys.org/news/2014-06-scientists-tactics-ethical-twitter.html Increasing academic interest – data access issues The rise of “Big Analytics” corporations http://venturebeat.com/2014/07/02/datasift-works-with-the-united-nations-to-analyze-social-data-f or-humanitarian-missions/

Trends • Public -> private data provision • Free (samples)
-> paid data • Small -> big • Pre-processed by provider -> academics pre-process the data (e.g. Sandy Tweets) • Aggregate -> Individual-level • Space + time snapshots -> Space-time

Part II: Advances in methods “Modern computing is now sufficiently
powerful to deal with most [urban] models ... models based on individuals are now feasible both in terms of their computation and their representation using new programming languages” (Batty, 2007, p. 5).

New method 1: the radiation model • Tij: flow from
i to j • Ti: flow out of i – sum(Tj≄i) • mi: population of zone i (equiv: Pi) • nj: population of dest. Zone (eqiv: Wj) • sij: population in the circle surrounding i, with circumference touching j

A visualisation of sij i j Radius rij Sij =
sum(pop %in% Circle of radius rij) Sum of populations of all black circles

'New' method 2: Bayesian statistics • Approach to statistics superseding
the dominant 'NHST' or “frequentist” paradigm • You must state expectations – no more unrealistic 'null hypotheses' • Monte Carlo or numerical solutions move us from prior to posterior probability distributions for each model parameter • Can also tell us precisely how much more realistic model 1 is than model 2 • New software for handling space-time data (e.g. R packages INLA, spTimer)

New method 3: Interactive results

E.g. 1: 'museum tweets' 445 days 15 months of data.
All 'geotagged' Tweets for Leeds and surroundings. Part III: Examples

 Pre-processing  Filtered out 'museum Tweets' from dataset of
2.8 geo-referenced messages. Semantic filters Basically "regex" Search terms Overall just under 1,000 'museum Tweets' resulted from filters Spatial filters A buffer around each museum with osmar Preprocessing the Tweets

Tweets as input into SI model

Calibration Very simple calibration procedure: reran model for many different
beta values Closest aggregated tweet/model fit selected for different model implementations Opportunities for Bayesian approaches here

Example 2: Hurricane Sandy ~20 million tweets, 160,000 of which
are geo-located. 10 day worldwide filter on “sandy”.. Purchased for ~£2,000 for project on climate disasters. Code for filtering the data: https://github.com/Robinlovelace/twitter-sandy/blob/master/R%20Scripts/1-formatting-filtering.R

Part IV: Opportunities Opportunities of 'Big' data • New insight
into questions previously beyond the reach of survey • Diversity, low cost, comprehensive coverage • High spatial and temporal resolution • New opportunities for testing • Need to move away from 'big noise' to processed and verified datasets • Comparison with official datasets

Opportunities for new methods • Accuracy • Simplicity (Masucci et
al. 2013) • Bayesian methods are well-suited to the analysis of large datasets • Interactive visualisation • New models and methods need to be tested on 'Big Data'

Risks of Big Data [Big data is] "a version of
cherry-picking that destroys the entire spirit of research and makes the abundance of data extremely harmful to knowledge." (Taleb 2012, 416)

Synthesis New datasets call for • New statistical tools (Bayesian)
to deal with uncertainty • Ways to ingest continual data • Filtering • Aggregation New methods call for: • More code sharing (e.g. Dennett, 2012) • Comparative testing • Ways to input new data streams • Visualisation of key processes Convergence in research design?

Conclusion Opportunities and risks associated with both new datasets and
models Little correspondence between advances in modelling and data sources New datasets call for new methodologies Bridging this model-data gap = research priority: thinking behind research at Leeds

Key References • Lovelace, R., Malleson, N., Harland, K., &
Birkin, M. (2014). Geotagged tweets to inform a spatial interaction model: a case study of museums. arXiv preprint arXiv:1403.5118. • Masucci, a. P., Serras, J., Johansson, A., & Batty, M. (2013). Gravity versus radiation models: On the importance of scale and heterogeneity in commuting flows. Physical Review E, 88(2). • Simini, F., González, M. C., Maritan, A., & Barabási, A. L. (2012). A universal model for mobility and migration patterns. Nature, 484(7392), 96-100.

Aggregation Necessary to compare aggregate flow model with individual Tweets
Also vital to 'smooth' the stochasticity inherent to VGI In reality: LOTS more data needed for reliable results

Risks Of new data • Data vs actual behaviour •
Policy relevance • Time pre-processing • Unrepresentative (Strava) • Less use of official data Of new models • New is not always better • Oversimplification (Masucci et al. 2013)

Sensing the city

Sensing the city

Robin

More Decks by Robin

Other Decks in Technology

Featured

Transcript

Advances in datasets and methods for 'sensing the city' Robin

What I’m going to talk about 1.Advances in datasets 2.Advances

Part I: Advances in the data Ongoing and accelerating digital

Overview of access and rights Availability -> Right to access/store/use

Recent developments: Twitter “a single search of the 21 billion

Trends • Public -> private data provision • Free (samples)

Part II: Advances in methods “Modern computing is now sufficiently

New method 1: the radiation model • Tij: flow from

A visualisation of sij i j Radius rij Sij =

'New' method 2: Bayesian statistics • Approach to statistics superseding

New method 3: Interactive results

E.g. 1: 'museum tweets' 445 days 15 months of data.

 Pre-processing  Filtered out 'museum Tweets' from dataset of

Tweets as input into SI model

Calibration Very simple calibration procedure: reran model for many different

Example 2: Hurricane Sandy ~20 million tweets, 160,000 of which

Part IV: Opportunities Opportunities of 'Big' data • New insight

Opportunities for new methods • Accuracy • Simplicity (Masucci et

Risks of Big Data [Big data is] "a version of

Synthesis New datasets call for • New statistical tools (Bayesian)

Conclusion Opportunities and risks associated with both new datasets and

Key References • Lovelace, R., Malleson, N., Harland, K., &

Aggregation Necessary to compare aggregate flow model with individual Tweets

Risks Of new data • Data vs actual behaviour •