Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Sensing the city

Robin
July 09, 2014

Sensing the city

Slides delivered at the Research Methods Festival 2014

Robin

July 09, 2014
Tweet

More Decks by Robin

Other Decks in Technology

Transcript

  1. Advances in datasets and methods for 'sensing the city' Robin

    Lovelace, University of Leeds 9th July, Research Methods Festival University of Oxford. [email protected] - Slides: robinlovelace.net
  2. What I’m going to talk about 1.Advances in datasets 2.Advances

    in method 3.Examples 4.Opportunities for synthesis 5.Conclusion
  3. Part I: Advances in the data Ongoing and accelerating digital

    revolution More data than ever before Most growth in 'big data' Or rather 'V data': • High Volume • High Velocity • Highly Variable • Often Un-Verified Continuing rush to access this data
  4. Overview of access and rights Availability -> Right to access/store/use

    -> 2001 o-d flow data (imperfect) Twitter API Historic tweets (pending Library of Congress action) (£) Mobile phone triangulation data (e.g. Telefonica) Strava data on Running/cycling (£) Migration flow data Anonymous, non-geo. Ind. Survey data Google Location Services data Primary survey data (e.g. Ian Kellar; LAs, Bogota) Size/recent/(potential) utility Direction of movement
  5. Recent developments: Twitter “a single search of the 21 billion

    tweets in the fixed 2006-10 archive was taking 24 hours just last year. Twitter acquired Gnip in April, prompting hopes that the archive may be operational in 2014-15, but even so, the archive will only be accessible on-site at the Library in Washington, D.C.” source: Poynter.org, 25th June http://www.poynter.org/latest-news/media-lab/social-media/256811/how-to-do-twitter-research-on-a-shoestring/ “Modeled after the U.S. Consumer Privacy Bill of Rights, Rivers and Lewis outline six guidelines for the ethical use of Twitter data:” http://phys.org/news/2014-06-scientists-tactics-ethical-twitter.html Increasing academic interest – data access issues The rise of “Big Analytics” corporations http://venturebeat.com/2014/07/02/datasift-works-with-the-united-nations-to-analyze-social-data-f or-humanitarian-missions/
  6. Trends • Public -> private data provision • Free (samples)

    -> paid data • Small -> big • Pre-processed by provider -> academics pre-process the data (e.g. Sandy Tweets) • Aggregate -> Individual-level • Space + time snapshots -> Space-time
  7. Part II: Advances in methods “Modern computing is now sufficiently

    powerful to deal with most [urban] models ... models based on individuals are now feasible both in terms of their computation and their representation using new programming languages” (Batty, 2007, p. 5).
  8. New method 1: the radiation model • Tij: flow from

    i to j • Ti: flow out of i – sum(Tj≄i) • mi: population of zone i (equiv: Pi) • nj: population of dest. Zone (eqiv: Wj) • sij: population in the circle surrounding i, with circumference touching j
  9. A visualisation of sij i j Radius rij Sij =

    sum(pop %in% Circle of radius rij) Sum of populations of all black circles
  10. 'New' method 2: Bayesian statistics • Approach to statistics superseding

    the dominant 'NHST' or “frequentist” paradigm • You must state expectations – no more unrealistic 'null hypotheses' • Monte Carlo or numerical solutions move us from prior to posterior probability distributions for each model parameter • Can also tell us precisely how much more realistic model 1 is than model 2 • New software for handling space-time data (e.g. R packages INLA, spTimer)
  11. E.g. 1: 'museum tweets' 445 days 15 months of data.

    All 'geotagged' Tweets for Leeds and surroundings. Part III: Examples
  12.  Pre-processing  Filtered out 'museum Tweets' from dataset of

    2.8 geo-referenced messages. Semantic filters Basically "regex" Search terms Overall just under 1,000 'museum Tweets' resulted from filters Spatial filters A buffer around each museum with osmar Preprocessing the Tweets
  13. Calibration Very simple calibration procedure: reran model for many different

    beta values Closest aggregated tweet/model fit selected for different model implementations Opportunities for Bayesian approaches here
  14. Example 2: Hurricane Sandy ~20 million tweets, 160,000 of which

    are geo-located. 10 day worldwide filter on “sandy”.. Purchased for ~£2,000 for project on climate disasters. Code for filtering the data: https://github.com/Robinlovelace/twitter-sandy/blob/master/R%20Scripts/1-formatting-filtering.R
  15. Part IV: Opportunities Opportunities of 'Big' data • New insight

    into questions previously beyond the reach of survey • Diversity, low cost, comprehensive coverage • High spatial and temporal resolution • New opportunities for testing • Need to move away from 'big noise' to processed and verified datasets • Comparison with official datasets
  16. Opportunities for new methods • Accuracy • Simplicity (Masucci et

    al. 2013) • Bayesian methods are well-suited to the analysis of large datasets • Interactive visualisation • New models and methods need to be tested on 'Big Data'
  17. Risks of Big Data [Big data is] "a version of

    cherry-picking that destroys the entire spirit of research and makes the abundance of data extremely harmful to knowledge." (Taleb 2012, 416)
  18. Synthesis New datasets call for • New statistical tools (Bayesian)

    to deal with uncertainty • Ways to ingest continual data • Filtering • Aggregation New methods call for: • More code sharing (e.g. Dennett, 2012) • Comparative testing • Ways to input new data streams • Visualisation of key processes Convergence in research design?
  19. Conclusion Opportunities and risks associated with both new datasets and

    models Little correspondence between advances in modelling and data sources New datasets call for new methodologies Bridging this model-data gap = research priority: thinking behind research at Leeds
  20. Key References • Lovelace, R., Malleson, N., Harland, K., &

    Birkin, M. (2014). Geotagged tweets to inform a spatial interaction model: a case study of museums. arXiv preprint arXiv:1403.5118. • Masucci, a. P., Serras, J., Johansson, A., & Batty, M. (2013). Gravity versus radiation models: On the importance of scale and heterogeneity in commuting flows. Physical Review E, 88(2). • Simini, F., González, M. C., Maritan, A., & Barabási, A. L. (2012). A universal model for mobility and migration patterns. Nature, 484(7392), 96-100.
  21. Aggregation Necessary to compare aggregate flow model with individual Tweets

    Also vital to 'smooth' the stochasticity inherent to VGI In reality: LOTS more data needed for reliable results
  22. Risks Of new data • Data vs actual behaviour •

    Policy relevance • Time pre-processing • Unrepresentative (Strava) • Less use of official data Of new models • New is not always better • Oversimplification (Masucci et al. 2013)