Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Big Data in the Real World

9609e8b5ecb752618cbb0db56b5c0a46?s=47 alexsingleton
December 07, 2014

Big Data in the Real World

Talk given at UCL, 2/12/14



December 07, 2014


  1. www.alex-singleton.com @alexsingleton Alex Singleton Reader in Geographic Information Science Department

    of Geography and Planning “Big Data” in the Real World
  2. Overview • A definition of “Big Data”? • Limits and

    problems • 4 challenges with examples • Data Management • Linkage • Visualisation • Reproducibility • Future Directions
  3. None
  4. None
  5. None
  6. Maths / Stats Computer Science Data Science

  7. None
  8. None
  9. http://www.npr.org/blogs/parallels/2013/05/31/187316703/Rio-Goes-High-Tech-With-An-Eye-Toward-Olympics-World-Cup http://www.ft.com/cms/s/0/a1f11494-1439-11e3-9289-00144feabdc0.html#axzz3EKlEY87Y CCTV Weather Radar Traffic Social Networks GPS “Models

    that predict emergencies“ prediction, mitigation, preparation and response Many others… Velocity Volume Variety Rio de Janeiro, Brazil
  10. http://upload.wikimedia.org/wikipedia/commons/2/23/Noahs_Ark.jpg Software Systems

  11. http://www.flickr.com/photos/x-ray_delta_one/8184264475/ http://www.flickr.com/photos/neoporcupine/1866929252/ Fantasy Reality

  12. None
  13. Sampling bias?

  14. What we do with data…

  15. http://streetbump.org/ Social implications of bias

  16. Eric Fischer (Twitter Map)

  17. Not really But often are still quite challenging to deal

  18. Hospital School University Retail Property Banking Telecoms Transport Police

  19. Professor Mike Batty, UCL Anything that won’t fit on a

  20. Challenges for “Big Data” Visualisation Linkage Synthesis Reproducibility Pretty

  21. Synthesis Variables V Context

  22. http://www.google.co.uk/intl/en_uk/earth/

  23. http://www.google.co.uk/intl/en_uk/earth/ 52: POORER FAMILIES, MANY CHILDREN, TERRACED HOUSING 51: YOUNG

  24. None
  25. Areas Attributes PCA “Pre” clustering - contiguity constraint

  26. http://newamericanatlas.com

  27. A: Hispanic and Kids B: Wealthy Nuclear Families C: Middle

    income, single family homes D: Native American E: Wealthy Urbanites F: Low Income and Diverse G: Old, Wealthy White H: Low Income Minority Mix I: African−American Adversity J: Residential Institutions, Young People A: Hispanic and Kids B: Wealthy Nuclear Families C: Middle income, single family homes D: Native American E: Wealthy Urbanites F: Low Income and Diverse G: Old, Wealthy White H: Low Income Minority Mix I: African−American Adversity J: Residential Institutions, Young People
  28. http://www.flickr.com/photos/epsos/5591761716/sizes/o/in/photostream/ CO2 Emissions and the School Commute Data Linkage

  29. CO2 Emissions • ~7.5 million school trips • 2007-2012 -

    Usual Travel Mode Singleton, A. (2013) A GIS Approach to Modelling CO2 Emissions Associated with the Pupil-School Commute. International Journal of Geographical Information Science, 28(2):256–273.
  30. CO2 Emissions • d distance • p pupil • i

    pupil home postcode • j school postcode • e CO2 g/km • t transport mode • g location k p = 2 d i p j p t p ( )e t p g p ( )w t p ( ) ( )
  31. Transport Mode Average CO2g / km Taxi 150.3 Bus (London)

    85.7 Bus (Non London) 184.3 Coach 30.0 Light Rail - Average 71 London (DLR) 68.3 Birmingham / Midlands 70.5 Newcastle 103.0 Croydon 44.3 Manchester 39.5 Nottingham # Sheffield 96.8 National Rail 53.4 London Underground 73.1 Cycling 8.3 Walking 11.4 (Coley 2002, DEFRA 2011, Tranter 2012)
  32. Data Processing OpenStreetMap Meridian2 Roads / Paths .osm XML Light

    Rail / Tube railway=light_rail railway=subway Routino QGIS Railways Cleaning 1) Single lines 2) Nodes join 3) Nodes at stations QGIS
  33. Software Infrastructure Query Pupil (origin, destination, mode) Mode? Tube /

    Light Rail Road Car based? Return LSOA average CO2g/km Yes No Use national averages
  34. 0.00 0.25 0.50 0.75 0−0.5km 0.5−1km 1−1.5km 1.5−2km 2−2.5km 2.5−3km

    3−3.5km 3.5−4km 4−4.5km 4.5−5km 5−5.5km 5.5−6km 6−6.5km 6.5−7km 7−7.5km 7.5−8km 8−8.5km 8.5−9km 9−9.5km 9.5−10km 10−10.5km 10.5−11km 11−11.5km 11.5−12km 12−12.5km 12.5−13km 13−13.5km 13.5−14km 14−14.5km 14.5−15km 15−15.5km 15.5−16km 16−16.5km 16.5−17km 17−17.5km 17.5−18km 18−18.5km 18.5−19km 19−19.5km 19.5−20km > 20km Distance Percentage Mode BUS CAR NON TRA
  35. Results Versus a simple model (straight line, vehicle national averages)

    +ve = simple model overestimating
  36. Massive Online Open Map Books Visualisation

  37. http://www.educationprofiler.org

  38. 2010 Census of Japan Open Atlas Alex Singleton [www.alex-singleton.com] Chris

    Brunsdon, Tomoki Nakaya, Keiji Yano Version 1.0 ! 2011 Census Open Atlas Alex Singleton (www.alex-singleton.com) Version 2.0
  39. 1) Download and process all data 2) Download OA and

    Ward boundaries 3) Render maps and legends for LAD 4) Write a latex file pdfcrop PDFTK
  40. None
  41. None
  42. Savings • A manual map might typically take 5 minutes

    to create • 134,567 maps • 35 hour working week - 6.9 years. • Median wages of a GIS Technician £138,207. ! 2011 Census Open Atlas Alex Singleton (www.alex-singleton.com) Version 2.0
  43. Open GISc Reproducibility

  44. Complex problems, unstable environment http://www.nature.com/news/when-google-got-flu-wrong-1.12413

  45. Free Data != Open Data

  46. Open Systems

  47. None
  48. James Reid - Northern Ireland Atlas (http://ukbdev.edina.ac.uk/Census2011/) James Trimble (http://ukdataexplorer.com/census/england/)

  49. The Future? • Big Data as a concept is not

    a new phenomenon: a disjunction between available data and the ability process it • Moving from relatively clean / ordered data to dirty / unstructured drawn from multiple sources • New methodology will emerge • Great Opportunity: Should begin with a problem to solve not a technology or infrastructure “B ig D ata”
  50. The Future? http://www.theguardian.com/technology/2014/jan/29/uk-government-plans-switch-to-open-source-from-microsoft-office-suite https://www.whatdotheyknow.com/request/133909/response/323829/attach/3/RESPONSE%2028509371.pdf • Opensource GIS • Reduce market

    share commercial desktop GIS • Commercial GIS, refocus on cloud - services Infrastructure
  51. Infrastructure

  52. The Future? Sweave (.Rnw) Markdown (.Rmd) Rpubs R eproducibility

  53. The Future? The ability to code relates to basic programming

    and database skills that enable students to manipulate large and small geographic data sets, and to analyse them in automated and transparent ways. Although it might seem odd for a geographer to want to learn programming languages, we only have to look at geography curriculums from the 1980s to realise that these skills used to be taught. For example, it wouldn’t have been unusual for an undergraduate geographer to learn how to programme a basic statistical model (for example, regression) from base principles in Fortran (a programming language popular at the time) as part of a methods course. But during the 1990s, the popularisation of graphical user interfaces in software design enabled many statistical, spatial analysis and mapping operations to be wrapped up within visual and menu-driven interfaces, which were designed to lower the barriers of entry for users of these techniques. Gradually, much GIS teaching has transformed into learning how these software package, they increasingly look like advertisements for computer scientists, with expected skills and experience that wouldn’t traditionally be part of an undergraduate geography curriculum. Many of the problems that GIS set out to address can now be addressed with mainstream software or shared online services that are, as such, much easier to use. If I want to determine the most efficient route between two locations, a simple website query can give a response within seconds, accounting for live traffic-volume data. If I want to view the distribution of a census attribute over a given area, there are multiple free services that offer street-level mapping. Such tasks used to be far more complex, involving specialist software and technical skills. There are now far fewer job advertisements for GIS technicians than there were ten years ago. Much traditional GIS-type analysis is now sufficiently non-technical that it requires little specialist skill, or has been automated through software services, with a subscription replacing the employment of a technician. The market has moved on. Geographers shouldn’t become computer scientists; however, we need to reassert our role in the development and critique of existing and new GIS. For example, we need to ask questions such as which type of geographic representation might be most appropriate for a given dataset. Today’s geographers may be able to talk in general terms about such a question, but they need to be able to provide a more effective answer that encapsulates the technologies that are used for display. Understanding what is and isn’t possible in technical terms is as important as understanding the underlying cartographic principles. Such insights will be more available to a geographer who has learnt how to code. Within the area of GIS, technological change has accelerated at an alarming rate in the past decade and geography curriculums need to ensure that they embrace these developments. This does, however, come with challenges. Academics must ensure that they are up to date with market developments and also that there’s sufficient capacity within the system to make up-skilling possible. Prospective geography undergraduates should also consider how the university curriculums have adapted to modern market conditions and whether they offer the opportunity to learn how to code. software systems operate, albeit within a framework of geographic information science (GISc) concerned with the social and ethical considerations of building representations from geographic data. Some Masters degrees in GISc still require students to code, but few undergraduate courses do so. The good news is that it’s never been more exciting to be a geographer. Huge volumes of spatial data about how the world looks and functions are being collected and disseminated. However, translating such data safely into useful information is a complex task. During the past ten years, there has been an explosion in new platforms through which geographic data can be processed and visualised. For example, the advent of services such as Google Maps has made it easier for people to create geographical representations online. However, both the analysis of large volumes of data and the use of these new methods of representation or analysis do require some level of basic programming ability. Furthermore, many of these developments haven’t been led by geographers, and there’s a real danger that our skill set will be seen as superfluous to these activities in the future without some level of intervention. Indeed, it’s a sobering experience to look through the pages of job advertisements for GIS-type roles in the UK and internationally. Whereas these might once have required knowledge of a particular I N M Y O P I N I O N, a geography curriculum should require students to learn how to code, ensuring that they’re equipped for a changed job market that’s increasingly detached from geographic information systems (GIS) as they were originally conceived. January 2014 | 77 Learning to code A L E X S I N G L E T O N is a lecturer in geographic information science at the University of Liverpool P O I N T O F V I E W January 2014 | UK £4.50 www.geographical.co.uk M AG A Z I N E O F T H E R OYA L G E O G R A P H I C A L S O C I E T Y ( W I T H I B G ) Geographical HOW INDUSTRIAL FISHING IS EMPTYING THE SEAS AROUND THAILAND Can carbon capture and storage save the world? Deep disposal Manchester is my orchard Turning Moss Side's unwanted fruit into a thriving cider business Net loss "TDFOTJPO*TMBOEq/FQBMq"VSFM4UFJO PLUS www.geographical.co.uk Education
  54. Maths / Stats Computer Science Data Science

  55. Maths / Stats Computer Science Geography Data Science

  56. Maths / Stats Computer Science Geography Geographic Data Science

  57. Maths / Stats Computer Science Geography Geocomputation

  58. Many Thanks Any questions?