Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Geodemographics and digital cultures: Developin...

Geodemographics and digital cultures: Developing a national classification of Internet use and engagement

Talk given at the Geospatial Innovation Facility, University of California, Berkeley. October 2014

Dean Riddlesden

November 12, 2014
Tweet

More Decks by Dean Riddlesden

Other Decks in Technology

Transcript

  1. Dean  Riddlesden   Geographic  Data  Science  Laboratory   Geodemographics  and

     digital  cultures:       Developing  a  na;onal  classifica;on  of  Internet  use  and  engagement  
  2. -  Background -  Why? -  The three fundamentals of engagement:

    - Infrastructure - Behaviors - Demographics -  Developing a national classification -  Assessment and impact -  Conclusions Outline
  3. Background Geodemographics links the sciences of geography and demography Commercial

    geodemographic classifications began to emerge in the late 1970’s - PRIZM (Claritas) USA - Acorn (CACI) UK - Mosaic (Experian) Predominantly used in the targeting of ‘ideal’ populations for products and services -  Some public sector significance -  Recent shift from ‘black-box’ to open source
  4. Background cont. Open source geodemographics •  Output Area Classification (OAC)

    - Initially 2001 - Updated 2011 - In line with Census releases from ONS •  E-society Classification - 2007 - Predominantly based on lifestyle and consumer survey data/ some census •  Internet User Classification (IUC) - Survey/ Crowdsourced data/ Census/ Infrastructure
  5. Background cont. Positives: -  Easy to use measures -  Widespread

    application -  Policy formation/ delivery -  Insight and Business Intelligence -  Proven success in Industry Negatives: -  One big Ecological Fallacy? Up for debate: -  Re-use of personal information for profit -  Issues of privacy -  Spam Spam and more Spam
  6. Why build a bespoke classification? •  Broadly: - Commercial classifications

    lack focus - The age of the Internet - Harness crowdsourced data - No domain-specific open source classifications •  Specifically: - Learn from/ improve upon the E-society classification - Higher spatial resolution - Wide scope for implementation (ONS/ Online Census 2021) - An interesting PhD topic
  7. 1.  Infrastructure •  Comprehensive studies of English broadband infrastructure (fixed-line

    and mobile) •  Performance and access evaluated by indicators of socio-spatial structure (OAC and open geodemographic profiling) •  A geography of supply to consider alongside demand •  What are the apparent disparities?
  8. Infrastructure Data •  Hard to find, but essential to the

    project •  No public data/ Incumbent suppliers don’t like to share •  The OfCom Sitefinder fiasco/ had to get creative Data: •  7.6m geo-tagged speed test estimates •  Exchange location database (web scraped) •  OfCom Sitefinder database (eventually) •  Laborious ‘Janitor’ work
  9. 0 50000 100000 150000 0 2500 5000 7500 Distance to

    Exchange (m) Count 2010/11 0 25000 50000 75000 100000 125000 0 2500 5000 7500 Distance to Exchange (m) Count 2012/13
  10. 0 2000 4000 6000 0.5 1 1.5 2 2.5 3

    3.5 4 4.5 5 5.5 6 6.5 7 7.5 7.5+ Distance From Exchange (Kilometres) Mean Download Speed (Kb/s) 2010/11 0 2500 5000 7500 10000 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 7.5+ Distance From Exchange (Kilometres) Mean Download Speed (Kb/s) 2012/13
  11. under 3543 3543 to 4249 4249 to 5021 5021 to

    5853 5853 to 6650 over 6650 Mean Download Speeds (Kbps) by English District 2010/11 under 5176 5176 to 6705 6705 to 8206 8206 to 9853 9853 to 11516 over 11516 Mean Download Speeds (Kbps) by English District 2012/13
  12. 0 1 2 3 4 1a: Rural Retirement 1b: Farming

    Communities 1c: Country Life 2a: Aspirational Migrants 2b: Student Communities 2c: Settled City Living 3a: Urban Deprivation 3b: Connected Achievers 3c: Aspirational Multicultural Families 3d: Challenged Ethnic Mix 4a: Blue Collar Estates 4b: Blue Collar Transitions 4c: Blue Collar Terraces 5a: Socially Mobile Minorities 5b: Ethnic Communities 6a: Inner Suburbs 6b: Established Suburbs 6c: Suburban Aspiration 7a: Industrial Legacy 7b: Hard−Pressed Multi−Ethnic Neighbourhoods 7c: Elderly in Flats 8a: Traditional Trades 8b: Service Sector Urbanities 8c: Late Retirement Preliminary 2011 Output Area Classification Test Results Per Current Postcode Who was supplying the data?
  13. 2000 4000 6000 8000 10000 16 21 17 11 1

    9 22 8 15 10 12 13 6 0 20 18 23 7 2 19 14 4 3 5 Hour Mean Download Speed (Kbps) URBAN_RURAL_INDEX Hamlet and Isolated Dwelling Less Sparse Hamlet and Isolated Dwelling Sparse Town and Fringe Less Sparse Town and Fringe Sparse Urban >= 10K Less Sparse Urban >= 10K Sparse Village Less Sparse Village Sparse Considering Spatio-temporal variation
  14. 2. Behaviors Key aims: -  What are the geographies of

    demand for various Internet applications? -  What can we learn about preferences and perceptions? -  Can we start to identify user groups? -  How do disparities in supply match with disparities in demand? -  How can we model this nationally? MK41 8RG
  15. Behavioral Data •  Again, not easy to find •  ESRC

    Understanding Society – Issues •  OXIS / OII offered the most comprehensive survey data (vast dataset) •  Partnership between GDSL and OII established •  Data exchanged for the purpose of creating small area estimates of Internet engagement
  16. Putting the data to use National Coverage required (and at

    high spatial resolution) Sufficient sample for expansion to national coverage Small area estimation methods tested Initially at small area level using Regression Best method was data mining/ decision tree induction using QUEST Create subgroup estimates to fit at area level
  17. •  Data mining allowed for profiling at the respondent level

    •  Granularity in the data was retained / no aggregation before estimation •  42 OXIS questions estimated nationally •  7 domains: Seeking Information Perceptions Household Access Mobile Access Access Patterns Commercial Applications Rates of personal Access •  Age / Social Grade / Pop. Density used for synthetic estimations in QUEST •  Produced up to 100 subgroup rates for each Output Area Estimation •  7,619,136 Small Area estimates
  18. Next steps •  Aggregation from OA to LSOA •  Reduce

    noise in our dataset Validation: •  Profiling by existing geodemographic classifications •  Comparisons with Oxford estimates •  Visualizations and Mapping •  Compile a master database to be used in cluster formation Internet non-use
  19. −60 −50 −40 −30 −20 −10 0 10 20 30

    40 50 60 1a: Rural Retirement 1b: Farming Communities 1c: Country Life 2a: Aspirational Migrants 2b: Student Communities 2c: Settled City Living 3a: Urban Deprivation 3b: Connected Achievers 3c: Aspirational Multicultural Families 3d: Challenged Ethnic Mix 4a: Blue Collar Estates 4b: Blue Collar Transitions 4c: Blue Collar Terraces 5a: Socially Mobile Minorities 5b: Ethnic Communities 6a: Inner Suburbs 6b: Established Suburbs 6c: Suburban Aspiration 7a: Industrial Legacy 7b: Hard−Pressed Multi−Ethnic Neighbourhoods 7c: Elderly in Flats 8a: Traditional Trades 8b: Service Sector Urbanities 8c: Late Retirement Output Area Classification 2011 % Difference From National Average Internet non-use by OAC
  20. 51.0 51.4 51.8 52.2 52.6 53.0 53.4 53.8 54.2 54.6

    55.0 55.4 55.8 56.2 1 2 3 4 5 6 7 8 9 10 IMD Decile (1 = Most Deprived, 10 = Least Deprived) QC30b: Estimated % of People Frequently Buying Products Online Frequent online shopping by Indices of Multiple Deprivation
  21. Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield

    Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth Kensington and Chelsea Hammersmith and Fulham Brent Havering Sutton Enfield Croydon Kingston upon Thames Harrow Wandsworth Hounslow Haringey Newham Camden Waltham Forest Hackney Redbridge Hillingdon Tower Hamlets City of London Islington Barking and Dagenham Bromley Richmond upon Thames Bexley Barnet Merton Lewisham Ealing Westminster Greenwich Southwark Lambeth % under 17 17 to 18 18 to 19 19 to 20 20 to 20 20 to 21 21 to 23 over 23 % under 17.4 17.4 to 18.2 18.2 to 18.9 18.9 to 19.6 19.6 to 20.4 20.4 to 21.4 21.4 to 22.7 over 22.7 % Frequently ordering food or groceries online - London
  22. 3. Demographic and Contextual Data •  2011 Census data • 

    Area level statistics for multiple domains •  Assists in cluster formation and unpicking results •  Data selected based on previous research •  31 contextual indicators: Age Level of qualification Employment sector Full time students Population density
  23. Pre-Build •  More ‘Janitor’ work •  Compile a master database

    •  Select indicators from all previous studies •  Match and join •  Final dataset contained over 25m observations •  Probably constitutes big data •  Too much information •  Time to build something user friendly
  24. Building a geodemographic classification Assessment of input variables: -  Distributions

    -  Skew -  Normalize? -  Standardize Cluster: -  K-means -  10,000 iterations -  Supergroup and Group level hierarchy
  25. Hierarchy Clustergram of the PCA−weighted Mean of the clusters k−mean

    clusters vs number of clusters (k) Number of clusters (k) PCA weighted Mean of the clusters 2 3 4 5 6 7 8 9 10 −20 −10 0 10 20
  26. Untangling the mess •  Identify Supergroup and Group characteristics • 

    Create a logical hierarchy •  Cluster naming •  ‘Pen Portraits’ •  User Guide
  27. •  Use and impact: - Customer Insight/ BI applications -

    Targeted marketing - Policy formation and delivery •  Distribution: Open Source and freely available Interactive web maps
  28. Conclusions •  Geodemographic classifications can be a powerful tool if

    built carefully •  Input data is crucially important – E-society failures •  Requirement for more focused classifications •  Classification building involves trial and error. Part art, part science •  Data acquisition is time consuming and difficult •  Open source classifications are increasing in popularity •  IUC success will depend on uptake – time will tell