Where's the data?

Bb08e85e1b61fce0b9ff29e29b931b3f?s=47 James Eggers
February 04, 2013

Where's the data?

I gave this talk as a lecture to a 3rd year class of students studying Business in Trinity College Dublin.

Bb08e85e1b61fce0b9ff29e29b931b3f?s=128

James Eggers

February 04, 2013
Tweet

Transcript

  1. Hi Where’s the Data?

  2. •  I’m James •  I’m 18 •  Doing the Leaving

    Cert, take pity •  Have a weird obsession with computers @james_eggers
  3. •  Entered Young Scientist twice, won my category both times

    & EMC “Data-Hero” award. •  I made “Better Examinations.ie” which won an Irish Web Award. •  Spoken at the Dublin Web Summit on the Main Stage, & at the Central Statistics office re: data.
  4. What I want to talk to you about 1. What data

    really is 2. How can data affect your life 3. Some things I’ve done 4. Where is all that data, anyway? 5. How can you analyse the data, too?
  5. What is data, really?

  6. Data is…

  7. “Data” are lots of individual bits of info that aren’t

    so useful by themselves, but become very useful when you put them together.
  8. Some examples of “data”

  9. Your shoe sizes

  10. The texts you send every day

  11. the people you stalk most on Facebook

  12. the people you don’t stalk on Facebook

  13. Your tweets

  14. The books you read

  15. The foods you like

  16. The number of times Kim Kardashian will get married

  17. The amount of horse-meat in your lunch

  18. Life is full of data. Too much to comprehend.

  19. If you had all the data, You could predict anything.

  20. Think of life as an equation A2 + B2 =

    C2 If you know just two variables, say A and B, then you know C too.
  21. A “life” equation If you had enough “data”, you could

    theoretically predict anything in life with a really large equation.
  22. Obviously not so possible (yet)

  23. But, we can roughly get the probability of something happening

  24. Twitter can be used to roughly predict stock market flucuations

  25. Twitter can be used to predict who might win the

    X-Factor
  26. You see tweets about earthquakes before you feel it Speed

    of light is faster than sound
  27. Not just predicting things

  28. Data helps us understand the past and present, too

  29. What are a company’s customers saying about their products?

  30. Obama’s team constantly monitored twitter for public opinion (= data

    = lots of individual opinions accumulated).
  31. So really, data = ^ information (= power?) lot’s of

  32. None
  33. The Static Internet

  34. •  1990’s •  The static web •  Websites were always

    the same, rarely changed. •  Information was stagnant and outdated.
  35. Dell in 1996

  36. Google in 1998

  37. •  2000+ we start to see the web becomes more

    real-time used more widely. •  Facebook setup in 2004 which sets the stage for massive amounts of social information moving across the internet. •  Imagine it like an Information super- highway.
  38. •  APIs for accessing this information widely + easily available

    to everybody (almost). •  Massive datasets full of information to be accessed and analysed. •  Many avenues of analytics on this data yet to be explored + many ongoing creative experiments.
  39. Facebook   Twi,er   LinkedIn   3.2  Billion  Likes  +

      Comments  per  day   500+  million  Tweets  per   day   200+  Million  People.  
  40. •  Over 160 million people using Twitter. •  Collectively these

    people create 500 million Tweets /day. •  Each Tweet contains meta information (location, time, name of people mentioned in Tweet, info about user account etc). •  Accessing 2-3% of these tweets is free. •  Data from Twitter is widely used in research and statistical projects – it’s proven to work well. •  Experiments such as predicting the stocks have proven very possible with twitter data.
  41. Nowadays, there’s a huge amount of data out there for

    us to use.
  42. We should do that

  43. That’s what Big Data, is all about

  44. Harnessing the vast stores of data on the internet is

    invaluable.
  45. Everything can be connected to the internet

  46. Your lights, kettle, fridge, car, your chair (Facebook: “James just

    sat down, again”) and anything else you can think of
  47. Your kettle would love to be able to send you

    a notification when it’s boiled
  48. If everybody’s kettle was on the internet, we would have

    a lot of data about kettles
  49. Which we could analyse and correlate with other sets of

    data
  50. None
  51. If you just think, there are so many was to

    use and harness the data around you.
  52. None
  53. None
  54. How data can affect your life

  55. Advertising

  56. Targeted Advertising, specifically

  57. There’s a lot of data about you online •  Your

    age •  Relations •  Friends •  Likes •  What you look like •  What you say •  Who you talk to •  Your age •  occupation
  58. So, naturally, this info is used to show ads that

    are relevant to you.
  59. Lots of advertisements are targeted Has probably been seen by

    everybody here, because of your ages.
  60. Targeting on Facebook

  61. None
  62. Some of the things I’ve done with big data

  63. The Vibes of Ireland

  64. Real-Time Mood Monitoring via tweets

  65. Probably your tweets too

  66. Algorithm analyses millions of tweets and marks them as “happy”,

    “unhappy”, or “neutral”.
  67. It was a big part of the Science Gallery’s HAPPY?

    exhibition in May
  68. It also won its category in the BT Young Scientist

    & Technology Exhibition 2011.
  69. Mood of Ireland on an Average Day (Oct 2010 –

    Dec 2010)
  70. Results

  71. People are happiest on a Friday evening, and least happy

    early on a Thursday morning.
  72. There is a definite dip in the mood during the

    middle of the week.
  73. On an average day, people are happiest at about 18:00

    (6pm) and least happy early in the morning 04:00 – 08:00.
  74. I also found that the East Coast is generally in

    a worse mood than the West Coast. When the Budget 2011 was being read, there was a dip in the overall mood.
  75. Average Mood of all people in Ireland over an Average

    week:
  76. Average Mood of all people in Ireland over an Average

    day:
  77. Average Mood of all people in Ireland over an Average

    day, west coast vs. east coast:
  78. People are nearly always happier on the West coast. The

    east coast seems to consistently lag behind in terms of overall happiness.
  79. Better Examinations.ie

  80. Most of you have had the misfortune of having to

    use examinations.ie
  81. It is one of the most backward websites on the

    internet.
  82. No.

  83. But, there is a lot of data available to use.

  84. By training algorithms to store every word in every exam

    paper, students can now search all the papers for specific questions.
  85. Searching for questions that relate to Martin Luther

  86. The Dept. of Education makes statistics about past exams, which

    is cool.
  87. But they all look like this

  88. Not so easy to understand

  89. But, computers can understand.

  90. So, I made a better way

  91. Comparing four years of exam results of Art vs. Biology.

  92. Chemistry vs. Biology

  93. English vs. Irish

  94. Exam Paper Filter

  95. Better Examinations.ie has been used by hundreds of students all

    over Ireland.
  96. It uses the data the department of education make available

    to make life easier for students.
  97. In this case, the data are the exam papers and

    all of the statistics the dept. of education create.
  98. .

  99. Break

  100. None
  101. Freeflow Real-time, and automatic road traffic detection.

  102. Automatically gathered real-time road traffic data With Traffic Cameras and

    via Twitter Dispensed structured road traffic info via web app Also displayed data about ice levels on roads, average road temperature and air temperture from the national roads authority.
  103. Automatically gathered real-time road traffic data With Traffic Cameras and

    via Twitter Dispensed structured road traffic info via web app Also displayed data about ice levels on roads, average road temperature and air temperture from the national roads authority.
  104. So, instead of this, to deal with traffic reports

  105. We can just let computers do all the work for

    us
  106. How does it detect cars in an image?

  107. It’s a difficult problem

  108. Eventually I came up with something simple

  109. Instead of getting the computer to look for a car

    in an image, it looks for the absence of a car.
  110. So essentially, the more road it can see = the

    less traffic.
  111. O’Connell Bridge. Obviously the road is all a similar shade

    of grey.
  112. The Computer simply counts up the 4x4 pixel areas of

    black colour. Red = Area of empty space the computer can see
  113. Twitter was a great source of information Tweets were analyzed

    in real-time Looking for words like “accident” or “delays” Location was also found by searching for words like “on” or “at”.
  114. Then used Bing Maps to complete the address, and convert

    to latitude/longitude pair to map. Tweets that mentioned an incident, were kept for 4 hours before being cleared from the system.
  115. Then used Bing Maps to complete the address, and convert

    to latitude/longitude pair to map. Tweets that mentioned an incident, were kept for 4 hours before being cleared from the system.
  116. None
  117. None
  118. Ways of accessing all this data

  119. If you do CS + Business: •  Twitter’s APIs (dev.twitter.com)

    •  Facebook’s APIs (developers.facebook.com) •  Read up on machine learning techniques •  Learn Python, it’s really good for this type of stuff. •  Find unstructured data like road traffic images from cameras, and convert it to structured data to make something cool. •  Take data from anywhere and everywhere (as long as it’s legal), put it all together and see what you get.
  120. Working with data with out programming is easy, too.

  121. Step one is to find a source(s) of data you

    want to analyse.
  122. There’s data all over the internet, just waiting to be

    analysed
  123. Find a set of statistics you wish to use, and

    copy it into excel.
  124. Excel will handle most of what you throw at it.

  125. Compare datasets to other datasets.

  126. Compare populations of different country’s to their average exam scores.

  127. Compare shoe sizes to the number of people who have

    a disease.
  128. You could find that people with bigger feet have a

    greater chance of getting disease x.
  129. Do remember though, correlation is not causation.

  130. Just because one thing seems to cause the other, it

    doesn’t mean they do.
  131. E.g. The frequency with which Lindsay Lohan finds herself in

    jail may be correlated with the rate of increasing deforestation, but that doesn’t mean the two events have an effect on each other.
  132. Make sure to have additional evidence too, just to be

    sure of your hypothesis.
  133. Thanks for listening! Questions? @james_eggers