Big Data - 01 - Introduction

66d5abafc597b670cf6f109e4c278ebc?s=47 Ghislain Fourny
September 19, 2017

Big Data - 01 - Introduction

Lecture given at ETH Zurich, September 19, 2017.

66d5abafc597b670cf6f109e4c278ebc?s=128

Ghislain Fourny

September 19, 2017
Tweet

Transcript

  1. Ghislain Fourny Big Data 1. Introduction

  2. None
  3. Exploring the infinitely big...

  4. ... means exploring the infinitely small Own work by uploader,

    PBS NOVA [1], Fermilab, Office of Science, United States Department of Energy, Particle Data Group
  5. 5 Data is like matter Study of the real world

    Physics Study of the data world Data Science
  6. A good decision is based on knowledge, not on numbers.

    - Plato
  7. We have tons of data

  8. We have data on data

  9. 9 Poll https://eduapp-app1.ethz.ch/ Go now to: or install EduApp 2.x

  10. 10 Big Data Lectures Big Data Computer Science Data Science

    CBB MSc with CS background
  11. 11 Big Data Lectures Fall 2017 Computer Science Data Science

    CBB MSc with CS background Other departments Big Data Information Systems for Engineers
  12. 12 Big Data Lectures Fall 2017 Spring 2018 Computer Science

    Data Science CBB MSc with CS background Other departments Big Data Information Systems for Engineers Big Data (for Engineers)
  13. 13 Big Data Lectures Fall 2017 Spring 2018 Big Data

    Information Systems for Engineers Big Data (for Engineers) This lecture Computer Science Data Science CBB MSc with CS background Other departments
  14. 14 A Short History of Databases

  15. 15

  16. 16 Speaking/Singing

  17. 17 Rosetta Stone © Hans Hillewaert Writing

  18. 18 Accounting Plimpton 322 (Public Domain)

  19. 19 Printing Willi Heidelbach

  20. 20 Ben Franske - DM IBM S360.jpg on en.wiki Computers

  21. 21 1960s: File Systems Lorem Ipsum Dolor sit amet Consectetur

    Adipiscing Elit. In Imperdiet Ipsum ante
  22. 22

  23. 23 1970s: The Relational Era

  24. 24 1980s: The Object Era

  25. 25 2000s: The NoSQL Era foo bar foobar Key-value stores

    Triple stores Column stores Document stores
  26. 26 In short? 1970 We threw data at computers.

  27. 27 In short? 1990 We threw computers at computers.

  28. 28 In short? 2000 We threw computers at data.

  29. 29 In short? now We are throwing data at data.

  30. 30 Big Data

  31. 31 It's a buzzword!

  32. 32 Applications Data Management Algorithms Big Data goes across disciplines

    Distributed Systems High-Performance Computing Programming Languages Statistics Machine Learning
  33. 33 Big Data involves a lot of proprietary technology

  34. 34 The Big in Big Data

  35. 35 The Three Vs Big Data Volume Variety Velocity TB

    ZB
  36. 36 MORE MORE MORE Data Volume

  37. 37 Data Volume Web Sensors Proprietary Scientific Content Usage Location

    IoT Digital Traces Experiments Surveys
  38. 38 Data Volume … because we can! Infrastructure Hardware Software

    Technology
  39. 39 Data Volume … because data carries value

  40. 40 Data is worth more than the sum of its

    parts Utility( + ) > Utility( ) + Utility( )
  41. 41 Data totality: one can have complete data All flights

    All hotels All shops ...
  42. 42 Prefixes (International System of Units) kilo (k) 1,000 (3

    zeros) Mega (M) 1,000,000 (6 zeros) Giga (G) 1,000,000,000 (9 zeros) Tera (T) 1,000,000,000,000 (12 zeros) Peta (P) 1,000,000,000,000,000 (15 zeros) Exa (E) 1,000,000,000,000,000,000 (18 zeros) Zetta (Z) 1,000,000,000,000,000,000,000 (21 zeros) Yotta (Y) 1,000,000,000,000,000,000,000,000 (24 zeros) You must know this by !
  43. 43 Prefixes (International System of Units) kibi (ki) 1,024 (210)

    Mebi (Mi) 1,048,576 (220) Gibi (Gi) 1,073,741,824 (230) Tebi (Ti) 1,099,511,627,776 (240) Pebi (Pi) 1,125,899,906,842,624 (250) Exbi (Ei) 1,152,921,504,606,846,976 (260) Zebi (Zi) 1,180,591,620,717,411,303,424 (270) Yobi (Yi) 1,208,925,819,614,629,174,706,176 (280) You must NOT know this by !
  44. 44 Data Variety

  45. 45 Data Shapes: Tables

  46. 46 Data Shapes: Trees

  47. 47 Data Shapes: Graphs

  48. 48 Data Shapes: Cubes

  49. 49 Data Shapes: Text Lorem ipsum dolor sit amet, consectetur

    adipiscing elit. Etiam vel erat nec dui aliquet vulputate sed quis nulla. Donec eget ultricies magna, eu dignissim elit. Nullam sed urna nec nisl rhoncus ullamcorper placerat et enim. Integer varius ornare libero quis consequat. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean eu efficitur orci. Aenean ac posuere tellus. Ut id commodo turpis. Praesent nec libero metus. Praesent at turpis placerat, congue ipsum eget, scelerisque justo. Ut volutpat, massa ac lacinia cursus, nisl dui volutpat arcu, quis interdum sapien turpis in tellus. Suspendisse potenti. Vestibulum pharetra justo massa, ac venenatis mi condimentum nec. Proin viverra tortor non orci suscipit rutrum. Phasellus sit amet euismod diam. Nullam convallis nunc sit amet diam suscipit dapibus. Integer porta hendrerit nunc. Quisque pharetra congue porta. Suspendisse vestibulum sed mi in euismod. Etiam a purus suscipit, accumsan nibh vel, posuere ipsum. Nulla nec tempor nibh, id venenatis lectus. Duis lobortis id urna eget tincidunt.
  50. 50 Data Velocity

  51. 51 Data is generated automatically Picture: Vladimir Voronin/123RF

  52. 52 Data is a realtime byproduct of human activity Picture:

    Ash Waechter/123RF
  53. 53 Capacity Throughput Latency Three paramount factors Picture: Ash Waechter/123RF

  54. 54 Capacity Picture: Ash Waechter/123RF ? "How much data can

    we store?"
  55. 55 Throughput Picture: Ash Waechter/123RF ? "How fast can we

    transmit data?"
  56. 56 Latency Picture: Ash Waechter/123RF ? ! "When do I

    start receiving data?"
  57. 57 The progress made (1956-2010) Picture: Ash Waechter/123RF Source: Michael

    E. Friske, Claus Mikkelsen, The History of Storage, SHARE 2014 Capacity Throughput Latency
  58. 58 The progress made (1956-2010) Picture: Ash Waechter/123RF 150,000,000x 10,000x

    8x Capacity Throughput Latency
  59. 59 The progress made (1956-2010): Logarithmic Picture: Ash Waechter/123RF 150,000,000x

    10,000x 8x Capacity Throughput Latency
  60. 60 Capacity: Example Picture: Ash Waechter/123RF ?~ 600,000 words.

  61. 61 Throughput: Example Picture: Ash Waechter/123RF ~1,000 word per minute

  62. 62 Latency: Example ~ 1 minute to stand up, go

    to the shelf, pick the book, find the page.
  63. 63 2017 – Analogy with a book 600,000 words 1,000

    words per minute 10 hours
  64. 64 2217 – Analogy with a book 75,000,000,000,000 words 10,000,000

    words per minute 750 years
  65. 65 The progress made (1956-2010): Logarithmic Picture: Ash Waechter/123RF 150,000,000x

    10,000x 8x Capacity Throughput Latency Parallelize!
  66. 66 2217: 15,000 persons could read it all in 10

    hours.
  67. 67 Data centers: clusters of machines (10,000s) Slave Master Slave

    Slave Slave Slave Slave
  68. 68 The progress made (1956-2010): Logarithmic Picture: Ash Waechter/123RF 8x

    Capacity Throughput Latency Batch processing! 150,000,000x 10,000x
  69. 69 What is Big Data (my definition)? Big Data is

    a portfolio of technologies that were designed to store, manage and analyze data that is too large to fit on a single machine while accommodating for the issue of growing discrepancy between capacity, throughput and latency.
  70. 70 Big Data in the Sciences Picture: pcanzo/123RF

  71. 71 Physics: CERN pioneers, produces 30 PB/year Picture: CERN

  72. 72 Physics: CERN pioneers, produces 30 PB/year 600,000,000 collisions/second 11,000

    computers 100,000+ processors
  73. 73 Astronomy: Sloan Digital Sky Survey Picture: NASA / WMAP

    Science Team
  74. 74 Astronomy: Sloan Digital Sky Survey Since 2000, now in

    phase IV till 2020 The most detailed 3D maps of the Universe ever made 35% covered so far 1G objects, 4 spectra
  75. 75 Astronomy: Sloan Digital Sky Survey 200 GB/night Picture: Wikipedia/EdPost

  76. 76 Genomics: DNA sequencing High Volume Low Cost

  77. 77 Genomics: the complete human genome 3B base pairs Picture:

    Wikipedia/Zephyris
  78. 78 Lecture Scope

  79. 79 Lecture scope

  80. 80 Lecture scope Databases Machine Learning AI

  81. 81 Lecture scope: databases only Databases Machine Learning AI

  82. 82 Lecture Team Ingo Müller (Head TA) Renato Marroquin (TA)

    Marco Ancona (TA) Konstantin Taranov (TA) Alexandr Nigay (TA) Damien Desfontaines (TA) Ghislain Fourny
  83. 83 Guest Lecture Bart Samwel Senior Staff Software Engineer (...

    and maybe more)
  84. 84 § Data in the large § Key-value stores (S3,

    Azure Blob Storage) § Distributed file systems (HDFS) § Distributed query processing (MapReduce, Spark) § Resource management (YARN) § Column stores (HBase) Lecture Overview
  85. 85 § Data in the large § Key-value stores (S3,

    Azure Blob Storage) § Distributed file systems (HDFS) § Distributed query processing (MapReduce, Spark) § Resource management (YARN) § Column stores (HBase) § Data in the small § Document stores (MongoDB) § Syntax (XML, JSON) § Data models, Schemas, Querying Lecture Overview
  86. 86 § Data in the large § Key-value stores (S3,

    Azure Blob Storage) § Distributed file systems (HDFS) § Distributed query processing (MapReduce, Spark) § Resource management (YARN) § Column stores (HBase) § Data in the small § Document stores (MongoDB) § Syntax (XML, JSON) § Data models, Schemas, Querying § Data in the very small § Data warehouses (OLAP, ROLAP, XBRL) § Graph databases (RDF) Lecture Overview
  87. 87 Attendance of the weekly lecture (3 hours/w Tuesdays 10-12,

    Wednesdays 9-10) What is expected
  88. 88 Attendance of the weekly lecture (3 hours/w Tuesdays 10-12,

    Wednesdays 9-10) Attendance of the exercise session (2 hours/w Wednesdays/Fridays 13-15) What is expected
  89. 89 Attendance of the weekly lecture (3 hours/w Tuesdays 10-12,

    Wednesdays 9-10) Attendance of the exercise session (2 hours/w Wednesdays/Fridays 13-15) Hands-on self-study, read the books, play with technology (2-3 hours/w) What is expected
  90. 90 Attendance of the weekly lecture (3 hours/w Tuesdays 10-12,

    Wednesdays 9-10) Attendance of the exercise session (2 hours/w Wednesdays/Fridays 13-15) Hands-on self-study, read the books, play with technology (2-3 hours/w) Passing the written exam in the winter session (150 minutes in January-February) What is expected
  91. 91 Attendance of the weekly lecture (3 hours/w Tuesdays 10-12,

    Wednesdays 9-10) Attendance of the exercise session (2 hours/w Wednesdays/Fridays 13-15) Hands-on self-study, read the books, play with technology (2-3 hours/w) Passing the written exam in the winter session (150 minutes in January-February) What is expected 8 KP
  92. 92 All students get ~ 100$ of Azure credit for

    self-study Self-study: Microsoft Azure
  93. 93 Please log on with your credentials to https://portal.azure.com Self-study:

    Registration for Azure