Big Data Lecture, Semi-Structured Data, Part 1: Syntax

Big Data Lecture, Semi-Structured Data, Part 1: Syntax

ETH Zürich
Big Data Lecture HS 2014
Semi-Structured Data Part 1: Syntax
Tuesday, December 2nd, 2014

66d5abafc597b670cf6f109e4c278ebc?s=128

Ghislain Fourny

December 02, 2014
Tweet

Transcript

  1. Big  Data     Semi-­‐Structured  Data   Part  1:  Syntax

            Dr.  Ghislain  Fourny     Tuesday,  December  2nd,  2014     © Department of Computer Science | ETH Zürich
  2. INTRODUCTION  

  3. Trees...  

  4. ...  and  Graphs  

  5. Data  in  the  1960s  

  6. Data  in  the  1960s   Issue:  Data  Independence  

  7. Edgar  Codd,  1970   Company   Year   Assets  

    Liabili:es   Equity   AmericanRapid   2012   1,000,000   1,000,000   1,000,000   AmericanRapid   2013   1,000,000   1,000,000   1,000,000   Visto   2012   1,000,000   1,000,000   1,000,000   Visto   2013   1,000,000   1,000,000   1,000,000  
  8. Edgar  Codd,  1970   Company   Year   Assets  

    Liabili:es   Equity   AmericanRapid   2012   1,000,000   1,000,000   1,000,000   AmericanRapid   2013   1,000,000   1,000,000   1,000,000   Visto   2012   1,000,000   1,000,000   1,000,000   Visto   2013   1,000,000   1,000,000   1,000,000   Issue:  Scaling  Up  
  9. NoSQL  (2000s-­‐2010s)   Key-­‐Value  Stores   {  Company:  "Visto",  Year:

     2012,  Assets:  1000000000,  LiabiliXes:  1000000000,  Equity:  1000000000  }   {  Company:  "AmericanRapid",  Year:  2012,  Assets:  1000000000,  LiabiliXes:  1000000000,  Equity:  1000000000  }   {  Company:  "AmericanRapid",  Year:  2011,  Assets:  1000000000,  LiabiliXes:  1000000000,  Equity:  1000000000  }   foo   bar   foobar  
  10. NoSQL  (2000s-­‐2010s)   Triple  Stores   Visto   2012  

    1000000000   1000000000   1000000000   Company   Year   Assets   LiabiliXes   Equity   Year  
  11. NoSQL  (2000s-­‐2010s)   Column  Stores   Company   Year  

    Assets   Liabili:es   Equity   AmericanRapid   2012   1,000,000   1,000,000   1,000,000   AmericanRapid   2013   1,000,000   1,000,000   Visto   2012   1,000,000   1,000,000   Visto   2013   1,000,000   1,000,000   1,000,000  
  12. NoSQL  (2000s-­‐2010s)   Document  Stores   {      Company:

     "Visto",      Year:  2012,      Assets:  1000000000,      LiabiliXes:  1000000000,      Equity:  1000000000   }   {      Company:  "AmericanRapid",      Year:  2012,      Assets:  1000000000,      LiabiliXes:  1000000000,      Equity:  1000000000   }   {      Company:  "AmericanRapid",      Year:  2011,      Assets:  1000000000,      LiabiliXes:  1000000000,      Equity:  1000000000   }  
  13. NoSQL Data Stores 13

  14. NoSQL Data Stores 14 Key-­‐value  stores     Document  Stores

        Triple  Stores     Column  Stores  
  15. NoSQL Data Stores 15 Key-­‐value  stores     Document  Stores

        Triple  Stores     Column  Stores  
  16. Semi-Structured Documents 16

  17. Semi-Structured Documents 17 Structured   Unstructured   Lorem  ipsum  dolor

     sit  amet,  consectetur   adipiscing  elit,  sed  do  eiusmod  tempor   incididunt  ut  labore  et  dolore  magna  aliqua.   Ut  enim  ad  minim  veniam,  quis  nostrud   exercitaXon  ullamco  laboris  nisi  ut  aliquip  ex   ea  commodo  consequat.  Duis  aute  irure  dolor   in  reprehenderit  in  voluptate  velit  esse  cillum   dolore  eu  fugiat  nulla  pariatur.  Excepteur  sint   occaecat  cupidatat  non  proident,  sunt  in  culpa   qui  officia  deserunt  mollit  anim  id  est   laborum.  
  18. Semi-Structured Documents 18 Structured   Unstructured   Lorem  ipsum  dolor

     sit  amet,  consectetur   adipiscing  elit,  sed  do  eiusmod  tempor   incididunt  ut  labore  et  dolore  magna  aliqua.   Ut  enim  ad  minim  veniam,  quis  nostrud   exercitaXon  ullamco  laboris  nisi  ut  aliquip  ex   ea  commodo  consequat.  Duis  aute  irure  dolor   in  reprehenderit  in  voluptate  velit  esse  cillum   dolore  eu  fugiat  nulla  pariatur.  Excepteur  sint   occaecat  cupidatat  non  proident,  sunt  in  culpa   qui  officia  deserunt  mollit  anim  id  est   laborum.   Semi-­‐structured   a d This is c b . text e:f
  19. Standards 19

  20. For whom? 20

  21. 1.  SYNTAX  

  22. Well-­‐formedness  

  23. Well-­‐formedness   One  syntax  =  one  language  

  24. Well-­‐formedness   One  syntax  =  one  language   D ∈

    L ?  
  25. Well-­‐formedness   One  syntax  =  one  language   D ∈

    L ?   D  is   well-­‐formed   D  is  not   well-­‐formed  
  26. XML 26 <?xml&version="1.0"?>& <country&code="CH">& &&<name>Switzerland</name>& &&<population>8014000</population>& &&<currency&code="CHF">Swiss&Franc</currency>& &&<cities>& &&&&<city>Zurich</city>& &&&&<city>Geneva></city>&

    &&&&<city>Bern&<!II&the&Federal&City&II></city>& &&</cities>& &&<description>& &&&&We&produce&<b>very</b>&good&chocolate.& &&</description>& </country>& &
  27. JSON 27 {" """code":""CH"," """name":""Switzerland"," """population":"8014000," """currency":"{" """""name":""Swiss"Franc"," """""code":""CHF"" ""},"

    """confederation":"true," """president"":""Didier"Burkhalter"," """capital":"null," """cities":"[""Zurich",""Geneva",""Bern""]," """description":""We"produce"very"good"chocolate."" }" !
  28. HTML 28 <!DOCTYPE*html>* <html>* **<head>* ****<title>Country</title>* **</head>* **<body>* ****<h1*class="Title">Switzerland</h1>* ****<div>Population:*8014000<br>*

    ******Currency:*Swiss*Franc*(CHF)</div>* ****<h2>Cities</h2>* ****<ul>* ******<li>Zurich</li>* ******<li>Geneva></li>* ******<li>Bern&nbsp;<!SS*the*Federal*City*SS></li>* ****</ul>* **</body>* </html>* !
  29. XML  

  30. XML: Element 30 <foo>[more XML]</foo> <bar/> = <bar></bar>

  31. XML: Element 31 <foo>[more XML]</foo> <bar/> = <bar></bar> opening  tag

      closing  tag   empty  tag  
  32. XML: Attribute 32 <a  aDr="value"/>  

  33. XML: Text 33 <a>This  is  text</a>  

  34. XML: Comment 34 <!-­‐-­‐  This  is  a  comment  -­‐-­‐>  

  35. XML: Processing Instruction 35 <?myapp  do  whatever  ?>   <?xml

     version="1.0"?>   Charles  Goldfarb   "In  a  perfect  world,  processing  instruc5ons  would  not  be   necessary.  However,  as  you  might  have  no5ced,  the  world   is  not  perfect."  
  36. What Appears Where? 36 Top-­‐Level   Between   Element  Tags

      Inside  Opening   Element  Tag   Elements   once   Aeributes   Text   Comments   Processing   InstrucXons  
  37. XML: Well-formedness   <a  foo="bar"  foo="bar2"/>       <a

     foo="bar"  bar="foo"/>               37
  38. XML: Well-formedness   <a><b></a></b>     <a><b></b></a>     38

  39. XML: Well-formedness   <a>1  <  2</a>     <a>1  &lt;

     2</a>     39
  40. XML: Entity References 40 <?xml  version  "1.0"?>   <document>  

    Lorem  ipsum  dolor  sit  amet,  consectetur  adipiscing  elit,   &lt;  &gt;  &quot;  &apos;  &amp;   eiusmod  tempor  incididunt  ut  labore  et  dolore  magna   aliqua.     </document>    <    >        "          '          &  
  41. XML: Character References 41 <?xml  version  "1.0"?>   <document>  

    Lorem  ipsum  dolor  sit  amet,   consectetur  adipiscing  elit,  sed  do     &#x03A0;   eiusmod  tempor  incididunt  ut   labore  et  dolore  magna  aliqua.     </document>   Π  
  42. XML: Well-formedness   <a  aer="a  "quote""/>     <a  aer="a

     &quot;quote&quot;"/>     42
  43. XML: Well-formedness   <!-­‐-­‐  my  -­‐-­‐  comment  -­‐-­‐>    

    43
  44. XML: Entity References 44 <?xml  version  "1.0"?>   <!DOCTYPE  document

     [      <!ENTITY  myownenXty  "foobar">   ]>   <document>   Lorem  ipsum  dolor  sit  amet,  consectetur  adipiscing  elit,   sed  do     &myownenXty;   eiusmod  tempor  incididunt  ut  labore  et  dolore  magna   aliqua.     </document>    foobar  
  45. XML: CDATA sections 45 <?xml  version  "1.0"?>   <document>  

    Lorem  ipsum  dolor  sit  amet,  consectetur  adipiscing  elit,   sed  do     <![CDATA[      &<<>>"'      ]]>   eiusmod  tempor  incididunt  ut  labore  et  dolore  magna   aliqua.     </document>  
  46. XML Names   <1234/>   <a<b/>   <xml/>    

    <foo1234/>   <_bar/>       46
  47. XML: Namespaces   XML  with          

                  Namespaces   XML   47
  48. XML Names 48 Namespace http://nosql.example.com + Local name entity Expanded

    name {http://nosql.example.com}document
  49. XML Names 49 Namespace http://nosql.example.com + Local name entity Expanded

    name {http://nosql.example.com}entity
  50. Life without QNames (Clark Notation) 50 <{http://www.w3.org/1998/Math/MathML}math> <{http://www.w3.org/1998/Math/MathML}apply> <{http://www.w3.org/1998/Math/MathML}eq/> <{http://www.w3.org/1998/Math/MathML}ci>

    x </{http://www.w3.org/1998/Math/MathML}ci> <{http://www.w3.org/1998/Math/MathML}apply> <{http://www.w3.org/1998/Math/MathML}root/> <{http://www.w3.org/1998/Math/MathML}cn> 2 </{http://www.w3.org/1998/Math/MathML}cn> </{http://www.w3.org/1998/Math/MathML}apply> </{http://www.w3.org/1998/Math/MathML}apply> </{http://www.w3.org/1998/Math/MathML}math>
  51. Life with Prefixes and QNames 51 <m:math xmlns:m="http://www.w3.org/1998/Math/MathML"> <m:apply> <m:eq/>

    <m:ci> x </m:ci> <m:apply> <m:root/> <m:cn> 2 </m:cn> </m:apply> </m:apply> </m:math>
  52. Life with Prefixes and QNames 52 <m:math xmlns:m="http://www.w3.org/1998/Math/MathML"> <m:apply> <m:eq/>

    <m:ci> x </m:ci> <m:apply> <m:root/> <m:cn> 2 </m:cn> </m:apply> </m:apply> </m:math> The namespace is represented by a prefix.
  53. Life with Prefixes and QNames 53 <m:math xmlns:m="http://www.w3.org/1998/Math/MathML"> <m:apply> <m:eq/>

    <m:ci> x </m:ci> <m:apply> <m:root/> <m:cn> 2 </m:cn> </m:apply> </m:apply> </m:math> Prefix m is bound to a namespace using an xmlns:m attribute.
  54. Life with Prefixes and QNames 54 <m:math xmlns:m="http://www.w3.org/1998/Math/MathML"> <m:apply> <m:eq/>

    <m:ci> x </m:ci> <m:apply> <m:root/> <m:cn> 2 </m:cn> </m:apply> </m:apply> </m:math> Prefix: m Namespace: http://www.w3.org/1998/Math/MathML Local name: apply QName  
  55. Default Namespace 55 <math xmlns ="http://www.w3.org/1998/Math/MathML"> <apply> <eq/> <ci> x

    </ci> <apply> <root/> <cn> 2 </cn> </apply> </apply> </math>
  56. Default Namespace 56 <math xmlns ="http://www.w3.org/1998/Math/MathML"> <apply> <eq/> <ci> x

    </ci> <apply> <root/> <cn> 2 </cn> </apply> </apply> </math> No Prefix: Default Namespace
  57. Binding scopes 57 <m:math xmlns:m="http://www.w3.org/1998/Math/MathML"> <m:apply> <m:eq/> <m:ci> x </m:ci>

    <m:apply> <m:root/> <m:cn> 2 </m:cn> </m:apply> </m:apply> </m:math> LifeXme  of  the  prefix  binding  
  58. (Not) Well-Formed XML 58 <?xml version="1.0" encoding="utf-16"?> <movies> <movie id=”56225”>

    <title>Love Story</title> <title></title> <year>1980</year> <_director name='Coppola'></_director> <comment text=”Five start” text=”Average”/> <xml>Introduce XML content</xml> <newcomment text="An <important> text">Oscar</ newcomment> <comment lang=de>&copy; 1980 Warner Bros.</comment> <!-- Famous movie of the --80s --> </Movie> </movies>
  59. (Not) Well-Formed XML 59 <?xml version="1.0" encoding="utf-16"?> <movies> <movie id=”56225”>

    <title>Love Story</title> <title></title> <year>1980</year> <_director name='Coppola'></_director> <comment text=”Five start” text=”Average”/> <xml>Introduce XML content</xml> <newcomment text="An <important> text">Oscar</ newcomment> <comment lang=de>&copy; 1980 Warner Bros.</comment> <!-- Famous movie of the --80s --> </Movie> </movies>
  60. Well-Formedness: How To Tell? An editor (oXygen, ...) will tell

    you. 60
  61. Well Formed XML 61 <?xml version="1.0" encoding="utf-16"?> <!DOCTYPE movies [

    <!ENTITY copy "&#169;"> ]> <movies> <Movie id="56225"> <title>Love Story</title> <title></title> <year>1980</year> <_director name='Coppola'></_director> <comment text="Five start"/> <comment text="Average"/> <newcomment text="An &lt;important&gt; text">Oscar</newcomment> <comment lang="de">&copy; 1980 Warner Bros.</comment> <!-- Famous movie of the 80s --> </Movie> </movies> !
  62. Which QNames are in which Namespaces? 62 <?xml version="1.0"?> <!DOCTYPE

    eth> <eth xmlns="http://www.ethz.ch" xmlns:xmldb="http://www.dbis.ethz.ch" date="11.11.2006" xmldb:date="12.11.2006"> <date>13.11.2006</date> <president number="1">Empty</president> <Rektor>Name 2</Rektor> </eth> !
  63. Which QNames are in which Namespaces? 63 <?xml version="1.0"?> <!DOCTYPE

    eth> <eth xmlns="http://www.ethz.ch" xmlns:xmldb="http://www.dbis.ethz.ch" date="11.11.2006" xmldb:date="12.11.2006"> <date>13.11.2006</date> <president number="1">Empty</president> <Rektor>Name 2</Rektor> </eth> !
  64. XML:  Not  covered   • NotaXons   • Unparsed  enXXes   • Parameter

     enXXes  
  65. JSON  

  66. JSON: String 66 "foo" "foo\nbar\u005f"

  67. JSON: Number 67 3.1415 -1.2345E+5

  68. JSON: Boolean 68 true false

  69. JSON: Null 69 null

  70. JSON: Array 70 [ 3.14159265368979, true, "This is a string",

    { "foo" : false }, null ]
  71. JSON: Object 71 {      foo:  3.14159265368979,    

     bar:  true,      str:  "This  is  a  string",      obj:  {  "school"  :  "ETH"},      Q:  null   }  
  72. JSON: Well-formedness {  "foo"  :  "bar",  "foo"  :  "bar2"  }

        {  "foo"  :  "bar",  "bar"  :  "foo"  }     (SHOULD)   72
  73. JSON: Well-formedness 73 {  [  1  ]  :  "bar",  

     2  :  "bar2"  }     {  "1"  :  "bar",  "2"  :  "foo"  }  
  74. JSON: Well-formedness 74 {  foo:  "bar",    bar:  "bar2"  }

        {  "foo"  :  "bar",  "bar"  :  "foo"  }  
  75. HTML  

  76. XHTML  syntax                

       HTML  syntax  
  77. HTML Syntax 77 <!DOCTYPE html> <html> <head> <title>Untitled</title> </head> <body>

    Dear jane <br> <p>You are invited at the weekly meeting</p> <p>Yours sincerely, <br> John</p> </body> </html> !
  78. HTML Elements: Void 78 <br>  

  79. HTML Elements: Raw text 79 <style>        body

     {  color:  black;  background:  white;  }        em  {  font-­‐style:  normal;  color:  red;  }   </style>  
  80. HTML Elements: Escapable raw text 80 <Xtle>      

     This  is  a  &quot;Xtle&quot;   </Xtle>  
  81. HTML Elements: Foreign 81 <math xmlns="http://www.w3.org/1998/Math/MathML"> <apply> <eq/> <ci> x

    </ci> <apply> <root/> <cn> 2 </cn> </apply> </apply> </math>
  82. HTML Elements: Normal 82 <ol>      <li>ee</li>    

     <li>två</li>      <li>tre</li>   </ol>  
  83. XHTML Syntax 83 <?xml version "1.0"?> <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml">

    <head> <title>Untitled</title> </head> <body> Dear jane <br/> <p>You are invited at the weekly meeting</p> <p>Yours sincerely, <br/> John</p> </body> </html> ! ...  looks  familiar?  
  84. YAML  

  85. YAML 85 %YAML&1.2& ***& Country:& &&code:&'CH'& &&name:&'Switzerland'& &&population:&8014000& &&currency:& &&&&name:&'Swiss&Franc'&

    &&&&code:&'CHF'& &&confederation:&true& &&president&:&'Didier&Burkhalter'& &&capital:&null& &&cities:& &&&&*&'Zurich'& &&&&*&'Geneva'& &&&&*&'Bern'& &&description:&'We&produce&very&good&chocolate.'& !
  86. This  presenta5on  contains  pictures  that  are  under  copyright:    

    simo988  @  123RF  Stock  Photo   Ilka  Erika  Szasz-­‐Fabian  @  123RF  Stock  Photo   texelart  @  123RF  Stock  Photo