Upgrade to Pro — share decks privately, control downloads, hide ads and more …

An introduction to Open Data

An introduction to Open Data

How do you share data for maximum reuse? And when you’re a consumer, how do you consume as many datasets as possible in your application? With Open Data, we are creating a knowledge graph for humanity.

This knowledge graphs is created by humans, and therefore we need to make some arrangements. In this presentation, you’ll learn how to properly use the HTTP protocol to publish data, add an open data license to your data, learn how to make your dataset linked, and will get an introduction to scalable Web API design.

Pieter Colpaert

March 28, 2019
Tweet

More Decks by Pieter Colpaert

Other Decks in Technology

Transcript

  1. Open Data Sharing data for maximum reuse Consuming data on

    Web-Scale https://pietercolpaert.be/#me Ghent University – IMEC – IDLab Guest lecture 2019-03-28
  2. Open Data Depending on who’s asking, the main goal is

    to… Share data for maximum reuse Or to... Consume data on Web-Scale
  3. Can you find data about the LEZ in Antwerp? Can

    you automate your workflow in a script? What would you suggest to the city of Antwerp?
  4. Addresses (CRAB) Accommodations with a permit (published by Toerisme Vlaanderen)

    Company register (KBO) Laws (Staatsblad) Local decisions (cfr your city’s website) Road registry, road works (GIPOD), road signs, traffic light status Public transport timetables Weather observations Air quality observations Water levels Biodiversity statistics Statistics used by the local, Flemish, Federal, EU gov and by the Worldbank Research results Catalog of papers written by a university Opening hours The catalog of your supermarket Clinical trials Drug database Molecule databases General knowledge: Wikipedia, wikidata Maps: Open Street Map Examples of other datasets that must be open
  5. Program Open Data theory: 1. The legal aspects: © and

    sui generis database right 2. The HTTP protocol: focus on public read access 3. RDF and why serializations shouldn’t matter 4. How global identifiers allow discussing semantics on Web-Scale 5. Public Web API design Excercises: 1. Solving a question over 3 Linked Open Datasets with 1 piece of code 2. Publishing a 4th dataset
  6. 2 most important laws for data: 1. Copyright on the

    container 2. Sui generis database rights (only when you invested “substantially” in creating the database) Both are incredibly vague regarding data (read up on Text and Data Mining)
  7. Cross Origin Resource Sharing: the problem Step 1: Open a

    website (e.g., https://ruben.verborgh.org) Step 2: Open the JavaScript console (shortcut: f12) Step 3: Execute the following code: fetch('https://gmail.com').then(async response => { console.log(await response.text()); }); Can you explain what happens? https://github.com/solid/web-access-control-spec/blob/master/Background.md
  8. Cross Origin Resource Sharing: the solution? Respond to HTTP GET

    requests with: Allow all origins: Access-control-allow-origin: * Also tell the browser which headers it can show to the client through: Access-Control-Expose-Headers: Content-Type, Link Respond to HTTP OPTIONS (preflight) request and allow common headers: Access-Control-Allow-Headers: Accept
  9. Cross Origin Resource Sharing: the solution? Step 1: Open a

    website (e.g., https://ruben.verborgh.org) Step 2: Open the JavaScript console (shortcut: F12) Step 3: Execute the following code: fetch('https://pietercolpaert.be').then(async response => { console.log(await response.text()); }); Can you explain what happens? We’re trying to find a better solution: “Proposal: Allow servers to take full responsibility for cross-origin access protection” https://github.com/whatwg/fetch/issues/878
  10. To HTTPS or not to HTTPS? For Open Data, we

    can maybe trust certain intermediate HTTP caches? Interesting idea: neighbourhood HTTP caches between peers on a train Pauline Folz, Hala Skaf-Molli, Pascal Molli. CyCLaDEs: A Decentralized Cache for Triple Pattern Fragments. ESWC: Extended Semantic Web Conference, May 2016, Heraklion, Greece. ESWC 2016: Extended Semantic Web Conference, 2016. 〈hal-01251654v2〉 HTTP HTTPS You can cache a response in an intermediate third party HTTP proxy yes no You know an intermediary did not change the response no yes
  11. To HTTPS or not to HTTPS? But, does it matter

    what we think? Step 1: go to a HTTPS website (e.g., https://ruben.verborgh.org) Step 2: Execute in the console (F12): fetch('http://pietercolpaert.be').then(async response => { console.log(await response.text()); }); Step 3: An error appears: CORS request is unsafe. Conclusion: When we want to get our data used as much as possible, best strategy is to offer HTTPS
  12. Caching: powering Web-scale 1. Caching based on expiration: Response headers:

    Cache-Control: max-age=X (in seconds) Age: Y 2. Caching based on the ETag header: Response header: etag: <unique hash> Request header: if-none-match Status code: 304 Not Modified In both cases: set the Vary header! ⇒ defines which other headers can change the cache key (e.g., the accept headers) Full story: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching expirationTime = responseTime + X - Y
  13. Once set, leave the hard work to: • different easy

    to configure self-hosted caches: Apache, nginx, Varnish… • different cloud caches or content delivery networks (CDN) Caching: powering Web-scale
  14. HTTP/2 Reuses open TCP connection for future HTTP requests Can

    promise extra resources through a server push No limit of 6-8 concurrent requests to a server Can already be enabled on e.g., nginx Already implemented in browsers
  15. Experiment 1. Requesting an HTTP/1.1 server for (let i =0;

    i < 50 ; i++) fetch('https://linked.open.gent/parking').then(); 2. Requesting an HTTP/2 server for (let i =0; i < 50 ; i++) fetch('https://graph.irail.be/sncb/connections').then();
  16. Rule 2: publish your data on the Web But a

    lot of technical consequences you want to get right: 1. Make sure your document has a permalink (HTTP URL) 2. Enable Cross Origin Resource Sharing 3. Use HTTPS 4. Enable publish HTTP caching 5. Enable compression (gzip, deflate, br) 6. Enable HTTP2
  17. name type city population StP-Plein Parking Gent 257k { "StP-Plein"

    : { "type" : "Parking", "city" : "Gent", "population" : "257k" } } <StP-Plein> <type>Parking</type> <city>Gent</city> <population> 257k </population> </StP-Plein> Table / CSV / Spreadsheet JSON XML Serialisations
  18. name type city population StP-Plein Parking Gent 257k <StP-Plein> <type>Parking</type>

    <city>Gent</city> <population> 257k </population> </StP-Plein> <StP-Plein> <type> <Parking> . <StP-Plein> <city> <Gent> . <Gent> <population> "257k" . Table / CSV / Spreadsheet 3 time a datum Triples JSON XML { "StP-Plein" : { "type" : "Parking", "city" : "Gent", "population" : "257k" } }
  19. Rule 4: make sure your data can be intepreted as

    triples using RDF* * Resource Description Framework https://www.w3.org/TR/rdf-primer/
  20. World Wide Web St-P Plein city Gent St Pietersplein type

    Parking Gent population 257k HTTP Machine 1 HTTP Machine 2 HTTP Machine 3 Thought experiment: decentralized publishing A user agent visiting each machine knows more than any of the machines independently
  21. Solution Sint Pietersplein → https://stad.gent/id/parking/P10 is a → http://www.w3.org/1999/02/22-rdf-syntax-ns#type Parking

    → http://vocab.datex.org/terms#UrbanParkingSite Uniform Resource Identifiers (URIs)
  22. Which URIs? Learn from other datasets Get started at e.g.:

    https://data.vlaanderen.be https://stad.gent/linked-data https://lov.linkeddata.es
  23. ↓ Querying syntactic semantic technical legal Open licenses Publishing over

    HTTP RDF serializations HTTP URIs allow to discuss semantics globally International, regional and local domain models Summary of our rules to raise interoperability
  24. Summary of our rules to raise interoperability ↓ Querying syntactic

    semantic technical legal Open licenses Publishing over HTTP RDF serializations HTTP URIs allow to discuss semantics globally International, regional and local domain models Web API design
  25. Trade-off in Web publishing Data dumps Smart servers Data publishing

    (cheap/reliable) Data services (rather expensive/unreliable) Entire query languages over HTTP Dataset split in fragments Smart agents algorithms as a service Read more at http://linkeddatafragments.org
  26. data dump Route planning algorithms as a service Asking questions

    Your system 3d party Your system ? ? ? ? ? ? Does not scale: Extra users comes with extra load Does not give necessary flexibility to companies
  27. Page X Page ... Page 2 Page 1 time next

    next Publishing time schedules in fragments on the Web
  28. Exercise! Given 3 datasets with multiple pages, build 1 codebase

    showing an overview of all things (entities/resources) described in it Write it on paper, in pseudocode, or implement it in the language of your choice Feel free to cut corners with functions you can implement later 3 datasets: • https://linked.open.gent/parking • https://graph.irail.be/sncb/connections • http://fragments.dbpedia.org/2016-04/en
  29. Quick convenience tool LDFetch $ npm install -g ldfetch $

    ldfetch https://linked.open.gent/parking $ ldfetch --help # for more options
  30. Hints 1. Async Iterators for incremental results ⇒ also way

    to raise user perceived performance 2. Only hardcode the URLs of the 3 datasets 3. Make an abstraction that detects building blocks in dataset ◦ Document your “building blocks” 4. Set an Accept header
  31. It doesn’t stop at pagination DCAT for describing datasets in

    data catalogs Triple Pattern Fragments: for solving basic graph patterns The Tree Ontology: describe data in trees Routable Tiles: describe geospatial data Demos at https://dexagod.github.io, http://query.linkeddatafragments.org, http://pieter.pm/demo-paper-routable-tiles/
  32. Exercise 2: Publish your own dataset Take a dataset from

    data.stad.gent and publish it properly Make sure your code also works with this dataset
  33. Websockets Client subscribes, server pushes updates to clients For datasets

    that update within a short timespan // Create WebSocket connection. const socket = new WebSocket('ws://localhost:8080'); // Connection opened socket.addEventListener('open', function (event) { socket.send('Hello Server!'); }); // Listen for messages socket.addEventListener('message', function (event) { console.log('Message from server ', event.data); });