An introduction to Open Data

An introduction to Open Data

How do you share data for maximum reuse? And when you’re a consumer, how do you consume as many datasets as possible in your application? With Open Data, we are creating a knowledge graph for humanity.

This knowledge graphs is created by humans, and therefore we need to make some arrangements. In this presentation, you’ll learn how to properly use the HTTP protocol to publish data, add an open data license to your data, learn how to make your dataset linked, and will get an introduction to scalable Web API design.

25b6db9c0680e598186d819051ad9e4b?s=128

Pieter Colpaert

March 28, 2019
Tweet

Transcript

  1. Open Data Sharing data for maximum reuse Consuming data on

    Web-Scale https://pietercolpaert.be/#me Ghent University – IMEC – IDLab Guest lecture 2019-03-28
  2. Open Data Depending on who’s asking, the main goal is

    to… Share data for maximum reuse Or to... Consume data on Web-Scale
  3. None
  4. Can you find data about the LEZ in Antwerp? Can

    you automate your workflow in a script? What would you suggest to the city of Antwerp?
  5. Addresses (CRAB) Accommodations with a permit (published by Toerisme Vlaanderen)

    Company register (KBO) Laws (Staatsblad) Local decisions (cfr your city’s website) Road registry, road works (GIPOD), road signs, traffic light status Public transport timetables Weather observations Air quality observations Water levels Biodiversity statistics Statistics used by the local, Flemish, Federal, EU gov and by the Worldbank Research results Catalog of papers written by a university Opening hours The catalog of your supermarket Clinical trials Drug database Molecule databases General knowledge: Wikipedia, wikidata Maps: Open Street Map Examples of other datasets that must be open
  6. We are building humanity’s knowledge graph

  7. Program Open Data theory: 1. The legal aspects: © and

    sui generis database right 2. The HTTP protocol: focus on public read access 3. RDF and why serializations shouldn’t matter 4. How global identifiers allow discussing semantics on Web-Scale 5. Public Web API design Excercises: 1. Solving a question over 3 Linked Open Datasets with 1 piece of code 2. Publishing a 4th dataset
  8. 2 most important laws for data: 1. Copyright on the

    container 2. Sui generis database rights (only when you invested “substantially” in creating the database) Both are incredibly vague regarding data (read up on Text and Data Mining)
  9. OpenDefinition.org Interested in the full story? https://pietercolpaert.be/open data/2017/02/23/cc0

  10. Give users the legal certainty they need https://github.com/iRail/stations

  11. Rule 1: document the data’s license

  12. Rule 2: publish your data on the Web

  13. Keep it simple An HTTP URL per document Example: data.vlaanderen.be/dumps

  14. Cross Origin Resource Sharing: the problem Step 1: Open a

    website (e.g., https://ruben.verborgh.org) Step 2: Open the JavaScript console (shortcut: f12) Step 3: Execute the following code: fetch('https://gmail.com').then(async response => { console.log(await response.text()); }); Can you explain what happens? https://github.com/solid/web-access-control-spec/blob/master/Background.md
  15. Cross Origin Resource Sharing: the solution? Respond to HTTP GET

    requests with: Allow all origins: Access-control-allow-origin: * Also tell the browser which headers it can show to the client through: Access-Control-Expose-Headers: Content-Type, Link Respond to HTTP OPTIONS (preflight) request and allow common headers: Access-Control-Allow-Headers: Accept
  16. Cross Origin Resource Sharing: the solution? Step 1: Open a

    website (e.g., https://ruben.verborgh.org) Step 2: Open the JavaScript console (shortcut: F12) Step 3: Execute the following code: fetch('https://pietercolpaert.be').then(async response => { console.log(await response.text()); }); Can you explain what happens? We’re trying to find a better solution: “Proposal: Allow servers to take full responsibility for cross-origin access protection” https://github.com/whatwg/fetch/issues/878
  17. To HTTPS or not to HTTPS? For Open Data, we

    can maybe trust certain intermediate HTTP caches? Interesting idea: neighbourhood HTTP caches between peers on a train Pauline Folz, Hala Skaf-Molli, Pascal Molli. CyCLaDEs: A Decentralized Cache for Triple Pattern Fragments. ESWC: Extended Semantic Web Conference, May 2016, Heraklion, Greece. ESWC 2016: Extended Semantic Web Conference, 2016. 〈hal-01251654v2〉 HTTP HTTPS You can cache a response in an intermediate third party HTTP proxy yes no You know an intermediary did not change the response no yes
  18. To HTTPS or not to HTTPS? But, does it matter

    what we think? Step 1: go to a HTTPS website (e.g., https://ruben.verborgh.org) Step 2: Execute in the console (F12): fetch('http://pietercolpaert.be').then(async response => { console.log(await response.text()); }); Step 3: An error appears: CORS request is unsafe. Conclusion: When we want to get our data used as much as possible, best strategy is to offer HTTPS
  19. Caching: powering Web-scale 1. Caching based on expiration: Response headers:

    Cache-Control: max-age=X (in seconds) Age: Y 2. Caching based on the ETag header: Response header: etag: <unique hash> Request header: if-none-match Status code: 304 Not Modified In both cases: set the Vary header! ⇒ defines which other headers can change the cache key (e.g., the accept headers) Full story: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching expirationTime = responseTime + X - Y
  20. Once set, leave the hard work to: • different easy

    to configure self-hosted caches: Apache, nginx, Varnish… • different cloud caches or content delivery networks (CDN) Caching: powering Web-scale
  21. HTTP/2 Reuses open TCP connection for future HTTP requests Can

    promise extra resources through a server push No limit of 6-8 concurrent requests to a server Can already be enabled on e.g., nginx Already implemented in browsers
  22. Experiment 1. Requesting an HTTP/1.1 server for (let i =0;

    i < 50 ; i++) fetch('https://linked.open.gent/parking').then(); 2. Requesting an HTTP/2 server for (let i =0; i < 50 ; i++) fetch('https://graph.irail.be/sncb/connections').then();
  23. HTTP/1.1 HTTP/2

  24. Rule 2: publish your data on the Web But a

    lot of technical consequences you want to get right: 1. Make sure your document has a permalink (HTTP URL) 2. Enable Cross Origin Resource Sharing 3. Use HTTPS 4. Enable publish HTTP caching 5. Enable compression (gzip, deflate, br) 6. Enable HTTP2
  25. Rule 3: use an open format

  26. name type city population StP-Plein Parking Gent 257k { "StP-Plein"

    : { "type" : "Parking", "city" : "Gent", "population" : "257k" } } <StP-Plein> <type>Parking</type> <city>Gent</city> <population> 257k </population> </StP-Plein> Table / CSV / Spreadsheet JSON XML Serialisations
  27. name type city population StP-Plein Parking Gent 257k <StP-Plein> <type>Parking</type>

    <city>Gent</city> <population> 257k </population> </StP-Plein> <StP-Plein> <type> <Parking> . <StP-Plein> <city> <Gent> . <Gent> <population> "257k" . Table / CSV / Spreadsheet 3 time a datum Triples JSON XML { "StP-Plein" : { "type" : "Parking", "city" : "Gent", "population" : "257k" } }
  28. Rule 4: make sure your data can be intepreted as

    triples using RDF* * Resource Description Framework https://www.w3.org/TR/rdf-primer/
  29. World Wide Web St-P Plein city Gent St Pietersplein type

    Parking Gent population 257k HTTP Machine 1 HTTP Machine 2 HTTP Machine 3 Thought experiment: decentralized publishing A user agent visiting each machine knows more than any of the machines independently
  30. Problem Sint-Pietersplein is a Parking Site ?

  31. Problem Sint-Pietersplein is a Parking Site ?

  32. Solution Sint Pietersplein → https://stad.gent/id/parking/P10 is a → http://www.w3.org/1999/02/22-rdf-syntax-ns#type Parking

    → http://vocab.datex.org/terms#UrbanParkingSite Uniform Resource Identifiers (URIs)
  33. Rule 5: (re)use global identifiers

  34. Which URIs? Learn from other datasets Get started at e.g.:

    https://data.vlaanderen.be https://stad.gent/linked-data https://lov.linkeddata.es
  35. Remember our Rule 1?

  36. Rule 1: document the data’s license

  37. Document this with a triple: <yourdocumenturl> dcterms:license <https://creativecommons.org/publicdomain/zero/1.0/> .

  38. ↓ Querying syntactic semantic technical legal Open licenses Publishing over

    HTTP RDF serializations HTTP URIs allow to discuss semantics globally International, regional and local domain models Summary of our rules to raise interoperability
  39. Summary of our rules to raise interoperability ↓ Querying syntactic

    semantic technical legal Open licenses Publishing over HTTP RDF serializations HTTP URIs allow to discuss semantics globally International, regional and local domain models Web API design
  40. What happens when your document grows too large?

  41. Trade-off in Web publishing Data dumps Smart servers Data publishing

    (cheap/reliable) Data services (rather expensive/unreliable) Entire query languages over HTTP Dataset split in fragments Smart agents algorithms as a service Read more at http://linkeddatafragments.org
  42. http://api.{mycompany}/?from={A}&to={B} &departuretime=2019-03-27T12:02.024Z &wheelchairaccessible=true &transit_modes=plane,railway,bus,car &algoritm_mode=shortest ... Yet this interface will

    need to answer all questions for all third party apps…
  43. data dump Route planning algorithms as a service Asking questions

    Your system 3d party Your system ? ? ? ? ? ? Does not scale: Extra users comes with extra load Does not give necessary flexibility to companies
  44. Page X Page ... Page 2 Page 1 time next

    next Publishing time schedules in fragments on the Web
  45. Try it yourselves at https://linkedconnections.org https://planner.js.org

  46. Exercise! Given 3 datasets with multiple pages, build 1 codebase

    showing an overview of all things (entities/resources) described in it Write it on paper, in pseudocode, or implement it in the language of your choice Feel free to cut corners with functions you can implement later 3 datasets: • https://linked.open.gent/parking • https://graph.irail.be/sncb/connections • http://fragments.dbpedia.org/2016-04/en
  47. Quick convenience tool LDFetch $ npm install -g ldfetch $

    ldfetch https://linked.open.gent/parking $ ldfetch --help # for more options
  48. Hints 1. Async Iterators for incremental results ⇒ also way

    to raise user perceived performance 2. Only hardcode the URLs of the 3 datasets 3. Make an abstraction that detects building blocks in dataset ◦ Document your “building blocks” 4. Set an Accept header
  49. It doesn’t stop at pagination DCAT for describing datasets in

    data catalogs Triple Pattern Fragments: for solving basic graph patterns The Tree Ontology: describe data in trees Routable Tiles: describe geospatial data Demos at https://dexagod.github.io, http://query.linkeddatafragments.org, http://pieter.pm/demo-paper-routable-tiles/
  50. Exercise 2: Publish your own dataset Take a dataset from

    data.stad.gent and publish it properly Make sure your code also works with this dataset
  51. Contact details at https://pietercolpaert.be/#me ↓ Querying syntactic semantic technical legal

    data dump
  52. Final thoughts

  53. Not to be underestimated: the organizational challenges

  54. Extending these ideas to “shared” datasets SOLID protocols

  55. SOLID applies the same ideas to shared data https://solid.mit.edu/

  56. Websockets Client subscribes, server pushes updates to clients For datasets

    that update within a short timespan // Create WebSocket connection. const socket = new WebSocket('ws://localhost:8080'); // Connection opened socket.addEventListener('open', function (event) { socket.send('Hello Server!'); }); // Listen for messages socket.addEventListener('message', function (event) { console.log('Message from server ', event.data); });
  57. Should data streams always be push-based? Clients CPU / Response

    Time Polling Pubsub ? ?
  58. July 2019: Student job Building route planners Linked Data xp

    needed