An introduction to Open Data

Slide 1

Slide 1 text

Open Data Sharing data for maximum reuse Consuming data on Web-Scale https://pietercolpaert.be/#me Ghent University – IMEC – IDLab Guest lecture 2019-03-28

Slide 2

Slide 2 text

Open Data Depending on who’s asking, the main goal is to… Share data for maximum reuse Or to... Consume data on Web-Scale

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Can you ﬁnd data about the LEZ in Antwerp? Can you automate your workﬂow in a script? What would you suggest to the city of Antwerp?

Slide 5

Slide 5 text

Addresses (CRAB) Accommodations with a permit (published by Toerisme Vlaanderen) Company register (KBO) Laws (Staatsblad) Local decisions (cfr your city’s website) Road registry, road works (GIPOD), road signs, traﬃc light status Public transport timetables Weather observations Air quality observations Water levels Biodiversity statistics Statistics used by the local, Flemish, Federal, EU gov and by the Worldbank Research results Catalog of papers written by a university Opening hours The catalog of your supermarket Clinical trials Drug database Molecule databases General knowledge: Wikipedia, wikidata Maps: Open Street Map Examples of other datasets that must be open

Slide 6

Slide 6 text

We are building humanity’s knowledge graph

Slide 7

Slide 7 text

Program Open Data theory: 1. The legal aspects: © and sui generis database right 2. The HTTP protocol: focus on public read access 3. RDF and why serializations shouldn’t matter 4. How global identiﬁers allow discussing semantics on Web-Scale 5. Public Web API design Excercises: 1. Solving a question over 3 Linked Open Datasets with 1 piece of code 2. Publishing a 4th dataset

Slide 8

Slide 8 text

2 most important laws for data: 1. Copyright on the container 2. Sui generis database rights (only when you invested “substantially” in creating the database) Both are incredibly vague regarding data (read up on Text and Data Mining)

Slide 9

Slide 9 text

OpenDeﬁnition.org Interested in the full story? https://pietercolpaert.be/open data/2017/02/23/cc0

Slide 10

Slide 10 text

Give users the legal certainty they need https://github.com/iRail/stations

Slide 11

Slide 11 text

Rule 1: document the data’s license

Slide 12

Slide 12 text

Rule 2: publish your data on the Web

Slide 13

Slide 13 text

Keep it simple An HTTP URL per document Example: data.vlaanderen.be/dumps

Slide 14

Slide 14 text

Cross Origin Resource Sharing: the problem Step 1: Open a website (e.g., https://ruben.verborgh.org) Step 2: Open the JavaScript console (shortcut: f12) Step 3: Execute the following code: fetch('https://gmail.com').then(async response => { console.log(await response.text()); }); Can you explain what happens? https://github.com/solid/web-access-control-spec/blob/master/Background.md

Slide 15

Slide 15 text

Cross Origin Resource Sharing: the solution? Respond to HTTP GET requests with: Allow all origins: Access-control-allow-origin: * Also tell the browser which headers it can show to the client through: Access-Control-Expose-Headers: Content-Type, Link Respond to HTTP OPTIONS (preﬂight) request and allow common headers: Access-Control-Allow-Headers: Accept

Slide 16

Slide 16 text

Cross Origin Resource Sharing: the solution? Step 1: Open a website (e.g., https://ruben.verborgh.org) Step 2: Open the JavaScript console (shortcut: F12) Step 3: Execute the following code: fetch('https://pietercolpaert.be').then(async response => { console.log(await response.text()); }); Can you explain what happens? We’re trying to ﬁnd a better solution: “Proposal: Allow servers to take full responsibility for cross-origin access protection” https://github.com/whatwg/fetch/issues/878

Slide 17

Slide 17 text

To HTTPS or not to HTTPS? For Open Data, we can maybe trust certain intermediate HTTP caches? Interesting idea: neighbourhood HTTP caches between peers on a train Pauline Folz, Hala Skaf-Molli, Pascal Molli. CyCLaDEs: A Decentralized Cache for Triple Pattern Fragments. ESWC: Extended Semantic Web Conference, May 2016, Heraklion, Greece. ESWC 2016: Extended Semantic Web Conference, 2016. 〈hal-01251654v2〉 HTTP HTTPS You can cache a response in an intermediate third party HTTP proxy yes no You know an intermediary did not change the response no yes

Slide 18

Slide 18 text

To HTTPS or not to HTTPS? But, does it matter what we think? Step 1: go to a HTTPS website (e.g., https://ruben.verborgh.org) Step 2: Execute in the console (F12): fetch('http://pietercolpaert.be').then(async response => { console.log(await response.text()); }); Step 3: An error appears: CORS request is unsafe. Conclusion: When we want to get our data used as much as possible, best strategy is to oﬀer HTTPS

Slide 19

Slide 19 text

Caching: powering Web-scale 1. Caching based on expiration: Response headers: Cache-Control: max-age=X (in seconds) Age: Y 2. Caching based on the ETag header: Response header: etag: Request header: if-none-match Status code: 304 Not Modified In both cases: set the Vary header! ⇒ deﬁnes which other headers can change the cache key (e.g., the accept headers) Full story: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching expirationTime = responseTime + X - Y

Slide 20

Slide 20 text

Once set, leave the hard work to: ● different easy to configure self-hosted caches: Apache, nginx, Varnish… ● different cloud caches or content delivery networks (CDN) Caching: powering Web-scale

Slide 21

Slide 21 text

HTTP/2 Reuses open TCP connection for future HTTP requests Can promise extra resources through a server push No limit of 6-8 concurrent requests to a server Can already be enabled on e.g., nginx Already implemented in browsers

Slide 22

Slide 22 text

Experiment 1. Requesting an HTTP/1.1 server for (let i =0; i < 50 ; i++) fetch('https://linked.open.gent/parking').then(); 2. Requesting an HTTP/2 server for (let i =0; i < 50 ; i++) fetch('https://graph.irail.be/sncb/connections').then();

Slide 23

Slide 23 text

HTTP/1.1 HTTP/2

Slide 24

Slide 24 text

Rule 2: publish your data on the Web But a lot of technical consequences you want to get right: 1. Make sure your document has a permalink (HTTP URL) 2. Enable Cross Origin Resource Sharing 3. Use HTTPS 4. Enable publish HTTP caching 5. Enable compression (gzip, deﬂate, br) 6. Enable HTTP2

Slide 25

Slide 25 text

Rule 3: use an open format

Slide 26

Slide 26 text

name type city population StP-Plein Parking Gent 257k { "StP-Plein" : { "type" : "Parking", "city" : "Gent", "population" : "257k" } } Parking Gent 257k Table / CSV / Spreadsheet JSON XML Serialisations

Slide 27

Slide 27 text

name type city population StP-Plein Parking Gent 257k Parking Gent 257k . . "257k" . Table / CSV / Spreadsheet 3 time a datum Triples JSON XML { "StP-Plein" : { "type" : "Parking", "city" : "Gent", "population" : "257k" } }

Slide 28

Slide 28 text

Rule 4: make sure your data can be intepreted as triples using RDF* * Resource Description Framework https://www.w3.org/TR/rdf-primer/

Slide 29

Slide 29 text

World Wide Web St-P Plein city Gent St Pietersplein type Parking Gent population 257k HTTP Machine 1 HTTP Machine 2 HTTP Machine 3 Thought experiment: decentralized publishing A user agent visiting each machine knows more than any of the machines independently

Slide 30

Slide 30 text

Problem Sint-Pietersplein is a Parking Site ?

Slide 31

Slide 31 text

Problem Sint-Pietersplein is a Parking Site ?

Slide 32

Slide 32 text

Solution Sint Pietersplein → https://stad.gent/id/parking/P10 is a → http://www.w3.org/1999/02/22-rdf-syntax-ns#type Parking → http://vocab.datex.org/terms#UrbanParkingSite Uniform Resource Identiﬁers (URIs)

Slide 33

Slide 33 text

Rule 5: (re)use global identiﬁers

Slide 34

Slide 34 text

Which URIs? Learn from other datasets Get started at e.g.: https://data.vlaanderen.be https://stad.gent/linked-data https://lov.linkeddata.es

Slide 35

Slide 35 text

Remember our Rule 1?

Slide 36

Slide 36 text

Rule 1: document the data’s license

Slide 37

Slide 37 text

Document this with a triple: dcterms:license .

Slide 38

Slide 38 text

↓ Querying syntactic semantic technical legal Open licenses Publishing over HTTP RDF serializations HTTP URIs allow to discuss semantics globally International, regional and local domain models Summary of our rules to raise interoperability

Slide 39

Slide 39 text

Summary of our rules to raise interoperability ↓ Querying syntactic semantic technical legal Open licenses Publishing over HTTP RDF serializations HTTP URIs allow to discuss semantics globally International, regional and local domain models Web API design

Slide 40

Slide 40 text

What happens when your document grows too large?

Slide 41

Slide 41 text

Trade-oﬀ in Web publishing Data dumps Smart servers Data publishing (cheap/reliable) Data services (rather expensive/unreliable) Entire query languages over HTTP Dataset split in fragments Smart agents algorithms as a service Read more at http://linkeddatafragments.org

Slide 42

Slide 42 text

http://api.{mycompany}/?from={A}&to={B} &departuretime=2019-03-27T12:02.024Z &wheelchairaccessible=true &transit_modes=plane,railway,bus,car &algoritm_mode=shortest ... Yet this interface will need to answer all questions for all third party apps…

Slide 43

Slide 43 text

data dump Route planning algorithms as a service Asking questions Your system 3d party Your system ? ? ? ? ? ? Does not scale: Extra users comes with extra load Does not give necessary ﬂexibility to companies

Slide 44

Slide 44 text

Page X Page ... Page 2 Page 1 time next next Publishing time schedules in fragments on the Web

Slide 45

Slide 45 text

Try it yourselves at https://linkedconnections.org https://planner.js.org

Slide 46

Slide 46 text

Exercise! Given 3 datasets with multiple pages, build 1 codebase showing an overview of all things (entities/resources) described in it Write it on paper, in pseudocode, or implement it in the language of your choice Feel free to cut corners with functions you can implement later 3 datasets: ● https://linked.open.gent/parking ● https://graph.irail.be/sncb/connections ● http://fragments.dbpedia.org/2016-04/en

Slide 47

Slide 47 text

Quick convenience tool LDFetch $ npm install -g ldfetch $ ldfetch https://linked.open.gent/parking $ ldfetch --help # for more options

Slide 48

Slide 48 text

Hints 1. Async Iterators for incremental results ⇒ also way to raise user perceived performance 2. Only hardcode the URLs of the 3 datasets 3. Make an abstraction that detects building blocks in dataset ○ Document your “building blocks” 4. Set an Accept header

Slide 49

Slide 49 text

It doesn’t stop at pagination DCAT for describing datasets in data catalogs Triple Pattern Fragments: for solving basic graph patterns The Tree Ontology: describe data in trees Routable Tiles: describe geospatial data Demos at https://dexagod.github.io, http://query.linkeddatafragments.org, http://pieter.pm/demo-paper-routable-tiles/

Slide 50

Slide 50 text

Exercise 2: Publish your own dataset Take a dataset from data.stad.gent and publish it properly Make sure your code also works with this dataset

Slide 51

Slide 51 text

Contact details at https://pietercolpaert.be/#me ↓ Querying syntactic semantic technical legal data dump

Slide 52

Slide 52 text

Final thoughts

Slide 53

Slide 53 text

Not to be underestimated: the organizational challenges

Slide 54

Slide 54 text

Extending these ideas to “shared” datasets SOLID protocols

Slide 55

Slide 55 text

SOLID applies the same ideas to shared data https://solid.mit.edu/

Slide 56

Slide 56 text

Websockets Client subscribes, server pushes updates to clients For datasets that update within a short timespan // Create WebSocket connection. const socket = new WebSocket('ws://localhost:8080'); // Connection opened socket.addEventListener('open', function (event) { socket.send('Hello Server!'); }); // Listen for messages socket.addEventListener('message', function (event) { console.log('Message from server ', event.data); });

Slide 57

Slide 57 text

Should data streams always be push-based? Clients CPU / Response Time Polling Pubsub ? ?

Slide 58

Slide 58 text

July 2019: Student job Building route planners Linked Data xp needed