An introduction to Open Data

Open Data Sharing data for maximum reuse Consuming data on
Web-Scale https://pietercolpaert.be/#me Ghent University – IMEC – IDLab Guest lecture 2019-03-28

Open Data Depending on who’s asking, the main goal is
to… Share data for maximum reuse Or to... Consume data on Web-Scale

Can you ﬁnd data about the LEZ in Antwerp? Can
you automate your workﬂow in a script? What would you suggest to the city of Antwerp?

Addresses (CRAB) Accommodations with a permit (published by Toerisme Vlaanderen)
Company register (KBO) Laws (Staatsblad) Local decisions (cfr your city’s website) Road registry, road works (GIPOD), road signs, traﬃc light status Public transport timetables Weather observations Air quality observations Water levels Biodiversity statistics Statistics used by the local, Flemish, Federal, EU gov and by the Worldbank Research results Catalog of papers written by a university Opening hours The catalog of your supermarket Clinical trials Drug database Molecule databases General knowledge: Wikipedia, wikidata Maps: Open Street Map Examples of other datasets that must be open

We are building humanity’s knowledge graph

Program Open Data theory: 1. The legal aspects: © and
sui generis database right 2. The HTTP protocol: focus on public read access 3. RDF and why serializations shouldn’t matter 4. How global identiﬁers allow discussing semantics on Web-Scale 5. Public Web API design Excercises: 1. Solving a question over 3 Linked Open Datasets with 1 piece of code 2. Publishing a 4th dataset

2 most important laws for data: 1. Copyright on the
container 2. Sui generis database rights (only when you invested “substantially” in creating the database) Both are incredibly vague regarding data (read up on Text and Data Mining)

OpenDeﬁnition.org Interested in the full story? https://pietercolpaert.be/open data/2017/02/23/cc0

Give users the legal certainty they need https://github.com/iRail/stations

Rule 1: document the data’s license

Rule 2: publish your data on the Web

Keep it simple An HTTP URL per document Example: data.vlaanderen.be/dumps

Cross Origin Resource Sharing: the problem Step 1: Open a
website (e.g., https://ruben.verborgh.org) Step 2: Open the JavaScript console (shortcut: f12) Step 3: Execute the following code: fetch('https://gmail.com').then(async response => { console.log(await response.text()); }); Can you explain what happens? https://github.com/solid/web-access-control-spec/blob/master/Background.md

Cross Origin Resource Sharing: the solution? Respond to HTTP GET
requests with: Allow all origins: Access-control-allow-origin: * Also tell the browser which headers it can show to the client through: Access-Control-Expose-Headers: Content-Type, Link Respond to HTTP OPTIONS (preﬂight) request and allow common headers: Access-Control-Allow-Headers: Accept

Cross Origin Resource Sharing: the solution? Step 1: Open a
website (e.g., https://ruben.verborgh.org) Step 2: Open the JavaScript console (shortcut: F12) Step 3: Execute the following code: fetch('https://pietercolpaert.be').then(async response => { console.log(await response.text()); }); Can you explain what happens? We’re trying to ﬁnd a better solution: “Proposal: Allow servers to take full responsibility for cross-origin access protection” https://github.com/whatwg/fetch/issues/878

To HTTPS or not to HTTPS? For Open Data, we
can maybe trust certain intermediate HTTP caches? Interesting idea: neighbourhood HTTP caches between peers on a train Pauline Folz, Hala Skaf-Molli, Pascal Molli. CyCLaDEs: A Decentralized Cache for Triple Pattern Fragments. ESWC: Extended Semantic Web Conference, May 2016, Heraklion, Greece. ESWC 2016: Extended Semantic Web Conference, 2016. 〈hal-01251654v2〉 HTTP HTTPS You can cache a response in an intermediate third party HTTP proxy yes no You know an intermediary did not change the response no yes

To HTTPS or not to HTTPS? But, does it matter
what we think? Step 1: go to a HTTPS website (e.g., https://ruben.verborgh.org) Step 2: Execute in the console (F12): fetch('http://pietercolpaert.be').then(async response => { console.log(await response.text()); }); Step 3: An error appears: CORS request is unsafe. Conclusion: When we want to get our data used as much as possible, best strategy is to oﬀer HTTPS

Caching: powering Web-scale 1. Caching based on expiration: Response headers:
Cache-Control: max-age=X (in seconds) Age: Y 2. Caching based on the ETag header: Response header: etag: <unique hash> Request header: if-none-match Status code: 304 Not Modified In both cases: set the Vary header! ⇒ deﬁnes which other headers can change the cache key (e.g., the accept headers) Full story: https://developer.mozilla.org/en-US/docs/Web/HTTP/Caching expirationTime = responseTime + X - Y

Once set, leave the hard work to: • different easy
to configure self-hosted caches: Apache, nginx, Varnish… • different cloud caches or content delivery networks (CDN) Caching: powering Web-scale

HTTP/2 Reuses open TCP connection for future HTTP requests Can
promise extra resources through a server push No limit of 6-8 concurrent requests to a server Can already be enabled on e.g., nginx Already implemented in browsers

Experiment 1. Requesting an HTTP/1.1 server for (let i =0;
i < 50 ; i++) fetch('https://linked.open.gent/parking').then(); 2. Requesting an HTTP/2 server for (let i =0; i < 50 ; i++) fetch('https://graph.irail.be/sncb/connections').then();

HTTP/1.1 HTTP/2

Rule 2: publish your data on the Web But a
lot of technical consequences you want to get right: 1. Make sure your document has a permalink (HTTP URL) 2. Enable Cross Origin Resource Sharing 3. Use HTTPS 4. Enable publish HTTP caching 5. Enable compression (gzip, deﬂate, br) 6. Enable HTTP2

Rule 3: use an open format

name type city population StP-Plein Parking Gent 257k { "StP-Plein"
: { "type" : "Parking", "city" : "Gent", "population" : "257k" } } <StP-Plein> <type>Parking</type> <city>Gent</city> <population> 257k </population> </StP-Plein> Table / CSV / Spreadsheet JSON XML Serialisations

name type city population StP-Plein Parking Gent 257k <StP-Plein> <type>Parking</type>
<city>Gent</city> <population> 257k </population> </StP-Plein> <StP-Plein> <type> <Parking> . <StP-Plein> <city> <Gent> . <Gent> <population> "257k" . Table / CSV / Spreadsheet 3 time a datum Triples JSON XML { "StP-Plein" : { "type" : "Parking", "city" : "Gent", "population" : "257k" } }

Rule 4: make sure your data can be intepreted as
triples using RDF* * Resource Description Framework https://www.w3.org/TR/rdf-primer/

World Wide Web St-P Plein city Gent St Pietersplein type
Parking Gent population 257k HTTP Machine 1 HTTP Machine 2 HTTP Machine 3 Thought experiment: decentralized publishing A user agent visiting each machine knows more than any of the machines independently

Problem Sint-Pietersplein is a Parking Site ?

Solution Sint Pietersplein → https://stad.gent/id/parking/P10 is a → http://www.w3.org/1999/02/22-rdf-syntax-ns#type Parking
→ http://vocab.datex.org/terms#UrbanParkingSite Uniform Resource Identiﬁers (URIs)

Rule 5: (re)use global identiﬁers

Which URIs? Learn from other datasets Get started at e.g.:
https://data.vlaanderen.be https://stad.gent/linked-data https://lov.linkeddata.es

Remember our Rule 1?

Rule 1: document the data’s license

Document this with a triple: <yourdocumenturl> dcterms:license <https://creativecommons.org/publicdomain/zero/1.0/> .

↓ Querying syntactic semantic technical legal Open licenses Publishing over
HTTP RDF serializations HTTP URIs allow to discuss semantics globally International, regional and local domain models Summary of our rules to raise interoperability

Summary of our rules to raise interoperability ↓ Querying syntactic
semantic technical legal Open licenses Publishing over HTTP RDF serializations HTTP URIs allow to discuss semantics globally International, regional and local domain models Web API design

What happens when your document grows too large?

Trade-oﬀ in Web publishing Data dumps Smart servers Data publishing
(cheap/reliable) Data services (rather expensive/unreliable) Entire query languages over HTTP Dataset split in fragments Smart agents algorithms as a service Read more at http://linkeddatafragments.org

http://api.{mycompany}/?from={A}&to={B} &departuretime=2019-03-27T12:02.024Z &wheelchairaccessible=true &transit_modes=plane,railway,bus,car &algoritm_mode=shortest ... Yet this interface will
need to answer all questions for all third party apps…

data dump Route planning algorithms as a service Asking questions
Your system 3d party Your system ? ? ? ? ? ? Does not scale: Extra users comes with extra load Does not give necessary ﬂexibility to companies

Page X Page ... Page 2 Page 1 time next
next Publishing time schedules in fragments on the Web

Try it yourselves at https://linkedconnections.org https://planner.js.org

Exercise! Given 3 datasets with multiple pages, build 1 codebase
showing an overview of all things (entities/resources) described in it Write it on paper, in pseudocode, or implement it in the language of your choice Feel free to cut corners with functions you can implement later 3 datasets: • https://linked.open.gent/parking • https://graph.irail.be/sncb/connections • http://fragments.dbpedia.org/2016-04/en

Quick convenience tool LDFetch $ npm install -g ldfetch $
ldfetch https://linked.open.gent/parking $ ldfetch --help # for more options

Hints 1. Async Iterators for incremental results ⇒ also way
to raise user perceived performance 2. Only hardcode the URLs of the 3 datasets 3. Make an abstraction that detects building blocks in dataset ◦ Document your “building blocks” 4. Set an Accept header

It doesn’t stop at pagination DCAT for describing datasets in
data catalogs Triple Pattern Fragments: for solving basic graph patterns The Tree Ontology: describe data in trees Routable Tiles: describe geospatial data Demos at https://dexagod.github.io, http://query.linkeddatafragments.org, http://pieter.pm/demo-paper-routable-tiles/

Exercise 2: Publish your own dataset Take a dataset from
data.stad.gent and publish it properly Make sure your code also works with this dataset

Contact details at https://pietercolpaert.be/#me ↓ Querying syntactic semantic technical legal
data dump

Final thoughts

Not to be underestimated: the organizational challenges

Extending these ideas to “shared” datasets SOLID protocols

SOLID applies the same ideas to shared data https://solid.mit.edu/

Websockets Client subscribes, server pushes updates to clients For datasets
that update within a short timespan // Create WebSocket connection. const socket = new WebSocket('ws://localhost:8080'); // Connection opened socket.addEventListener('open', function (event) { socket.send('Hello Server!'); }); // Listen for messages socket.addEventListener('message', function (event) { console.log('Message from server ', event.data); });

Should data streams always be push-based? Clients CPU / Response
Time Polling Pubsub ? ?

July 2019: Student job Building route planners Linked Data xp
needed

An introduction to Open Data

An introduction to Open Data

More Decks by Pieter Colpaert

Other Decks in Technology

Featured

Transcript