The impact of an extra feature on the scalability of Linked Connections

The impact of an extra feature on the scalability of
Pieter Colpaert, Sander Ballieu, Ruben Verborgh, Erik Mannens

How to publish public transport data for everyone?

Proposal http://api.{mycompany}/?from={A}&to={B} &departuretime=2016-10-16T14:45.024Z &wheelchairaccessible=true &transit_modes=plane,railway,bus,car &algoritm_mode=shortest ...

Proposal http://api.{mycompany}/?from={A}&to={B} &departuretime=2016-10-16T14:45.024Z &wheelchairaccessible=true &transit_modes=plane,railway,bus,car &algoritm_mode=shortest ... One service for
everything/everyone: unscalable

Linked Connections publishes paged collection of departure/arrival (connections) objects instead
GTFS data dump Route planning algorithms as a service

Page i Page 3 Page 2 Page 1 time hydra:next
When published in pages, route planning needs i requests instead of 1 hydra:next http://data.{yourcompany}/?page={i}

Demo at ISWC 2015 http://linkedconnections.org

Open source LC client code: Let’s build specialized user-agents* *
User-agent can be anything, also a third party API LC server HTTP cache HTTP cache Fetch pages LC client Private API

Can we extend this interface with an extra feature? E.g.,
for people in a wheelchair? GTFS data dump Route planning algorithms as a service http://data.{yourcompany}/?page={i}&wheelchair={true/false} Hypothesis: faster query response times when server helps filtering the connections

Wheelchair accessibility feature Step 1: trip based filtering get all
wheelchair accessible connections ordered in time Step 2: stop based filtering when getting on/off/transferring at a stop, the stop itself must also be wheelchair accessible In the Linked Connections framework, only step 1 could be done on the server

Evaluation Each time x times the normal load: With x:
0.5, 1, 2, 4, 8, 12 and 16 Grab this query mix over here: https://github.com/linkedconnections/belgianrail-query-load Re-playing all queries from a route planning API for Belgian railways for 15 min

Experiment 1: client filters trips and stops LC server HTTP
cache HTTP cache Fetch pages LC client + trips and stops filter Accessible trips database Accessible stops database http://data.{yourcompany}/?page={i}

Experiment 2: server filters trips LC server + trips filter
HTTP cache HTTP cache http://data.{yourcompany}/?page={i}&wheelchair={true/false} Fetch pages LC client + stops filter Accessible trips database Accessible stops database

Results 1. Difference in cache hit-rate 2. Difference in CPU
use on the server 3. Difference in CPU use on the client 4. Average time to relax one connection

The cache performance lowers with an extra boolean filter on
the server: 3-6%

CPU usage on server increases when doing filtering on server

The CPU usage of client is higher when filtering on
the server

Ms/scanning a connection is faster (under lower load) when filtering
on client

Conclusion For Linked Connections: Hypothesis was wrong: more CPU needed
for slower overall query times For Linked Data consumer/publisher community: Adding server functionality when publishing data for maximum reuse does not always mean helping user-agents Let’s enable the power of datasets published on the Web for the many, not the few → http://linkedconnections.org

The impact of an extra feature on the scalabili...

The impact of an extra feature on the scalability of Linked Connections

Pieter Colpaert

More Decks by Pieter Colpaert

Other Decks in Technology

Featured

Transcript