Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elasticsearch and Linked Data

Elastic Co
September 18, 2017

Elasticsearch and Linked Data

Linked data is a method to expose, share and connect pieces of (structured) data, information and knowledge based on URIs and the Resource Description Framework. Traditionally, this type of data would be stored to a triplestore optimized for running semantic queries. However, triple stores generally suffer from performance issues when performing search and retrieving a large quantity of data. As such, we have investigated a range of alternative for storing data. Based on the JSON-LD serialization of RDF and on ElasticSearch, we were able to develop (performant) tools for managing change events to legal and regulatory content as well as maintaining tax return data to identify accountants’ clients impacted by these changes.

Quentin Reul works as a Content Integration Manager for Wolters Kluwer. In his role, he is responsible for maintaining the Wolters Kluwer semantic model as well as the development of new solutions leveraging data expressed according to this model. For instance, he was a lead architect on the development of set of tools to identify changes in legal and regulatory content and to identify accountants’ clients impacted by these changes.

Quentin has earned his Bachelor of Science in Computing Science and his Ph.D. on ontology management from the University of Aberdeen (Scotland). Over the years, he has been involved in several W3C groups including the Semantic Web Deployment Working Group that developed the Simple Knowledge Organization System (SKOS) specification and the RDF & XML interoperability community group.

https://www.meetup.com/Chicago-Elastic-Fantastics/events/242335695/

Elastic Co

September 18, 2017
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. About the Speaker • Academic Background – PhD in Computing

    Science, 2012 • Applied Semantic Web technologies to enable jet engine designer to retrieve service information • Developed a new approach to map entities in different ontologies based on their definition [http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.558665] • Involved in the development of the W3C SKOS Standard – BSc in Computing Science and AI, 2005 • Work Experience (2008-Present) – Content Integration Manager @ Wolters Kluwer • Maintains and extends the Platform Content Interface (PCI) standard • Supports the development of Content Delivery Channel solutions • Supports the development of automated enrichment solutions – Content Semantics Architect @ Wolters Kluwer • Maintains and extends the Platform Content Interface (PCI) standard • Collaborate with WK businesses during the migration of content to the WK global publishing platform – Researcher @ Vrjie Universiteit Brussels • Developed semantic models for enabling the exchange of personal data across the web Elastic Chicago User Group – 22 August, 2018
  2. Elastic Chicago User Group – 22 August, 2018 "The Semantic

    Web is not a separate Web but an extension of the current one, in which [data] is given well-defined meaning, better enabling computers and people to work in cooperation.“ - Tim Berners-Lee, James Hendler and Ora Lassila; Scientific American, May 2001
  3. Elastic Chicago User Group – 22 August, 2018 Source: The

    Next Web Web of Documents Web of Data From knowledge being buried in content and interpretable by humans to data with well-defined meaning that can be interpreted by both computers and humans.
  4. Elastic Chicago User Group – 22 August, 2018 1. Use

    URIs to identify the “things” 2. Use https:// URIs to fetch data on the web 3. Data relates to other things https://goo.gl/nkhygY https://goo.gl/L2sHKj studied at originates from https://goo.gl/9GSS2P Source: The Next Web
  5. • Resource Description Framework (RDF) – is for describing resources

    on the Web – is designed to enable the exchange of metadata on the Web – uses Uniform Resource Identifiers (URIs) [RFC 1630] and Literals to identify and reference web resources • URI = scheme:[//authority]path[?query][#fragment] • Literals represent data values – is a W3C recommendation Elastic Chicago User Group – 22 August, 2018 Source: RDF - Semantic Web Standards
  6. “Quentin Reul” foaf:name Subject Predicate Object A fact in RDF

    is expressed as a triple of the form (Subject, Predicate, Object) The Subject and Predicate are always a URI, whereas the Object can be either a URI or a Literal. URI Literal https://goo.gl/nkhygY Elastic Chicago User Group – 22 August, 2018 Source: RDF - Semantic Web Standards
  7. RDF Serializations • RDF supports different serialization formats: – RDF/XML

    • “Official” machine readable syntax • Not governed by a DTD or XSD – Terse RDF Triple Language (Turtle) • Concise human readable syntax – RDFa • Developed to embed RDF triples in HTML and XML – JavaScript Notation for Linked Data (JSON-LD) • Compact data format to exchange data between applications Source: RDF Serializations Elastic Chicago User Group – 22 August, 2018
  8. RDF Serializations Elastic Chicago User Group – 22 August, 2018

    “Quentin Reul” foaf:name https://goo.gl/nkhygY RDF/XML Serialization JSON-LD Serialization
  9. RDF Datastores • Store data as triples or graphs •

    Support schemaless data structures • Leverage W3C SPARQL as query language • Provide inference of implicit facts NoSQL Datastores • Store data as key/value pairs, documents or graphs • Support schemaless data structures • Provide Domain Specific Language (DSL) for querying • Does not provide inference of implicit facts Elastic Chicago User Group – 22 August, 2018
  10. Elastic Chicago User Group – 22 August, 2018 Users are

    browsing content to identify relevant events in the system. Depending on the event, it may trigger closer examination and generation of additional material.
  11. • Blazegraph is an Open-source graph database that – Supports

    RDF “natively” – Supports SPARQL 1.1 for querying data – Supports inferencing of implicit facts – Supports high availability (HA) with online backup Elastic Chicago User Group – 22 August, 2018 Source: Case Study: Achieving a 200% increase in time on site with the Blazegraph High Availability Graph Database
  12. Elastic Chicago User Group – 22 August, 2018 • Elasticsearch

    is an open-source document store that: – Supports schemaless JSON objects – Supports search of unstructured & structured data – Supports high availability (HA) – Supports multi-tenancy Source: "It’s all in the {find} y’know” @ the BBC
  13. Blazegraph vs. Elasticsearch 0 1000 2000 3000 4000 5000 6000

    153 194 797 1079 1256 1482 1545 2139 2733 4166 4760 5043 5637 6514 Response time for all events (in ms) BG Elastic Comparing the performance of Blazegraph vs. Elastic for browse, we observed that using Elastic had less impact on performance as number of events increased. Elastic Chicago User Group – 22 August, 2018
  14. 0 100 200 300 400 500 600 700 800 153

    194 797 1079 1256 1482 1545 2139 2733 4166 4760 5043 5637 6514 Response time for last 10 events (in ms) BG Elastic Blazegraph vs. Elasticsearch Comparing the performance of Blazegraph vs. Elastic for browse, we observed that using Elastic had minimal impact on performance when doing pagination as number of events increased. Elastic Chicago User Group – 22 August, 2018
  15. Trigger Event Occurs How does an Accountant process all that

    information? Case law Legislation Rulings & announcements Macro-economic events Source: CCH iQ – Experience Predictive Intelligence Elastic Chicago User Group – 22 August, 2018
  16. HOW can we help? Trigger event occurs Accountant’s clients Client

    profiling Experts + technology Client match Source: CCH iQ – Experience Predictive Intelligence Elastic Chicago User Group – 22 August, 2018
  17. Event Detection Engine Is there an event? iFirm iKnow Publish

    My Impacted Clients Do any clients match? Event Accountant Expert tools Client Profile Engine Event Detection Detect “trigger event” from websites selected by Subject Matter Experts. Client Matching Client match is achieved by abstracting the information in the “trigger event” into a set of criteria that is used to filter impacted clients. Source: CCH iQ – Experience Predictive Intelligence Elastic Chicago User Group – 22 August, 2018
  18. Event Detection Source: Fuel tax credits – six monthly indexation

    alert Tax Concepts Tax Entities Event Signal Time Period Elastic Chicago User Group – 22 August, 2018
  19. Event Detection Source: Fuel tax credits – six monthly indexation

    alert Tax Concepts Tax Entities Event Signal Time Period Elastic Chicago User Group – 22 August, 2018
  20. SMEs have defined mappings between tax concepts and tax fields

    that are used to determine that an event relate to a particular tax field. Client Matching Elastic Chicago User Group – 22 August, 2018
  21. Client Matching Elastic Chicago User Group – 22 August, 2018

    Expert tools have been developed to enable SMEs to define client match criteria and visualize the number of clients impacted by an event.
  22. • Elasticsearch provides an efficient way to store, query and

    retrieve Linked Data • Elasticsearch allows the retrieval of documents or part thereof • Elasticsearch requires Linked Data resources to be expanded in JSON-LD documents – Duplication of data – Reload data when labels of lists change Elastic Chicago User Group – 22 August, 2018