Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Case Study in Pragmatism: exploring the practical failure modes of Linked Data as applied to classical music catalogues

A Case Study in Pragmatism: exploring the practical failure modes of Linked Data as applied to classical music catalogues

For the last decade efforts have been made in building the Linked Data enabled Semantic Web. Some of the earliest efforts in publishing Linked Data included music-specific corpora, modeled using e.g. the Music Ontology. Given that, one might assume creating a Linked Data client in the music domain would be a reasonably straight-forward process. The reality of the situation is not so simple. In this paper, we present a case study in building a real-world web-service that makes use of third-party Linked Data resources: libmus. We describe the variety of difficulties in reliably linking to several external sources. These range from poor data coverage, to serialization consistency, to sporadic availability. We outline the ways in which we have coped with these problems within the libmus system. These include: local caching of third-party data to enable search, screen-scraping to expand data availability, and automatically generated regular expressions to emulate fuzzy search. Finally, we will consider how the Linked Data ecosystem can be improved to mitigate these problems in other systems and what this means for the future of the Linked Data Web.

Ben Fields

June 24, 2015
Tweet

More Decks by Ben Fields

Other Decks in Technology

Transcript

  1. A Case Study in Pragmatism:
    exploring the practical failure modes of Linked
    Data as applied to classical music catalogues
    Ben Fields !/alsothings
    Sam Phippen !/samphippen
    Brad Cohen

    View Slide

  2. Introduction

    View Slide

  3. View Slide

  4. Where is the friction?
    What is the distance
    between specification
    and realisation?

    View Slide

  5. Background

    View Slide

  6. Background: Goals and
    design of libmus

    View Slide

  7. Mapping links between
    agents and works

    View Slide

  8. Mapping these concepts
    across disparate
    repositories

    View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. Background: Linked Data
    ecosystem reliability

    View Slide

  13. We care about two kinds
    of reliability

    View Slide

  14. Reliable modelling:
    resolvable and
    accessible

    View Slide

  15. Basically

    View Slide

  16. Use obvious
    self-documenting
    class and attribute names

    View Slide

  17. Reliable repositories:
    availability
    :

    View Slide

  18. Uptime: time service is
    reachable

    View Slide

  19. Downtime: time service
    is unreachable

    View Slide

  20. Availability: percentage
    of uptime in given length
    of time

    View Slide

  21. Example!

    View Slide

  22. 10 minutes of downtime
    in 1 week

    View Slide

  23. 10800 - 10
    10800
    x 100 =
    99.9%

    View Slide

  24. SPARQL Endpoint
    Current Availabiltiy
    0 25 50 75 100
    up
    down
    SPARQL availability index live at http://sparqles.ai.wu.ac.at/availability this data was fetched on 26 April 2015

    View Slide

  25. View Slide

  26. Linking Data: Theory vs.
    Reality

    View Slide

  27. Theory vs Reality: IMSLP

    View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. No search endpoint

    View Slide

  32. So we downloaded
    metadata for:
    16,716 composers
    88,464 works

    View Slide

  33. SELECT imslp_id, permalink
    FROM imslp_composers
    WHERE imslp_id
    LIKE

    View Slide

  34. search: done ✅

    View Slide

  35. search: done ✅
    (As long as there aren’t
    too many items)

    View Slide

  36. Theory vs Reality: VIAF

    View Slide

  37. No SPARQL endpoint

    View Slide

  38. Let’s have some
    questions

    View Slide

  39. Let’s have some
    questions

    View Slide

  40. Let’s have some
    questions

    View Slide

  41. Let’s have some
    questions

    View Slide

  42. Let’s have some
    questions

    View Slide

  43. Let’s have some
    questions

    View Slide

  44. Let’s have some
    questions

    View Slide

  45. Let’s have some
    questions

    View Slide

  46. Theory vs Reality:
    LinkedBrainz

    View Slide

  47. Really good triples!

    View Slide

  48. Functional Sparql
    endpoint

    View Slide

  49. Let’s have some
    questions

    View Slide

  50. Let’s have some
    questions

    View Slide

  51. Works notions!

    View Slide

  52. Let’s have some
    questions

    View Slide

  53. Let’s have some
    questions

    View Slide

  54. Let’s have some
    questions

    View Slide

  55. Sonata
    BEETHO QUAR O 18 N 3
    VARIOU ALTE TSCH ORGE
    HANDEL SUIT 5 AIR + DAN

    View Slide

  56. How to align pathological
    data?

    View Slide

  57. Let’s have some
    questions

    View Slide

  58. HANDEL SUIT 5 AIR +
    DAN

    View Slide

  59. (HANDEL (SUIT)? (5)? (AIR)? (\\+)? (DAN)?|HANDEL
    SUIT (5)? (AIR)? (\\+)? (DAN)?|HANDEL 5 (AIR)? (\\+)?
    (DAN)?|HANDEL AIR (\\+)? (DAN)?|HANDEL \\+
    (DAN)?|HANDEL DAN | SUIT (5)? (AIR)? (\\+)? (DAN)?|
    5 (AIR)? (\\+)? (DAN)?| AIR (\\+)? (DAN)?| \\+ (DAN)?|
    DAN |SUIT 5 (AIR)? (\\+)? (DAN)?|SUIT AIR (\\+)?
    (DAN)?|SUIT \\+ (DAN)?|SUIT DAN |5 AIR (\\+)?
    (DAN)?|5 \\+ (DAN)?|5 DAN |AIR \\+ (DAN)?|AIR DAN
    |\\+ DAN )

    View Slide

  60. 393 character regex
    35 optional groups
    20 choices

    View Slide

  61. Performance nightmare

    View Slide

  62. Let’s have some
    questions

    View Slide

  63. View Slide

  64. Let’s have some
    questions

    View Slide

  65. We then filter by
    Levenshtein and
    threshold

    View Slide

  66. Conclusions

    View Slide

  67. Doing Linked Data is
    harder than it should be

    View Slide

  68. When publishing data
    consider:

    View Slide

  69. the obviousness of your
    model,

    View Slide

  70. search and access,

    View Slide

  71. availability and reliability
    of endpoints.

    View Slide

  72. Future Work

    View Slide

  73. What course of action will
    help nurture the Linked
    Data ecosystem?

    View Slide

  74. Make likely paths to
    specific data items as
    obvious as possible

    View Slide

  75. SPARQL endpoint
    reliability

    View Slide

  76. Publish your data!

    View Slide

  77. Let’s have some questions
    Ben Fields !/alsothings
    Sam Phippen !/samphippen
    Brad Cohen

    View Slide