Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing

SWIB14
December 03, 2014

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing

Presenter: Cristina Sarasua (Institute for Web Science and Technologies (WeST). University of Koblenz-Landau, Germany)

Abstract:
Semantic Web technologies enable the integration of distributed data sets curated by different organisations and with different purposes. Descriptions of particular resources (e.g. events, persons or images) are connected through links that explicitly state the relationship between them. Connecting data of similar or disparate domains, libraries can offer a more extensive and detailed information to their visitors, while librarians have better documentation in their cataloguing activities. Despite the advances in data interlinking technology, human intervention is still a core aspect of the process. Humans, in particular librarians, are crucial both as knowledge providers and reviewers of the automatically computed links. One of the problems that arises in this scenario is that libraries might have limited human resources dedicated to authority control; so, running the time-consuming interlinking process over external data sets becomes troublesome. Microtask crowdsourcing provides an economic and scalable way to involve humans systematically in data processing. The goal of this talk is to introduce the process of crowdsourced data interlinking in semantic libraries, which is a paid crowd-powered approach that can support librarians in the interlinking task. Several use cases are described to illustrate how our software, which implements the crowdsourced data interlinking process, could be useful to reduce the amount of information that librarians would need to process when enriching their data with other sources, or to obtain a different perspective from potential users. In addition, challenges that become relevant when adopting this approach are listed.

SWIB14

December 03, 2014
Tweet

More Decks by SWIB14

Other Decks in Technology

Transcript

  1. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
    Supporting Data Interlinking
    in Semantic Libraries
    with Microtask Crowdsourcing
    Cristina Sarasua
    SWIB 2014, Bonn

    View full-size slide

  2. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 2
    Cristina Sarasua

    View full-size slide

  3. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 3
    Cristina Sarasua
    a b
    relation
    MARC 21
    FRBR
    EDM

    View full-size slide

  4. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 4
    Cristina Sarasua
    a b
    relation
    MARC 21
    FRBR
    EDM

    View full-size slide

  5. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 5
    Cristina Sarasua
    Please share your thoughts on interlinking!
    https://etherpad.mozilla.org/4IfZDaTBIe

    View full-size slide

  6. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 6
    Cristina Sarasua
    Interlinking on the Web of Data
    Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer,
    Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/
    https://etherpad.mozilla.org/4IfZDaTBIe

    View full-size slide

  7. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 7
    Cristina Sarasua
    Cross-dataset links
    D1
    d1:timbl owl:sameAs d2:timbernerslee;
    d1:donostia owl:sameAs d2:sansebastian;
    d1:timbl owl:sameAs d2:timbernerslee;
    d1:donostia owl:sameAs d2:sansebastian;
    d1:bjork dc:creator d2:volta;
    d1:Bonn wgs84:location d2:Germany;
    d1:work2012 o:inspiredBy d2:song1900;
    d1:bjork dc:creator d2:volta;
    d1:Bonn wgs84:location d2:Germany;
    d1:work2012 o:inspiredBy d2:song1900;
    D2
    (a,r,b) | a in D1, b in D2
    o1:Conference owl:equivalentClass o2:Congress;
    o1:Democracy skos:related o2:Government;
    o1:Publication skos:broader o2:JournalArticle;
    o1:ImpressionistPainting rdfs:subClassOf o2:Painting;
    o1:Conference owl:equivalentClass o2:Congress;
    o1:Democracy skos:related o2:Government;
    o1:Publication skos:broader o2:JournalArticle;
    o1:ImpressionistPainting rdfs:subClassOf o2:Painting;
    https://etherpad.mozilla.org/4IfZDaTBIe

    View full-size slide

  8. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 8
    Cristina Sarasua
    Why is interlinking important?
     Enhance the
    description of local
    entities
     Richer queries over
    aggregated data
     Cross-data set
    browsing
    What is known about Berlin?
    x:berlin owl:sameAs
    dbpedia:Berlin;
    tour:berlin;
    x:berlin o:homeOf
    authors:berlin;
    x:img09112014
    lode:atPlace geo:brandtor;
    What is known about Berlin?
    x:berlin owl:sameAs
    dbpedia:Berlin;
    tour:berlin;
    x:berlin o:homeOf
    authors:berlin;
    x:img09112014
    lode:atPlace geo:brandtor;
    SELECT ?city
    WHERE {
    ?city1 gov:population ?pop .
    ?city1 owl:sameAs ?city2 .
    ?city2 unesco:count ?mon .
    FILTER (?pop > 1000000
    ?mon > 50)}
    SELECT ?city
    WHERE {
    ?city1 gov:population ?pop .
    ?city1 owl:sameAs ?city2 .
    ?city2 unesco:count ?mon .
    FILTER (?pop > 1000000
    ?mon > 50)}
    https://etherpad.mozilla.org/4IfZDaTBIe
    http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/

    View full-size slide

  9. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 9
    Cristina Sarasua
    Generating links
    Comparison
    criteria
    https://etherpad.mozilla.org/4IfZDaTBIe
    D1 D2
    Identify the
    resources to
    be connected
    with relation R
    Picture:
    https://www.assembla.com/spaces/silk/wiki/Managin
    g_Reference_Links
    Decision boundary between
    link and non-link

    View full-size slide

  10. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 10
    Cristina Sarasua
    He is already busy
    Attribution: Thomas Leu

    View full-size slide

  11. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 11
    Cristina Sarasua
    Attribution: Thomas Leu
    He is already busy
    … but still would like
    correct and useful links

    View full-size slide

  12. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 12
    Cristina Sarasua
    Crowdsourced Interlinking

    View full-size slide

  13. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 13
    Cristina Sarasua
    Crowdsourcing
    “Crowdsourcing represents the act of a company or institution taking a
    function once performed by employees and outsourcing it to an undefined
    (and generally large) network of people in the form of an open call”
    Jeff Howe, 2006
    Fast
    Scalable
    Microtask
    crowdsourcing
    Microtask
    crowdsourcing
    Macrotask
    crowdsourcing
    Macrotask
    crowdsourcing
    Contest-based
    crowdsourcing
    Contest-based
    crowdsourcing Citizen Science
    Citizen Science
    -E.g. tweet sentiment
    analysis
    -Seconds, reward cents
    -Crowd workers register
    with simple profile, limited
    filtering
    -E.g. writing an E-Book
    -Months, $30per hour /
    hundreds or thousands of
    dollars
    -Freelancers recruitment,
    interviews
    -E.g. NLP algorithm for a
    particular challenging
    scenario
    -Months, up to thousands
    of dollards
    -Final evaluation and
    winner selection
    -E.g. classify galaxies in
    pictures
    - seconds/minutes, no
    money
    - Open to everyone

    View full-size slide

  14. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 14
    Cristina Sarasua
    An interlinking microtask

    View full-size slide

  15. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 15
    Cristina Sarasua
    An interlinking microtask

    View full-size slide

  16. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 16
    Cristina Sarasua
    An interlinking microtask

    View full-size slide

  17. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 17
    Cristina Sarasua
    Approach
    D1 D2
    cl1: (s,p,o)
    cl2: (s,p,o)

    cln: (s,p,o)
    cl1: (s,p,o)
    cl2: (s,p,o)

    cln: (s,p,o)
    candidate links
    1
    2
    3
    Analyse crowd workers
    Aggregated
    response
    Collect crowd responses for the candidate links to
    be processed
    cl5: (s,p,o)

    cln: (s,p,o)
    cl5: (s,p,o)

    cln: (s,p,o) crowd interlinking
    4
    Parse
    RDF links
    Generate
    and publish
    microtasks
    Collect
    responses
    Generate
    RDF file with
    final links
    Query
    D1,D2

    View full-size slide

  18. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 18
    Cristina Sarasua
    Approach (II)
     Analyse crowd workers to filter out people
    – With bad intentions (i.e. scammers)
    – Who do not have enough knowledge
     Select representative links from which the answer is known
    (ground truth) and assess people → domain expert useful
    x:b rdfs:label “Berlin”;
    rdf:type o:City;
    x:b rdfs:label “Berlin”;
    rdf:type o:City;
    x:b rdfs:label “Córdoba”;
    rdf:type o:City;
    x:b rdfs:label “Córdoba”;
    rdf:type o:City;
    x:b2 rdfs:label “Berlinale”;
    rdf:type o:Event;
    x:b2 rdfs:label “Berlinale”;
    rdf:type o:Event;
    x:b2 rdfs:label “Córdoba”;
    rdf:type o:City;
    x:b2 rdfs:label “Córdoba”;
    rdf:type o:City; Select
    different
    matching
    cases
    x:b rdfs:label “Córdoba”;
    rdf:type o:City;
    wgs84:lat -31.400;
    x:b rdfs:label “Córdoba”;
    rdf:type o:City;
    wgs84:lat -31.400;
    Measure
    difficulty based
    on data
    heuristics
    x:b2 rdf:type o:City;
    wgs84:lat 37.883;
    x:b2 rdf:type o:City;
    wgs84:lat 37.883;

    View full-size slide

  19. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 19
    Cristina Sarasua
    Approach (II)
     Analyse crowd workers to filter out people
    – With bad intentions (i.e. scammers)
    – Who do not have enough knowledge
     Select representative links from which the answer is known
    (ground truth) and assess people → domain expert useful
    x:b rdfs:label “Berlin”;
    rdf:type o:City;
    x:b rdfs:label “Berlin”;
    rdf:type o:City;
    x:b rdfs:label “Córdoba”;
    rdf:type o:City;
    x:b rdfs:label “Córdoba”;
    rdf:type o:City;
    x:b2 rdfs:label “Berlinale”;
    rdf:type o:Event;
    x:b2 rdfs:label “Berlinale”;
    rdf:type o:Event;
    x:b2 rdfs:label “Córdoba”;
    rdf:type o:City;
    x:b2 rdfs:label “Córdoba”;
    rdf:type o:City; Select
    different
    matching
    cases
    x:b rdfs:label “Córdoba”;
    rdf:type o:City;
    wgs84:lat -31.400;
    x:b rdfs:label “Córdoba”;
    rdf:type o:City;
    wgs84:lat -31.400;
    Measure
    difficulty based
    on data
    heuristics
    x:b2 rdf:type o:City;
    wgs84:lat 37.883;
    x:b2 rdf:type o:City;
    wgs84:lat 37.883;
    Two-way feedback

    View full-size slide

  20. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 20
    Cristina Sarasua
    Approach
    D1 D2
    cl1: (s,p,o)
    cl2: (s,p,o)

    cln: (s,p,o)
    cl1: (s,p,o)
    cl2: (s,p,o)

    cln: (s,p,o)
    candidate links
    1
    2
    3
    Analyse crowd workers
    Aggregated
    response
    Collect crowd responses for the candidate links to
    be processed
    cl5: (s,p,o)

    cln: (s,p,o)
    cl5: (s,p,o)

    cln: (s,p,o) crowd interlinking
    4
    Parse
    RDF links
    Generate
    and publish
    microtasks
    Collect
    responses
    Generate
    RDF file with
    final links
    Query
    D1,D2
    agreement
    #workers per link
    Context information

    View full-size slide

  21. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 21
    Cristina Sarasua
    Approach (II)
    D1 D2
    Manual interlinking
    D1 D2
    HCOMP interlinking
    Guide Review
    Algorithm

    View full-size slide

  22. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 22
    Cristina Sarasua
    Use cases

    View full-size slide

  23. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 23
    Cristina Sarasua
    Mapping vocabularies
    Run an automatic ontology
    alignment tool and post-process
    the results with the crowd
    See also: [Sarasua et al., 2012]
    Context information
    pre-configured

    View full-size slide

  24. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 24
    Cristina Sarasua







    a) To extract the patterns of the linkage rules (i.e. labelling)
    b) To post-process irregular multilingual values, different name versions
    c) To automatically identify patterns of errors in a resulting set of links, which
    may be afterwards reviewed by the experts
    Discovering links between instances

    View full-size slide

  25. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 25
    Cristina Sarasua
     There are different possible targets for the interlinking of a dataset:
    which possibility to select for the Web portal?
     Embed Web site in a microtask and ask for specific information or
    observe next Web site opened
    Curating mapping extensions to authority files
    Quality control can be done
    by giving these answers to
    other crowd workers
    Checking usefulness of links with library users

    View full-size slide

  26. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 26
    Cristina Sarasua
    3 Challenges

    View full-size slide

  27. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 27
    Cristina Sarasua
    # Deciding whether to crowdsource or not
     Depends to a large extent on the data
    – Specific domains require more crowd management effort
    – Benefit compared to automatically generated links may vary
    – Availability of workers may change in time
     What should be processed by the crowd
    – Criteria for selecting subsets of the data (e.g. confidence of
    machine)
    Libraries and the cultural heritage domain have high potential
    (multilinguality, different naming conventions, knowledge exploration)
    > Trial, error and assessment

    View full-size slide

  28. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 28
    Cristina Sarasua
    # Building a loyal workforce
     Attracting good crowd workers
    – Microtasks are constantly being published
    – Higher reward may also attract more malicious workers
     Working with people repeatedly is not supported by majority of
    crowdsourcing platforms
     How to make crowd workers keep on working in these microtasks
    without them getting demotivated?
    > Be fair (see also Guidelines on
    Crowd Work for Academic Researchers,
    2014)
    > Listen to crowd workers (e.g. direct
    comments, twitter, ratings, monitor online
    discussions)
    > Recognize their work
    > Be aware that gamification is not always
    the best solution
    It's really easy to change people's motivations, [at
    Zooniverse] we find people are motivated by wanting
    to contribute, they want a sense that this is something
    real. And in adding game-like elements you can
    destroy that quite quickly” Chris Lintott, Zoouniverse
    http://www.wired.co.uk/news/archive/2013-
    09/12/fraxinus-gamifying-science/viewgallery/307960

    View full-size slide

  29. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 29
    Cristina Sarasua
    # Working with unknown humans
     Open call can be a problem and an opportinty at the same
    time: people have diverse
    – Motivation and dedication
    – Context and profile
    – Background knowledge
     Crowdsourcing platforms have limited support for
    personalisation
     Working with suitable crowd
    – Identify what they can do best
    ▪ Type of task / data level
    ▪ Competences vs experience cross platform analysis
    – Assign work accordingly
    ▪ Weight vs reject
    >Towards a Crowd Work CV
    See also: [Sarasua et al., 2014]

    View full-size slide

  30. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 30
    Cristina Sarasua
    Plea to this community
     Interlinking is much more than deduplication, consider using
    also other relations
     Consider connecting library datasets to different
    complementary domains
     Interlinking to non editorial data can also be enriching
     The more datasets you connect the better
     Document your interlinking on the VoiD description of your
    dataset
     Query and make use of available links

    View full-size slide

  31. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 31
    Cristina Sarasua
    If you need humans to process data while interlinking
    datasets, consider crowd intervention because it can be
    very valuable for enhancing your results.

    View full-size slide

  32. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
    Thank you for your attention!
    Cristina Sarasua
    Institute for Web Science and Technologies
    Universität Koblenz-Landau
    [email protected]
    http://de.slideshare.net/cristinasarasua
    https://github.com/criscod

    View full-size slide

  33. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 33
    Cristina Sarasua
    References
     Sarasua, C., Simperl, E., Noy, N.F.: CrowdMAP: Crowdsourcing ontology
    alignment with microtasks. In: Proceedings of the
    11th International Semantic Web Conference (ISWC). (2012)
     Sarasua, C., Thimm, M. Crowd Work CV: Recognition for Micro Work. In:
    SoHuman workshop, co-located with Social Informatics (SocInfo). (2014)
     Guidelines on Crowd Work for Academic Researchers (2014).
    http://wiki.wearedynamo.org/index.php/Guidelines_for_Academic_Requesters

    View full-size slide