Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing

SWIB14
December 03, 2014

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing

Presenter: Cristina Sarasua (Institute for Web Science and Technologies (WeST). University of Koblenz-Landau, Germany)

Abstract:
Semantic Web technologies enable the integration of distributed data sets curated by different organisations and with different purposes. Descriptions of particular resources (e.g. events, persons or images) are connected through links that explicitly state the relationship between them. Connecting data of similar or disparate domains, libraries can offer a more extensive and detailed information to their visitors, while librarians have better documentation in their cataloguing activities. Despite the advances in data interlinking technology, human intervention is still a core aspect of the process. Humans, in particular librarians, are crucial both as knowledge providers and reviewers of the automatically computed links. One of the problems that arises in this scenario is that libraries might have limited human resources dedicated to authority control; so, running the time-consuming interlinking process over external data sets becomes troublesome. Microtask crowdsourcing provides an economic and scalable way to involve humans systematically in data processing. The goal of this talk is to introduce the process of crowdsourced data interlinking in semantic libraries, which is a paid crowd-powered approach that can support librarians in the interlinking task. Several use cases are described to illustrate how our software, which implements the crowdsourced data interlinking process, could be useful to reduce the amount of information that librarians would need to process when enriching their data with other sources, or to obtain a different perspective from potential users. In addition, challenges that become relevant when adopting this approach are listed.

SWIB14

December 03, 2014
Tweet

More Decks by SWIB14

Other Decks in Technology

Transcript

  1. Institute for Web Science and Technologies · University of Koblenz-Landau,

    Germany Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing Cristina Sarasua SWIB 2014, Bonn
  2. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 5

    Cristina Sarasua Please share your thoughts on interlinking! https://etherpad.mozilla.org/4IfZDaTBIe
  3. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 6

    Cristina Sarasua Interlinking on the Web of Data Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/ https://etherpad.mozilla.org/4IfZDaTBIe
  4. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 7

    Cristina Sarasua Cross-dataset links D1 d1:timbl owl:sameAs d2:timbernerslee; d1:donostia owl:sameAs d2:sansebastian; d1:timbl owl:sameAs d2:timbernerslee; d1:donostia owl:sameAs d2:sansebastian; d1:bjork dc:creator d2:volta; d1:Bonn wgs84:location d2:Germany; d1:work2012 o:inspiredBy d2:song1900; d1:bjork dc:creator d2:volta; d1:Bonn wgs84:location d2:Germany; d1:work2012 o:inspiredBy d2:song1900; D2 (a,r,b) | a in D1, b in D2 o1:Conference owl:equivalentClass o2:Congress; o1:Democracy skos:related o2:Government; o1:Publication skos:broader o2:JournalArticle; o1:ImpressionistPainting rdfs:subClassOf o2:Painting; o1:Conference owl:equivalentClass o2:Congress; o1:Democracy skos:related o2:Government; o1:Publication skos:broader o2:JournalArticle; o1:ImpressionistPainting rdfs:subClassOf o2:Painting; https://etherpad.mozilla.org/4IfZDaTBIe
  5. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 8

    Cristina Sarasua Why is interlinking important?  Enhance the description of local entities  Richer queries over aggregated data  Cross-data set browsing What is known about Berlin? x:berlin owl:sameAs dbpedia:Berlin; tour:berlin; x:berlin o:homeOf authors:berlin; x:img09112014 lode:atPlace geo:brandtor; What is known about Berlin? x:berlin owl:sameAs dbpedia:Berlin; tour:berlin; x:berlin o:homeOf authors:berlin; x:img09112014 lode:atPlace geo:brandtor; SELECT ?city WHERE { ?city1 gov:population ?pop . ?city1 owl:sameAs ?city2 . ?city2 unesco:count ?mon . FILTER (?pop > 1000000 ?mon > 50)} SELECT ?city WHERE { ?city1 gov:population ?pop . ?city1 owl:sameAs ?city2 . ?city2 unesco:count ?mon . FILTER (?pop > 1000000 ?mon > 50)} https://etherpad.mozilla.org/4IfZDaTBIe http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/
  6. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 9

    Cristina Sarasua Generating links Comparison criteria https://etherpad.mozilla.org/4IfZDaTBIe D1 D2 Identify the resources to be connected with relation R Picture: https://www.assembla.com/spaces/silk/wiki/Managin g_Reference_Links Decision boundary between link and non-link
  7. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 10

    Cristina Sarasua He is already busy Attribution: Thomas Leu
  8. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 11

    Cristina Sarasua Attribution: Thomas Leu He is already busy … but still would like correct and useful links
  9. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 13

    Cristina Sarasua Crowdsourcing “Crowdsourcing represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call” Jeff Howe, 2006 Fast Scalable Microtask crowdsourcing Microtask crowdsourcing Macrotask crowdsourcing Macrotask crowdsourcing Contest-based crowdsourcing Contest-based crowdsourcing Citizen Science Citizen Science -E.g. tweet sentiment analysis -Seconds, reward cents -Crowd workers register with simple profile, limited filtering -E.g. writing an E-Book -Months, $30per hour / hundreds or thousands of dollars -Freelancers recruitment, interviews -E.g. NLP algorithm for a particular challenging scenario -Months, up to thousands of dollards -Final evaluation and winner selection -E.g. classify galaxies in pictures - seconds/minutes, no money - Open to everyone
  10. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 17

    Cristina Sarasua Approach D1 D2 cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) candidate links 1 2 3 Analyse crowd workers Aggregated response Collect crowd responses for the candidate links to be processed cl5: (s,p,o) … cln: (s,p,o) cl5: (s,p,o) … cln: (s,p,o) crowd interlinking 4 Parse RDF links Generate and publish microtasks Collect responses Generate RDF file with final links Query D1,D2
  11. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 18

    Cristina Sarasua Approach (II)  Analyse crowd workers to filter out people – With bad intentions (i.e. scammers) – Who do not have enough knowledge  Select representative links from which the answer is known (ground truth) and assess people → domain expert useful x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Córdoba”; rdf:type o:City; Select different matching cases x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; Measure difficulty based on data heuristics x:b2 rdf:type o:City; wgs84:lat 37.883; x:b2 rdf:type o:City; wgs84:lat 37.883;
  12. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 19

    Cristina Sarasua Approach (II)  Analyse crowd workers to filter out people – With bad intentions (i.e. scammers) – Who do not have enough knowledge  Select representative links from which the answer is known (ground truth) and assess people → domain expert useful x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Córdoba”; rdf:type o:City; Select different matching cases x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; Measure difficulty based on data heuristics x:b2 rdf:type o:City; wgs84:lat 37.883; x:b2 rdf:type o:City; wgs84:lat 37.883; Two-way feedback
  13. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 20

    Cristina Sarasua Approach D1 D2 cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) candidate links 1 2 3 Analyse crowd workers Aggregated response Collect crowd responses for the candidate links to be processed cl5: (s,p,o) … cln: (s,p,o) cl5: (s,p,o) … cln: (s,p,o) crowd interlinking 4 Parse RDF links Generate and publish microtasks Collect responses Generate RDF file with final links Query D1,D2 agreement #workers per link Context information
  14. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 21

    Cristina Sarasua Approach (II) D1 D2 Manual interlinking D1 D2 HCOMP interlinking Guide Review Algorithm
  15. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 23

    Cristina Sarasua Mapping vocabularies Run an automatic ontology alignment tool and post-process the results with the crowd See also: [Sarasua et al., 2012] Context information pre-configured
  16. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 24

    Cristina Sarasua        a) To extract the patterns of the linkage rules (i.e. labelling) b) To post-process irregular multilingual values, different name versions c) To automatically identify patterns of errors in a resulting set of links, which may be afterwards reviewed by the experts Discovering links between instances
  17. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 25

    Cristina Sarasua  There are different possible targets for the interlinking of a dataset: which possibility to select for the Web portal?  Embed Web site in a microtask and ask for specific information or observe next Web site opened Curating mapping extensions to authority files Quality control can be done by giving these answers to other crowd workers Checking usefulness of links with library users
  18. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 27

    Cristina Sarasua # Deciding whether to crowdsource or not  Depends to a large extent on the data – Specific domains require more crowd management effort – Benefit compared to automatically generated links may vary – Availability of workers may change in time  What should be processed by the crowd – Criteria for selecting subsets of the data (e.g. confidence of machine) Libraries and the cultural heritage domain have high potential (multilinguality, different naming conventions, knowledge exploration) > Trial, error and assessment
  19. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 28

    Cristina Sarasua # Building a loyal workforce  Attracting good crowd workers – Microtasks are constantly being published – Higher reward may also attract more malicious workers  Working with people repeatedly is not supported by majority of crowdsourcing platforms  How to make crowd workers keep on working in these microtasks without them getting demotivated? > Be fair (see also Guidelines on Crowd Work for Academic Researchers, 2014) > Listen to crowd workers (e.g. direct comments, twitter, ratings, monitor online discussions) > Recognize their work > Be aware that gamification is not always the best solution It's really easy to change people's motivations, [at Zooniverse] we find people are motivated by wanting to contribute, they want a sense that this is something real. And in adding game-like elements you can destroy that quite quickly” Chris Lintott, Zoouniverse http://www.wired.co.uk/news/archive/2013- 09/12/fraxinus-gamifying-science/viewgallery/307960
  20. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 29

    Cristina Sarasua # Working with unknown humans  Open call can be a problem and an opportinty at the same time: people have diverse – Motivation and dedication – Context and profile – Background knowledge  Crowdsourcing platforms have limited support for personalisation  Working with suitable crowd – Identify what they can do best ▪ Type of task / data level ▪ Competences vs experience cross platform analysis – Assign work accordingly ▪ Weight vs reject >Towards a Crowd Work CV See also: [Sarasua et al., 2014]
  21. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 30

    Cristina Sarasua Plea to this community  Interlinking is much more than deduplication, consider using also other relations  Consider connecting library datasets to different complementary domains  Interlinking to non editorial data can also be enriching  The more datasets you connect the better  Document your interlinking on the VoiD description of your dataset  Query and make use of available links
  22. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 31

    Cristina Sarasua If you need humans to process data while interlinking datasets, consider crowd intervention because it can be very valuable for enhancing your results.
  23. Institute for Web Science and Technologies · University of Koblenz-Landau,

    Germany Thank you for your attention! Cristina Sarasua Institute for Web Science and Technologies Universität Koblenz-Landau [email protected] http://de.slideshare.net/cristinasarasua https://github.com/criscod
  24. Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 33

    Cristina Sarasua References  Sarasua, C., Simperl, E., Noy, N.F.: CrowdMAP: Crowdsourcing ontology alignment with microtasks. In: Proceedings of the 11th International Semantic Web Conference (ISWC). (2012)  Sarasua, C., Thimm, M. Crowd Work CV: Recognition for Micro Work. In: SoHuman workshop, co-located with Social Informatics (SocInfo). (2014)  Guidelines on Crowd Work for Academic Researchers (2014). http://wiki.wearedynamo.org/index.php/Guidelines_for_Academic_Requesters