Slide 1

Slide 1 text

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing Cristina Sarasua SWIB 2014, Bonn

Slide 2

Slide 2 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 2 Cristina Sarasua

Slide 3

Slide 3 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 3 Cristina Sarasua a b relation MARC 21 FRBR EDM

Slide 4

Slide 4 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 4 Cristina Sarasua a b relation MARC 21 FRBR EDM

Slide 5

Slide 5 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 5 Cristina Sarasua Please share your thoughts on interlinking! https://etherpad.mozilla.org/4IfZDaTBIe

Slide 6

Slide 6 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 6 Cristina Sarasua Interlinking on the Web of Data Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/ https://etherpad.mozilla.org/4IfZDaTBIe

Slide 7

Slide 7 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 7 Cristina Sarasua Cross-dataset links D1 d1:timbl owl:sameAs d2:timbernerslee; d1:donostia owl:sameAs d2:sansebastian; d1:timbl owl:sameAs d2:timbernerslee; d1:donostia owl:sameAs d2:sansebastian; d1:bjork dc:creator d2:volta; d1:Bonn wgs84:location d2:Germany; d1:work2012 o:inspiredBy d2:song1900; d1:bjork dc:creator d2:volta; d1:Bonn wgs84:location d2:Germany; d1:work2012 o:inspiredBy d2:song1900; D2 (a,r,b) | a in D1, b in D2 o1:Conference owl:equivalentClass o2:Congress; o1:Democracy skos:related o2:Government; o1:Publication skos:broader o2:JournalArticle; o1:ImpressionistPainting rdfs:subClassOf o2:Painting; o1:Conference owl:equivalentClass o2:Congress; o1:Democracy skos:related o2:Government; o1:Publication skos:broader o2:JournalArticle; o1:ImpressionistPainting rdfs:subClassOf o2:Painting; https://etherpad.mozilla.org/4IfZDaTBIe

Slide 8

Slide 8 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 8 Cristina Sarasua Why is interlinking important?  Enhance the description of local entities  Richer queries over aggregated data  Cross-data set browsing What is known about Berlin? x:berlin owl:sameAs dbpedia:Berlin; tour:berlin; x:berlin o:homeOf authors:berlin; x:img09112014 lode:atPlace geo:brandtor; What is known about Berlin? x:berlin owl:sameAs dbpedia:Berlin; tour:berlin; x:berlin o:homeOf authors:berlin; x:img09112014 lode:atPlace geo:brandtor; SELECT ?city WHERE { ?city1 gov:population ?pop . ?city1 owl:sameAs ?city2 . ?city2 unesco:count ?mon . FILTER (?pop > 1000000 ?mon > 50)} SELECT ?city WHERE { ?city1 gov:population ?pop . ?city1 owl:sameAs ?city2 . ?city2 unesco:count ?mon . FILTER (?pop > 1000000 ?mon > 50)} https://etherpad.mozilla.org/4IfZDaTBIe http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/

Slide 9

Slide 9 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 9 Cristina Sarasua Generating links Comparison criteria https://etherpad.mozilla.org/4IfZDaTBIe D1 D2 Identify the resources to be connected with relation R Picture: https://www.assembla.com/spaces/silk/wiki/Managin g_Reference_Links Decision boundary between link and non-link

Slide 10

Slide 10 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 10 Cristina Sarasua He is already busy Attribution: Thomas Leu

Slide 11

Slide 11 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 11 Cristina Sarasua Attribution: Thomas Leu He is already busy … but still would like correct and useful links

Slide 12

Slide 12 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 12 Cristina Sarasua Crowdsourced Interlinking

Slide 13

Slide 13 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 13 Cristina Sarasua Crowdsourcing “Crowdsourcing represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call” Jeff Howe, 2006 Fast Scalable Microtask crowdsourcing Microtask crowdsourcing Macrotask crowdsourcing Macrotask crowdsourcing Contest-based crowdsourcing Contest-based crowdsourcing Citizen Science Citizen Science -E.g. tweet sentiment analysis -Seconds, reward cents -Crowd workers register with simple profile, limited filtering -E.g. writing an E-Book -Months, $30per hour / hundreds or thousands of dollars -Freelancers recruitment, interviews -E.g. NLP algorithm for a particular challenging scenario -Months, up to thousands of dollards -Final evaluation and winner selection -E.g. classify galaxies in pictures - seconds/minutes, no money - Open to everyone

Slide 14

Slide 14 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 14 Cristina Sarasua An interlinking microtask

Slide 15

Slide 15 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 15 Cristina Sarasua An interlinking microtask

Slide 16

Slide 16 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 16 Cristina Sarasua An interlinking microtask

Slide 17

Slide 17 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 17 Cristina Sarasua Approach D1 D2 cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) candidate links 1 2 3 Analyse crowd workers Aggregated response Collect crowd responses for the candidate links to be processed cl5: (s,p,o) … cln: (s,p,o) cl5: (s,p,o) … cln: (s,p,o) crowd interlinking 4 Parse RDF links Generate and publish microtasks Collect responses Generate RDF file with final links Query D1,D2

Slide 18

Slide 18 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 18 Cristina Sarasua Approach (II)  Analyse crowd workers to filter out people – With bad intentions (i.e. scammers) – Who do not have enough knowledge  Select representative links from which the answer is known (ground truth) and assess people → domain expert useful x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Córdoba”; rdf:type o:City; Select different matching cases x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; Measure difficulty based on data heuristics x:b2 rdf:type o:City; wgs84:lat 37.883; x:b2 rdf:type o:City; wgs84:lat 37.883;

Slide 19

Slide 19 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 19 Cristina Sarasua Approach (II)  Analyse crowd workers to filter out people – With bad intentions (i.e. scammers) – Who do not have enough knowledge  Select representative links from which the answer is known (ground truth) and assess people → domain expert useful x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Berlin”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Berlinale”; rdf:type o:Event; x:b2 rdfs:label “Córdoba”; rdf:type o:City; x:b2 rdfs:label “Córdoba”; rdf:type o:City; Select different matching cases x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; x:b rdfs:label “Córdoba”; rdf:type o:City; wgs84:lat -31.400; Measure difficulty based on data heuristics x:b2 rdf:type o:City; wgs84:lat 37.883; x:b2 rdf:type o:City; wgs84:lat 37.883; Two-way feedback

Slide 20

Slide 20 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 20 Cristina Sarasua Approach D1 D2 cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) cl1: (s,p,o) cl2: (s,p,o) … cln: (s,p,o) candidate links 1 2 3 Analyse crowd workers Aggregated response Collect crowd responses for the candidate links to be processed cl5: (s,p,o) … cln: (s,p,o) cl5: (s,p,o) … cln: (s,p,o) crowd interlinking 4 Parse RDF links Generate and publish microtasks Collect responses Generate RDF file with final links Query D1,D2 agreement #workers per link Context information

Slide 21

Slide 21 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 21 Cristina Sarasua Approach (II) D1 D2 Manual interlinking D1 D2 HCOMP interlinking Guide Review Algorithm

Slide 22

Slide 22 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 22 Cristina Sarasua Use cases

Slide 23

Slide 23 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 23 Cristina Sarasua Mapping vocabularies Run an automatic ontology alignment tool and post-process the results with the crowd See also: [Sarasua et al., 2012] Context information pre-configured

Slide 24

Slide 24 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 24 Cristina Sarasua        a) To extract the patterns of the linkage rules (i.e. labelling) b) To post-process irregular multilingual values, different name versions c) To automatically identify patterns of errors in a resulting set of links, which may be afterwards reviewed by the experts Discovering links between instances

Slide 25

Slide 25 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 25 Cristina Sarasua  There are different possible targets for the interlinking of a dataset: which possibility to select for the Web portal?  Embed Web site in a microtask and ask for specific information or observe next Web site opened Curating mapping extensions to authority files Quality control can be done by giving these answers to other crowd workers Checking usefulness of links with library users

Slide 26

Slide 26 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 26 Cristina Sarasua 3 Challenges

Slide 27

Slide 27 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 27 Cristina Sarasua # Deciding whether to crowdsource or not  Depends to a large extent on the data – Specific domains require more crowd management effort – Benefit compared to automatically generated links may vary – Availability of workers may change in time  What should be processed by the crowd – Criteria for selecting subsets of the data (e.g. confidence of machine) Libraries and the cultural heritage domain have high potential (multilinguality, different naming conventions, knowledge exploration) > Trial, error and assessment

Slide 28

Slide 28 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 28 Cristina Sarasua # Building a loyal workforce  Attracting good crowd workers – Microtasks are constantly being published – Higher reward may also attract more malicious workers  Working with people repeatedly is not supported by majority of crowdsourcing platforms  How to make crowd workers keep on working in these microtasks without them getting demotivated? > Be fair (see also Guidelines on Crowd Work for Academic Researchers, 2014) > Listen to crowd workers (e.g. direct comments, twitter, ratings, monitor online discussions) > Recognize their work > Be aware that gamification is not always the best solution It's really easy to change people's motivations, [at Zooniverse] we find people are motivated by wanting to contribute, they want a sense that this is something real. And in adding game-like elements you can destroy that quite quickly” Chris Lintott, Zoouniverse http://www.wired.co.uk/news/archive/2013- 09/12/fraxinus-gamifying-science/viewgallery/307960

Slide 29

Slide 29 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 29 Cristina Sarasua # Working with unknown humans  Open call can be a problem and an opportinty at the same time: people have diverse – Motivation and dedication – Context and profile – Background knowledge  Crowdsourcing platforms have limited support for personalisation  Working with suitable crowd – Identify what they can do best ▪ Type of task / data level ▪ Competences vs experience cross platform analysis – Assign work accordingly ▪ Weight vs reject >Towards a Crowd Work CV See also: [Sarasua et al., 2014]

Slide 30

Slide 30 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 30 Cristina Sarasua Plea to this community  Interlinking is much more than deduplication, consider using also other relations  Consider connecting library datasets to different complementary domains  Interlinking to non editorial data can also be enriching  The more datasets you connect the better  Document your interlinking on the VoiD description of your dataset  Query and make use of available links

Slide 31

Slide 31 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 31 Cristina Sarasua If you need humans to process data while interlinking datasets, consider crowd intervention because it can be very valuable for enhancing your results.

Slide 32

Slide 32 text

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany Thank you for your attention! Cristina Sarasua Institute for Web Science and Technologies Universität Koblenz-Landau [email protected] http://de.slideshare.net/cristinasarasua https://github.com/criscod

Slide 33

Slide 33 text

Supporting Data Interlinking in Semantic Libraries with Microtask Crowdsourcing 33 Cristina Sarasua References  Sarasua, C., Simperl, E., Noy, N.F.: CrowdMAP: Crowdsourcing ontology alignment with microtasks. In: Proceedings of the 11th International Semantic Web Conference (ISWC). (2012)  Sarasua, C., Thimm, M. Crowd Work CV: Recognition for Micro Work. In: SoHuman workshop, co-located with Social Informatics (SocInfo). (2014)  Guidelines on Crowd Work for Academic Researchers (2014). http://wiki.wearedynamo.org/index.php/Guidelines_for_Academic_Requesters