Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wikidata: A Free Collaborative Knowledge Base

De87c771ade1b486944471caee227739?s=47 SWIB14
December 02, 2014

Wikidata: A Free Collaborative Knowledge Base

Presenter: Markus Krötzsch

Abstract:
Wikidata, the free knowledge base of Wikipedia, is one of the largest collections of human-authored structured information that are freely available on the Web. It is curated by a unique community of tens of thousands of editors who contribute in up to 400 different languages. Data is stored in a language-independent way, so that most users can access information in their native language. To support plurality, Wikidata uses a rich content model that gives up on the idea that the world can be described as a set of *true” facts. Instead, statements in Wikidata provide additional context information, such as temporal validity and provenance (in particular, most statements in Wikidata already provide one or more references). One could easily image this to lead to a rather chaotic pile of disjointed facts that are hard to use or even navigate. However, large parts of the data are interlinked with international authority files, catalogues, databases, and, of course, Wikipedia. Moreover, the community strives to reach “global” agreement on how to organise knowledge: over 1,000 properties and 40,000 classes are currently used as an ontology of the system, and many aspects of this knowledge model are discussed extensively in the community. Together, this leads to a multilingual knowledge base of increasing quality that has many practical uses. This talk gives an overview of the project, explains design choices, and discusses emerging developments and opportunities related to Wikidata.

De87c771ade1b486944471caee227739?s=128

SWIB14

December 02, 2014
Tweet

Transcript

  1. Technische Universität Dresden Fakultät Informatik Wikidata A Free Collaborative Knowledge

    Base Markus Krötzsch TU Dresden Semantic Web in Libraries December 2014
  2. Where is Wikipedia Going? Wikipedia in 2014:  A project

    that has shaped the Web  Huge global reach (> 500M unique visitors/month)  Stable, reliable, … loosing momentum?  Criticized on a regular basis
  3. None
  4. Wikipedia's Challenges (selection) Community of Contributors Content Size and Quality

    Mobile markets Community of Contributors Editing experience Language diversity Maintenance effort Integration with external sources User engagement Content reuse
  5. Example: Language Diversity  There is no one Wikipedia: over

    280 language editions  English, German, French, Dutch: 1 Mio+  40 languages: 100,000+  112 languages: 10,000+  Great differences in  Size  Goals (“What is encyclpaedic?” …)  Community  Coverage  Quality
  6. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene English
  7. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene French
  8. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Catalan
  9. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Italian
  10. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Greek
  11. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Russian
  12. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Chinese
  13. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene English
  14. Example: Content Reuse  Wikipedia as an information cul-de-sac 

    Extremely restricted access paths (main access method: reading lengthy pages of text)  Information extraction is hard  Question answering is hard  Adapting to new contexts is hard Example: “What are the world's largest cities with a female mayor?”
  15. Markus Krötzsch: Wikidata Toolkit Kickoff

  16. Markus Krötzsch: Wikidata Toolkit Kickoff

  17. Markus Krötzsch: Wikidata Toolkit Kickoff

  18. Markus Krötzsch: Wikidata Toolkit Kickoff

  19. Markus Krötzsch: Wikidata Toolkit Kickoff

  20. Markus Krötzsch: Wikidata Toolkit Kickoff

  21. Markus Krötzsch: Wikidata Toolkit Kickoff

  22. Markus Krötzsch: Wikidata Toolkit Kickoff

  23. Markus Krötzsch: Wikidata Toolkit Kickoff

  24. Markus Krötzsch: Wikidata Toolkit Kickoff

  25. Markus Krötzsch: Wikidata Toolkit Kickoff

  26. Markus Krötzsch: Wikidata Toolkit Kickoff

  27. Markus Krötzsch: Wikidata Toolkit Kickoff

  28. Markus Krötzsch: Wikidata Toolkit Kickoff

  29. Markus Krötzsch: Wikidata Toolkit Kickoff

  30. Markus Krötzsch: Wikidata Toolkit Kickoff

  31. Markus Krötzsch: Wikidata Toolkit Kickoff

  32. Markus Krötzsch: Wikidata Toolkit Kickoff

  33. Markus Krötzsch: Wikidata Toolkit Kickoff

  34. Markus Krötzsch: Wikidata Toolkit Kickoff

  35. Markus Krötzsch: Wikidata Toolkit Kickoff

  36. Markus Krötzsch: Wikidata Toolkit Kickoff

  37. Markus Krötzsch: Wikidata Toolkit Kickoff

  38. Markus Krötzsch: Wikidata Toolkit Kickoff

  39. Markus Krötzsch: Wikidata Toolkit Kickoff

  40. Markus Krötzsch: Wikidata Toolkit Kickoff

  41. Markus Krötzsch: Wikidata Toolkit Kickoff

  42. Markus Krötzsch: Wikidata Toolkit Kickoff

  43. Markus Krötzsch: Wikidata Toolkit Kickoff

  44. Markus Krötzsch: Wikidata Toolkit Kickoff

  45. Markus Krötzsch: Wikidata Toolkit Kickoff

  46. None
  47. Wikidata  Official “Wikipedia Database”  Live at www.wikidata.org 

    Data used by most Wikimedia Projects  All 285 language editions of Wikipedia  Wikivoyage, Wikiquote, Wikimedia Commons (new!)  Large, active community  More than 50K editors so far  Among the most active Wikimedia projects by edits
  48. Markus Krötzsch: Wikidata Toolkit Kickoff

  49. Wikidata Development  Based on free software “Wikibase”  Ongoing

    development led by Wikimedia Germany  Funded by Wikimedia Foundation  Original funding by donations (ai², Google, Moore Foundation, Yandex)
  50. Important note All data is entered by volunteers. The community

    decides what to enter and how. Wikimedia provides infrastructure, not data. Really.
  51. Data Model

  52. The Content of Wikidata

  53. Statements  The richest part of Wikidata's data Property Value

    Reference(s)
  54. Statements  The richest part of Wikidata's data

  55. Statements  The richest part of Wikidata's data Property Value

    List of qualifiers Reference = List of property- value pairs List of references Rank
  56. Some Statistics

  57. Size as of October 2014  Items: 16,318,300  Properties:

    1,255  Statements: 48,243,540 … references: 25,473,820  Labels: 54,922,438  Aliases: 8,719,665  Descriptions: 39,869,556  Site links: 40,660,771
  58. Growth (up to Feb 2014)

  59. Activity (Feb 2014)  54k contributors – 5k contributors with

    5+ edits in Jun 2014  Over 150M edits so far – up to 500k per day
  60. Wikidata and the Semantic Web

  61. Exporting Wikidata Statements to RDF URIs for items: http://www.wikidata.org/entity/<id>

  62. Classification  Properties subclass of (P279) and instance of (P31)

     P31 is the most used property on Wikidata  Often (but not always) used without qualifiers  Interesting class hierarchy:  Entities used as classes: 110,366  Subclass of: 110,910 (without qualifiers)  Instance of: 11,659,604 (without qualifiers)
  63. Available RDF Exports  RDF/OWL file exports at: http://tools.wmflabs.org/wikidata-exports/rdf/ 

    Dumps of Oct 13, 2014:  450M triples RDF dumps (main serializations)  67M triples simplified statements  12M triples unqualified instanceOf/subclassOf  LD Fragments/HDT dumps by Cristian Consonni: http://wikidataldf.com
  64. Wikidata and DBpedia: A Superficial Comparison Wikidata  Data related

    to Wikipedia  Online since late 2012*  Manual editing  One multilingual dataset  Based on statements  About 1k properties  Wikipedia integration  Unique community *) influenced by Semantic MediaWiki (started 2005) DBpedia  Data related to Wikipedia  Started in 2006  Automated extraction  One dataset per language  Based on triples (RDF)  >10k properties  Stand-alone dataset  Unique community
  65. Usage & Applications

  66. None
  67. Application Areas  Labels and descriptions  Identifiers  Data

    access  Advanced analytics
  68. Third-party applications Wikipedia iOS app (beta)

  69. Third-party applications Reasonator (by Magnus Manske)

  70. Third-party applications Wikidata Game (by Magnus Manske)

  71. Third-party applications Wikipedia Gender Ratio analysis (by Max Klein)

  72. Third-party applications Missing Images Heatmap (Magnus Manske)

  73. Third-party applications Vizidata (by Georg Wild)

  74. Third-party applications Histropedia

  75. Third-party applications Wikidata Classes and Properties browser

  76. Getting the Data See www.wikidata.org/wiki/Wikidata:Data_access  Direct access per item

    (Web API, JSON, RDF, …)  Database dumps (JSON)  Use Wikidata Toolkit to parse dumps in Java https://www.mediawiki.org/wiki/Wikidata_Toolkit  RDF dumps  Useful third-party Web services  Wikidata Query (Magnus Manske)  Wikidata LDF (Cristian Consonni)
  77. Conclusions  Wikidata is developing rapidly  Data size 

    Vocabulary size  Technical features and community processes  A platform for data integration  Including links to many other databases  Data access is easy, both legally and technically  Further improvements planned for exports
  78. Further reading  Denny Vrandecic, Markus Krötzsch. Wikidata: A Free

    Collaborative Knowledge Base. CACM 2014. To appear → general first introduction to Wikidata  Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, Denny Vrandečić. Introducing Wikidata to the Linked Data Web. 2014. → introduction of the Wikidata RDF export and data model