Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wikidata: A Free Collaborative Knowledge Base

SWIB14
December 02, 2014

Wikidata: A Free Collaborative Knowledge Base

Presenter: Markus Krötzsch

Abstract:
Wikidata, the free knowledge base of Wikipedia, is one of the largest collections of human-authored structured information that are freely available on the Web. It is curated by a unique community of tens of thousands of editors who contribute in up to 400 different languages. Data is stored in a language-independent way, so that most users can access information in their native language. To support plurality, Wikidata uses a rich content model that gives up on the idea that the world can be described as a set of *true” facts. Instead, statements in Wikidata provide additional context information, such as temporal validity and provenance (in particular, most statements in Wikidata already provide one or more references). One could easily image this to lead to a rather chaotic pile of disjointed facts that are hard to use or even navigate. However, large parts of the data are interlinked with international authority files, catalogues, databases, and, of course, Wikipedia. Moreover, the community strives to reach “global” agreement on how to organise knowledge: over 1,000 properties and 40,000 classes are currently used as an ontology of the system, and many aspects of this knowledge model are discussed extensively in the community. Together, this leads to a multilingual knowledge base of increasing quality that has many practical uses. This talk gives an overview of the project, explains design choices, and discusses emerging developments and opportunities related to Wikidata.

SWIB14

December 02, 2014
Tweet

More Decks by SWIB14

Other Decks in Technology

Transcript

  1. Technische Universität Dresden Fakultät Informatik Wikidata A Free Collaborative Knowledge

    Base Markus Krötzsch TU Dresden Semantic Web in Libraries December 2014
  2. Where is Wikipedia Going? Wikipedia in 2014:  A project

    that has shaped the Web  Huge global reach (> 500M unique visitors/month)  Stable, reliable, … loosing momentum?  Criticized on a regular basis
  3. Wikipedia's Challenges (selection) Community of Contributors Content Size and Quality

    Mobile markets Community of Contributors Editing experience Language diversity Maintenance effort Integration with external sources User engagement Content reuse
  4. Example: Language Diversity  There is no one Wikipedia: over

    280 language editions  English, German, French, Dutch: 1 Mio+  40 languages: 100,000+  112 languages: 10,000+  Great differences in  Size  Goals (“What is encyclpaedic?” …)  Community  Coverage  Quality
  5. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene English
  6. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene French
  7. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Catalan
  8. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Italian
  9. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Greek
  10. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Russian
  11. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene Chinese
  12. Markus Krötzsch: Wikidata Toolkit Kickoff  Mastertextformat bearbeiten  Zweite

    Ebene  Dritte Ebene  Vierte Ebene  Fünfte Ebene English
  13. Example: Content Reuse  Wikipedia as an information cul-de-sac 

    Extremely restricted access paths (main access method: reading lengthy pages of text)  Information extraction is hard  Question answering is hard  Adapting to new contexts is hard Example: “What are the world's largest cities with a female mayor?”
  14. Wikidata  Official “Wikipedia Database”  Live at www.wikidata.org 

    Data used by most Wikimedia Projects  All 285 language editions of Wikipedia  Wikivoyage, Wikiquote, Wikimedia Commons (new!)  Large, active community  More than 50K editors so far  Among the most active Wikimedia projects by edits
  15. Wikidata Development  Based on free software “Wikibase”  Ongoing

    development led by Wikimedia Germany  Funded by Wikimedia Foundation  Original funding by donations (ai², Google, Moore Foundation, Yandex)
  16. Important note All data is entered by volunteers. The community

    decides what to enter and how. Wikimedia provides infrastructure, not data. Really.
  17. Statements  The richest part of Wikidata's data Property Value

    List of qualifiers Reference = List of property- value pairs List of references Rank
  18. Size as of October 2014  Items: 16,318,300  Properties:

    1,255  Statements: 48,243,540 … references: 25,473,820  Labels: 54,922,438  Aliases: 8,719,665  Descriptions: 39,869,556  Site links: 40,660,771
  19. Activity (Feb 2014)  54k contributors – 5k contributors with

    5+ edits in Jun 2014  Over 150M edits so far – up to 500k per day
  20. Classification  Properties subclass of (P279) and instance of (P31)

     P31 is the most used property on Wikidata  Often (but not always) used without qualifiers  Interesting class hierarchy:  Entities used as classes: 110,366  Subclass of: 110,910 (without qualifiers)  Instance of: 11,659,604 (without qualifiers)
  21. Available RDF Exports  RDF/OWL file exports at: http://tools.wmflabs.org/wikidata-exports/rdf/ 

    Dumps of Oct 13, 2014:  450M triples RDF dumps (main serializations)  67M triples simplified statements  12M triples unqualified instanceOf/subclassOf  LD Fragments/HDT dumps by Cristian Consonni: http://wikidataldf.com
  22. Wikidata and DBpedia: A Superficial Comparison Wikidata  Data related

    to Wikipedia  Online since late 2012*  Manual editing  One multilingual dataset  Based on statements  About 1k properties  Wikipedia integration  Unique community *) influenced by Semantic MediaWiki (started 2005) DBpedia  Data related to Wikipedia  Started in 2006  Automated extraction  One dataset per language  Based on triples (RDF)  >10k properties  Stand-alone dataset  Unique community
  23. Getting the Data See www.wikidata.org/wiki/Wikidata:Data_access  Direct access per item

    (Web API, JSON, RDF, …)  Database dumps (JSON)  Use Wikidata Toolkit to parse dumps in Java https://www.mediawiki.org/wiki/Wikidata_Toolkit  RDF dumps  Useful third-party Web services  Wikidata Query (Magnus Manske)  Wikidata LDF (Cristian Consonni)
  24. Conclusions  Wikidata is developing rapidly  Data size 

    Vocabulary size  Technical features and community processes  A platform for data integration  Including links to many other databases  Data access is easy, both legally and technically  Further improvements planned for exports
  25. Further reading  Denny Vrandecic, Markus Krötzsch. Wikidata: A Free

    Collaborative Knowledge Base. CACM 2014. To appear → general first introduction to Wikidata  Fredo Erxleben, Michael Günther, Markus Krötzsch, Julian Mendez, Denny Vrandečić. Introducing Wikidata to the Linked Data Web. 2014. → introduction of the Wikidata RDF export and data model