Information Organization in the Web Age

Slide 1

Slide 1 text

Information Organization in the Web Age ウェブ時代の情報組織化 Masao Takaku (高久雅生) [email protected] 2020年2月14日（金） 1 TSUKUBA Short-term Study Program (TSSP) 2020

Slide 2

Slide 2 text

Me? • Masao Takaku (高久雅生; たかくまさお） • Research interests Information retrieval, information seeking behaviour Digital library Linked Open Data (LOD) • Contact: Email: [email protected] Twitter: @tmasao 2

Slide 3

Slide 3 text

My research area? 3 Information System Contents Document collections User & Community Information Needs My main research focus is to understand these elements and their relationships among them. Organization

Slide 4

Slide 4 text

Contents • Introduction • What is Information Organization? • What is Web? • Web & Information Organization • Discussions 4

Slide 5

Slide 5 text

WHAT DOES REALLY MEANS “INFORMATION ORGANIZATION”? 5 「組織化」とは何か?

Slide 6

Slide 6 text

Let’s start with the conclusion... • Information organization does… make the target information resources findable and understandable complement the human embodiments and subjectivity • It depends on the genre of information needs and resources add “Value-added information” • Second order, third order, and N-th order information • User tasks: Identification, Find, and Access Methodology: Description (record) and Classfication 6

Slide 7

Slide 7 text

What is information organization? • Large amounts of information should be organized well to make it easier to find and understand  Group the common/similar items together  Describe items with the common attributes and structure  Explain the common properties with the same name In general, “information organization” covers as follows: • Describe and extract the common structured information (Metadata) as the record, and make it searchable.  Cataloguing and description • Analyze the contents based on a certain criteria, assign labels, and enable information resources with common contents together.  Subject analysis, classification, subject headings, and indexing 7

Slide 8

Slide 8 text

Ex. Description and classification 8

Slide 9

Slide 9 text

Ex. Description and classification 9 Tent Rocket Rabbit Grape Gorilla Desk Paper airplane Pants Performer Motorbike Plumber Apple Pencil Tuba One piece

Slide 10

Slide 10 text

Ex. Description and classification 10 Tent Rocket Rabbit Grape Gorilla Desk Paper airplane Pants Performer Motorbike Plumber Apple Pencil Tuba One piece

Slide 11

Slide 11 text

Ex. Description and classification (Ordering) 11 Tent Rocket Rabbit Grape Gorilla Desk Paper airplane Pants Mortorbike Apple Pencil チューバ One piece Plumber Performer

Slide 12

Slide 12 text

Ex. Organize with attibutes and values Class Item (value) Animal Rabbit, Gorilla Human Plumber, Performer Fruit Apple, Grape Vehicle Motorbike, Rocket Tool One piece, Pants Tent, Desk, Tuba, Pencil, Paper airplane 12 Note that we may need more documentation on the classification scheme, if we need to classify more samples and more precisely.

Slide 13

Slide 13 text

Information organization in the context of information seeking 13 Information System Contents Document collections User & Community Information Needs Organization

Slide 14

Slide 14 text

Information organization in the context of information seeking (cont.) • Information organization helps users (& user community) to do the following tasks: 1. Identification 2. Find (subject search, content analysis) 3. Access (Acquirement, referring the location of the item) 14

Slide 15

Slide 15 text

What is User tasks? • Find task  Use any keyword or category to browse through and discover what the content or subject matter is. • In the traditional information retrieval researches, the find tasks are divided into “subject search” and “known item search” • Identification task  Distinguish one thing from another.  The unit of identification of "difference in things" varies depending on its area and use, such as having the same title but clearly distinguishing different versions or different versions • Access task  Check the location of the item, get it, and/or access the resources on the network. Note that every tasks are often conducted without the actual material, due we usually use surrogates (described metadata). 15

Slide 16

Slide 16 text

In the context of description 16

Slide 17

Slide 17 text

In the context of description (cont.) 17 田中宏和.com | 田中宏和宣言!! http://www.tanakahirokazu.com/ 田中宏和. 田中宏和さん. リーダーズノート, 2010, 192p.

Slide 18

Slide 18 text

WORLD WIDE WEB (WWW) 18 Webとは?

Slide 19

Slide 19 text

World Wide Web • WWW (World Wide Web) Or just “Web” • 【web】 (noun) A network of silken thread spun especially by the larvae of various insects (as a tent caterpillar) and usually serving as a nest or shelter. 19 https://commons.wikimedia.org/wiki/File: Spider_web_Belgium_Luc_Viatour.jpg

Slide 20

Slide 20 text

Three elements of the Web • HTTP, URI and HTML are the Three main components of the Web. • HTTP specifies the data transmission on the network and the type of the document format . • URI specifies the address of web pages, and it enables the hyperlinks among them on the network. 20

Slide 21

Slide 21 text

21 Knight Foundation (2008) http://www.flickr.com/photos/knightfoundation/2467553359/

Slide 22

Slide 22 text

CERN • International research institute of high energy physics in Europe. • Big science using High-speed accelerator Material science, Particle physics, etc. • Large amounts of device information • Massive needs for documenting and sharing information Employee: about 2,500 Visiting scholars: about 15,000 22

Slide 23

Slide 23 text

Collaboration by many scientists around the world! ATLAS Collaboration: “Dynamics of isolated— photon plus jet production in pp collisions at √s = 7 TeV with the ATLAS detector”. Nuclear Physics B, 875, 438-533 (2013) The number of authors : over 5,800

Slide 24

Slide 24 text

Brief history of Web • 1989 – 1991: Proposed (design and establishing the specifications) • 1992 – 1993: Became popular gradually… • 1993 – 1994: Gain popularity exponentially Mosaic, Netscape, Yahoo! • 1994 – 1995: Gain popularity in the society Windows95, Amazon, … 24

Slide 25

Slide 25 text

Very beginning of Web 25 Screenshot of the original NeXT web browser in 1993 http://info.cern.ch/

Slide 26

Slide 26 text

[side story] Hypermedia and hypertext to the Web The rise and spread of the Web, its conflict • The concept “Hypermedia” coined and spread Memex (Vannevar Bush) - 1945 Xanadu (Ted Nelson) - 1963? WWW (Tim Berners-Lee) – 1989 • What the Web has lost Integration of browsing and editing Version control Diverse and extensible hyperlinks Copyright management & Micro payment 26 Tim Berners-Lee: “Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web”. Harper Business, 2000, 256p.

Slide 27

Slide 27 text

Memex by Vannevar Bush (1945) 27

Slide 28

Slide 28 text

SEMANTIC WEB 28 Semantic Webの世界

Slide 29

Slide 29 text

Semantic Web (1) Tim Berners-Lee, James Hendler, Ora Lassila. The Semantic Web. Scientific American, 2001, Vol.284, No.5, pp.35-43. • From Web to “Semantic Web” • Web markups to enable semantic description and machine understandings 29

Slide 30

Slide 30 text

Semantic Web application (1) • Example: “I want to find a dentist who can stop by after work” After work: week days 9:00-18:00 Stop by after work: Tsukuba Express Line (TX) • Having a consultation after 18 • Stations along the TX line: Tsukuba, Kenkyu-gakuen, …, Minami-nagareyama, Kita-senju, … • Within a 500-minute walk from the station • (Personal assistant / agent) 30

Slide 31

Slide 31 text

Semantic Web application (2) • Disambiguation 月=月曜日 = Monday = Mon. “9:00-13:00 ; 15:00-19:00” Closed days, Medical hours Holidays, public holidays, open all year round • Understanding common sense One week = Mon, Tue, Wed, Thu, Fri, Sat, Sun Week days = Monday to Friday • Information extraction from the Web markups 31

Slide 32

Slide 32 text

Semantic Web components 32 Identifier：URI Character set: Unicode Notation：XML Data exchange：RDF Vocabulary：RDFS Ontology： OWL Rule： RIF/SWRL Search： SPARQL Digital Signature Logic Reasoning Trust User Interface / Application

Slide 33

Slide 33 text

Issues of Semantic Web • Decentralization + massive nature of the Web Huge amounts of web spaces Big data with various concepts and descriptions can be obtained Diverse information provider Multi-languages and multi-culture Cannot assume strict use of controlled vocabularies and custom conventions • Difficulty of general-purpose model It is difficult for computer applications to understand the meaning of things 33

Slide 34

Slide 34 text

RDF data model • RDF (Resource Description Framework) • Graph-based data model Directed graph with labels Triple representation • Feature Simple and highly expressive data model Writing rules tend to be complicated Processing operation takes time Resource (node) = URI (Uniform Resource Identifier) • Inherit the decentralized features of the web 34 J.K. Rowling Harry potter Author

Slide 35

Slide 35 text

Description with triples (1) • Consider the book itself as a “subject” resource, and build triples by its attributes and values • { subject, predicate, object } ⇒ { this book, property, value } 35 Property Value Title Weaving the Web Author Tim Berners-Lee Publisher Harper Business

Slide 36

Slide 36 text

Description with triples (2) Graphical representation of triples This book Weaving the Web Title This book Tim Berners-Lee Author This book Harper Bussiness Publisher 36

Slide 37

Slide 37 text

Description with triples (3) Aggregates the same resource as a single resource 37 This book Weaving the Web Title Tim Berners-Lee Author Harper Bussiness Publisher

Slide 38

Slide 38 text

Description with triples (4) • “Literal values” cannot be extended to other resources Only the “resource” node is possible to become a subject of a triple • In this example, if the author is a “resource” node, another triple can be connected This book Tim Berners-Lee Author Birth 1955 This book Author Birth 1955 Tim Berners-Lee Tim Berners-Lee Name

Slide 39

Slide 39 text

Advantages of RDF data • Formalized as a simple triple data model • Highly expandable by linking as a graph (network) (highly expressive) • Doesn't matter who writes and where • Uses only URI identifiers  Extend by combining RDF data separately described in another place • Distribute RDF data further on the Web  Turn a web space composed of hypertext documents into a web space with linked data descriptions.  → Linked Data framework, proposed by Tim Berners-Lee 39

Slide 40

Slide 40 text

The role of URI in RDF resources • In the RDF data model, all the resources are identified by assigning a URI • It is important to assign an appropriate URI to a resource • The property (predicate) are also identified by assigning a URI There is no “title” property, but actually the property is identified with the URI http://purl.org/dc/terms/title Since URIs tend to be long, they are presented by short names (prefix) for convenience http://purl.org/dc/terms/title → dc:title • The URI http://purl.org/dc/terms/ are shortend with the prefix “dc:” 40

Slide 41

Slide 41 text

Example of RDF data • The data representation for the following information: The title of a resource (URI) is “Home page of Masao Takaku”, its creator’s name is “Masao Takaku”. 41 https://masao.jpn.org Masao Takaku dc:creator foaf:name mailto:[email protected] foaf:mbox Home page of Masao Takaku dc:title

Slide 42

Slide 42 text

Example of RDF data with RDF/Turtle format @prefix dc: @prefix foaf: dc:title “Home page of Masao Takaku”; dc:creator [ foaf:name “Masao Takaku”; foaf:mbox ] . 42

Slide 43

Slide 43 text

For the reference: URI (Uniform Resource Identifier) • Works as an address that points to resources on the Web If you type in the browser address field, you will reach that resource Since it has a separate address space for each web server, it can be used as a simple identifier http://klis.tsukuba.ac.jp/school_affairs.html 43 Server address Location within the server Access scheme

Slide 44

Slide 44 text

LINKED DATA 44

Slide 45

Slide 45 text

What is Linked Data? • A proposal to make it easy to create application applications for each individual area with a simple data model • Structuring information on individual resources  It is ok from where it is possible  Add links (properties) one by one • Data models  Uses RDF data model = Triples  Data types are resources and literals • Resources act as identifiers (URIs) with addresses on the web • Transforms the current Web of document into “Web of Data” 45

Slide 46

Slide 46 text

Linked Data Principle 1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4. Include links to other URIs. so that they can discover more things. 46 https://www.w3.org/DesignIssues/LinkedData.html

Slide 47

Slide 47 text

[Related topic] Cool URIs don't change • Designing URIs which will not change URIs for 2 years, 20 years and 200 years later? • What to leave out Technical issues • Filename (suffixes) • Software (mechanisms) • Drive names Document management • Authors name • Topics • Status • Access permissions 47 https://www.w3.org/Provider/Style/URI.html

Slide 48

Slide 48 text

Examples of Linked Open Data • Web search engines Entity search and rich snippets • LOD dataset providers DBpedia : http://ja.dbpedia.org/ NDL Web Authorities : https://id.ndl.go.jp/auth/ndla CiNii Articles : http://ci.nii.ac.jp/ LC Linked Data : http://id.loc.gov/ 48

Slide 49

Slide 49 text

Use of Linked Data: Entity search 49 https://www.google.co.jp/search?q=嘉納治五郎

Slide 50

Slide 50 text

Use of Linked Data: Rich snippets 50 https://www.google.co.jp/searc h?q=京王プラザホテル

Slide 51

Slide 51 text

Use of Linked Data: Rich snippets 51 https://www.google.com/search?q= ダイワロイネットホテルつくば

Slide 52

Slide 52 text

Metadata vocabulary for the Web: Schema.org • Vocabulary for simple metadata description of various types of things for use by web search engines • Proposed and maintained by major search engine companies, Google, Microsoft, Yahoo!, etc. • Used in rich snippets at search results pages https://schema.org/Hotel https://schema.org/Book etc. 52

Slide 53

Slide 53 text

Example of Schema.org metadata embedded in web pages •

 

• • • • •



•

4.2

• 1,614 • •

53 https://www.daiwaroynet.jp/tsukuba/

Slide 54

Slide 54 text

Examples of Linked Data dataset: DBPedia • Example: http://ja.dbpedia.org/page/つくば市 • Structured data is extracted and integrated from Free Encyclopedia Wikipedia http://mappings.dbpedia.org/index.php/Mapping_j a 54

Slide 55

Slide 55 text

55 https://ja.wikipedia.org/wiki/つくば市

Slide 56

Slide 56 text

56 http://ja.dbpedia.org/page/つくば市

Slide 57

Slide 57 text

57 Linked Open Data Cloud http://lod-cloud.net/ (as of March 2019) • The number of datasets: 1,239 • LOD datasets around the world  Crossdomain, Geography, Government, Life sciences, Linguistics, Media, Publications, Social networking, User generated  (From Japan) NDL Web Authorities, Textbook LOD

Slide 58

Slide 58 text

Summary (keywords) • Information organization Description and classification User tasks: identify, find, access • Web HTTP, URI, and HTML • Semantic Web RDF data model, triples, URIs • Linked Data URI resources, Schema.org, datasets 58