Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Organization in the Web Age

Masao Takaku
February 14, 2020

Information Organization in the Web Age

TSUKUBA Short-term Study Program (TSSP) 2020
Feburary 14th, 2020
University of Tsukuba, Japan

Lecture by Masao Takaku

Masao Takaku

February 14, 2020
Tweet

More Decks by Masao Takaku

Other Decks in Education

Transcript

  1. Me? • Masao Takaku (高久 雅生; たかく まさお) • Research

    interests Information retrieval, information seeking behaviour Digital library Linked Open Data (LOD) • Contact: Email: [email protected] Twitter: @tmasao 2
  2. My research area? 3 Information System Contents Document collections User

    & Community Information Needs My main research focus is to understand these elements and their relationships among them. Organization
  3. Contents • Introduction • What is Information Organization? • What

    is Web? • Web & Information Organization • Discussions 4
  4. Let’s start with the conclusion... • Information organization does… make

    the target information resources findable and understandable complement the human embodiments and subjectivity • It depends on the genre of information needs and resources add “Value-added information” • Second order, third order, and N-th order information • User tasks: Identification, Find, and Access Methodology: Description (record) and Classfication 6
  5. What is information organization? • Large amounts of information should

    be organized well to make it easier to find and understand  Group the common/similar items together  Describe items with the common attributes and structure  Explain the common properties with the same name In general, “information organization” covers as follows: • Describe and extract the common structured information (Metadata) as the record, and make it searchable.  Cataloguing and description • Analyze the contents based on a certain criteria, assign labels, and enable information resources with common contents together.  Subject analysis, classification, subject headings, and indexing 7
  6. Ex. Description and classification 9 Tent Rocket Rabbit Grape Gorilla

    Desk Paper airplane Pants Performer Motorbike Plumber Apple Pencil Tuba One piece
  7. Ex. Description and classification 10 Tent Rocket Rabbit Grape Gorilla

    Desk Paper airplane Pants Performer Motorbike Plumber Apple Pencil Tuba One piece
  8. Ex. Description and classification (Ordering) 11 Tent Rocket Rabbit Grape

    Gorilla Desk Paper airplane Pants Mortorbike Apple Pencil チューバ One piece Plumber Performer
  9. Ex. Organize with attibutes and values Class Item (value) Animal

    Rabbit, Gorilla Human Plumber, Performer Fruit Apple, Grape Vehicle Motorbike, Rocket Tool One piece, Pants Tent, Desk, Tuba, Pencil, Paper airplane 12 Note that we may need more documentation on the classification scheme, if we need to classify more samples and more precisely.
  10. Information organization in the context of information seeking 13 Information

    System Contents Document collections User & Community Information Needs Organization
  11. Information organization in the context of information seeking (cont.) •

    Information organization helps users (& user community) to do the following tasks: 1. Identification 2. Find (subject search, content analysis) 3. Access (Acquirement, referring the location of the item) 14
  12. What is User tasks? • Find task  Use any

    keyword or category to browse through and discover what the content or subject matter is. • In the traditional information retrieval researches, the find tasks are divided into “subject search” and “known item search” • Identification task  Distinguish one thing from another.  The unit of identification of "difference in things" varies depending on its area and use, such as having the same title but clearly distinguishing different versions or different versions • Access task  Check the location of the item, get it, and/or access the resources on the network. Note that every tasks are often conducted without the actual material, due we usually use surrogates (described metadata). 15
  13. In the context of description (cont.) 17 田中宏和.com | 田中宏和宣言!!

    http://www.tanakahirokazu.com/ 田中宏和. 田中宏和さ ん. リーダーズノート, 2010, 192p.
  14. World Wide Web • WWW (World Wide Web) Or just

    “Web” • 【web】 (noun) A network of silken thread spun especially by the larvae of various insects (as a tent caterpillar) and usually serving as a nest or shelter. 19 https://commons.wikimedia.org/wiki/File: Spider_web_Belgium_Luc_Viatour.jpg
  15. Three elements of the Web • HTTP, URI and HTML

    are the Three main components of the Web. • HTTP specifies the data transmission on the network and the type of the document format . • URI specifies the address of web pages, and it enables the hyperlinks among them on the network. 20
  16. CERN • International research institute of high energy physics in

    Europe. • Big science using High-speed accelerator Material science, Particle physics, etc. • Large amounts of device information • Massive needs for documenting and sharing information Employee: about 2,500 Visiting scholars: about 15,000 22
  17. Collaboration by many scientists around the world! ATLAS Collaboration: “Dynamics

    of isolated— photon plus jet production in pp collisions at √s = 7 TeV with the ATLAS detector”. Nuclear Physics B, 875, 438-533 (2013) The number of authors : over 5,800
  18. Brief history of Web • 1989 – 1991: Proposed (design

    and establishing the specifications) • 1992 – 1993: Became popular gradually… • 1993 – 1994: Gain popularity exponentially Mosaic, Netscape, Yahoo! • 1994 – 1995: Gain popularity in the society Windows95, Amazon, … 24
  19. Very beginning of Web 25 Screenshot of the original NeXT

    web browser in 1993 http://info.cern.ch/
  20. [side story] Hypermedia and hypertext to the Web The rise

    and spread of the Web, its conflict • The concept “Hypermedia” coined and spread Memex (Vannevar Bush) - 1945 Xanadu (Ted Nelson) - 1963? WWW (Tim Berners-Lee) – 1989 • What the Web has lost Integration of browsing and editing Version control Diverse and extensible hyperlinks Copyright management & Micro payment 26 Tim Berners-Lee: “Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web”. Harper Business, 2000, 256p.
  21. Semantic Web (1) Tim Berners-Lee, James Hendler, Ora Lassila. The

    Semantic Web. Scientific American, 2001, Vol.284, No.5, pp.35-43. • From Web to “Semantic Web” • Web markups to enable semantic description and machine understandings 29
  22. Semantic Web application (1) • Example: “I want to find

    a dentist who can stop by after work” After work: week days 9:00-18:00 Stop by after work: Tsukuba Express Line (TX) • Having a consultation after 18 • Stations along the TX line: Tsukuba, Kenkyu-gakuen, …, Minami-nagareyama, Kita-senju, … • Within a 500-minute walk from the station • (Personal assistant / agent) 30
  23. Semantic Web application (2) • Disambiguation 月=月曜日 = Monday =

    Mon. “9:00-13:00 ; 15:00-19:00” Closed days, Medical hours Holidays, public holidays, open all year round • Understanding common sense One week = Mon, Tue, Wed, Thu, Fri, Sat, Sun Week days = Monday to Friday • Information extraction from the Web markups 31
  24. Semantic Web components 32 Identifier:URI Character set: Unicode Notation:XML Data

    exchange:RDF Vocabulary:RDFS Ontology: OWL Rule: RIF/SWRL Search: SPARQL Digital Signature Logic Reasoning Trust User Interface / Application
  25. Issues of Semantic Web • Decentralization + massive nature of

    the Web Huge amounts of web spaces Big data with various concepts and descriptions can be obtained Diverse information provider Multi-languages and multi-culture Cannot assume strict use of controlled vocabularies and custom conventions • Difficulty of general-purpose model It is difficult for computer applications to understand the meaning of things 33
  26. RDF data model • RDF (Resource Description Framework) • Graph-based

    data model Directed graph with labels Triple representation • Feature Simple and highly expressive data model Writing rules tend to be complicated Processing operation takes time Resource (node) = URI (Uniform Resource Identifier) • Inherit the decentralized features of the web 34 J.K. Rowling Harry potter Author
  27. Description with triples (1) • Consider the book itself as

    a “subject” resource, and build triples by its attributes and values • { subject, predicate, object } ⇒ { this book, property, value } 35 Property Value Title Weaving the Web Author Tim Berners-Lee Publisher Harper Business
  28. Description with triples (2) Graphical representation of triples This book

    Weaving the Web Title This book Tim Berners-Lee Author This book Harper Bussiness Publisher 36
  29. Description with triples (3) Aggregates the same resource as a

    single resource 37 This book Weaving the Web Title Tim Berners-Lee Author Harper Bussiness Publisher
  30. Description with triples (4) • “Literal values” cannot be extended

    to other resources Only the “resource” node is possible to become a subject of a triple • In this example, if the author is a “resource” node, another triple can be connected This book Tim Berners-Lee Author Birth 1955 This book Author Birth 1955 Tim Berners-Lee Tim Berners-Lee Name
  31. Advantages of RDF data • Formalized as a simple triple

    data model • Highly expandable by linking as a graph (network) (highly expressive) • Doesn't matter who writes and where • Uses only URI identifiers  Extend by combining RDF data separately described in another place • Distribute RDF data further on the Web  Turn a web space composed of hypertext documents into a web space with linked data descriptions.  → Linked Data framework, proposed by Tim Berners-Lee 39
  32. The role of URI in RDF resources • In the

    RDF data model, all the resources are identified by assigning a URI • It is important to assign an appropriate URI to a resource • The property (predicate) are also identified by assigning a URI There is no “title” property, but actually the property is identified with the URI http://purl.org/dc/terms/title Since URIs tend to be long, they are presented by short names (prefix) for convenience http://purl.org/dc/terms/title → dc:title • The URI http://purl.org/dc/terms/ are shortend with the prefix “dc:” 40
  33. Example of RDF data • The data representation for the

    following information: The title of a resource (URI) is “Home page of Masao Takaku”, its creator’s name is “Masao Takaku”. 41 https://masao.jpn.org Masao Takaku dc:creator foaf:name mailto:[email protected] foaf:mbox Home page of Masao Takaku dc:title
  34. Example of RDF data with RDF/Turtle format @prefix dc: <http://purl.org/dc/terms/>

    @prefix foaf: <http://xmlns.com/foaf/0.1/> <https://masao.jpn.org/> dc:title “Home page of Masao Takaku”; dc:creator [ foaf:name “Masao Takaku”; foaf:mbox <mailto:[email protected]> ] . 42
  35. For the reference: URI (Uniform Resource Identifier) • Works as

    an address that points to resources on the Web If you type in the browser address field, you will reach that resource Since it has a separate address space for each web server, it can be used as a simple identifier http://klis.tsukuba.ac.jp/school_affairs.html 43 Server address Location within the server Access scheme
  36. What is Linked Data? • A proposal to make it

    easy to create application applications for each individual area with a simple data model • Structuring information on individual resources  It is ok from where it is possible  Add links (properties) one by one • Data models  Uses RDF data model = Triples  Data types are resources and literals • Resources act as identifiers (URIs) with addresses on the web • Transforms the current Web of document into “Web of Data” 45
  37. Linked Data Principle 1. Use URIs as names for things

    2. Use HTTP URIs so that people can look up those names. 3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) 4. Include links to other URIs. so that they can discover more things. 46 https://www.w3.org/DesignIssues/LinkedData.html
  38. [Related topic] Cool URIs don't change • Designing URIs which

    will not change URIs for 2 years, 20 years and 200 years later? • What to leave out Technical issues • Filename (suffixes) • Software (mechanisms) • Drive names Document management • Authors name • Topics • Status • Access permissions 47 https://www.w3.org/Provider/Style/URI.html
  39. Examples of Linked Open Data • Web search engines Entity

    search and rich snippets • LOD dataset providers DBpedia : http://ja.dbpedia.org/ NDL Web Authorities : https://id.ndl.go.jp/auth/ndla CiNii Articles : http://ci.nii.ac.jp/ LC Linked Data : http://id.loc.gov/ 48
  40. Metadata vocabulary for the Web: Schema.org • Vocabulary for simple

    metadata description of various types of things for use by web search engines • Proposed and maintained by major search engine companies, Google, Microsoft, Yahoo!, etc. • Used in rich snippets at search results pages https://schema.org/Hotel https://schema.org/Book etc. 52
  41. Example of Schema.org metadata embedded in web pages • <div

    itemscope itemtype="http://schema.org/LodgingBusiness">  <meta itemprop="name" content="Daiwa Roynet Hotel Tsukuba"/>  <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"> • <meta itemprop="addressCountry" content="JP"/> • <meta itemprop="addressLocality" content="Tsukuba"/> • <meta itemprop="addressRegion" content="Ibaraki Prefecture"/> • <meta itemprop="streetAddress" content="1-5-7 AzumaTsukuba- shi"/> • <meta itemprop="postalCode" content="305-0031"/></div>  <div itemscope itemprop="aggregateRating" itemtype="http://schema.org/AggregateRating" > • <div class="rating-score" itemprop="ratingValue">4.2</div> • <span itemprop="reviewCount" content="1614">1,614</span> • <meta itemprop="bestRating" content="5"/> • <meta itemprop="worstRating" content="0"/></div> 53 https://www.daiwaroynet.jp/tsukuba/
  42. Examples of Linked Data dataset: DBPedia • Example: http://ja.dbpedia.org/page/つくば市 •

    Structured data is extracted and integrated from Free Encyclopedia Wikipedia http://mappings.dbpedia.org/index.php/Mapping_j a 54
  43. 57 Linked Open Data Cloud http://lod-cloud.net/ (as of March 2019)

    • The number of datasets: 1,239 • LOD datasets around the world  Crossdomain, Geography, Government, Life sciences, Linguistics, Media, Publications, Social networking, User generated  (From Japan) NDL Web Authorities, Textbook LOD
  44. Summary (keywords) • Information organization Description and classification User tasks:

    identify, find, access • Web HTTP, URI, and HTML • Semantic Web RDF data model, triples, URIs • Linked Data URI resources, Schema.org, datasets 58