Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Information Organization in the Web Age

Masao Takaku
February 14, 2020

Information Organization in the Web Age

TSUKUBA Short-term Study Program (TSSP) 2020
Feburary 14th, 2020
University of Tsukuba, Japan

Lecture by Masao Takaku

Masao Takaku

February 14, 2020
Tweet

More Decks by Masao Takaku

Other Decks in Education

Transcript

  1. Information Organization
    in the Web Age
    ウェブ時代の情報組織化
    Masao Takaku (高久雅生)
    [email protected]
    2020年2月14日(金)
    1
    TSUKUBA Short-term Study Program (TSSP) 2020

    View Slide

  2. Me?
    • Masao Takaku (高久 雅生; たかく まさお)
    • Research interests
    Information retrieval, information seeking
    behaviour
    Digital library
    Linked Open Data (LOD)
    • Contact:
    Email: [email protected]
    Twitter: @tmasao
    2

    View Slide

  3. My research area?
    3
    Information
    System
    Contents
    Document
    collections
    User &
    Community
    Information
    Needs
    My main research focus is to
    understand these elements and
    their relationships among them.
    Organization

    View Slide

  4. Contents
    • Introduction
    • What is Information Organization?
    • What is Web?
    • Web & Information Organization
    • Discussions
    4

    View Slide

  5. WHAT DOES REALLY MEANS
    “INFORMATION ORGANIZATION”?
    5
    「組織化」とは何か?

    View Slide

  6. Let’s start with the conclusion...
    • Information organization does…
    make the target information resources findable and
    understandable
    complement the human embodiments and
    subjectivity
    • It depends on the genre of information needs and resources
    add “Value-added information”
    • Second order, third order, and N-th order information
    • User tasks: Identification, Find, and Access
    Methodology: Description (record) and Classfication
    6

    View Slide

  7. What is information organization?
    • Large amounts of information should be organized well to
    make it easier to find and understand
     Group the common/similar items together
     Describe items with the common attributes and structure
     Explain the common properties with the same name
    In general, “information organization” covers as follows:
    • Describe and extract the common structured information
    (Metadata) as the record, and make it searchable.
     Cataloguing and description
    • Analyze the contents based on a certain criteria, assign
    labels, and enable information resources with common
    contents together.
     Subject analysis, classification, subject headings, and indexing
    7

    View Slide

  8. Ex. Description and classification
    8

    View Slide

  9. Ex. Description and classification
    9
    Tent
    Rocket Rabbit
    Grape
    Gorilla
    Desk
    Paper airplane
    Pants
    Performer
    Motorbike
    Plumber
    Apple
    Pencil
    Tuba
    One piece

    View Slide

  10. Ex. Description and classification
    10
    Tent
    Rocket Rabbit
    Grape
    Gorilla
    Desk
    Paper airplane
    Pants
    Performer
    Motorbike
    Plumber
    Apple
    Pencil
    Tuba
    One piece

    View Slide

  11. Ex. Description and classification
    (Ordering)
    11
    Tent
    Rocket
    Rabbit
    Grape
    Gorilla
    Desk
    Paper airplane
    Pants
    Mortorbike
    Apple
    Pencil
    チューバ
    One piece
    Plumber Performer

    View Slide

  12. Ex. Organize with attibutes and values
    Class Item (value)
    Animal Rabbit, Gorilla
    Human Plumber, Performer
    Fruit Apple, Grape
    Vehicle Motorbike, Rocket
    Tool One piece, Pants
    Tent, Desk, Tuba, Pencil,
    Paper airplane
    12
    Note that we may need more documentation on the
    classification scheme, if we need to classify more
    samples and more precisely.

    View Slide

  13. Information organization in the context
    of information seeking
    13
    Information
    System
    Contents
    Document
    collections
    User &
    Community
    Information
    Needs
    Organization

    View Slide

  14. Information organization in the context
    of information seeking (cont.)
    • Information organization helps users (& user
    community) to do the following tasks:
    1. Identification
    2. Find (subject search, content analysis)
    3. Access (Acquirement, referring the location of
    the item)
    14

    View Slide

  15. What is User tasks?
    • Find task
     Use any keyword or category to browse through and discover
    what the content or subject matter is.
    • In the traditional information retrieval researches, the find tasks are
    divided into “subject search” and “known item search”
    • Identification task
     Distinguish one thing from another.
     The unit of identification of "difference in things" varies
    depending on its area and use, such as having the same title
    but clearly distinguishing different versions or different versions
    • Access task
     Check the location of the item, get it, and/or access the
    resources on the network.
    Note that every tasks are often conducted without the actual
    material, due we usually use surrogates (described metadata).
    15

    View Slide

  16. In the context of description
    16

    View Slide

  17. In the context of description (cont.)
    17
    田中宏和.com | 田中宏和宣言!!
    http://www.tanakahirokazu.com/
    田中宏和. 田中宏和さ
    ん. リーダーズノート,
    2010, 192p.

    View Slide

  18. WORLD WIDE WEB (WWW)
    18
    Webとは?

    View Slide

  19. World Wide Web
    • WWW (World Wide Web)
    Or just “Web”
    • 【web】 (noun)
    A network of silken thread spun especially by the
    larvae of various insects (as a tent caterpillar) and
    usually serving as a nest or shelter.
    19
    https://commons.wikimedia.org/wiki/File:
    Spider_web_Belgium_Luc_Viatour.jpg

    View Slide

  20. Three elements of the Web
    • HTTP, URI and HTML are the Three main
    components of the Web.
    • HTTP specifies the data transmission on the
    network and the type of the document format .
    • URI specifies the address of web pages, and it
    enables the hyperlinks among them on the
    network.
    20

    View Slide

  21. 21
    Knight Foundation (2008) http://www.flickr.com/photos/knightfoundation/2467553359/

    View Slide

  22. CERN
    • International research institute of high energy
    physics in Europe.
    • Big science using High-speed accelerator
    Material science, Particle physics, etc.
    • Large amounts of device information
    • Massive needs for documenting and sharing
    information
    Employee: about 2,500
    Visiting scholars: about 15,000
    22

    View Slide

  23. Collaboration by
    many scientists
    around the
    world!
    ATLAS Collaboration: “Dynamics of isolated—
    photon plus jet production in pp collisions at √s = 7
    TeV with the ATLAS detector”. Nuclear Physics B,
    875, 438-533 (2013)
    The number of authors : over 5,800

    View Slide

  24. Brief history of Web
    • 1989 – 1991: Proposed (design and
    establishing the specifications)
    • 1992 – 1993: Became popular gradually…
    • 1993 – 1994: Gain popularity exponentially
    Mosaic, Netscape, Yahoo!
    • 1994 – 1995: Gain popularity in the society
    Windows95, Amazon, …
    24

    View Slide

  25. Very beginning of Web
    25
    Screenshot of the original NeXT web browser in 1993
    http://info.cern.ch/

    View Slide

  26. [side story] Hypermedia and hypertext to the Web
    The rise and spread of the Web, its conflict
    • The concept “Hypermedia” coined and spread
    Memex (Vannevar Bush) - 1945
    Xanadu (Ted Nelson) - 1963?
    WWW (Tim Berners-Lee) – 1989
    • What the Web has lost
    Integration of browsing and editing
    Version control
    Diverse and extensible hyperlinks
    Copyright management & Micro payment
    26
    Tim Berners-Lee: “Weaving the Web: The Original
    Design and Ultimate Destiny of the World Wide Web”.
    Harper Business, 2000, 256p.

    View Slide

  27. Memex by Vannevar Bush (1945)
    27

    View Slide

  28. SEMANTIC WEB
    28
    Semantic Webの世界

    View Slide

  29. Semantic Web (1)
    Tim Berners-Lee, James Hendler, Ora Lassila.
    The Semantic Web. Scientific American, 2001,
    Vol.284, No.5, pp.35-43.
    • From Web to “Semantic Web”
    • Web markups to enable
    semantic description and
    machine understandings
    29

    View Slide

  30. Semantic Web application (1)
    • Example: “I want to find a dentist who can stop by after
    work”
    After work: week days 9:00-18:00
    Stop by after work: Tsukuba Express Line (TX)
    • Having a consultation after 18
    • Stations along the TX line: Tsukuba, Kenkyu-gakuen, …,
    Minami-nagareyama, Kita-senju, …
    • Within a 500-minute walk from the station
    • (Personal assistant / agent)
    30

    View Slide

  31. Semantic Web application (2)
    • Disambiguation
    月=月曜日 = Monday = Mon.
    “9:00-13:00 ; 15:00-19:00”
    Closed days, Medical hours
    Holidays, public holidays,
    open all year round
    • Understanding common sense
    One week = Mon, Tue, Wed, Thu, Fri,
    Sat, Sun
    Week days = Monday to Friday
    • Information extraction from the Web
    markups
    31

    View Slide

  32. Semantic Web components
    32
    Identifier:URI
    Character set:
    Unicode
    Notation:XML
    Data exchange:RDF
    Vocabulary:RDFS
    Ontology:
    OWL
    Rule:
    RIF/SWRL
    Search:
    SPARQL
    Digital Signature
    Logic
    Reasoning
    Trust
    User Interface / Application

    View Slide

  33. Issues of Semantic Web
    • Decentralization + massive nature of the Web
    Huge amounts of web spaces
    Big data with various concepts and descriptions can
    be obtained
    Diverse information provider
    Multi-languages and multi-culture
    Cannot assume strict use of controlled vocabularies
    and custom conventions
    • Difficulty of general-purpose model
    It is difficult for computer applications to
    understand the meaning of things
    33

    View Slide

  34. RDF data model
    • RDF (Resource Description Framework)
    • Graph-based data model
    Directed graph with labels
    Triple representation
    • Feature
    Simple and highly expressive data model
    Writing rules tend to be complicated
    Processing operation takes time
    Resource (node) = URI (Uniform Resource Identifier)
    • Inherit the decentralized features of the web
    34
    J.K. Rowling
    Harry potter
    Author

    View Slide

  35. Description with triples (1)
    • Consider the book itself as a “subject” resource, and
    build triples by its attributes and values
    • { subject, predicate, object }
    ⇒ { this book, property, value }
    35
    Property Value
    Title Weaving the Web
    Author Tim Berners-Lee
    Publisher Harper Business

    View Slide

  36. Description with triples (2)
    Graphical representation of triples
    This
    book
    Weaving the Web
    Title
    This
    book
    Tim Berners-Lee
    Author
    This
    book
    Harper Bussiness
    Publisher
    36

    View Slide

  37. Description with triples (3)
    Aggregates the same resource as a single
    resource
    37
    This
    book
    Weaving the Web
    Title
    Tim Berners-Lee
    Author
    Harper Bussiness
    Publisher

    View Slide

  38. Description with triples (4)
    • “Literal values” cannot be extended to other
    resources
    Only the “resource” node is possible to become a
    subject of a triple
    • In this example, if the author is a “resource”
    node, another triple can be connected
    This
    book
    Tim Berners-Lee
    Author Birth 1955
    This
    book
    Author
    Birth 1955
    Tim
    Berners-Lee
    Tim Berners-Lee
    Name

    View Slide

  39. Advantages of RDF data
    • Formalized as a simple triple data model
    • Highly expandable by linking as a graph (network) (highly
    expressive)
    • Doesn't matter who writes and where
    • Uses only URI identifiers
     Extend by combining RDF data separately described in
    another place
    • Distribute RDF data further on the Web
     Turn a web space composed of hypertext documents into a
    web space with linked data descriptions.
     → Linked Data framework, proposed by Tim Berners-Lee
    39

    View Slide

  40. The role of URI in RDF resources
    • In the RDF data model, all the resources are
    identified by assigning a URI
    • It is important to assign an appropriate URI to a
    resource
    • The property (predicate) are also identified by
    assigning a URI
    There is no “title” property, but actually the property is
    identified with the URI http://purl.org/dc/terms/title
    Since URIs tend to be long, they are presented by
    short names (prefix) for convenience
    http://purl.org/dc/terms/title → dc:title
    • The URI http://purl.org/dc/terms/ are shortend with the
    prefix “dc:”
    40

    View Slide

  41. Example of RDF data
    • The data representation for the following
    information: The title of a resource (URI) is
    “Home page of Masao Takaku”, its creator’s
    name is “Masao Takaku”.
    41
    https://masao.jpn.org
    Masao Takaku
    dc:creator
    foaf:name
    mailto:[email protected]
    foaf:mbox
    Home page of Masao Takaku
    dc:title

    View Slide

  42. Example of RDF data with RDF/Turtle
    format
    @prefix dc:
    @prefix foaf:

    dc:title “Home page of Masao Takaku”;
    dc:creator [
    foaf:name “Masao Takaku”;
    foaf:mbox
    ] .
    42

    View Slide

  43. For the reference:
    URI (Uniform Resource Identifier)
    • Works as an address that points to resources on
    the Web
    If you type in the browser address field, you will
    reach that resource
    Since it has a separate address space for each web
    server, it can be used as a simple identifier
    http://klis.tsukuba.ac.jp/school_affairs.html
    43
    Server address Location within
    the server
    Access
    scheme

    View Slide

  44. LINKED DATA
    44

    View Slide

  45. What is Linked Data?
    • A proposal to make it easy to create application applications
    for each individual area with a simple data model
    • Structuring information on individual resources
     It is ok from where it is possible
     Add links (properties) one by one
    • Data models
     Uses RDF data model = Triples
     Data types are resources and literals
    • Resources act as identifiers (URIs) with addresses on the web
    • Transforms the current Web of document into “Web of Data”
    45

    View Slide

  46. Linked Data Principle
    1. Use URIs as names for things
    2. Use HTTP URIs so that people can look up
    those names.
    3. When someone looks up a URI, provide useful
    information, using the standards (RDF*,
    SPARQL)
    4. Include links to other URIs. so that they can
    discover more things.
    46
    https://www.w3.org/DesignIssues/LinkedData.html

    View Slide

  47. [Related topic] Cool URIs don't change
    • Designing URIs which will not change
    URIs for 2 years, 20 years and 200 years later?
    • What to leave out
    Technical issues
    • Filename (suffixes)
    • Software (mechanisms)
    • Drive names
    Document management
    • Authors name
    • Topics
    • Status
    • Access permissions
    47
    https://www.w3.org/Provider/Style/URI.html

    View Slide

  48. Examples of Linked Open Data
    • Web search engines
    Entity search and rich snippets
    • LOD dataset providers
    DBpedia : http://ja.dbpedia.org/
    NDL Web Authorities :
    https://id.ndl.go.jp/auth/ndla
    CiNii Articles : http://ci.nii.ac.jp/
    LC Linked Data : http://id.loc.gov/
    48

    View Slide

  49. Use of Linked Data: Entity search
    49
    https://www.google.co.jp/search?q=嘉納治五郎

    View Slide

  50. Use of Linked Data: Rich snippets
    50
    https://www.google.co.jp/searc
    h?q=京王プラザホテル

    View Slide

  51. Use of Linked Data: Rich snippets
    51
    https://www.google.com/search?q=
    ダイワロイネットホテルつくば

    View Slide

  52. Metadata vocabulary for the Web:
    Schema.org
    • Vocabulary for simple metadata description of
    various types of things for use by web search
    engines
    • Proposed and maintained by major search
    engine companies, Google, Microsoft, Yahoo!,
    etc.
    • Used in rich snippets at search results pages
    https://schema.org/Hotel
    https://schema.org/Book
    etc.
    52

    View Slide

  53. Example of Schema.org metadata
    embedded in web pages


     itemtype="http://schema.org/PostalAddress">





     itemtype="http://schema.org/AggregateRating" >
    • 4.2
    • 1,614


    53
    https://www.daiwaroynet.jp/tsukuba/

    View Slide

  54. Examples of Linked Data dataset:
    DBPedia
    • Example: http://ja.dbpedia.org/page/つくば市
    • Structured data is extracted and integrated
    from Free Encyclopedia Wikipedia
    http://mappings.dbpedia.org/index.php/Mapping_j
    a
    54

    View Slide

  55. 55
    https://ja.wikipedia.org/wiki/つくば市

    View Slide

  56. 56
    http://ja.dbpedia.org/page/つくば市

    View Slide

  57. 57
    Linked Open Data Cloud
    http://lod-cloud.net/
    (as of March 2019)
    • The number of
    datasets: 1,239
    • LOD datasets
    around the world
     Crossdomain,
    Geography,
    Government, Life
    sciences, Linguistics,
    Media, Publications,
    Social networking,
    User generated
     (From Japan)
    NDL Web
    Authorities,
    Textbook LOD

    View Slide

  58. Summary (keywords)
    • Information organization
    Description and classification
    User tasks: identify, find, access
    • Web
    HTTP, URI, and HTML
    • Semantic Web
    RDF data model, triples, URIs
    • Linked Data
    URI resources, Schema.org, datasets
    58

    View Slide