PANDA: A Platform for Academic Knowledge Discovery and Acquisition

PANDA: A Platform for Academic Knowledge Discovery and Acquisition Zhaoan
Dong1; Jiaheng Lu2,1; Tok Wang Ling3 1.Renmin University of China 2.University of Helsinki 3.National University of Singapore

CONTENT 2 1. Motivation and background 5. Related work 2.
Definitions and problem statement 3. Our hybrid framework 4. Current system implementation 6. Conclusion and future work

1. Motivation and background 3  Existing popular web-based academic
search systems  provide literature search and retrieval services through a user-friendly interface  Keywords search  return a long list of paper titles and other textual information  To find the papers they want, users often need scan the long list and download some papers to read one by one.  Time-consuming  costly

1. Motivation and background 4  PandaSearch:  A Fine-grained
Academic Search Engine For Research Documents on Computer Science (ICDE 2015)

1. Motivation and background 5  PandaSearch:  A Fine-grained
Academic Search Engine For Research Documents On Computer Science (ICDE 2015)

1. Motivation and background 6  Knowledge cells  Some
meaningful information objects within academic documents, e.g. Figures, Tables, Definitions, etc.) . Some examples for different types of Knowledge Cells Figure Table Definition Algorithm Example 1:

1. Motivation and background 7  some relationships among knowledge
cells are usually implied or hidden in the sentences of the articles. PAM PIL K-medoids Equation 4 DEPD (1) K-Medoids is compared with PIL . (2) K-Medoids algorithm depends on Equation 4 (3) PAM is a kind of K-Medoids algorithms Example 2:

1. Motivation and background 8  some relationships among knowledge
cells are usually implied or hidden in the sentences of the articles. According the sentences we can find the relationships among three algorithms: HKMed, LKmed and PAM. HKMed LKMed PAM PIL K-medoids Equation 4 VARNT DEPD (4) HKMed is adapted from LKMed (5) HKMed is adapted from PAM. Example 3:

1. Motivation and background 9  Example 4: A Fragment
of an Academic Knowledge Graph Figure Algorithm Definition Theorem Table … CMP DEPD REF

1. Motivation and background 10  The academic knowledge graph
can provide  a more accurate paper-level results  Improving the ranking of the relevant papers towards keywords query.  a fine-grained search  “Looking inside” the documents to search some research data within scientific articles  Returning some fine-grained information objects not only a flat list of paper-level information.  deep-level information exploring  Academic Knowledge discovery  Academic information exploring developers

1. Motivation and background 11  In the future, 
on the one hand, we want to add “Advanced Search” to PandaSearch for common users as below.  on the other hand, we can provide SQL-Like APIs for external systems as demonstrated in the following examples.

1. Motivation and background 12  Example 5:  To
find the Figures that contain “inverted list" in their captions and the papers these Figures from.  We use a non-standard SQL statements to illustrate what the query language looks like. SELECT p.pid, p.title, k.name, k.content FROM papers p, cells k WHERE contains(k.name,"inverted list") AND k.type="Figure" AND p.pid=k.pid; • “papers” and “cells “ can be either relational tables or non- relational data collections • “contains” can be some built-in functions.

1. Motivation and background 13  Example 6:  Search
algorithms from different papers which are variants or have been compared with an Algorithm whose name is related to“hash join” algorithm. SELECT k1 .pid, k1 .name, k2 .pid, k2 .name FROM cells k1 , cells k2 WHERE relations(k1 ,k2 ) IN ("CMP","VARNT") AND contains(k2 .name,"hash join") AND k1 .type = k2 .type = "Algorithm" AND k1 .pid != k2 .pid; • “cells “ can be either relational tables or non-relational data collections • “relations” can be some built-in functions.

1. Motivation and background 14  Objectives and challenges. 
(1) Correctly identify and extract the contents of each Knowledge Cell .  PDFs lacks of enough structural information  diverse journals published in different years and layouts  (2) Extract the attributes, key phrases and contexts of the Knowledge Cells.  The captions of Figures, the specifications of Algorithms etc. are hard for computer to understand.  (3) Identify and extract various relationships between Knowledge Cells  The relationships are usually implied in text, rare or invisible.  Some even require expertise to be recognized.

1. Motivation and background 15  Example 8:  The
layouts of Knowledge Cells are always changing with the format of different documents, different conferences or different journals.

1. Motivation and background 16  Example 9:  Even
in one document, the layouts are different. • There are at least three different layouts of 11 logical objects including one Table and ten Figures.

1. Motivation and background 17  Example 10: • missing
information makes some attributes of the knowledge cells null. • information overload makes it hard to extract the attributes and relationships. missing Caption Text, Number, Formula…

1. Motivation and background 18  To overcome the challenges,
 we propose a hybrid framework combining the accuracy of human workers with the speed of computer algorithms.  Automatic computer algorithms:  Low cost, speed  can hardly extend to handle diverse journals and layouts, with the increasing amount of scientific publications.  Human workers in crowdsourcing  more accuracy, higher performance.  Expensive crowdsourcing cost, e.g. time, money. • The cooperation of human and machine can help researchers to resolve large scale complex problems in a more efficient way

2. Definitions and problem statement 19  The definition of
Knowledge Cell Definition 1: A Knowledge Cell is a meaningful information object within an academic document. Each Knowledge Cell should have some attributes including an identifier, paper identifier, type, name, content and key phrases, and so on. • Generally, if papers are also of a special kind of Knowledge Cells that have attributes like paper identifier ( e.g. pid ), title, authors, pages, conference or journal, date, references, etc.

2. Definitions and problem statement 20  The definition of
Academic Knowledge Graph Definition 2: An Academic Knowledge Graph is a directed graph AKG = (K, R ), where K is the set of Knowledge Cells extracted from a collection of academic documents and R = { (k1 , k2 , r )|k1 , k2 ϵ K; k1 ≠k2 ; and r is the relationship between k1 and k2 }. • Note that k1 and k2 are two knowledge cells either from one PDF file or two different files.

2. Definitions and problem statement 21  We will obtain
a more general Academic Knowledge Graph (GAKG) as a hyper graph if it contains the relationships between:  each paper and it citations.  each paper and Knowledge Cells within it.  Knowledge Cells Figure Algorithm Definition Theorem Table … A fragment of a general Academic Knowledge Graph CITE

2. Definitions and problem statement 22  Problem statement: the
problem of academic knowledge discovery and acquisition can be modeled as a crowd-sourced database problem, where scholarly papers, Knowledge Cells and their relationships could be represented as rows /records with some missing attributes that could be supplied by either automatic algorithms or anonymous human workers. • We focus on how to design such hybrid workflows that combine the automatic algorithms and crowdsourced tasks efficiently and effectively . • Our objectives is to identify and extract them by either automatic algorithms or anonymous human workers for further queries.

3. Our hybrid framework 23  A generic framework for
knowledge discovery and acquisition from PDF documents. The hybrid workflows PDF Pages Automated Extracting Algorithm HIT Candidate HIT Candidate HIT Candidates HITs HITs HITs Confirmed Knowledge Cells Crowd training Low confidence High confidence HITs generating Automated Extracting Crowdsourcing

3. Our hybrid framework 24  Our hybrid workflows can
be regarded as a multi-stage process  (1)Preprocessing stage.  Metadata information of papers could be harvested from public website previously.  title, authors, publication date, page number, etc.  Format conversion  PDF documents  text files  PDF pages  JPEG/PNG images  pages filtering by rule-based filters.  Some PDF pages that obviously do not contain the target objects to be extracted should be filtered  …

3. Our hybrid framework 25  (2) Extracting academic knowledge
using automatic algorithms  Heuristic methods and machine learning algorithms are employed to:  Locate the position of the area of each Knowledge Cell  Analyze the texts and extract the attributes, contexts, key phrases of each Knowledge Cells.  Provide a confidence estimate value on how accurate and reliable an identified result is likely to be.  Adjust the filtering threshold of the confidence dynamically with consideration of time cost, result quality and budget of crowdsourcing.  …

3. Our hybrid framework 26  (3) Crowdsourcing tasks design
 Results with high confidence value will be retained.  Otherwise, the current page will be switched to the crowdsourcing layer as a Human Intelligence Task Candidate (HITC).  Human Intelligence Tasks (HITs) for extracting certain Knowledge Cells or information will be designed and generated.  A web-based task-oriented crowdsourcing system  Identifying tasks  Reviewing tasks  Tutorial tasks  Test tasks  …

3. Our hybrid framework 27  (4)Crowdsourcing process management and
cost model  Answers aggregation and quality control  Majority vote, etc.  a tutorial module  a test module  A crowdsourcing cost model  how to archive a higher quality with a fixed budget.  how to reduce the whole cost with quality constraints.  User management module  Registration  ranking and reputation  …

4. Current system implementation 28  Platform for Academic kNowledge
Discovery and Acquisition (PANDA) PDFs PANDA PandaSearch Academic Knowledge Base Crowds User Query Result Internet PANDA serves as a data provider for Pandasearch

4. Current system implementation 29 The system architecture of PANDA

4. Current system implementation 30  (1) Data Storage 
2.9 Million PDF documents in computer science.  We currently focus on the extraction of Figures Data Type Number Papers 2975828 Figures 15492 Definitions 1939 Lemmas 757 Theorems 726 Algorithms 671 Propositions 52 Examples 1038 Statistics of current data stores • Now, we have extracted 15492 Figures from 5000 papers, including nearly 4000 SIGMOD papers published from 1980 to 2014. • So that the number of Figures is quite less than the number of papers. This is why we want to develop the PANDA, to process the rest papers that are still increasing in amount.

4. Current system implementation 31  (2) Algorithmic Layer 
we have built an algorithm using rule-based and machine learning methods to automatically extract Figures: 1. Splitting the PDF document into pages. 2. Converting the PDF file into standard text file format. 3. Filtering the pages that obviously do not contain figures. 4. Locating the boundary of the figure’s content area by a detector. (PDFBox and libSVM are used.) 5. Cropping the Figures’ content by an Extractor or a Cropper according to the position information.

4. Current system implementation 32  We performed an initial
experiment for extracting Figures within nearly 4,000 SIGMOD papers published from 1980 to 2014.  We use Completeness and Purity to evaluate the result of boundary detector in addition to Precision, Recall and F-Measure.  Complete: the result region includes all the parts of the Knowledge Cell content.  Pure: does not contain anything that does not belong to the Knowledge Cell. A correctly identified component of a Knowledge Cell is therefore both complete and pure.

4. Current system implementation 33  Example 11：The identified results
in the left page are not correct, since the first one discard the left part and the second one covers too much texts.

4. Current system implementation 34  Preliminary experimental results of
current algorithms for Figures Extraction.  This figure shows that the performance for papers from 1980 to 1989 are lower than those of the later years. 0.00 0.20 0.40 0.60 0.80 1.00 1980-1984 1985-1989 1990-1994 1995-2000 2000-2004 2005-2009 2010-2014 Recall Precision F-Measure

4. Current system implementation 35  Example 12：PDF pages in
earlier years • This is because the PDF files in earlier years usually have low quality or resolutions. The extracted texts usually contain various type of noises in character recognition process, e.g. typos. This maybe affect the discovery and locating of some Knowledge Cells.

4. Current system implementation 36  (3) Crowdsourcing Layer An
Example of Web-based Interfaces for Extracting Figures

4. Current system implementation 37  (4) Crowds/human workers 
Who might contribute to the crowdsourced tasks  Common users  Authors  Student volunteers  Published on Mechanical Turk？Crowdflower？  …  How to motivate and retain human workers?  Game？  award points？  reCAPTCHA?

5. Related work 38  More and more interests have
been spent on the extraction and management of research data within scientific literature.  Digital Curation (DC)  is the selection, preservation, maintenance, collection and archiving of digital assets.  establishes, maintains and adds value to repositories of digital data for present and future use.  Deep Indexing(DI)  Indexing the research data within articles that are invisible to the traditional bibliographic searches.  Deep Indexing is now available in ProQuest, CiteSeerX, ScienceDirect, etc.

5. Related work 39  Figures and tables are also
displayed when the paper they from are returned as a search result. In Citeseer, users can search tables by inputting some keyworks.

5. Related work 40  However  The extraction and
management of each kind of Knowledge Cells is independent.  The query and display of them depend on the query of academic papers, not the attributes of Knowledge Cells themselves.  No published works focus on the relationships among various kind of Knowledge Cells.  No related work utilizing the relationships to build the Academic Knowledge Graph as we proposed.

5. Related work 41  Automatic Information Extraction  A
number of methods, techniques and tools have been employed to analyze the structure of PDFs and identify different layout blocks within PDFs. • Hu, Jianying, and Y. Liu. Analysis of Documents Born Digital. Handbook of Document Image Processing and Recognition. Springer London, 2014:775-804. • Klampfl, Stefan, et al. "Unsupervised document structure analysis of digital scientific articles." International Journal on Digital Libraries14.3(2014):83-99. • J. Wu, K. Williams, H. Chen, M. Khabsa, C. Caragea, A. Ororbia, D. Jordan, and C. L. Giles, “Citeseerx: AI in a digital library search engine,” in AAAI, 2014, pp. 2930–2937 • Most of them focus on the structure analysis of PDF documents to identify and extract the content of Figures and Tables. • We want to extend them to the extraction of other kinds of Knowledge Cells and their attributes.

5. Related work 42  Task-Oriented Crowdsourcing • C. Lofi
and K. E. Maarry, “Design patterns for hybrid algorithmic crowdsourcing workflows,” in CBI, 2014, pp. 1–8. • N. Luz, N. Silva, and P. Novais, “A survey of task-oriented crowdsourcing,” Artificial Intelligence Review, pp. 1–27, 2014. • N. Luz, N. Silva, and P. Novais, “Generating human-computer microtask workflows from domain ontologies,” in Human-Computer Interaction. Theories, Methods, and Tools. Springer, 2014, pp. 98–109. • E. Kamar, S. Hacker, and E. Horvitz, “Combining human and machine intelligence in large-scale crowdsourcing” in AAMAS, 2012, pp.467-474. • S. K. Kondreddi, P. Triantafillou, and G. Weikum, “Combining information extraction and human computing for crowdsourced knowledge acquisition,” in ICDE, 2014, pp. 988–999. • There are no related work on academic knowledge discovery and acquisition using crowdsourcing methods.

6. Conclusion and future work 43  The objectives of
this research is  to identify and extract academic knowledge using a hybrid framework integrating the accuracy of human workers and the speed of algorithms.  The contributions of this paper  Stated the problem of academic knowledge discovery and acquisition as a crowd-sourced database problem based on the definitions of Knowledge Cells and Academic Knowledge Graph.  Proposed a hybrid framework integrating the accuracy of human workers and the speed of automatic algorithms.  Designed a web-based crowdsourcing module for Figure extraction with some preliminary achievements.

6. Conclusion and future work 44  We have a
lot of works to do  Improving the feasibility of the crowdsourcing interfaces and optimize the design of HITs  Making the algorithms to be confidence-aware and to iteratively interact with the crowdsourcing modules.  Strategies for switch tasks.  Optimization of the algorithms using human contributions.  Trade-off considerations.  Extending the framework to identify and extract various attributes and information of Knowledge Cells.  Different Knowledge Cells have some different features

45 Thank you！ Email: [email protected]

PANDA: A Platform for Academic Knowledge Discov...

PANDA: A Platform for Academic Knowledge Discovery and Acquisition

Other Decks in Research

Featured

Transcript