词汇挖掘与实体挖掘

《知识图谱: 概念与技术》第 2 讲词汇挖掘与实体挖掘任翔南加州大学

2 Unstructured Text Data (account for ~80% of all data
in organizations) Knowledge & Insights (Chakraborty, 2016), Thanks to flaticon.com Machine? Human? Mining Structures from Massive Text Data

3 Knowledge Graph http://searchengineland.com/laymans-visual-guide-googles-knowledge-graph-search-api-2419350 Entity Attributes Relation

4 Structure Mining 4 The Mona Lisa is a half-
length portrait painting by the Italian Renaissance artist Leonardo da Vinci that has Mona Lisa paint Leonardo di ser Piero da Vinci (15 April 1452 – 2 May 1519) …... Attribute Names Attribute Values Entity Relation Mona Lisa Attribute Names & Values Massive Text Corpus

5 1 2 3 4 5 6 7 broadway shows
beacon theater broadway dance center broadway plays david letterman show radio city music hall theatre shows 1 2 3 4 5 6 7 high line park chelsea market highline walkway elevated park meatpacking district west side old railway http://engineering.tripadvisor.com/using-nlp-to-find-interesting-collections-of-hotels/ Technology Transfer to TripAdvisor Features for “Catch a Show” collection Features for “Near The High Line” collection Grouping hotels based on structured facts extracted from the review text A Product Use Case: Finding “Interesting Hotel Collections”

6 Why Text to Structure? 6 Structured Search & Exploration
Facet Taxonomy Construction Structured Feature Generation Graph Mining & Network Analysis

7 News Reviews Scientific Papers … … Prior Art: Extracting
Structure with Domain Expert Effort Domain Experts Extraction Rules Machine-Learning Models Stanford CoreNLP CMU NELL UW KnowItAll USC AMR IBM Alchemy APIs Google Knowledge Graph Microsoft Satori … Text Corpus Entities, Relations, and Attribute Names &Values Yelp reviews NYTimes News PubMed Papers • Models for the same task may require different labeled data in different domains

8 News Reviews Scientific Papers … This Lecture: Automatic Structure
Mining from Massive Text Corpora 8 Public Knowledge Bases Extraction Rules Machine-Learning Models AutoPhrase CoType MetaPAD … Text Corpus Yelp reviews NYTimes News PubMed Papers • Enables quick development of applications in various domains. • Extracts complex structures without introducing additional human effort.

9 “Automatic” Definition  Automatic  Minimal Human Effort 
Using only existing general knowledge bases without any other human effort. Rapidly growing! Freely available! • Common knowledge • Life sciences • Art … ERA structures: entity names, entity types, typed relationships ... That’s it? Problem solved? Everything can be found in KBs? Number of Wikipedia articles https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

10 Automatic Structure Mining: Methodology Automatic Phrase Mining Methods (SIGMOD’15,
arXiv’17) Entity Names & Context Units Typing and Relation Extraction Methods (KDD’15, KDD’16, EMNLP’16, WWW’17) Typed Entity & Relations Knowledge Bases Massive Text Corpus Meta Pattern- Driven Attribute Name & Value Discovery Methods (KDD’17) Attribute Names & Values, more General Relations Other Tasks and Applications (ECMLPKDD’17, KDD’17, WWW’16)

11 Lecture Outline  Introduction  Part I: Entity Extraction
through Phrase Mining  Part II: Entity Typing

Part I: Entity Extraction through Phrase Mining

14 Definition: Quality Phrase Mining  Quality phrase mining seeks
to extract a ranked list of phrases with decreasing quality from a large collection of documents  Examples: Scientific Papers News Articles Expected Results US President Anderson Cooper Barack Obama … Obama administration … a town … Expected Results data mining machine learning information retrieval … support vector machine … the paper …

15 Why Phrase Mining? w/o phrase mining w/ phrase mining
• What is “united”? • Which Dao? • United Airline! • David Dao!  Applications in NLP, IR, Text Mining  Document analysis  Indexing in search engine  Keyphrases for topic modeling  Summarization

16 What Kind of Phrases Are of “High Quality”? 
Popularity: Frequency  “information retrieval” vs. “cross-language information retrieval”  Concordance: A sequence of words that occur more frequently than expected  “powerful tea” vs. “strong tea”; “active learning” vs. “learning classification”  Informativeness  “this paper” (frequent but not discriminative, not informative)  Completeness  “vector machine” vs. “support vector machine” Concordance can be measured using many statistical measures, e.g., significance score, mutual information, t-test, z-test, chi-squared test, likelihood ratio, …

17 Our Recent Efforts on Phrase Mining  KERT (2014)
 Maria Danilevsky, Chi Wang, Nihit Desai, Xiang Ren, Jingyi Guo, and Jiawei Han. “Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SIAM Data Mining Conf. (SDM), 2014  ToPMine (2014-2015) Code package downloadable at http://elkishk2.web.engr.illinois.edu  Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, and Jiawei Han, "Scalable Topical Phrase Mining from Text Corpora", 2015 Int. Conf. on Very Large Data Bases (VLDB'15)  SegPhrase (2015) GitHub Source: https://github.com/shangjingbo1226/SegPhrase  Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han, "Mining Quality Phrases from Massive Text Corpora", 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15)  Received Grand Prize of 2015 Yelp Data Set Challenge, used in industry, e.g., TripAdvisor, MSR, Google Research  AutoPhrase (2017) GitHub Source: https://github.com/shangjingbo1226/AutoPhrase  Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R. Voss, Jiawei Han, “AutoPhrase: Automated Phrase Mining from Massive Text Corpora”, ArXiv, Feb. 2017  A New E-Book on Phrase Mining and Its Applications  Jialu Liu, Jingbo Shang, and Jiawei Han, Phrase Mining from Massive Text and Its Applications, Synthesis Lectures on Data Mining and Knowledge Discovery, Morgan & Claypool Publishers, 2017

18 Three Families of Methods Supervised (linguistic analyzers) Unsupervised (statistical
signals) Weakly / Distantly Supervised

19 Supervised Phrase Mining  Phrase mining was originated from
the NLP community  How to use linguistic analyzers to extract phrases?  Parsing (e.g., stanford NLP parsers)  Noun Phrase (NP) Chunking  How to rank extracted phrases?  C-value [Frantzi et al.’00]  TextRank [Mihalcea et al.’04]  TF-IDF

20  Minimal Grammatical Segments  Phrases  Phrases: “the
chef”, “the soup” Linguistic Analyzer – Parsing Raw text sentence (string) Full parse tree (grammatical analysis) The chef cooks the soup. Full-text Parsing

21 Linguistic Analyzer – Chunking  Noun phrase chunking is
a light version of parsing 1. Apply tokenization and part-of-speech (POS) tagging to each sentence 2. Search for noun phrase chunks

22 Inefficiencies of Linguistic Analyzer  Difficult to directly apply
pre-trained to new domains (e.g. twitter, biomedical, yelp)  Unless sophisticated, manually curated, domain-specific training data are provided  Computationally slow.  Cannot be applied on web-scale data to support emerging applications  Lack of the usage of corpora-level information  NP sometimes can’t meet the requirements of quality phrases  We need “shallow” phrase mining techniques

23 Ranking  C-Value  Prefers “maximal” phrases  Popularity
& Completeness  TextRank  Similar to PageRank  Popularity & Informativeness  TF-IDF  Term Frequency  Inverse Document Frequency  Popularity & Informativeness Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. …..

25 Unsupervised Phrase Mining  Statistics based on massive text
corpora  Popularity  Raw frequency  Frequency distribution based on Zipfian ranks [Deane’05]  Concordance  Significance score [Church et al.’91][El-Kishky et al.’14]  Completeness  Comparison to super/sub-sequences [Parameswaran et al.’10]

26 Raw Frequency  Raw frequency could NOT reflect the
quality of phrases, because  Combine with topic modeling  Merge adjacent unigrams of the same topic [Blei & Lafferty’09]  Frequent pattern mining within the same topic [Danilevsky et al.’14]  Limitations  Tokens in the same phrase may be assigned to different topics  E.g. knowledge discovery using least squares support vector machine classifiers…

27 Frequency Distribution  Idea: ranks in a Zipfian frequency
distribution is more reliable than raw frequency  Heuristic: Actual Rank / Expected Rank  Example:  Given a phrase like “east end”  Actual Rank: rank “east end” among all occurrences of “east” (e.g., “east end”, “east side”, “the east”, “towards the east”, etc.)  Expected Rank: rank “__ end” among all contexts of “east” (e.g., “__ end”, “__ side”, “the __ ”, “towards the __”, etc.)

28 Significance score  Significance score [Church et al.’91] 
A.k.a. Z score  ToPMine [El-Kishky et al.’15]  If a phrase can be decomposed into two parts  P = P1 • P2  α(P1, P2) ≈ (f(P1•P2) ̶ µ0(P1,P2))/√ f(P1•P2) Quality phrases

29 TopMine: First Mine Frequent Contiguous Patterns, Then Conduct Topic
Modeling [Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]… Phrase Raw freq. True freq. [support vector machine] 90 80 [vector machine] 95 0 [support vector] 100 20 Quality phrases Based on significance score [Church et al.’91]: α(P1 , P2 ) ≈ (f(P1 •P2 ) ̶ µ0 (P1 ,P2 ))/√ f(P1 •P2 ) First perform frequent contiguous pattern mining to extract candidate phrases and their counts Rectify to real frequency by Phrasal Segmentation

30 ToPMine: Experiments on Yelp Reviews

31 Comparison to super/sub-sequences  Frequency ratio between an n-gram
phrase and its two (n-1)-gram phrases  Example  Pre-confidence of San Antonio: 2385 / 14585  Post-confidence of San Antonio: 2385 / 2855  Expand / Terminate based on thresholds Phrase Raw frequency San 14585 Antonio 2855 San Antonio 2385

32 Comparison to super/sub-sequences (cont’d)  Assumption  Counter-example 
“relational database system” is a quality phrase.  Both “relational database” and “database system” can be quality phrases. An n-gram quality phrase Two (n-1)-gram sub-phrases At least one of them is not a quality phrase.

33  The thresholds should be carefully chosen.  Only
consider a subset of quality phrase requirements.  Combining different signals in an unsupervised manner is difficult.  Introduce some supervision may help! Limitations of Statistical Signals

35 Phrase Mining: From Raw Corpus to Quality Phrases and
Segmented Corpus TOPMINE: A. El-Kishky, et al., Scalable Topical Phrase Mining from Text Corpora”, in VLDB’15 Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Phrase Mining Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications. Quality Phrases Phrasal Segmentation Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus + A small set of labels or a general KB Integrating phrase mining with phrasal segmentation SegPhrase: J. Liu et al., Mining Quality Phrases from Massive Text Corpora. SIGMOD’15 (Grand Prize in Yelp Dataset Challenge) AutoPhrase: J. Shang, et al., Automated Phrase Mining from Massive Text Corpora, 2017 (TKDE’17)

36 SegPhrase Document 1 Citation recommendation is an interesting but
challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Phrase Mining Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications. Quality Phrases Phrasal Segmentation Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus  Outperform all above methods on domain-specific corpus (e.g., Yelp reviews)

37 Quality Estimation  Weakly Supervised  Labels: Whether a
phrase is a quality one or not  “support vector machine”: 1  “the experiment shows”: 0  For ~1GB corpus, only 300 labels  Pros  Binary annotations are easy  Cons  The selection of hundreds of varying-quality phrases from millions of candidates should be careful.

38 SegPhrase: Segmentation of Phrases  Partition a sequence of
words by maximizing the likelihood  Considering  Phrase quality score  ClassPhrase assigns a quality score for each phrase  Probability in corpus  Length penalty  length penalty : when > 1, it favors shorter phrases  Filter out phrases with low rectified frequency  Bad phrases are expected to rarely occur in the segmentation results

39 SegPhrase+: Enhancing Phrasal Segmentation  SegPhrase+: One more round
for enhanced phrasal segmentation  Feedback  Using rectified frequency, re-compute those features previously computing based on raw frequency  Process  Classification  Phrasal segmentation // SegPhrase  Classification  Phrasal segmentation // SegPhrase+  Effects on computing quality scores  np hard in the strong sense  np hard in the strong  data base management system

40 Experimental Results: Interesting Phrases Generated (From Titles & Abstracts
of SIGKDD) Query SIGKDD Method SegPhrase+ Chunking (TF-IDF & C-Value) 1 data mining data mining 2 data set association rule 3 association rule knowledge discovery 4 knowledge discovery frequent itemset 5 time series decision tree … … … 51 association rule mining search space 52 rule set domain knowledge 53 concept drift important problem 54 knowledge acquisition concurrency control 55 gene expression data conceptual graph … … … 201 web content optimal solution 202 frequent subgraph semantic relationship 203 intrusion detection effective way 204 categorical attribute space complexity 205 user preference small set … … … Only in SegPhrase+ Only in Chunking

41 Mining Quality Phrases in Multiple Languages Rank Phrase In
English … … … 62 首席_执行官 CEO 63 中间_偏右 Middle-right … … … 84 百度_百科 Baidu Pedia 85 热带_气旋 Tropical cyclone 86 中国科学院_院士 Fellow of Chinese Academy of Sciences … … … 1001 十大_中文_金曲 Top-10 Chinese Songs 1002 全球_资讯网 Global News Website 1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name … … … 9934 国家_戏剧_院 National Theater 9935 谢谢_你 Thank you … … …  Both ToPMine and SegPhrase+ are extensible to mining quality phrases in multiple languages  SegPhrase+ on Chinese (From Chinese Wikipedia)  ToPMine on Arabic (From Quran (Fus7a Arabic)(no preprocessing)  Experimental results of Arabic phrases: اورفك Those who disbelieve مسب هللا نمحرلا ميحرلا  In the name of God the Gracious and Merciful

42 AutoPhrase: Automated Phrase Mining  Jingbo Shang, Jialu Liu,
Meng Jiang, Xiang Ren, Clare R. Voss, Jiawei Han, “AutoPhrase: Automated Phrase Mining from Massive Text Corpora”, arXiv, Feb. 2017  Automatic extraction of high-quality phrases (e.g., scientific terms and general entity names) in a given corpus (e.g., research papers and news)  No human efforts  Multiple languages  High performance—precision, recall, efficiency

43 AutoPhrase: Label Generation by Distant Supervision  Completely remove
the human effort for labeling phrases  Distant training: Utilize high-quality phrases in KBs (e.g., Wiki) as positive phrase labels  Method: Sampling-based label generation  Positive Labels  Wikipedia contains many high-quality phrases in titles, keywords, and internal links  E.g., in Chinese, more than 20,000  Uniformly draw 100 samples as positive labels for single-word and multi-word phrases respectively  Negative Labels  Phrase candidates meeting the popularity requirement is huge in size and the majority of them are actually poor in quality (e.g., “francisco opera and”).  Ex. A small corpus in Chinese has about 4 million frequent phrase candidates, while more than 90% are not in good quality

44 Robust Positive-Only Distant Training  In each base classifier,
randomly sample K positive (e.g., wiki titles, keywords, links) and K noisy negative labels from the pools  Noisy negative pool: δ quality phrases among the K negative labels  Perturbed training set: size-2K subset of the full set of all phrase where the labels of some quality phrases are switched from positive to negative  For each base classifier, we randomly draw K phrase candidates with replacement from the positive pool and the negative pool respectively  We grow an unpruned decision tree to the point of separating all phrases to meet this requirement  Use an ensemble classifier that averages the results of T independently trained base classifiers

45  Theoretical Analysis  T base classifiers  Exponentially
decreasing  Empirical Performance  AUC to evaluate the ranking  There is also model error Is this idea robust?

46 Automatic vs. Manual  4 Types of Label Pools
 EP = domain experts give positive  DP = distant supervision provides positive  EN = domain experts give positive  DN = all unlabeled phrases form the negative  2 Cases:  When we have enough labels in EP  After we exhaust labels in EP

47 Case 1: When we have enough labels in EP
 DP has reasonable quality, but EP is better  DN has been successfully denoised

48 Case 2: After we exhaust labels in EP 
DPDN can finally beat domain experts as DP becomes large enough!

49 Generating High-Quality Phrases in Multi-Languages  Complicated pre-processing models,
such as dependency parsing, heavily rely on human efforts and thus cannot be smoothly applied to multiple languages  Minimum Language Dependency = Tokenization + POS tagging  Drawbacks of Frequency-based signals only: Over-decomposition & Under- decomposition  “Sophia Smith” vs. “Sophia” and “Smith”  “Great Firewall” vs. “firewall software”  Drawbacks of POS only:  “classifier SVM” vs. “discriminative classifier” and “SVM”  Context-aware phrasal segmentation

50 Single-Word Modeling: Enhancing Recall  AutoPhrase: Simultaneously model single-word
and multi-word phrases  A phrase is not only a group of multiple words, but also possibly a single word, as long as it functions as a constituent in the syntax of a sentence, e.g., “UIUC”, “Illinois”  Based on our experiments: 10%~30% quality phrases are single-word phrases  Modeling single-word phrases: Examining requirements of quality multi-word phrase  Popularity: sufficient frequent in the given corpus  Concordance: the collocation of tokens in such frequency that is significantly higher than random  Informativeness: indicative of a specific topic or concept  Completeness: Complete semantic unit  Only concordance cannot be defined in single-word phrases  Independence: A quality single-word phrase is more likely a complete semantic unit in the given documents

51 Experiments and Performance Comparison  Datasets:  Comparing methods
 SegPhrase/WrapSegPhrae (encoding preprocessing for handling non-English)  TF-IDF/TextRank Phrase Mining Results

52 References  Deane, P., 2005, June. A nonparametric method
for extraction of candidate phrasal terms. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 605-613). Association for Computational Linguistics.  Koo, T., Carreras Pérez, X. and Collins, M., 2008. Simple semi-supervised dependency parsing. In 46th Annual Meeting of the Association for Computational Linguistics (pp. 595-603).  Xun, E., Huang, C. and Zhou, M., 2000, October. A unified statistical model for the identification of English baseNP. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 109-116). Association for Computational Linguistics.  Zhang, Z., Iria, J., Brewster, C. and Ciravegna, F., 2008, May. A comparative evaluation of term recognition algorithms. In LREC.  Park, Y., Byrd, R.J. and Boguraev, B.K., 2002, August. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th international conference on Computational linguistics-Volume 1 (pp. 1-7). Association for Computational Linguistics.  Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 1999, August. KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries (pp. 254-255). ACM.  Liu, Z., Chen, X., Zheng, Y. and Sun, M., 2011, June. Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (pp. 135-144). Association for Computational Linguistics.  Evans, D.A. and Zhai, C., 1996, June. Noun-phrase analysis in unrestricted text for information retrieval. In Proceedings of the 34th annual meeting on Association for Computational Linguistics (pp. 17-24). Association for Computational Linguistics.

53 References (Continued)  Frantzi, K., Ananiadou, S. and Mima,
H., 2000. Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2), pp.115-130.  Mihalcea, R. and Tarau, P., 2004, July. TextRank: Bringing order into texts. Association for Computational Linguistics.  Blei, D.M. and Lafferty, J.D., 2009. Topic models. Text mining: classification, clustering, and applications, 10(71), p.34.  Danilevsky, M., Wang, C., Desai, N., Ren, X., Guo, J. and Han, J., 2014, April. Automatic construction and ranking of topical keyphrases on collections of short documents. In Proceedings of the 2014 SIAM International Conference on Data Mining (pp. 398-406). Society for Industrial and Applied Mathematics.  Church, K., Gale, W., Hanks, P. and Hindle, D., 1991. Using statistics in lexical analysis. Lexical acquisition: exploiting on-line resources to build a lexicon, 115, p.164.  El-Kishky, A., Song, Y., Wang, C., Voss, C.R. and Han, J., 2014. Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment, 8(3), pp.305-316.  Parameswaran, A., Garcia-Molina, H. and Rajaraman, A., 2010. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.566-577.  Liu, J., Shang, J., Wang, C., Ren, X. and Han, J., 2015, May. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1729-1744). ACM.  Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R. and Han, J., 2017. Automated Phrase Mining from Massive Text Corpora. arXiv preprint arXiv:1702.04457.

Part II: Typing Entities

56 Reading the Reviews: From Text to Typed Entities and
Relationships Restaurant Location Organization Event This hotel is my favorite Hilton property in NYC! It is located right on 42nd street near Times Square, it is close to all subways, Broadways shows, and next to great restaurants like Junior’s Cheesecake, Virgil’s BBQ and many others. -- TripAdvisor hotel located at NYC Times Square Hilton property is a Junior’s Cheesecake Virgil’s BBQ located near close to close to Broadways shows close to Structured Facts 1. “Typed” entities 2. “Typed” relationships

57 Prior Art: Extracting Structures with Repeated Human Effort This
hotel is my favorite Hilton property in NYC! It is located right on 42nd street near Times Square, it is close to all subways, Broadways shows, and next to many great … … The June 2013 Egyptian protest were mass protest event that occurred in Egypt on 30 June 2013, … Human labeling … We had a room facing Times Square and a room facing the Empire State Building, The location is close to everything and we love … Extraction Rules Machine-Learning Models Broadways shows NYC Hilton property Labeled data Text Corpus Stanford CoreNLP CMU NELL UW KnowItAll USC AMR IBM Alchemy APIs Google Knowledge Graph Microsoft Satori … Structured Facts Times square hotel

58 Our Work: Effort-Light StructMine Corpus-specific Models Text Corpus Structures
• Enables quick development of applications over various corpora • Extracts complex structures without introducing human errors News articles PubMed papers Knowledge Bases (KB)

59 Effort–Light StructMine: Where Are We? Human labeling effort Feature
engineering effort Weakly-supervised learning systems Hand-crafted Systems Supervised learning systems Distantly-supervised Learning Systems CMU NELL, 2009 - present UW KnowItAll, Open IE, 2005 - present Max-Planck YAGO, 2008 - present Stanford CoreNLP, 2005 - present UT Austin Dependency Kernel, 2005 IBM Watson Language APIs UCB Hearst Pattern, 1992 NYU Proteus, 1997 Stanford DeepDive, MIML-RE 2012 - present UW FIGER, MultiR, 2012 Effort-Light StructMine (WWW’15, KDD’15, 16, 17, EMNLP’16, WWW’17, …)

60 “Distant” Supervision: What Is It? Text corpus Knowledge Bases
“Matchable” structures: entity names, entity types, typed relationships ... (Mintz et al., 2009) , (Riedek et al., 2010), (Lin et al., 2012), (Ling et al., 2012), (Surdeanu et al., 2012), (Xu et al., 2013), (Nagesh et al., 2014), … Freely available! • Common knowledge • Life sciences • Art … Rapidly growing! Number of Wikipedia articles https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia Human crowds “Un-matchable”

61 Learning with Distant Supervision: Challenges 1. Sparsity of “Matchable”
 Incomplete knowledge bases  Low-confidence matching 2. Accuracy of “Expansion”  For “matchable”: Are all the labels assigned accurately?  For “un-matchable”: How to perform inference accurately? (Ren et al., KDD’15) It is my favorite city in the United States The United States needs a new strategy to meet this challenge Government Location … next to restaurants like Junior’s Cheesecake ✗

62 Effort-Light StructMine: Contributions Sparsity of “Matchable” Effective expansion from
“matchable” to “un-matchable” Accuracy of “Expansion” Pick the “best” labels based on the context (for both “matchable” and “un-matchable”) Harness the “data redundancy” using graph- based joint optimization Challenge Solution Idea Text corpus It is my favorite city in the United States The United States needs a new strategy to meet this challenge Government Location

63 Effort-Light StructMine: Methodology Data-driven text segmentation (SIGMOD’15, WWW’16) Entity
names & context units Partially- labeled corpus Learning Corpus-specific Model (KDD’15, KDD’16, EMNLP’16, WWW’17) Structures from the remaining unlabeled data Entity Recognition and Coarse- grained Typing (KDD’15) Fine-grained Entity Typing (KDD’16) Joint Entity and Relation Extraction (WWW’17) Corpus to Structured Network: The Roadmap Knowledge bases Text corpus

64 Entity Recognition and Coarse- grained Typing (KDD’15) Fine-grained Entity
Typing (KDD’16) Corpus to Structured Network: The Roadmap Data-driven text segmentation (SIGMOD’15, WWW’16) entity names & context units Partially- labeled corpus Learning Corpus-specific Model (KDD’15, KDD’16, EMNLP’16, WWW’17) Structures from the remaining unlabeled data Knowledge bases Text corpus

65 Recognizing Entities of Target Types in Text The best
BBQ I’ve tasted in Phoenix ! I had the pulled pork sandwich with coleslaw and baked beans for lunch. The owner is very nice. … food location person The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. The owner is very nice. …

66 Traditional Named Entity Recognition (NER) Systems  Heavy reliance
on corpus-specific human labeling  Training sequence models is slow A manual annotation interface e.g., (McMallum & Li, 2003), (Finkel et al.,2005), (Ratinov & Roth, 2009), … The best BBQ I’ve tasted in Phoenix O O Food O O O Location Sequence model training NER Systems: Stanford NER Illinois Name Tagger IBM Alchemy APIs …

67 Leveraging Distant Supervision 1. Detect entity names from text
2. Match name strings to KB entities 3. Propagate types to the un- matchable names ID Sentence S1 Phoenix is my all-time favorite dive bar in New York City . S2 The best BBQ I’ve tasted in Phoenix. S3 Phoenix has become one of my favorite bars in NY . BBQ Phoenix NY tasted in has become one of my favorite bars in Location New York City ??? ??? Food is my all-time favorite dive bar in Location  (Lin et al., 2012), (Ling et al., 2012), (Nakashole et al., 2013)

68 ID Sentence S1 Phoenix is my all-time favorite dive
bar in New York City . S2 The best BBQ I’ve tasted in Phoenix. S3 Phoenix has become one of my favorite bars in NY . Current Distant Supervision: Limitation I 1. Context-agnostic type prediction  Predict types for each mention regardless of context 2. Sparsity of contextual bridges

69 Current Distant Supervision: Limitation II 1. Context-agnostic type prediction
2. Sparsity of contextual bridges  Some relational phrases are infrequent in the corpus  ineffective type propagation ID Sentence S1 Phoenix is my all-time favorite dive bar in New York City . S3 Phoenix has become one of my favorite bars in NY .

70 Our Solution: Data-Driven Entity Mention Detection Corpus-level Concordance Syntactic
quality Quality of merging • Significance of a merging between two sub-phrases Pattern Example (J*)N* support vector machine VP tasted in, damage on VW*(P) train a classifier with Good Concordance The best BBQ I’ve tasted in Phoenix ! I had the pulled pork sandwich with coleslaw and baked beans for lunch. … This place serves up the best cheese steak sandwich in west of Mississippi.

71 Our Solution: ClusType (KDD’15) BBQ NY tasted in has
become one of my favorite bars in New York City is my all-time favorite dive bar in ID Segmented Sentences S1 Phoenix is my all-time favorite dive bar in New York City . S2 The best BBQ I’ve tasted in Phoenix. S3 Phoenix has become one of my favorite bars in NY . S2: BBQ S3: NY S1: New York City S2: Phoenix S3: Phoenix Putting two sub-tasks together: 1. Type label propagation 2. Relation phrase clustering Similar relation phrases Correlated mentions Phoenix S1: Phoenix Represents object interactions https://github.com/shanzhenren/ClusType

72 ClusType: Comparing with State-of-the-Art Systems (F1 Score) Methods NYT
Yelp Tweet Pattern (Stanford, CONLL’14) 0.301 0.199 0.223 SemTagger (U Utah, ACL’10) 0.407 0.296 0.236 NNPLB (UW, EMNLP’12) 0.637 0.511 0.246 APOLLO (THU, CIKM’12) 0.795 0.283 0.188 FIGER (UW, AAAI’12) 0.881 0.198 0.308 ClusType (KDD’15) 0.939 0.808 0.451 Precision (P) = #− #− , Recall (R) = #− #−ℎ , F1 score = 2 × (+) Bootstrapping Label propagation Classifier with linguistic features NYT: 118k news articles (1k manually labeled for evaluation); Yelp: 230k business reviews (2.5k reviews are manually labeled for evaluation); Tweet: 302 tweets (3k tweets are manually labeled for evaluation)

73 Entity Recognition and Coarse- grained Typing (KDD’15) Fine-grained Entity
Typing (KDD’16) entity names & context units Partially- labeled corpus Learning Corpus-specific Model (KDD’15, KDD’16, EMNLP’16, WWW’17) Structures from the remaining unlabeled data Knowledge bases Corpus to Structured Network: The Roadmap Text corpus Data-driven text segmentation (SIGMOD’15, WWW’16)

74 From Coarse-Grained Typing to Fine-Grained Entity Typing ID Sentence
S1 Donald Trump spent 14 television seasons presiding over a game show, NBC’s The Apprentice. Person Location Organization root product person location organiz ation ... ... politician artist business man ... ... ... author actor singer ... ... ... A few common types A type hierarchy with 100+ types (from knowledge base) (Ling et al., 2012), (Nakashole et al., 2013), (Yogatama et al., 2015)

75 Current Distant Supervision: Context-Agnostic Labeling root person location organization
politician artist businessman author actor singer ... Entity types from knowledge base Entity: Donald Trump S1: Donald Trump Entity Types: person, artist, actor, author, businessman, politician ID Sentence S1 Donald Trump spent 14 television seasons presiding over a game show, NBC’s The Apprentice • Inaccurate labels in training data • Prior work: all labels are “perfect”

76 Our Solution: Partial Label Embedding (KDD’16) “De-noised” labeled data
ID Sentence S1 Donald Trump spent 14 television seasons presiding over a game show, NBC’s The Apprentice Extract Text Features “Label Noise Reduction” with PLE Train Classifiers on De-noised Data Prediction on New Data S1: Donald Trump Entity Types: person, artist, actor, author, businessman, politician Text features: TOKEN_Donald, CONTEXT: television, CONTEXT: season, TOKEN_trump, SHAPE: AA More effective classifiers (Ren et al., KDD’16) https://github.com/shanzhenren/PLE

77 PLE: Modeling Clean and Noisy Mentions Separately For a
clean mention, its “positive types” should be ranked higher than all its “negative types” For a noisy mention, its “best candidate type” should be ranked higher than all its “non-candidate types” S1: Donald Trump Types in KB: person, artist, actor, author, businessman, politician ID Noisy Entity Mention S1 Donald Trump spent 14 television seasons presiding over a game show, NBC’s The Apprentice (+) actor 0.88 (+) artist 0.74 (+) person 0.55 (+) author 0.41 (+) politician 0.33 (+) business 0.31 “Best” candidate type (+) actor (-) singer (-) coach (-) doctor (-) location (-) organization Types ranked (Ren et al., KDD’16) Si: Ted Cruz Types in KB: person, politician

78 root product person location organization politician artist businessman author
actor singer ... president politician person actor senator gave address star play ID Sentence Si President Trump gave an all-hands address to troops at the U.S. Central Command headquarters … + + + President gave speech Vectors for text features Test mention: Si _Trump  Top-down nearest neighbor search in the given type hierarchy Low-dimensional vector space Type Inference in PLE Type hierarchy (from knowledge base) (Ren et al., KDD’16)

79 PLE: Performance of Fine-Grained Entity Typing  Raw: candidate
types from distant supervision  WSABIE (Google, ACL’15): joint feature and type embedding  Predictive Text Embedding (MSR, WWW’15): joint mention, feature and type embedding  Both WASBIE and PTE suffer from “noisy” training labels  PLE (KDD’16): partial-label loss for context-aware labeling 0.7 0.45 0.05 0.79 0.49 0.14 0.78 0.51 0.19 0.81 0.62 0.48 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Level-1 Level-2 Level-3 Accuracy on different type levels Raw WSABIE PTE PLE Accuracy = # ℎ # ℎ OntoNotes public dataset (Weischedel et al. 2011, Gillick et al., 2014): 13,109 news articles, 77 annotated documents, 89 entity types

80 Conclusion  Introduction  Part I: Entity Extraction through
Phrase Mining  Part II: Entity Typing Thank you! Questions can be sent to [email protected]

词汇挖掘与实体挖掘

词汇挖掘与实体挖掘

More Decks by 孙玉龙

Other Decks in Technology

Featured

Transcript