Upgrade to Pro — share decks privately, control downloads, hide ads and more …

词汇挖掘与实体挖掘

Avatar for 孙玉龙 孙玉龙
September 02, 2019

 词汇挖掘与实体挖掘

词汇挖掘与实体挖掘

Avatar for 孙玉龙

孙玉龙

September 02, 2019
Tweet

More Decks by 孙玉龙

Other Decks in Technology

Transcript

  1. 2 Unstructured Text Data (account for ~80% of all data

    in organizations) Knowledge & Insights (Chakraborty, 2016), Thanks to flaticon.com Machine? Human? Mining Structures from Massive Text Data
  2. 4 Structure Mining 4 The Mona Lisa is a half-

    length portrait painting by the Italian Renaissance artist Leonardo da Vinci that has Mona Lisa paint Leonardo di ser Piero da Vinci (15 April 1452 – 2 May 1519) …... Attribute Names Attribute Values Entity Relation Mona Lisa Attribute Names & Values Massive Text Corpus
  3. 5 1 2 3 4 5 6 7 broadway shows

    beacon theater broadway dance center broadway plays david letterman show radio city music hall theatre shows 1 2 3 4 5 6 7 high line park chelsea market highline walkway elevated park meatpacking district west side old railway http://engineering.tripadvisor.com/using-nlp-to-find-interesting-collections-of-hotels/ Technology Transfer to TripAdvisor Features for “Catch a Show” collection Features for “Near The High Line” collection Grouping hotels based on structured facts extracted from the review text A Product Use Case: Finding “Interesting Hotel Collections”
  4. 6 Why Text to Structure? 6 Structured Search & Exploration

    Facet Taxonomy Construction Structured Feature Generation Graph Mining & Network Analysis
  5. 7 News Reviews Scientific Papers … … Prior Art: Extracting

    Structure with Domain Expert Effort Domain Experts Extraction Rules Machine-Learning Models Stanford CoreNLP CMU NELL UW KnowItAll USC AMR IBM Alchemy APIs Google Knowledge Graph Microsoft Satori … Text Corpus Entities, Relations, and Attribute Names &Values Yelp reviews NYTimes News PubMed Papers • Models for the same task may require different labeled data in different domains
  6. 8 News Reviews Scientific Papers … This Lecture: Automatic Structure

    Mining from Massive Text Corpora 8 Public Knowledge Bases Extraction Rules Machine-Learning Models AutoPhrase CoType MetaPAD … Text Corpus Yelp reviews NYTimes News PubMed Papers • Enables quick development of applications in various domains. • Extracts complex structures without introducing additional human effort.
  7. 9 “Automatic” Definition  Automatic  Minimal Human Effort 

    Using only existing general knowledge bases without any other human effort. Rapidly growing! Freely available! • Common knowledge • Life sciences • Art … ERA structures: entity names, entity types, typed relationships ... That’s it? Problem solved? Everything can be found in KBs? Number of Wikipedia articles https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia
  8. 10 Automatic Structure Mining: Methodology Automatic Phrase Mining Methods (SIGMOD’15,

    arXiv’17) Entity Names & Context Units Typing and Relation Extraction Methods (KDD’15, KDD’16, EMNLP’16, WWW’17) Typed Entity & Relations Knowledge Bases Massive Text Corpus Meta Pattern- Driven Attribute Name & Value Discovery Methods (KDD’17) Attribute Names & Values, more General Relations Other Tasks and Applications (ECMLPKDD’17, KDD’17, WWW’16)
  9. 11 Lecture Outline  Introduction  Part I: Entity Extraction

    through Phrase Mining  Part II: Entity Typing
  10. 13 Automatic Structure Mining: Methodology Automatic Phrase Mining Methods (SIGMOD’15,

    arXiv’17) Entity Names & Context Units Typing and Relation Extraction Methods (KDD’15, KDD’16, EMNLP’16, WWW’17) Typed Entity & Relations Knowledge Bases Massive Text Corpus Meta Pattern- Driven Attribute Name & Value Discovery Methods (KDD’17) Attribute Names & Values, more General Relations Other Tasks and Applications (ECMLPKDD’17, KDD’17, WWW’16)
  11. 14 Definition: Quality Phrase Mining  Quality phrase mining seeks

    to extract a ranked list of phrases with decreasing quality from a large collection of documents  Examples: Scientific Papers News Articles Expected Results US President Anderson Cooper Barack Obama … Obama administration … a town … Expected Results data mining machine learning information retrieval … support vector machine … the paper …
  12. 15 Why Phrase Mining? w/o phrase mining w/ phrase mining

    • What is “united”? • Which Dao? • United Airline! • David Dao!  Applications in NLP, IR, Text Mining  Document analysis  Indexing in search engine  Keyphrases for topic modeling  Summarization
  13. 16 What Kind of Phrases Are of “High Quality”? 

    Popularity: Frequency  “information retrieval” vs. “cross-language information retrieval”  Concordance: A sequence of words that occur more frequently than expected  “powerful tea” vs. “strong tea”; “active learning” vs. “learning classification”  Informativeness  “this paper” (frequent but not discriminative, not informative)  Completeness  “vector machine” vs. “support vector machine” Concordance can be measured using many statistical measures, e.g., significance score, mutual information, t-test, z-test, chi-squared test, likelihood ratio, …
  14. 17 Our Recent Efforts on Phrase Mining  KERT (2014)

     Maria Danilevsky, Chi Wang, Nihit Desai, Xiang Ren, Jingyi Guo, and Jiawei Han. “Automatic Construction and Ranking of Topical Keyphrases on Collections of Short Documents“, SIAM Data Mining Conf. (SDM), 2014  ToPMine (2014-2015) Code package downloadable at http://elkishk2.web.engr.illinois.edu  Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare R. Voss, and Jiawei Han, "Scalable Topical Phrase Mining from Text Corpora", 2015 Int. Conf. on Very Large Data Bases (VLDB'15)  SegPhrase (2015) GitHub Source: https://github.com/shangjingbo1226/SegPhrase  Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han, "Mining Quality Phrases from Massive Text Corpora", 2015 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'15)  Received Grand Prize of 2015 Yelp Data Set Challenge, used in industry, e.g., TripAdvisor, MSR, Google Research  AutoPhrase (2017) GitHub Source: https://github.com/shangjingbo1226/AutoPhrase  Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R. Voss, Jiawei Han, “AutoPhrase: Automated Phrase Mining from Massive Text Corpora”, ArXiv, Feb. 2017  A New E-Book on Phrase Mining and Its Applications  Jialu Liu, Jingbo Shang, and Jiawei Han, Phrase Mining from Massive Text and Its Applications, Synthesis Lectures on Data Mining and Knowledge Discovery, Morgan & Claypool Publishers, 2017
  15. 19 Supervised Phrase Mining  Phrase mining was originated from

    the NLP community  How to use linguistic analyzers to extract phrases?  Parsing (e.g., stanford NLP parsers)  Noun Phrase (NP) Chunking  How to rank extracted phrases?  C-value [Frantzi et al.’00]  TextRank [Mihalcea et al.’04]  TF-IDF
  16. 20  Minimal Grammatical Segments  Phrases  Phrases: “the

    chef”, “the soup” Linguistic Analyzer – Parsing Raw text sentence (string) Full parse tree (grammatical analysis) The chef cooks the soup. Full-text Parsing
  17. 21 Linguistic Analyzer – Chunking  Noun phrase chunking is

    a light version of parsing 1. Apply tokenization and part-of-speech (POS) tagging to each sentence 2. Search for noun phrase chunks
  18. 22 Inefficiencies of Linguistic Analyzer  Difficult to directly apply

    pre-trained to new domains (e.g. twitter, biomedical, yelp)  Unless sophisticated, manually curated, domain-specific training data are provided  Computationally slow.  Cannot be applied on web-scale data to support emerging applications  Lack of the usage of corpora-level information  NP sometimes can’t meet the requirements of quality phrases  We need “shallow” phrase mining techniques
  19. 23 Ranking  C-Value  Prefers “maximal” phrases  Popularity

    & Completeness  TextRank  Similar to PageRank  Popularity & Informativeness  TF-IDF  Term Frequency  Inverse Document Frequency  Popularity & Informativeness Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. …..
  20. 25 Unsupervised Phrase Mining  Statistics based on massive text

    corpora  Popularity  Raw frequency  Frequency distribution based on Zipfian ranks [Deane’05]  Concordance  Significance score [Church et al.’91][El-Kishky et al.’14]  Completeness  Comparison to super/sub-sequences [Parameswaran et al.’10]
  21. 26 Raw Frequency  Raw frequency could NOT reflect the

    quality of phrases, because  Combine with topic modeling  Merge adjacent unigrams of the same topic [Blei & Lafferty’09]  Frequent pattern mining within the same topic [Danilevsky et al.’14]  Limitations  Tokens in the same phrase may be assigned to different topics  E.g. knowledge discovery using least squares support vector machine classifiers…
  22. 27 Frequency Distribution  Idea: ranks in a Zipfian frequency

    distribution is more reliable than raw frequency  Heuristic: Actual Rank / Expected Rank  Example:  Given a phrase like “east end”  Actual Rank: rank “east end” among all occurrences of “east” (e.g., “east end”, “east side”, “the east”, “towards the east”, etc.)  Expected Rank: rank “__ end” among all contexts of “east” (e.g., “__ end”, “__ side”, “the __ ”, “towards the __”, etc.)
  23. 28 Significance score  Significance score [Church et al.’91] 

    A.k.a. Z score  ToPMine [El-Kishky et al.’15]  If a phrase can be decomposed into two parts  P = P1 • P2  α(P1, P2) ≈ (f(P1•P2) ̶ µ0(P1,P2))/√ f(P1•P2) Quality phrases
  24. 29 TopMine: First Mine Frequent Contiguous Patterns, Then Conduct Topic

    Modeling [Markov blanket] [feature selection] for [support vector machines] [knowledge discovery] using [least squares] [support vector machine] [classifiers] …[support vector] for [machine learning]… Phrase Raw freq. True freq. [support vector machine] 90 80 [vector machine] 95 0 [support vector] 100 20 Quality phrases Based on significance score [Church et al.’91]: α(P1 , P2 ) ≈ (f(P1 •P2 ) ̶ µ0 (P1 ,P2 ))/√ f(P1 •P2 ) First perform frequent contiguous pattern mining to extract candidate phrases and their counts Rectify to real frequency by Phrasal Segmentation
  25. 31 Comparison to super/sub-sequences  Frequency ratio between an n-gram

    phrase and its two (n-1)-gram phrases  Example  Pre-confidence of San Antonio: 2385 / 14585  Post-confidence of San Antonio: 2385 / 2855  Expand / Terminate based on thresholds Phrase Raw frequency San 14585 Antonio 2855 San Antonio 2385
  26. 32 Comparison to super/sub-sequences (cont’d)  Assumption  Counter-example 

    “relational database system” is a quality phrase.  Both “relational database” and “database system” can be quality phrases. An n-gram quality phrase Two (n-1)-gram sub-phrases At least one of them is not a quality phrase.
  27. 33  The thresholds should be carefully chosen.  Only

    consider a subset of quality phrase requirements.  Combining different signals in an unsupervised manner is difficult.  Introduce some supervision may help! Limitations of Statistical Signals
  28. 35 Phrase Mining: From Raw Corpus to Quality Phrases and

    Segmented Corpus TOPMINE: A. El-Kishky, et al., Scalable Topical Phrase Mining from Text Corpora”, in VLDB’15 Document 1 Citation recommendation is an interesting but challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Phrase Mining Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications. Quality Phrases Phrasal Segmentation Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus + A small set of labels or a general KB Integrating phrase mining with phrasal segmentation SegPhrase: J. Liu et al., Mining Quality Phrases from Massive Text Corpora. SIGMOD’15 (Grand Prize in Yelp Dataset Challenge) AutoPhrase: J. Shang, et al., Automated Phrase Mining from Massive Text Corpora, 2017 (TKDE’17)
  29. 36 SegPhrase Document 1 Citation recommendation is an interesting but

    challenging research problem in data mining area. Document 2 In this study, we investigate the problem in the context of heterogeneous information networks using data mining technique. Phrase Mining Document 3 Principal Component Analysis is a linear dimensionality reduction technique commonly used in machine learning applications. Quality Phrases Phrasal Segmentation Raw Corpus Segmented Corpus Input Raw Corpus Quality Phrases Segmented Corpus  Outperform all above methods on domain-specific corpus (e.g., Yelp reviews)
  30. 37 Quality Estimation  Weakly Supervised  Labels: Whether a

    phrase is a quality one or not  “support vector machine”: 1  “the experiment shows”: 0  For ~1GB corpus, only 300 labels  Pros  Binary annotations are easy  Cons  The selection of hundreds of varying-quality phrases from millions of candidates should be careful.
  31. 38 SegPhrase: Segmentation of Phrases  Partition a sequence of

    words by maximizing the likelihood  Considering  Phrase quality score  ClassPhrase assigns a quality score for each phrase  Probability in corpus  Length penalty  length penalty : when > 1, it favors shorter phrases  Filter out phrases with low rectified frequency  Bad phrases are expected to rarely occur in the segmentation results
  32. 39 SegPhrase+: Enhancing Phrasal Segmentation  SegPhrase+: One more round

    for enhanced phrasal segmentation  Feedback  Using rectified frequency, re-compute those features previously computing based on raw frequency  Process  Classification  Phrasal segmentation // SegPhrase  Classification  Phrasal segmentation // SegPhrase+  Effects on computing quality scores  np hard in the strong sense  np hard in the strong  data base management system
  33. 40 Experimental Results: Interesting Phrases Generated (From Titles & Abstracts

    of SIGKDD) Query SIGKDD Method SegPhrase+ Chunking (TF-IDF & C-Value) 1 data mining data mining 2 data set association rule 3 association rule knowledge discovery 4 knowledge discovery frequent itemset 5 time series decision tree … … … 51 association rule mining search space 52 rule set domain knowledge 53 concept drift important problem 54 knowledge acquisition concurrency control 55 gene expression data conceptual graph … … … 201 web content optimal solution 202 frequent subgraph semantic relationship 203 intrusion detection effective way 204 categorical attribute space complexity 205 user preference small set … … … Only in SegPhrase+ Only in Chunking
  34. 41 Mining Quality Phrases in Multiple Languages Rank Phrase In

    English … … … 62 首席_执行官 CEO 63 中间_偏右 Middle-right … … … 84 百度_百科 Baidu Pedia 85 热带_气旋 Tropical cyclone 86 中国科学院_院士 Fellow of Chinese Academy of Sciences … … … 1001 十大_中文_金曲 Top-10 Chinese Songs 1002 全球_资讯网 Global News Website 1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name … … … 9934 国家_戏剧_院 National Theater 9935 谢谢_你 Thank you … … …  Both ToPMine and SegPhrase+ are extensible to mining quality phrases in multiple languages  SegPhrase+ on Chinese (From Chinese Wikipedia)  ToPMine on Arabic (From Quran (Fus7a Arabic)(no preprocessing)  Experimental results of Arabic phrases: اورفك Those who disbelieve مسب هللا نمحرلا ميحرلا  In the name of God the Gracious and Merciful
  35. 42 AutoPhrase: Automated Phrase Mining  Jingbo Shang, Jialu Liu,

    Meng Jiang, Xiang Ren, Clare R. Voss, Jiawei Han, “AutoPhrase: Automated Phrase Mining from Massive Text Corpora”, arXiv, Feb. 2017  Automatic extraction of high-quality phrases (e.g., scientific terms and general entity names) in a given corpus (e.g., research papers and news)  No human efforts  Multiple languages  High performance—precision, recall, efficiency
  36. 43 AutoPhrase: Label Generation by Distant Supervision  Completely remove

    the human effort for labeling phrases  Distant training: Utilize high-quality phrases in KBs (e.g., Wiki) as positive phrase labels  Method: Sampling-based label generation  Positive Labels  Wikipedia contains many high-quality phrases in titles, keywords, and internal links  E.g., in Chinese, more than 20,000  Uniformly draw 100 samples as positive labels for single-word and multi-word phrases respectively  Negative Labels  Phrase candidates meeting the popularity requirement is huge in size and the majority of them are actually poor in quality (e.g., “francisco opera and”).  Ex. A small corpus in Chinese has about 4 million frequent phrase candidates, while more than 90% are not in good quality
  37. 44 Robust Positive-Only Distant Training  In each base classifier,

    randomly sample K positive (e.g., wiki titles, keywords, links) and K noisy negative labels from the pools  Noisy negative pool: δ quality phrases among the K negative labels  Perturbed training set: size-2K subset of the full set of all phrase where the labels of some quality phrases are switched from positive to negative  For each base classifier, we randomly draw K phrase candidates with replacement from the positive pool and the negative pool respectively  We grow an unpruned decision tree to the point of separating all phrases to meet this requirement  Use an ensemble classifier that averages the results of T independently trained base classifiers
  38. 45  Theoretical Analysis  T base classifiers  Exponentially

    decreasing  Empirical Performance  AUC to evaluate the ranking  There is also model error Is this idea robust?
  39. 46 Automatic vs. Manual  4 Types of Label Pools

     EP = domain experts give positive  DP = distant supervision provides positive  EN = domain experts give positive  DN = all unlabeled phrases form the negative  2 Cases:  When we have enough labels in EP  After we exhaust labels in EP
  40. 47 Case 1: When we have enough labels in EP

     DP has reasonable quality, but EP is better  DN has been successfully denoised
  41. 48 Case 2: After we exhaust labels in EP 

    DPDN can finally beat domain experts as DP becomes large enough!
  42. 49 Generating High-Quality Phrases in Multi-Languages  Complicated pre-processing models,

    such as dependency parsing, heavily rely on human efforts and thus cannot be smoothly applied to multiple languages  Minimum Language Dependency = Tokenization + POS tagging  Drawbacks of Frequency-based signals only: Over-decomposition & Under- decomposition  “Sophia Smith” vs. “Sophia” and “Smith”  “Great Firewall” vs. “firewall software”  Drawbacks of POS only:  “classifier SVM” vs. “discriminative classifier” and “SVM”  Context-aware phrasal segmentation
  43. 50 Single-Word Modeling: Enhancing Recall  AutoPhrase: Simultaneously model single-word

    and multi-word phrases  A phrase is not only a group of multiple words, but also possibly a single word, as long as it functions as a constituent in the syntax of a sentence, e.g., “UIUC”, “Illinois”  Based on our experiments: 10%~30% quality phrases are single-word phrases  Modeling single-word phrases: Examining requirements of quality multi-word phrase  Popularity: sufficient frequent in the given corpus  Concordance: the collocation of tokens in such frequency that is significantly higher than random  Informativeness: indicative of a specific topic or concept  Completeness: Complete semantic unit  Only concordance cannot be defined in single-word phrases  Independence: A quality single-word phrase is more likely a complete semantic unit in the given documents
  44. 51 Experiments and Performance Comparison  Datasets:  Comparing methods

     SegPhrase/WrapSegPhrae (encoding preprocessing for handling non-English)  TF-IDF/TextRank Phrase Mining Results
  45. 52 References  Deane, P., 2005, June. A nonparametric method

    for extraction of candidate phrasal terms. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 605-613). Association for Computational Linguistics.  Koo, T., Carreras Pérez, X. and Collins, M., 2008. Simple semi-supervised dependency parsing. In 46th Annual Meeting of the Association for Computational Linguistics (pp. 595-603).  Xun, E., Huang, C. and Zhou, M., 2000, October. A unified statistical model for the identification of English baseNP. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 109-116). Association for Computational Linguistics.  Zhang, Z., Iria, J., Brewster, C. and Ciravegna, F., 2008, May. A comparative evaluation of term recognition algorithms. In LREC.  Park, Y., Byrd, R.J. and Boguraev, B.K., 2002, August. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th international conference on Computational linguistics-Volume 1 (pp. 1-7). Association for Computational Linguistics.  Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 1999, August. KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries (pp. 254-255). ACM.  Liu, Z., Chen, X., Zheng, Y. and Sun, M., 2011, June. Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (pp. 135-144). Association for Computational Linguistics.  Evans, D.A. and Zhai, C., 1996, June. Noun-phrase analysis in unrestricted text for information retrieval. In Proceedings of the 34th annual meeting on Association for Computational Linguistics (pp. 17-24). Association for Computational Linguistics.
  46. 53 References (Continued)  Frantzi, K., Ananiadou, S. and Mima,

    H., 2000. Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2), pp.115-130.  Mihalcea, R. and Tarau, P., 2004, July. TextRank: Bringing order into texts. Association for Computational Linguistics.  Blei, D.M. and Lafferty, J.D., 2009. Topic models. Text mining: classification, clustering, and applications, 10(71), p.34.  Danilevsky, M., Wang, C., Desai, N., Ren, X., Guo, J. and Han, J., 2014, April. Automatic construction and ranking of topical keyphrases on collections of short documents. In Proceedings of the 2014 SIAM International Conference on Data Mining (pp. 398-406). Society for Industrial and Applied Mathematics.  Church, K., Gale, W., Hanks, P. and Hindle, D., 1991. Using statistics in lexical analysis. Lexical acquisition: exploiting on-line resources to build a lexicon, 115, p.164.  El-Kishky, A., Song, Y., Wang, C., Voss, C.R. and Han, J., 2014. Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment, 8(3), pp.305-316.  Parameswaran, A., Garcia-Molina, H. and Rajaraman, A., 2010. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.566-577.  Liu, J., Shang, J., Wang, C., Ren, X. and Han, J., 2015, May. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1729-1744). ACM.  Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R. and Han, J., 2017. Automated Phrase Mining from Massive Text Corpora. arXiv preprint arXiv:1702.04457.
  47. 55 Automatic Structure Mining: Methodology Automatic Phrase Mining Methods (SIGMOD’15,

    arXiv’17) Entity Names & Context Units Typing and Relation Extraction Methods (KDD’15, KDD’16, EMNLP’16, WWW’17) Typed Entity & Relations Knowledge Bases Massive Text Corpus Meta Pattern- Driven Attribute Name & Value Discovery Methods (KDD’17) Attribute Names & Values, more General Relations Other Tasks and Applications (ECMLPKDD’17, KDD’17, WWW’16)
  48. 56 Reading the Reviews: From Text to Typed Entities and

    Relationships Restaurant Location Organization Event This hotel is my favorite Hilton property in NYC! It is located right on 42nd street near Times Square, it is close to all subways, Broadways shows, and next to great restaurants like Junior’s Cheesecake, Virgil’s BBQ and many others. -- TripAdvisor hotel located at NYC Times Square Hilton property is a Junior’s Cheesecake Virgil’s BBQ located near close to close to Broadways shows close to Structured Facts 1. “Typed” entities 2. “Typed” relationships
  49. 57 Prior Art: Extracting Structures with Repeated Human Effort This

    hotel is my favorite Hilton property in NYC! It is located right on 42nd street near Times Square, it is close to all subways, Broadways shows, and next to many great … … The June 2013 Egyptian protest were mass protest event that occurred in Egypt on 30 June 2013, … Human labeling … We had a room facing Times Square and a room facing the Empire State Building, The location is close to everything and we love … Extraction Rules Machine-Learning Models Broadways shows NYC Hilton property Labeled data Text Corpus Stanford CoreNLP CMU NELL UW KnowItAll USC AMR IBM Alchemy APIs Google Knowledge Graph Microsoft Satori … Structured Facts Times square hotel
  50. 58 Our Work: Effort-Light StructMine Corpus-specific Models Text Corpus Structures

    • Enables quick development of applications over various corpora • Extracts complex structures without introducing human errors News articles PubMed papers Knowledge Bases (KB)
  51. 59 Effort–Light StructMine: Where Are We? Human labeling effort Feature

    engineering effort Weakly-supervised learning systems Hand-crafted Systems Supervised learning systems Distantly-supervised Learning Systems CMU NELL, 2009 - present UW KnowItAll, Open IE, 2005 - present Max-Planck YAGO, 2008 - present Stanford CoreNLP, 2005 - present UT Austin Dependency Kernel, 2005 IBM Watson Language APIs UCB Hearst Pattern, 1992 NYU Proteus, 1997 Stanford DeepDive, MIML-RE 2012 - present UW FIGER, MultiR, 2012 Effort-Light StructMine (WWW’15, KDD’15, 16, 17, EMNLP’16, WWW’17, …)
  52. 60 “Distant” Supervision: What Is It? Text corpus Knowledge Bases

    “Matchable” structures: entity names, entity types, typed relationships ... (Mintz et al., 2009) , (Riedek et al., 2010), (Lin et al., 2012), (Ling et al., 2012), (Surdeanu et al., 2012), (Xu et al., 2013), (Nagesh et al., 2014), … Freely available! • Common knowledge • Life sciences • Art … Rapidly growing! Number of Wikipedia articles https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia Human crowds “Un-matchable”
  53. 61 Learning with Distant Supervision: Challenges 1. Sparsity of “Matchable”

     Incomplete knowledge bases  Low-confidence matching 2. Accuracy of “Expansion”  For “matchable”: Are all the labels assigned accurately?  For “un-matchable”: How to perform inference accurately? (Ren et al., KDD’15) It is my favorite city in the United States The United States needs a new strategy to meet this challenge Government Location … next to restaurants like Junior’s Cheesecake ✗
  54. 62 Effort-Light StructMine: Contributions Sparsity of “Matchable” Effective expansion from

    “matchable” to “un-matchable” Accuracy of “Expansion” Pick the “best” labels based on the context (for both “matchable” and “un-matchable”) Harness the “data redundancy” using graph- based joint optimization Challenge Solution Idea Text corpus It is my favorite city in the United States The United States needs a new strategy to meet this challenge Government Location
  55. 63 Effort-Light StructMine: Methodology Data-driven text segmentation (SIGMOD’15, WWW’16) Entity

    names & context units Partially- labeled corpus Learning Corpus-specific Model (KDD’15, KDD’16, EMNLP’16, WWW’17) Structures from the remaining unlabeled data Entity Recognition and Coarse- grained Typing (KDD’15) Fine-grained Entity Typing (KDD’16) Joint Entity and Relation Extraction (WWW’17) Corpus to Structured Network: The Roadmap Knowledge bases Text corpus
  56. 64 Entity Recognition and Coarse- grained Typing (KDD’15) Fine-grained Entity

    Typing (KDD’16) Corpus to Structured Network: The Roadmap Data-driven text segmentation (SIGMOD’15, WWW’16) entity names & context units Partially- labeled corpus Learning Corpus-specific Model (KDD’15, KDD’16, EMNLP’16, WWW’17) Structures from the remaining unlabeled data Knowledge bases Text corpus
  57. 65 Recognizing Entities of Target Types in Text The best

    BBQ I’ve tasted in Phoenix ! I had the pulled pork sandwich with coleslaw and baked beans for lunch. The owner is very nice. … food location person The best BBQ I’ve tasted in Phoenix! I had the pulled pork sandwich with coleslaw and baked beans for lunch. The owner is very nice. …
  58. 66 Traditional Named Entity Recognition (NER) Systems  Heavy reliance

    on corpus-specific human labeling  Training sequence models is slow A manual annotation interface e.g., (McMallum & Li, 2003), (Finkel et al.,2005), (Ratinov & Roth, 2009), … The best BBQ I’ve tasted in Phoenix O O Food O O O Location Sequence model training NER Systems: Stanford NER Illinois Name Tagger IBM Alchemy APIs …
  59. 67 Leveraging Distant Supervision 1. Detect entity names from text

    2. Match name strings to KB entities 3. Propagate types to the un- matchable names ID Sentence S1 Phoenix is my all-time favorite dive bar in New York City . S2 The best BBQ I’ve tasted in Phoenix. S3 Phoenix has become one of my favorite bars in NY . BBQ Phoenix NY tasted in has become one of my favorite bars in Location New York City ??? ??? Food is my all-time favorite dive bar in Location  (Lin et al., 2012), (Ling et al., 2012), (Nakashole et al., 2013)
  60. 68 ID Sentence S1 Phoenix is my all-time favorite dive

    bar in New York City . S2 The best BBQ I’ve tasted in Phoenix. S3 Phoenix has become one of my favorite bars in NY . Current Distant Supervision: Limitation I 1. Context-agnostic type prediction  Predict types for each mention regardless of context 2. Sparsity of contextual bridges
  61. 69 Current Distant Supervision: Limitation II 1. Context-agnostic type prediction

    2. Sparsity of contextual bridges  Some relational phrases are infrequent in the corpus  ineffective type propagation ID Sentence S1 Phoenix is my all-time favorite dive bar in New York City . S3 Phoenix has become one of my favorite bars in NY .
  62. 70 Our Solution: Data-Driven Entity Mention Detection Corpus-level Concordance Syntactic

    quality Quality of merging • Significance of a merging between two sub-phrases Pattern Example (J*)N* support vector machine VP tasted in, damage on VW*(P) train a classifier with Good Concordance The best BBQ I’ve tasted in Phoenix ! I had the pulled pork sandwich with coleslaw and baked beans for lunch. … This place serves up the best cheese steak sandwich in west of Mississippi.
  63. 71 Our Solution: ClusType (KDD’15) BBQ NY tasted in has

    become one of my favorite bars in New York City is my all-time favorite dive bar in ID Segmented Sentences S1 Phoenix is my all-time favorite dive bar in New York City . S2 The best BBQ I’ve tasted in Phoenix. S3 Phoenix has become one of my favorite bars in NY . S2: BBQ S3: NY S1: New York City S2: Phoenix S3: Phoenix Putting two sub-tasks together: 1. Type label propagation 2. Relation phrase clustering Similar relation phrases Correlated mentions Phoenix S1: Phoenix Represents object interactions https://github.com/shanzhenren/ClusType
  64. 72 ClusType: Comparing with State-of-the-Art Systems (F1 Score) Methods NYT

    Yelp Tweet Pattern (Stanford, CONLL’14) 0.301 0.199 0.223 SemTagger (U Utah, ACL’10) 0.407 0.296 0.236 NNPLB (UW, EMNLP’12) 0.637 0.511 0.246 APOLLO (THU, CIKM’12) 0.795 0.283 0.188 FIGER (UW, AAAI’12) 0.881 0.198 0.308 ClusType (KDD’15) 0.939 0.808 0.451 Precision (P) = #− #− , Recall (R) = #− #−ℎ , F1 score = 2 × (+) Bootstrapping Label propagation Classifier with linguistic features NYT: 118k news articles (1k manually labeled for evaluation); Yelp: 230k business reviews (2.5k reviews are manually labeled for evaluation); Tweet: 302 tweets (3k tweets are manually labeled for evaluation)
  65. 73 Entity Recognition and Coarse- grained Typing (KDD’15) Fine-grained Entity

    Typing (KDD’16) entity names & context units Partially- labeled corpus Learning Corpus-specific Model (KDD’15, KDD’16, EMNLP’16, WWW’17) Structures from the remaining unlabeled data Knowledge bases Corpus to Structured Network: The Roadmap Text corpus Data-driven text segmentation (SIGMOD’15, WWW’16)
  66. 74 From Coarse-Grained Typing to Fine-Grained Entity Typing ID Sentence

    S1 Donald Trump spent 14 television seasons presiding over a game show, NBC’s The Apprentice. Person Location Organization root product person location organiz ation ... ... politician artist business man ... ... ... author actor singer ... ... ... A few common types A type hierarchy with 100+ types (from knowledge base) (Ling et al., 2012), (Nakashole et al., 2013), (Yogatama et al., 2015)
  67. 75 Current Distant Supervision: Context-Agnostic Labeling root person location organization

    politician artist businessman author actor singer ... Entity types from knowledge base Entity: Donald Trump S1: Donald Trump Entity Types: person, artist, actor, author, businessman, politician ID Sentence S1 Donald Trump spent 14 television seasons presiding over a game show, NBC’s The Apprentice • Inaccurate labels in training data • Prior work: all labels are “perfect”
  68. 76 Our Solution: Partial Label Embedding (KDD’16) “De-noised” labeled data

    ID Sentence S1 Donald Trump spent 14 television seasons presiding over a game show, NBC’s The Apprentice Extract Text Features “Label Noise Reduction” with PLE Train Classifiers on De-noised Data Prediction on New Data S1: Donald Trump Entity Types: person, artist, actor, author, businessman, politician Text features: TOKEN_Donald, CONTEXT: television, CONTEXT: season, TOKEN_trump, SHAPE: AA More effective classifiers (Ren et al., KDD’16) https://github.com/shanzhenren/PLE
  69. 77 PLE: Modeling Clean and Noisy Mentions Separately For a

    clean mention, its “positive types” should be ranked higher than all its “negative types” For a noisy mention, its “best candidate type” should be ranked higher than all its “non-candidate types” S1: Donald Trump Types in KB: person, artist, actor, author, businessman, politician ID Noisy Entity Mention S1 Donald Trump spent 14 television seasons presiding over a game show, NBC’s The Apprentice (+) actor 0.88 (+) artist 0.74 (+) person 0.55 (+) author 0.41 (+) politician 0.33 (+) business 0.31 “Best” candidate type (+) actor (-) singer (-) coach (-) doctor (-) location (-) organization Types ranked (Ren et al., KDD’16) Si: Ted Cruz Types in KB: person, politician
  70. 78 root product person location organization politician artist businessman author

    actor singer ... president politician person actor senator gave address star play ID Sentence Si President Trump gave an all-hands address to troops at the U.S. Central Command headquarters … + + + President gave speech Vectors for text features Test mention: Si _Trump  Top-down nearest neighbor search in the given type hierarchy Low-dimensional vector space Type Inference in PLE Type hierarchy (from knowledge base) (Ren et al., KDD’16)
  71. 79 PLE: Performance of Fine-Grained Entity Typing  Raw: candidate

    types from distant supervision  WSABIE (Google, ACL’15): joint feature and type embedding  Predictive Text Embedding (MSR, WWW’15): joint mention, feature and type embedding  Both WASBIE and PTE suffer from “noisy” training labels  PLE (KDD’16): partial-label loss for context-aware labeling 0.7 0.45 0.05 0.79 0.49 0.14 0.78 0.51 0.19 0.81 0.62 0.48 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Level-1 Level-2 Level-3 Accuracy on different type levels Raw WSABIE PTE PLE Accuracy = # ℎ # ℎ OntoNotes public dataset (Weischedel et al. 2011, Gillick et al., 2014): 13,109 news articles, 77 annotated documents, 89 entity types
  72. 80 Conclusion  Introduction  Part I: Entity Extraction through

    Phrase Mining  Part II: Entity Typing Thank you! Questions can be sent to [email protected]