Upgrade to Pro — share decks privately, control downloads, hide ads and more …

論文読み会 How Large Are Lions? Inducing Distributio...

Reo
December 04, 2019

論文読み会 How Large Are Lions? Inducing Distributions over Quantitative Attributes

Reo

December 04, 2019
Tweet

More Decks by Reo

Other Decks in Technology

Transcript

  1. How Large Are Lions? Inducing Distributions over Quantitative Attributes Yanai

    Elazar, Abhijit Mahabal, Deepak Ramachandran, Tania Bedrax-Weiss, Dan Roth ACL 2019 紹介者: 平尾 礼央(TMU, B4, 小町研究室) 3 December, 2019 @論文読み会
  2. Abstract • Distribution over Quantities (DoQ) ◦ Unsupervised method for

    collecting quantitative information ◦ Objects, adjectives, and verbs • Contrasts with recent work in this area ◦ Only relative comparisons ◦ “Is a lion bigger than a wolf” • Contributions ◦ A new method for collecting expressive quantitative information ◦ A large resource of distributions over quantitative attributes ◦ Showed superior to existing datasets
  3. Introduction • Qustions ◦ How much does a lion weigh?

    ◦ How tall can they be? ◦ When do people typically eat breakfast? • Datasets ◦ Acquiring distributions over ten dimensions ▪ time, currency, length, … ◦ Nouns (e.g. elephant, airplane, NBA game), ◦ Adjectives (e.g. cold, hot, lukewarm) ◦ Verbs (e.g. eating, walking, running) ◦ It can be extended to other languages easily
  4. Distribution over Quantities: Method (1) • Measurement Identification and Normalization

    ◦ Do not extract sentences that are not recognized by the parser ◦ Data contains some typo (such as “17 C” where Centigrade is meant) ◦ “inch” = 0.02524 meters, “acre foot” = 1233.48, ... • Object Collection ◦ 1-token words and more complex phrases (e.g. noun phrases) (“race car”, “electric car”) ◦ Also retrieve its syntactic head (compare a “fast car” to a ‘car’) ◦ Collect the objects that co-occur within a certain context window ◦ Processed billions of English webpages
  5. Distribution over Quantities: Method (2) • De-noising ◦ Distance Based

    Co-Occurrences (within the same sentence or a token distance k) ◦ Simply discard all negation (“The dimension of the car is not 50cm.”) • Distribution over Quantities Statistics
  6. Evaluation Data Commonsense Property Comparison(1) • ORIG F&C dataset ◦

    Labeled the typical relation between two objects along five dimensions ▪ SIZE, WEIGHT, STRENGTH, RIGIDITY and SPEED ▪ whether the first object was typically greater than, lesser than, or equal to the second ◦ 47% of the annotations were not comparable ▪ Broad objects: e.g. (father, clothes, big) ▪ Abstract objects: e.g. (seal, place, big) ▪ Ill-defined dimension: e.g. (friend, bed, strong) ◦ Leakage ▪ 8% of transitivity leakage ((o1, o2, d) and (o2, o3, d) in train, (o1, o3, d) in dev/test) ▪ 95% of object leakage (same object in train and dev/test)
  7. Evaluation Data Commonsense Property Comparison(2) • NO-LEAK F&C (8,209 pairs)

    ◦ Formed new splits of train/dev/test (Table 2) • CLEAN F&C (2,964) ◦ Re-annotated the dataset due to the ill-defined comparison ◦ Used three crowd-source workers • New Data (+4,773) ◦ Created new dataset because F&C dataset became small after filtering ◦ Only used as a test set • The Relative Size Dataset ◦ 486 object pairs between 41 physical objects
  8. Evaluation Data Scalar Adjectives & Intrinsic Evaluation • De Melo

    and Bansal (2013) ◦ Used adjective clusters based on the “dumbbell” structure of adjectives in WordNet ◦ e.g. “cold < frigid < frozen” • Wilkinson and Oates (2016) ◦ Created another testset, by defining a total order between adjectives in the same cluster, spanning the entire scale range ◦ e.g. “minuscule < tiny < small < big < large < huge < enormous < gigantic” • Removed all of the non-measurable clusters ◦ e.g. “known< famous < legendary” • Intrinsic Evaluation ◦ Expanded the median of the distribution given an object and a dimension into a range and ask human raters whether this range overlaps with the range of the target object-dimension pair ◦ e.g. median of object “car” is 99.7 km/h -> 10-100km/h?
  9. Experimental Results (1) • Noun Comparison ◦ F&C Clean ▪

    Lower than Yang et al. (2018) because of ▪ fine-tune on a train set ▪ information through pre-trained word embeddings ◦ New Data ▪ Better in the new data but lower than F&C overall ◦ RELATIVE ▪ State-of-the-art result with k=10
  10. Experimental Results (2) • Adjective Comparison ◦ Achieve good results

    on the full range scale of Wilk-all ◦ There was no extreme difference in the errors ◦ Good at differentiating between the adjectives on the two tips of the scale • Intrinsic Evaluation ◦ The total agreement is 69% ◦ Re-annotated the currency (difference between India and US)
  11. Discussion Reporting Bias and Exaggeration • The temperature is exaggerated

    higher than actually measured • Since they collected data from an English website, the seasonal temperature is in the Northern Hemisphere • The weight of alfalfa and watermelon are very different. This bias is because alfalfa is shipped in tons • There is also a bias due to polysemy.
  12. Conclusion • Distribution over Quantities (DoQ) ◦ Unsupervised method for

    collecting quantitative information ◦ Objects, adjectives, and verbs • Contributions ◦ A new method for collecting expressive quantitative information ◦ A large resource of distributions over quantitative attributes ◦ Showed superior to existing datasets