collecting quantitative information ◦ Objects, adjectives, and verbs • Contrasts with recent work in this area ◦ Only relative comparisons ◦ “Is a lion bigger than a wolf” • Contributions ◦ A new method for collecting expressive quantitative information ◦ A large resource of distributions over quantitative attributes ◦ Showed superior to existing datasets
◦ How tall can they be? ◦ When do people typically eat breakfast? • Datasets ◦ Acquiring distributions over ten dimensions ▪ time, currency, length, … ◦ Nouns (e.g. elephant, airplane, NBA game), ◦ Adjectives (e.g. cold, hot, lukewarm) ◦ Verbs (e.g. eating, walking, running) ◦ It can be extended to other languages easily
◦ Do not extract sentences that are not recognized by the parser ◦ Data contains some typo (such as “17 C” where Centigrade is meant) ◦ “inch” = 0.02524 meters, “acre foot” = 1233.48, ... • Object Collection ◦ 1-token words and more complex phrases (e.g. noun phrases) (“race car”, “electric car”) ◦ Also retrieve its syntactic head (compare a “fast car” to a ‘car’) ◦ Collect the objects that co-occur within a certain context window ◦ Processed billions of English webpages
Labeled the typical relation between two objects along five dimensions ▪ SIZE, WEIGHT, STRENGTH, RIGIDITY and SPEED ▪ whether the first object was typically greater than, lesser than, or equal to the second ◦ 47% of the annotations were not comparable ▪ Broad objects: e.g. (father, clothes, big) ▪ Abstract objects: e.g. (seal, place, big) ▪ Ill-defined dimension: e.g. (friend, bed, strong) ◦ Leakage ▪ 8% of transitivity leakage ((o1, o2, d) and (o2, o3, d) in train, (o1, o3, d) in dev/test) ▪ 95% of object leakage (same object in train and dev/test)
◦ Formed new splits of train/dev/test (Table 2) • CLEAN F&C (2,964) ◦ Re-annotated the dataset due to the ill-defined comparison ◦ Used three crowd-source workers • New Data (+4,773) ◦ Created new dataset because F&C dataset became small after filtering ◦ Only used as a test set • The Relative Size Dataset ◦ 486 object pairs between 41 physical objects
and Bansal (2013) ◦ Used adjective clusters based on the “dumbbell” structure of adjectives in WordNet ◦ e.g. “cold < frigid < frozen” • Wilkinson and Oates (2016) ◦ Created another testset, by defining a total order between adjectives in the same cluster, spanning the entire scale range ◦ e.g. “minuscule < tiny < small < big < large < huge < enormous < gigantic” • Removed all of the non-measurable clusters ◦ e.g. “known< famous < legendary” • Intrinsic Evaluation ◦ Expanded the median of the distribution given an object and a dimension into a range and ask human raters whether this range overlaps with the range of the target object-dimension pair ◦ e.g. median of object “car” is 99.7 km/h -> 10-100km/h?
Lower than Yang et al. (2018) because of ▪ fine-tune on a train set ▪ information through pre-trained word embeddings ◦ New Data ▪ Better in the new data but lower than F&C overall ◦ RELATIVE ▪ State-of-the-art result with k=10
on the full range scale of Wilk-all ◦ There was no extreme difference in the errors ◦ Good at differentiating between the adjectives on the two tips of the scale • Intrinsic Evaluation ◦ The total agreement is 69% ◦ Re-annotated the currency (difference between India and US)
higher than actually measured • Since they collected data from an English website, the seasonal temperature is in the Northern Hemisphere • The weight of alfalfa and watermelon are very different. This bias is because alfalfa is shipped in tons • There is also a bias due to polysemy.
collecting quantitative information ◦ Objects, adjectives, and verbs • Contributions ◦ A new method for collecting expressive quantitative information ◦ A large resource of distributions over quantitative attributes ◦ Showed superior to existing datasets