MA
- MeCab, the most popular morphological analyzer of Japanese,
was tested
- All metrics indicated 89–91% although the tool has already
achieved over 98% on newspaper articles in Kudo et al. (2004)
Cookpad Parsed Corpus: Linguistic Annotations of Japanese Recipes
Jun Harashima and Makoto Hiramatsu (Cookpad Inc.)
The 14th Linguistic Annotation Workshop
Background
Cookpad Parsed Corpus
Name Year Main Content
CURD 2008 Machine-readable language representations
Flow Graph Corpus 2014 Graph representations and named entities
SIMMR Recipe Dataset 2015 Graph representations
Cookpad Recipe Dataset 2016 Reviews and meals
Cookpad Image Dataset 2017 Food images and cooking images
Recipe1M 2017 Food images
RecipeQA 2018 Question-answer pairs
Stroyboarding Data 2019 Cooking images
r-FG BB dataset 2019 Bounding boxes for cooking images
English Recipe Flow Graph Corpus 2020 Graph representations and named entities
Mulitimodal Aligned Recipe Corpus 2020 URLs to YouTube videos
Mulit-modal Recipe Structure dataset 2020 Graph representations and cooking images
Cookpad Parsed Corpus 2020 Linguistic annotations
Name Year Target documents
KU Text Corpus 2002 Newspaper articles
GDA Corpus 2005 Newspaper articles and dictionary entries
NAIST Text Corpus 2007 Newspaper articles
KU and NTT Blog Corpus 2011 Blogs
KU Web Document Leads Corpus 2012 Web documents
BCCWJ 2014 Newspaper articles, books, magazines, etc
Cookpad Parsed Corpus 2020 Cooking recipes
# Step-ID:1
# Sentence-ID:1-1
* 0 4D 1/2 .7
1 3:,,?,35,*,*,*,*,1,,,B-Fi
+ ?,,<,*,*,*,*,+,
,
,I-Fi
0,,$0,,*,*,*,*,,,,O
* 1 2D 1/2 =4'
( ?,,<,*,*,*,*,(,,,B-Sf
6 ?,,<,*,*,*,*,6, , ,I-Sf
0,, 0,,<,*,*,*,,,,O
* 2 4P 0/0 /'
2 ;,,-A,*,*,&8),B@%,2,,,B-Ap
* 3 4D 0/1 =4'
?,,<,*,*,*,*,,
,
,B-Fi
0,, 0,,<,*,*,*,,,,O
* 4 -1O 0/0 /'
;,,-A,*,*,&8),!>%,,,,B-Ap
"*,#9,*,*,*,*,,,,O
EOS
raw
salmon
(topic marker)
a bite
size
(dative)
cut
salt
(accusative)
sprinkle
.
- The number of cooking recipes on the Internet has grown
- Recipe-related studies and datasets are also increasing
- However, there are still few datasets that provide linguistic
annotations for recipe-related studies even though such
annotations should form the basis of the studies
Table1. Existing recipe-related datasets and our corpus
Figure 1. Linguistic annotations for an example sentence,
(Cut the raw salmon into bite-size
chunks and sprinkle them with salt.), in our corpus.
Precision Recall F1
MeCab 88.91 88.95 88.93
MeCab w/ domain adaptation 91.12 91.04 91.08
Accuracy Precision Recall F1
Sasada et al. (2015) 88.30 74.65 82.77 78.50
Lample et al. (2016) 91.41 88.17 87.18 87.67
Accuracy
CaboCha 92.21
CaboCha w/ domain adaptation 94.68
Table 3. Benchmark results for MA
Table 4. Benchmark results for NER
Table 5. Benchmark results for DP
- We divided our corpus into training (400 recipes), validation
(100 recipes), and test sets (100 recipes) and tested popular
tools or methods for Japanese MA, NER, and DP
- We also tested the tools with performing domain adaptation
NER
- We trained/tested two recognizers using our training/test sets
- Many errors were caused by domain-specific unknown words
DP
- We tested CaboCha, the most popular dependency parser for
Japanese
- Accuracy was 92–95% (over 20% of the sentences in our test
set had at least one parsing error)
- We randomly selected 500 recipes from the Cookpad Recipe
Dataset
- 4,738 sentences in the 500 recipes were annotated with
morphemes, named entities, and dependency relations
- Construction of a novel corpus, which contains linguistic
annotations of 500 Japanese recipes
- Benchmark results on the corpus for Japanese morphological
analysis (MA), named entity recognition (NER), and dependency
parsing (DP)
Contributions of this study
Morphemes
- We decided boundaries and part-of-speech for each morpheme
based on the IPA dictionary, commonly used for MA
Named entities
- Morphemes were annotated with 17 tags such as Fi (food
ingredient) and Sf (state of food) based on IOB2 format
Dependency relations
- Bunsetsus were annotated with the relations such as D (normal
dependency) and P (coordination dependency)
- A bunsetsu is a unit of Japanese that consists of one or more
content words and zero or more functions words
- Bunsetsus were also annotated with 7 types such as
(Topic)
Other
- Content in the Cookpad Recipe and Image Datasets, which
include the same 500 recipes, can also be used
- There is still room for improvement in Japanese MA, NER, and
DP of cooking recipes
- By improving the analyses using our corpus, a variety of recipe-
related studies based on them can also be improved
Table 2. Existing Japanese parsed corpora and our corpus