Slide 16
Slide 16 text
From each doc, extract html, identify paras/sents/words, tag with part-of-speech
Raw Corpus
HTML
corpus = [(‘How’, ’WRB’),
(‘long’, ‘RB’),
(‘will’, ‘MD’),
(‘this’, ‘DT’),
(‘go’, ‘VB’),
(‘on’, ‘IN’),
(‘?’, ‘.’),
...
]
Paras
Sents
Tokens
Tags