Slide 36
Slide 36 text
doc1 = "Hello world"
doc2 = "World travelers welcome!"
index = {
"hello": ['doc1'],
"world": ['doc1', 'doc2'],
"travel": ['doc2'],
"welcome": ['doc2'],
# ...
}
* Think a Python dictionary
* Split your documents on whitespace to tokenize up the content
* Talk about stemming, stop words, etc.
* Keys in the dictionary are the (now-unique) words
* Values in the dictionary are document ids