Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Information Retrieval Chapter 1

Introduction to Information Retrieval Chapter 1

情報検索の基礎 第1章 論理検索の読書会の発表資料です
2008年に出版されたウェブの情報検索を扱った教科書的な書籍
ウェブ検索や関連するテキスト分類、クラスタリング手法、クローラーといった内容を網羅
スタンフォード大学、シュッツトガルト大学のコンピュータサイエンスの大学院生向けの講義をまとめているスタンフォード大学にPDFやpptが公開

nishiokya

April 02, 2019
Tweet

More Decks by nishiokya

Other Decks in Technology

Transcript

  1. 20088 1@' 94>E;<C=B =D '>ELJ"#2N)*-:?,.). 07FG -&.$56( !#+#56 -%(.- 56MA3

    IH 07 -&.$56 PDFppt/K https://nlp.stanford.edu/IR-book/ppt/ https://www.kyoritsu-pub.co.jp/bookdetail/9784320123229
  2. #  2 9 3 8 3 8 1 ,

    , 8 1 2 8 86 XML #  0 7 3 3 8 8 8 9 8 9 3 8 45 45 8 9 3 8 8
  3. §:4@I(Information retrieval)(IR)6J RN7+$',,?H S5LB*(+ 9K:42R-SPAODRN7!#SM< RN7;GS   § ,

    (Search)8J >= :4@I  FQ . !& AO0", E3!,%) DataBase PAO0", Free text,/C :4@I(Information Retreival) 1AO0", XML,HTML :4@I(Information Retreival)
  4. E? >D#,64 . 9  >D HTML,pdf + $)@57>D G=>D@FC0

    %$) : /-B;8>D $*(>D! >D §E? 64 §3H 64 3H .  ( ">D &1A  <2 &') (*!5'3I 
  5. 1.   1.    2.  

    1.  2.   https://chalow.net/2008-01-19-5.html
  6. § UNIXGrep'+%%)(+$G9A83.TI@ J  ?F*#&+(GREPPING) 5 § / ,!&0S BRUTUS

    AND CAESAR, but not CALPURNIA K<@D § grep BRUTUS XXX.txt | grep CAESAR | grep –v CALPURNIA § 100-M  =7EJ *#&+ 42 § *#&+(GREPPING)6V § Q ( O (N)BR) § CO'#"A8P (XA8PWL>  § EJ:NA8P 1HU;L> 
  7. § /. ' ! &)"'$+ § &)  (-#,%* 

      § &) (-1 0*   (-/&) Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth . . . ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER . . . 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0
  8. 12 12 Anthony and Cleopatra Julius Caesar The Tempest Hamlet

    Othello Macbeth . . . ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER . . . 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 result: 1 0 0 1 0 0 § BRUTUS  CAESAR1 CALPURNIA0  §  110100 AND 110111 AND 101111 = 100100   ! BRUTUS AND CAESAR, but not CALPURNIA  
  9. §#"  §100 §1 1000 §1# 6byte §100 * 1000

    * 6 = 6Gb §# ! 50   §#"100 * 50 = 500Peta"
  10. §06+24% &:. §06+,4%('  !  §- #  06+,4%99.8%0;1=Sparse<

    1 (06$/< "35*  → 8  7, ;6)9<   ;8<
  11.     docID Title Anthony and Cleopatra Julius

    Caesar The Tempest Hamlet Othello Macbeth . . . docID 1 2 3 4 5 6 BRUTUS CAESAR CALPURNIA 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 0 1 0
  12. §C 1E 1.  35(D 2. 3>&)35 +2(Tokenizer) 3. @A-=$;48?&"(Lingustic

    Model) 4. B5 70C")<A:35  (Indexer) Friends So let it be with Caesar Friends,Romans,countrymen. … Romans countrymen So friend … roman countryman so F$; 6< /*'!  9. 35# 3,%(Chapter2-2)
  13. Brutus and Calpurnia #  & 1. Brutus ! 2.

    Brutus    3. Calpurnia ! 4. Calpurnia    5. BrutusCalpurnia  $%  6.  "
  14. § 3 And #"2%$ *, !)' § BRUTUS and CALPURNIA

    and CAESAR.!) 1. &(2%$ 1 -((CAESAR(2) -> Calpurnia(4) -> BRUTUS(8) 2. CAESAR   3. CALPURNIA   4. CAESARCALPURNIA  /0"+ (31) 5. BRUTUS   6. CAESARCALPURNIA /0"BRUTUS /0"+ &(2%(8) &(2%(4) &(2%(2)
  15.  M Merge BRUTUS or CALPURNIA 1,2,4,11,31,45,173,174 2,31,54,100 1,2,4,11,31,45,54,100,173,174 BRUTUS

    and CALPURNIA not CAESAR 1,2,4,11,31,45,54,100,173,174 5,31 1,2,4,11, ,45,54,100,173,174
  16.  Merge (tangerine or trees ) and (marmalade or skies

    ) and (kaeidoscope or eyes) 1st( kaeidoscope or eyes)= 87,009+213,312 2nd (tangerine or trees ) = 46,653+316,812 3rd and (marmalade or skies )= 107,913+271,658      eyes 213,312 kaleidoscope 87,009 marmalade 107,913 skies 271,658 tangerine 46,653 trees 316,812
  17. 30 Google    §Google  ([w 1 w

    2 . . .wn ] is w 1 AND w 2 AND . . .AND wn) §Cases where you get hits that do not contain one of the wi : §anchor text §page contains variant of wi (morphology, spelling correction, synonym) §long queries (n large) §boolean expression generates very few hits §Simple Boolean vs. Ranking of result set §Simple Boolean retrieval returns matching documents in no particular order. §Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 30 YES,
  18. -L 3C " 0GI J=&AIM/,A #%?D(“Stanford University”) K:?D(“Gates /s Microsoft”)

    $97+@'F (2 H;1+@AI8+ .I+@O6(TF)  (1B <=*O6P N5docIDN %E>3 $$ "!4)
  19. 1. 6! 3*,9 "'$ 2. 50-2 And,Or,Not14:  " 3.

    30)%#  & 4. (8-2!! /+ 5. -2 !.7
  20. 1. 2008!20 - 1. https://chalow.net/2008-01-12-1.html 2. http://d.hatena.ne.jp/naoya/20080205/1202208135 2. 2008!20 

    3 1 (5 1. http://naoya.dyndns.org/~naoya/iir/ppt/ 3. Udemy: Information Retrieval and Mining Massive Data Sets4*.,  5 1. $# /" 2. https://www.udemy.com/information-retrieval-and-mining-massive- data-sets/learn/v4/content 4. %) ' 1. ( %) &+ 2. https://www.youtube.com/watch?v=bOlvDIjUmf8