Introduction to Information Retrieval Chapter 1

Introduction to Information Retrieval Chapter 1

情報検索の基礎 第1章 論理検索の読書会の発表資料です
2008年に出版されたウェブの情報検索を扱った教科書的な書籍
ウェブ検索や関連するテキスト分類、クラスタリング手法、クローラーといった内容を網羅
スタンフォード大学、シュッツトガルト大学のコンピュータサイエンスの大学院生向けの講義をまとめているスタンフォード大学にPDFやpptが公開

146f04b4645afc2de2fe9d5bad51cc89?s=128

nishiokya

April 02, 2019
Tweet

Transcript

  1. Chapter.1  1 

  2. 20088 1@' 94>E;<C=B =D '>ELJ"#2N)*-:?,.). 07FG -&.$56( !#+#56 -%(.- 56MA3

    IH 07 -&.$56 PDFppt/K https://nlp.stanford.edu/IR-book/ppt/ https://www.kyoritsu-pub.co.jp/bookdetail/9784320123229
  3. https://nlp.stanford.edu/IR-book/ppt/ http://web.stanford.edu/class/cs276/

  4. #  2 9 3 8 3 8 1 ,

    , 8 1 2 8 86 XML #  0 7 3 3 8 8 8 9 8 9 3 8 45 45 8 9 3 8 8
  5. 1. $*/ 2. . %)#, 3. -' ' 4. -'

     $* 5. (!&+"
  6. §:4@I(Information retrieval)(IR)6J RN7+$',,?H S5LB*(+ 9K:42R-SPAODRN7!#SM< RN7;GS   § ,

    (Search)8J >= :4@I  FQ . !& AO0", E3!,%) DataBase PAO0", Free text,/C :4@I(Information Retreival) 1AO0", XML,HTML :4@I(Information Retreival)
  7. § IR  v

  8. E? >D#,64 . 9  >D HTML,pdf + $)@57>D G=>D@FC0

    %$) : /-B;8>D $*(>D! >D §E? 64 §3H 64 3H .  ( ">D &1A  <2 &') (*!5'3I 
  9. 1.   1.    2.  

    1.  2.   https://chalow.net/2008-01-19-5.html
  10. § UNIXGrep'+%%)(+$G9A83.TI@ J  ?F*#&+(GREPPING) 5 § / ,!&0S BRUTUS

    AND CAESAR, but not CALPURNIA K<@D § grep BRUTUS XXX.txt | grep CAESAR | grep –v CALPURNIA § 100-M  =7EJ *#&+ 42 § *#&+(GREPPING)6V § Q ( O (N)BR) § CO'#"A8P (XA8PWL>  § EJ:NA8P 1HU;L> 
  11. § /. ' ! &)"'$+ § &)  (-#,%* 

      § &) (-1 0*   (-/&) Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth . . . ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER . . . 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0
  12. 12 12 Anthony and Cleopatra Julius Caesar The Tempest Hamlet

    Othello Macbeth . . . ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER . . . 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 result: 1 0 0 1 0 0 § BRUTUS  CAESAR1 CALPURNIA0  §  110100 AND 110111 AND 101111 = 100100   ! BRUTUS AND CAESAR, but not CALPURNIA  
  13. §#"  §100 §1 1000 §1# 6byte §100 * 1000

    * 6 = 6Gb §# ! 50   §#"100 * 50 = 500Peta"
  14. §06+24% &:. §06+,4%('  !  §- #  06+,4%99.8%0;1=Sparse<

    1 (06$/< "35*  → 8  7, ;6)9<   ;8<
  15.     docID Title Anthony and Cleopatra Julius

    Caesar The Tempest Hamlet Othello Macbeth . . . docID 1 2 3 4 5 6 BRUTUS CAESAR CALPURNIA 1 1 0 1 1 1 0 0 0 1 1 0 0 1 0 0 1 0
  16. 1. $*/ 2. . %)#, 3. -' ' 4. -'

     $ * 5. (!&+"
  17. §C 1E 1.  35(D 2. 3>&)35 +2(Tokenizer) 3. @A-=$;48?&"(Lingustic

    Model) 4. B5 70C")<A:35  (Indexer) Friends So let it be with Caesar Friends,Romans,countrymen. … Romans countrymen So friend … roman countryman so F$; 6< /*'!  9. 35# 3,%(Chapter2-2)
  18. ,# + (" ' ID(docID) Step1. # %  Step2.!&)

    %* #$
  19. Step3.(+-&*  .,    ) " ) $%

    (+ '!# (
  20. Step4. !% "  &#'  $   Step5.!%

    &#   
  21. 256- !%$ )+ 7%   7%08Adhoc/3'1 .:,* 6-"#$!%$ 94

     ( 6-!%$ &  6- !%$
  22. 1. $*/ 2. . %)#, 3. -' ' 4. -'

     $* 5. (!&+"
  23. Brutus and Calpurnia #  & 1. Brutus ! 2.

    Brutus    3. Calpurnia ! 4. Calpurnia    5. BrutusCalpurnia  $%  6.  "
  24.   BRUTUS !+CALPURNIA  !Intersection# $ Operation = O(x

    +y ) %"  &
  25. § 3 And #"2%$ *, !)' § BRUTUS and CALPURNIA

    and CAESAR.!) 1. &(2%$ 1 -((CAESAR(2) -> Calpurnia(4) -> BRUTUS(8) 2. CAESAR   3. CALPURNIA   4. CAESARCALPURNIA  /0"+ (31) 5. BRUTUS   6. CAESARCALPURNIA /0"BRUTUS /0"+ &(2%(8) &(2%(4) &(2%(2)
  26. None
  27.  M Merge BRUTUS or CALPURNIA 1,2,4,11,31,45,173,174 2,31,54,100 1,2,4,11,31,45,54,100,173,174 BRUTUS

    and CALPURNIA not CAESAR 1,2,4,11,31,45,54,100,173,174 5,31 1,2,4,11, ,45,54,100,173,174
  28.  Merge (tangerine or trees ) and (marmalade or skies

    ) and (kaeidoscope or eyes) 1st( kaeidoscope or eyes)= 87,009+213,312 2nd (tangerine or trees ) = 46,653+316,812 3rd and (marmalade or skies )= 107,913+271,658      eyes 213,312 kaleidoscope 87,009 marmalade 107,913 skies 271,658 tangerine 46,653 trees 316,812
  29. 1. $*/ 2. . %)#, 3. -' ' 4. -'

     $* 5. (!&+"
  30. 30 Google    §Google  ([w 1 w

    2 . . .wn ] is w 1 AND w 2 AND . . .AND wn) §Cases where you get hits that do not contain one of the wi : §anchor text §page contains variant of wi (morphology, spelling correction, synonym) §long queries (n large) §boolean expression generates very few hits §Simple Boolean vs. Ranking of result set §Simple Boolean retrieval returns matching documents in no particular order. §Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 30 YES,
  31. -L 3C " 0GI J=&AIM/,A #%?D(“Stanford University”) K:?D(“Gates /s Microsoft”)

    $97+@'F (2 H;1+@AI8+ .I+@O6(TF)  (1B <=*O6P N5docIDN %E>3 $$ "!4)
  32. 1. 6! 3*,9 "'$ 2. 50-2 And,Or,Not14:  " 3.

    30)%#  & 4. (8-2!! /+ 5. -2 !.7
  33. 1. $*/ 2. . %)#, 3. -' ' 4. -'

     $* 5. (!&+"
  34. 1. 2008!20 - 1. https://chalow.net/2008-01-12-1.html 2. http://d.hatena.ne.jp/naoya/20080205/1202208135 2. 2008!20 

    3 1 (5 1. http://naoya.dyndns.org/~naoya/iir/ppt/ 3. Udemy: Information Retrieval and Mining Massive Data Sets4*.,  5 1. $# /" 2. https://www.udemy.com/information-retrieval-and-mining-massive- data-sets/learn/v4/content 4. %) ' 1. ( %) &+ 2. https://www.youtube.com/watch?v=bOlvDIjUmf8