Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Lucene/Solr Internals for JEEConf

Apache Lucene/Solr Internals for JEEConf

Anatoliy Sokolenko

May 23, 2014
Tweet

More Decks by Anatoliy Sokolenko

Other Decks in Technology

Transcript

  1. 100GB disk space 18 066 980 records Indexing took 1,5

    hours 
 in 200 threads
 1000 batch 4 CPU cores 16 GB memory
  2. Data Model • document oriented • flat • store •

    index score:1 tag:java type:answer Document boost = 1.1 docID = 23
  3. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene

    Directory Index Writer Index Searcher addDocument Analyzer Index Reader
  4. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene

    Directory Index Writer Index Searcher addDocument query tag:java Analyzer Index Reader
  5. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene

    Directory Index Writer Index Searcher addDocument query tag:java score:1 tag:java type:answer Document boost = 1.1 search Analyzer Index Reader
  6. score:0 score:1 score:5 ... tag:java tag:mysql tag:css ... type:answer type:question

    3 +1 +2 10 +3 +1 4 +11 5 +2 6 +52 +1 1 +30 +27 3 +7 +1 5 +2 3 +7 +2 4 2 2 3 4 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 5 2 1 Term Infos Term Frequencies
  7. score:0 ... tag:mysql ... score:0 score:1 score:5 ... tag:java tag:mysql

    tag:css ... type:answer type:question type:question 3 +1 +2 10 +3 +1 4 +11 5 +2 6 +52 +1 1 +30 +27 3 +7 +1 5 +2 3 +7 +2 4 2 2 3 4 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 5 2 1 Term Infos Term Info Index Term Frequencies 3 3 2
  8. Directory Segment
 α Segment
 β Segment
 ɣ A B C

    D E F G H I J K L 1 1 1 1 1 1 1 1 1 1 1 1
  9. Directory Segment
 α Segment
 β Segment
 ɣ A B C

    D E F G H I J K L K L 1 1 1 1 1 1 1 1 1 1 1 1
  10. Directory Segment
 α Segment
 β Segment
 ɣ Segment
 δ A

    B C D E F G H I J K L K L 0 0 1 1 1 1 1 1 1 1 1 1 1
  11. Directory Segment
 α Segment
 β Segment
 ɣ Segment
 δ A

    B C D E F G H I J L L 0 K 0 1 K 1 1 1 1 1 1 1 1 1 1
  12. Directory Segment
 α Segment
 β Segment
 ɣ Segment
 δ A

    B C D E F G H I J 1 K 1 1 1 1 1 1 1 1 Segment
 ε 1 1
  13. NLP

  14. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer Filter There

    are no pointers in Java! Char filter Index time Query time
  15. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter Index time Query time
  16. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter Index time Query time ? ? ? pointer ? java
  17. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time ? ? ? pointer ? java
  18. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java ? ? ? pointer ? java
  19. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java pointers Java in ? ? ? pointer ? java
  20. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java pointers Java in ? ? ? pointer ? java pointer java ?
  21. Shingle There There are There are no There are no

    pointers There are no pointers in Java are are no are no pointers are no pointers in no no pointers no pointers in no pointers in Java pointers pointers in pointers in Java in in Java Java
  22. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1

    +30 +27 +2 7 31 58 59 Query Result Index
  23. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1

    +30 +27 +2 7 31 58 59 Query Result Index 5 7 6 58 59 1 31 58 60
  24. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1

    +30 +27 +2 7 31 58 59 Query Result Index 5 7 6 58 59 1 31 58 60 7 58 59 31 58
  25. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1

    +30 +27 +2 7 31 58 59 Query Result Index Facet 1 2 2 5 7 6 58 59 1 31 58 60 7 58 59 31 58
  26. Lucene Index vs RDBMS Lucene • Text search • Updates

    are expensive • Requests N-times faster • Facets are cheap • Values modifications DB • Value search • Updates are cheap • Requests N-time slower • Facets are expensive • Values are stored as is