Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Apache Lucene/Solr Internals

Apache Lucene/Solr Internals

Anatoliy Sokolenko

December 08, 2013
Tweet

More Decks by Anatoliy Sokolenko

Other Decks in Programming

Transcript

  1. 4 nodes 
 ✕ 12GB disk space June 2013 database

    14 630 209 records Indexing took 2,5 hours 
 in 200 threads
 1000 batch VM 
 16 CPU cores 16 GB memory
  2. Data Model • document oriented • flat • store •

    index score:1 tag:java type:answer Document boost = 1.1 docID = 23
  3. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene

    Directory Index Writer Index Searcher addDocument Analyzer Index Reader
  4. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene

    Directory Index Writer Index Searcher addDocument query tag:java Analyzer Index Reader
  5. Basic Flow score:1 tag:java type:answer Document boost = 1.1 Lucene

    Directory Index Writer Index Searcher addDocument query tag:java score:1 tag:java type:answer Document boost = 1.1 search Analyzer Index Reader
  6. Directory Segment
 α Segment
 β Segment
 ɣ A B C

    D E F G H I J K L 1 1 1 1 1 1 1 1 1 1 1 1
  7. Directory Segment
 α Segment
 β Segment
 ɣ A B C

    D E F G H I J K L K L 1 1 1 1 1 1 1 1 1 1 1 1
  8. Directory Segment
 α Segment
 β Segment
 ɣ Segment
 δ A

    B C D E F G H I J K L K L 0 0 1 1 1 1 1 1 1 1 1 1 1
  9. Directory Segment
 α Segment
 β Segment
 ɣ Segment
 δ A

    B C D E F G H I J L L 0 K 0 1 K 1 1 1 1 1 1 1 1 1 1
  10. Directory Segment
 α Segment
 β Segment
 ɣ Segment
 δ A

    B C D E F G H I J 1 K 1 1 1 1 1 1 1 1 Segment
 ε 1 1
  11. score:0 score:1 score:5 ... tag:java tag:mysql tag:css ... type:answer type:question

    3 +1 +2 10 +3 +1 4 +11 5 +2 6 +52 +1 1 +30 +27 3 +7 +1 5 +2 3 +7 +2 4 2 2 3 4 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 5 2 1 Term Infos Term Frequencies
  12. score:0 ... tag:mysql ... score:0 score:1 score:5 ... tag:java tag:mysql

    tag:css ... type:answer type:question type:question 3 +1 +2 10 +3 +1 4 +11 5 +2 6 +52 +1 1 +30 +27 3 +7 +1 5 +2 3 +7 +2 4 2 2 3 4 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 5 2 1 Term Infos Term Info Index Term Frequencies 3 3 2
  13. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter Index time
  14. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter Index time ? ? ? pointer ? java
  15. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter Index time Query time ? ? ? pointer ? java
  16. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time ? ? ? pointer ? java
  17. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java ? ? ? pointer ? java
  18. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java pointers Java in ? ? ? pointer ? java
  19. Analyzer <strong>There are no pointers in Java!</strong> Tokenizer There are

    no pointers in Java Filter There are no pointers in Java! Char filter pointers in Java Index time Query time pointers in Java pointers Java in ? ? ? pointer ? java pointer java ?
  20. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1

    +30 +27 +2 7 31 58 59 Query Result Index
  21. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1

    +30 +27 +2 7 31 58 59 Query Result Index 5 7 6 58 59 1 31 58 60
  22. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1

    +30 +27 +2 7 31 58 59 Query Result Index 5 7 6 58 59 1 31 58 60 7 58 59 31 58
  23. Algorithm tag:java tag:mysql tag:css 5 +2 6 +52 +1 1

    +30 +27 +2 7 31 58 59 Query Result Index Facet 1 2 2 5 7 6 58 59 1 31 58 60 7 58 59 31 58
  24. Levenshtein Distance html htmm Levenshtein
 distance = 1 hlmz html

    Levenshtein
 distance = 2 tag:php tag:jquery tag:json tag:java tag:c# tag:apache tag:osx tag:html
  25. Levenshtein Automaton html Levenshtein
 distance = 1 H t t

    m m l l t m l H t t m m l l m l t t m H l m
  26. Solr is... • enterprise level search engine • vertically scalable

    • horizontally scalable, but... • tunable • poorly documented • with active community