A Tour of the Bleve

16cdfb0c4af5297e261cb36e30fa5c20?s=47 Marty Schoch
November 09, 2015

A Tour of the Bleve

Nearly two years ago we set out to build Bleve, an open-source full-text search library for Go. Since then we've worked hard to build not just the code, but also the community around the project. We'll start with a brief introduction to the capabilities of the library. Then we'll take a tour of the unique ways the community has shaped the project. These vignettes will guide our technical dive into select parts of the Bleve project.

16cdfb0c4af5297e261cb36e30fa5c20?s=128

Marty Schoch

November 09, 2015
Tweet

Transcript

  1. dotGo November 9, 2015 A Tour of the Bleve Marty

    Schoch @mschoch
  2. None
  3. full text search golang

  4. full-text searching gloang

  5. French Text Analysis Un article de Wikipédia l'encyclopédie libre "Un

    article de Wikipédia, l'encyclopédie libre un article de wikipédia encyclopédie libre article wikipédia encyclopédie libre articl wikipedia encycloped libr Tokenization Lowercase Article Elision Stop Words Stemming un article de wikipédia l'encyclopédie libre
  6. Full-Text Search encyclopedies { “body”: “Un article de Wikipédia, l'encyclopédie

    libre.” } Index Terms encycloped
  7. Install Bleve $ go get github.com/blevesearch/bleve/...

  8. Import Packages import ( "github.com/blevesearch/bleve" "github.com/blevesearch/bleve/analysis/language/fr" )

  9. Index Mapping mapping := bleve.NewIndexMapping() mapping.DefaultAnalyzer = fr.AnalyzerName

  10. New Index index, err := bleve.New("wiki.bleve", mapping) if err !=

    nil { log.Fatal(err) }
  11. Index Data article := Article{ ID: "Wikipedia", Body: "Un article

    de Wikipédia, l'encyclopédie libre.", } err = index.Index(article.ID, article) if err != nil { log.Fatal(err) }
  12. Query/Request query := bleve.NewMatchQuery("encyclopedies") request := bleve.NewSearchRequest(query)

  13. Search results, err := index.Search(request) if err != nil {

    log.Fatal(err) } fmt.Println(results)
  14. Search package main import ( "fmt" "log" "github.com/blevesearch/bleve" "github.com/blevesearch/bleve/analysis/language/fr" )

    type Article struct { ID string Body string } func main() { mapping := bleve.NewIndexMapping() mapping.DefaultAnalyzer = fr.AnalyzerName index, err := bleve.New("wiki.bleve", mapping) if err != nil { log.Fatal(err) } article := Article{ ID: "Wikipedia", Body: "Un article de Wikipédia, l'encyclopédie libre.", } err = index.Index(article.ID, article) if err != nil { log.Fatal(err) } query := bleve.NewMatchQuery("encyclopedies") request := bleve.NewSearchRequest(query) results, err := index.Search(request) if err != nil { log.Fatal(err) } fmt.Println(results) } 1 matches, showing 1 through 1, took 66.466µs 1. Wikipedia (0.137229) Program exited.
  15. Community

  16. BoltDB Storage Adapter Conrad Pankoff

  17. Extending Bleve through Go Interfaces Bleve Analysis Index Search KV

    Storage Disk
  18. KVStore type KVStore interface { Reader() (KVReader, error) Writer() (KVWriter,

    error) Close() error }
  19. Reader type KVReader interface { Get(key []byte) ([]byte, error) PrefixIterator(prefix

    []byte) KVIterator RangeIterator(start, end []byte) KVIterator Close() error } Note: Readers must provide a consistent view isolated from concurrent writes.
  20. Writer type KVWriter interface { NewBatch() KVBatch ExecuteBatch(batch KVBatch) error

    Close() error }
  21. Batch type KVBatch interface { Set(key, val []byte) Delete(key []byte)

    Merge(key, val []byte) Reset() } Note: Batches must be atomic, readers see all or none of the changes.
  22. Bleve KV Store Landscape (before) ❖ leveldb (cgo) ❖ inmem

    (no disk persistence)
  23. No pure Go storage???

  24. Bleve KV Store Landscape (today) ❖ leveldb (cgo) ❖ gtreap

    (no disk persistence) ❖ cznicb (no disk persistence) ❖ boltdb ❖ goleveldb ❖ rocksdb (cgo) ❖ forestdb (cgo)
  25. FOSDEM 2015 Go Developer Room

  26. Arabic Text Analysis زاّمجلا ناملس

  27. Text Analysis Interfaces Bleve Analysis Index Search Tokenizers Token Filters

  28. Tokenizer type Tokenizer interface { Tokenize([]byte) TokenStream } Example: Split

    flat []byte into discrete words.
  29. Token Filter type TokenFilter interface { Filter(TokenStream) TokenStream } Example:

    Any transformation on the tokens, including removal.
  30. Analyzer type Analyzer struct { Tokenizer Tokenizer TokenFilters []TokenFilter }

  31. Arabic Analyzer analysis.Analyzer{ Tokenizer: unicodeTokenizer, TokenFilters: []analysis.TokenFilter{ toLowerFilter, normalizeFilter, stopArFilter,

    normalizeArFilter, stemmerArFilter, }, }
  32. Arabic Stemmer func stem(input []byte) []byte { runes := bytes.Runes(input)

    for _, p := range prefixes { if canStemPrefix(runes, p) { runes = runes[len(p):] break } } for _, s := range suffixes { if canStemSuffix(runes, s) { runes = runes[:len(runes)-len(s)] } } return analysis.BuildTermFromRunes(runes) } prefixes := [][]rune{ []rune("ﻝلﺍا"), []rune("ﻝلﺍاﻭو"), []rune("ﻝلﺎﺑ"), []rune("ﻝلﺎﻛ"), []rune("ﻝلﺎﻓ"), []rune("ﻞﻟ"), []rune("ﻭو"), } suffixes := [][]rune{ []rune("ﺎﻫﮬﮪھ"), []rune("ﻥنﺍا"), []rune("ﺕتﺍا"), []rune("ﻥنﻭو"), []rune("ﻦﻳﯾ"), []rune("ﻪﮫﻳﯾ"), []rune("ﺔﻳﯾ"), []rune("ﻩه"), []rune("ﺓة"), []rune("ﻱي"), }
  33. Bleve Languages ❖ Arabic ❖ CJK ❖ Danish ❖ Dutch

    ❖ English ❖ Finnish ❖ French ❖ German ❖ Hindi ❖ Hungarian ❖ Italian ❖ Japanese ❖ Norwegian ❖ Persian ❖ Portuguese ❖ Romanian ❖ Russian ❖ Sorani ❖ Spanish ❖ Swedish ❖ Thai ❖ Turkish
  34. A Failing Test Case $ go test -v -run=ArabicAnalyzer ===

    RUN TestArabicAnalyzer --- FAIL: TestArabicAnalyzer (0.00s) analyzer_ar_test.go:175: expected [Start: 0 End: 16 Position: 1 Token: ﻚﻳﯾﺮﻣﺍا Type: 0], got [Start: 0 End: 16 Position: 1 Token: ﻲﻜﻳﯾﺮﻣﺍا Type: 0] analyzer_ar_test.go:176: expected d8 a7 d9 85 d8 b1 d9 8a d9 83, got d8 a7 d9 85 d8 b1 d9 8a d9 83 d9 8a FAIL exit status 1 FAIL github.com/blevesearch/bleve/analysis/language/ar 0.007s
  35. A Closer Look expected Token: ﻚﻳﯾﺮﻣﺍا got Token: ﻲﻜﻳﯾﺮﻣﺍا

  36. Sanity Check (hexdump Go source file) 00000680 2f 2f 20

    70 6c 75 72 61 6c 20 2d 69 6e 0a 09 09 |// plural -in...| 00000690 7b 0a 09 09 09 69 6e 70 75 74 3a 20 5b 5d 62 79 |{....input: []by| 000006a0 74 65 28 22 d8 a3 d9 85 d8 b1 d9 8a d9 83 d9 8a |te("............| 000006b0 d9 8a d9 86 22 29 2c 0a 09 09 09 6f 75 74 70 75 |...."),....outpu| 000006c0 74 3a 20 61 6e 61 6c 79 73 69 73 2e 54 6f 6b 65 |t: analysis.Toke| 000006d0 6e 53 74 72 65 61 6d 7b 0a 09 09 09 09 26 61 6e |nStream{.....&an| 000006e0 61 6c 79 73 69 73 2e 54 6f 6b 65 6e 7b 0a 09 09 |alysis.Token{...| 000006f0 09 09 09 54 65 72 6d 3a 20 20 20 20 20 5b 5d 62 |...Term: []b| 00000700 79 74 65 28 22 d8 a7 d9 85 d8 b1 d9 8a d9 83 22 |yte(".........."| 00000710 29 2c 0a 09 09 09 09 09 50 6f 73 69 74 69 6f 6e |),......Position|
  37. Sanity Check (hexdump Go test output) 00000050 65 73 74

    2e 67 6f 3a 31 37 35 3a 20 65 78 70 65 |est.go:175: expe| 00000060 63 74 65 64 20 5b 53 74 61 72 74 3a 20 30 20 20 |cted [Start: 0 | 00000070 45 6e 64 3a 20 31 36 20 20 50 6f 73 69 74 69 6f |End: 16 Positio| 00000080 6e 3a 20 31 20 20 54 6f 6b 65 6e 3a 20 d8 a7 d9 |n: 1 Token: ...| 00000090 85 d8 b1 d9 8a d9 83 20 20 54 79 70 65 3a 20 30 |....... Type: 0| expected [Start: 0 End: 16 Position: 1 Token: ﻚﻳﯾﺮﻣﺍا
  38. Mixing LTR and RTL Text

  39. GopherCon Denver 2015 Lightning Talk

  40. Hugo Site Integration Hugo, a fast and flexible static site

    generator built with love by spf13 and friends in Go
  41. Search extension for Caddy Pedro Nasser

  42. Caddy Integration ❖ Caddy is an alternative web server that

    is easy to configure and use. ❖ Search add-on activates a site search engine that includes a search page and JSON API ❖ HTML, Markdown, and .txt files are easily indexed automatically
  43. Global Community

  44. dotGo 2015

  45. blevesearch.com @blevesearch