Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Tour of the Bleve

Marty Schoch
November 09, 2015

A Tour of the Bleve

Nearly two years ago we set out to build Bleve, an open-source full-text search library for Go. Since then we've worked hard to build not just the code, but also the community around the project. We'll start with a brief introduction to the capabilities of the library. Then we'll take a tour of the unique ways the community has shaped the project. These vignettes will guide our technical dive into select parts of the Bleve project.

Marty Schoch

November 09, 2015
Tweet

More Decks by Marty Schoch

Other Decks in Technology

Transcript

  1. French Text Analysis Un article de Wikipédia l'encyclopédie libre "Un

    article de Wikipédia, l'encyclopédie libre un article de wikipédia encyclopédie libre article wikipédia encyclopédie libre articl wikipedia encycloped libr Tokenization Lowercase Article Elision Stop Words Stemming un article de wikipédia l'encyclopédie libre
  2. Index Data article := Article{ ID: "Wikipedia", Body: "Un article

    de Wikipédia, l'encyclopédie libre.", } err = index.Index(article.ID, article) if err != nil { log.Fatal(err) }
  3. Search results, err := index.Search(request) if err != nil {

    log.Fatal(err) } fmt.Println(results)
  4. Search package main import ( "fmt" "log" "github.com/blevesearch/bleve" "github.com/blevesearch/bleve/analysis/language/fr" )

    type Article struct { ID string Body string } func main() { mapping := bleve.NewIndexMapping() mapping.DefaultAnalyzer = fr.AnalyzerName index, err := bleve.New("wiki.bleve", mapping) if err != nil { log.Fatal(err) } article := Article{ ID: "Wikipedia", Body: "Un article de Wikipédia, l'encyclopédie libre.", } err = index.Index(article.ID, article) if err != nil { log.Fatal(err) } query := bleve.NewMatchQuery("encyclopedies") request := bleve.NewSearchRequest(query) results, err := index.Search(request) if err != nil { log.Fatal(err) } fmt.Println(results) } 1 matches, showing 1 through 1, took 66.466µs 1. Wikipedia (0.137229) Program exited.
  5. Reader type KVReader interface { Get(key []byte) ([]byte, error) PrefixIterator(prefix

    []byte) KVIterator RangeIterator(start, end []byte) KVIterator Close() error } Note: Readers must provide a consistent view isolated from concurrent writes.
  6. Batch type KVBatch interface { Set(key, val []byte) Delete(key []byte)

    Merge(key, val []byte) Reset() } Note: Batches must be atomic, readers see all or none of the changes.
  7. Bleve KV Store Landscape (today) ❖ leveldb (cgo) ❖ gtreap

    (no disk persistence) ❖ cznicb (no disk persistence) ❖ boltdb ❖ goleveldb ❖ rocksdb (cgo) ❖ forestdb (cgo)
  8. Arabic Stemmer func stem(input []byte) []byte { runes := bytes.Runes(input)

    for _, p := range prefixes { if canStemPrefix(runes, p) { runes = runes[len(p):] break } } for _, s := range suffixes { if canStemSuffix(runes, s) { runes = runes[:len(runes)-len(s)] } } return analysis.BuildTermFromRunes(runes) } prefixes := [][]rune{ []rune("ﻝلﺍا"), []rune("ﻝلﺍاﻭو"), []rune("ﻝلﺎﺑ"), []rune("ﻝلﺎﻛ"), []rune("ﻝلﺎﻓ"), []rune("ﻞﻟ"), []rune("ﻭو"), } suffixes := [][]rune{ []rune("ﺎﻫﮬﮪھ"), []rune("ﻥنﺍا"), []rune("ﺕتﺍا"), []rune("ﻥنﻭو"), []rune("ﻦﻳﯾ"), []rune("ﻪﮫﻳﯾ"), []rune("ﺔﻳﯾ"), []rune("ﻩه"), []rune("ﺓة"), []rune("ﻱي"), }
  9. Bleve Languages ❖ Arabic ❖ CJK ❖ Danish ❖ Dutch

    ❖ English ❖ Finnish ❖ French ❖ German ❖ Hindi ❖ Hungarian ❖ Italian ❖ Japanese ❖ Norwegian ❖ Persian ❖ Portuguese ❖ Romanian ❖ Russian ❖ Sorani ❖ Spanish ❖ Swedish ❖ Thai ❖ Turkish
  10. A Failing Test Case $ go test -v -run=ArabicAnalyzer ===

    RUN TestArabicAnalyzer --- FAIL: TestArabicAnalyzer (0.00s) analyzer_ar_test.go:175: expected [Start: 0 End: 16 Position: 1 Token: ﻚﻳﯾﺮﻣﺍا Type: 0], got [Start: 0 End: 16 Position: 1 Token: ﻲﻜﻳﯾﺮﻣﺍا Type: 0] analyzer_ar_test.go:176: expected d8 a7 d9 85 d8 b1 d9 8a d9 83, got d8 a7 d9 85 d8 b1 d9 8a d9 83 d9 8a FAIL exit status 1 FAIL github.com/blevesearch/bleve/analysis/language/ar 0.007s
  11. Sanity Check (hexdump Go source file) 00000680 2f 2f 20

    70 6c 75 72 61 6c 20 2d 69 6e 0a 09 09 |// plural -in...| 00000690 7b 0a 09 09 09 69 6e 70 75 74 3a 20 5b 5d 62 79 |{....input: []by| 000006a0 74 65 28 22 d8 a3 d9 85 d8 b1 d9 8a d9 83 d9 8a |te("............| 000006b0 d9 8a d9 86 22 29 2c 0a 09 09 09 6f 75 74 70 75 |...."),....outpu| 000006c0 74 3a 20 61 6e 61 6c 79 73 69 73 2e 54 6f 6b 65 |t: analysis.Toke| 000006d0 6e 53 74 72 65 61 6d 7b 0a 09 09 09 09 26 61 6e |nStream{.....&an| 000006e0 61 6c 79 73 69 73 2e 54 6f 6b 65 6e 7b 0a 09 09 |alysis.Token{...| 000006f0 09 09 09 54 65 72 6d 3a 20 20 20 20 20 5b 5d 62 |...Term: []b| 00000700 79 74 65 28 22 d8 a7 d9 85 d8 b1 d9 8a d9 83 22 |yte(".........."| 00000710 29 2c 0a 09 09 09 09 09 50 6f 73 69 74 69 6f 6e |),......Position|
  12. Sanity Check (hexdump Go test output) 00000050 65 73 74

    2e 67 6f 3a 31 37 35 3a 20 65 78 70 65 |est.go:175: expe| 00000060 63 74 65 64 20 5b 53 74 61 72 74 3a 20 30 20 20 |cted [Start: 0 | 00000070 45 6e 64 3a 20 31 36 20 20 50 6f 73 69 74 69 6f |End: 16 Positio| 00000080 6e 3a 20 31 20 20 54 6f 6b 65 6e 3a 20 d8 a7 d9 |n: 1 Token: ...| 00000090 85 d8 b1 d9 8a d9 83 20 20 54 79 70 65 3a 20 30 |....... Type: 0| expected [Start: 0 End: 16 Position: 1 Token: ﻚﻳﯾﺮﻣﺍا
  13. Hugo Site Integration Hugo, a fast and flexible static site

    generator built with love by spf13 and friends in Go
  14. Caddy Integration ❖ Caddy is an alternative web server that

    is easy to configure and use. ❖ Search add-on activates a site search engine that includes a search page and JSON API ❖ HTML, Markdown, and .txt files are easily indexed automatically