Slide 1

Slide 1 text

dotGo November 9, 2015 A Tour of the Bleve Marty Schoch @mschoch

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

full text search golang

Slide 4

Slide 4 text

full-text searching gloang

Slide 5

Slide 5 text

French Text Analysis Un article de Wikipédia l'encyclopédie libre "Un article de Wikipédia, l'encyclopédie libre un article de wikipédia encyclopédie libre article wikipédia encyclopédie libre articl wikipedia encycloped libr Tokenization Lowercase Article Elision Stop Words Stemming un article de wikipédia l'encyclopédie libre

Slide 6

Slide 6 text

Full-Text Search encyclopedies { “body”: “Un article de Wikipédia, l'encyclopédie libre.” } Index Terms encycloped

Slide 7

Slide 7 text

Install Bleve $ go get github.com/blevesearch/bleve/...

Slide 8

Slide 8 text

Import Packages import ( "github.com/blevesearch/bleve" "github.com/blevesearch/bleve/analysis/language/fr" )

Slide 9

Slide 9 text

Index Mapping mapping := bleve.NewIndexMapping() mapping.DefaultAnalyzer = fr.AnalyzerName

Slide 10

Slide 10 text

New Index index, err := bleve.New("wiki.bleve", mapping) if err != nil { log.Fatal(err) }

Slide 11

Slide 11 text

Index Data article := Article{ ID: "Wikipedia", Body: "Un article de Wikipédia, l'encyclopédie libre.", } err = index.Index(article.ID, article) if err != nil { log.Fatal(err) }

Slide 12

Slide 12 text

Query/Request query := bleve.NewMatchQuery("encyclopedies") request := bleve.NewSearchRequest(query)

Slide 13

Slide 13 text

Search results, err := index.Search(request) if err != nil { log.Fatal(err) } fmt.Println(results)

Slide 14

Slide 14 text

Search package main import ( "fmt" "log" "github.com/blevesearch/bleve" "github.com/blevesearch/bleve/analysis/language/fr" ) type Article struct { ID string Body string } func main() { mapping := bleve.NewIndexMapping() mapping.DefaultAnalyzer = fr.AnalyzerName index, err := bleve.New("wiki.bleve", mapping) if err != nil { log.Fatal(err) } article := Article{ ID: "Wikipedia", Body: "Un article de Wikipédia, l'encyclopédie libre.", } err = index.Index(article.ID, article) if err != nil { log.Fatal(err) } query := bleve.NewMatchQuery("encyclopedies") request := bleve.NewSearchRequest(query) results, err := index.Search(request) if err != nil { log.Fatal(err) } fmt.Println(results) } 1 matches, showing 1 through 1, took 66.466µs 1. Wikipedia (0.137229) Program exited.

Slide 15

Slide 15 text

Community

Slide 16

Slide 16 text

BoltDB Storage Adapter Conrad Pankoff

Slide 17

Slide 17 text

Extending Bleve through Go Interfaces Bleve Analysis Index Search KV Storage Disk

Slide 18

Slide 18 text

KVStore type KVStore interface { Reader() (KVReader, error) Writer() (KVWriter, error) Close() error }

Slide 19

Slide 19 text

Reader type KVReader interface { Get(key []byte) ([]byte, error) PrefixIterator(prefix []byte) KVIterator RangeIterator(start, end []byte) KVIterator Close() error } Note: Readers must provide a consistent view isolated from concurrent writes.

Slide 20

Slide 20 text

Writer type KVWriter interface { NewBatch() KVBatch ExecuteBatch(batch KVBatch) error Close() error }

Slide 21

Slide 21 text

Batch type KVBatch interface { Set(key, val []byte) Delete(key []byte) Merge(key, val []byte) Reset() } Note: Batches must be atomic, readers see all or none of the changes.

Slide 22

Slide 22 text

Bleve KV Store Landscape (before) ❖ leveldb (cgo) ❖ inmem (no disk persistence)

Slide 23

Slide 23 text

No pure Go storage???

Slide 24

Slide 24 text

Bleve KV Store Landscape (today) ❖ leveldb (cgo) ❖ gtreap (no disk persistence) ❖ cznicb (no disk persistence) ❖ boltdb ❖ goleveldb ❖ rocksdb (cgo) ❖ forestdb (cgo)

Slide 25

Slide 25 text

FOSDEM 2015 Go Developer Room

Slide 26

Slide 26 text

Arabic Text Analysis زاّمجلا ناملس

Slide 27

Slide 27 text

Text Analysis Interfaces Bleve Analysis Index Search Tokenizers Token Filters

Slide 28

Slide 28 text

Tokenizer type Tokenizer interface { Tokenize([]byte) TokenStream } Example: Split flat []byte into discrete words.

Slide 29

Slide 29 text

Token Filter type TokenFilter interface { Filter(TokenStream) TokenStream } Example: Any transformation on the tokens, including removal.

Slide 30

Slide 30 text

Analyzer type Analyzer struct { Tokenizer Tokenizer TokenFilters []TokenFilter }

Slide 31

Slide 31 text

Arabic Analyzer analysis.Analyzer{ Tokenizer: unicodeTokenizer, TokenFilters: []analysis.TokenFilter{ toLowerFilter, normalizeFilter, stopArFilter, normalizeArFilter, stemmerArFilter, }, }

Slide 32

Slide 32 text

Arabic Stemmer func stem(input []byte) []byte { runes := bytes.Runes(input) for _, p := range prefixes { if canStemPrefix(runes, p) { runes = runes[len(p):] break } } for _, s := range suffixes { if canStemSuffix(runes, s) { runes = runes[:len(runes)-len(s)] } } return analysis.BuildTermFromRunes(runes) } prefixes := [][]rune{ []rune("ﻝلﺍا"), []rune("ﻝلﺍاﻭو"), []rune("ﻝلﺎﺑ"), []rune("ﻝلﺎﻛ"), []rune("ﻝلﺎﻓ"), []rune("ﻞﻟ"), []rune("ﻭو"), } suffixes := [][]rune{ []rune("ﺎﻫﮬﮪھ"), []rune("ﻥنﺍا"), []rune("ﺕتﺍا"), []rune("ﻥنﻭو"), []rune("ﻦﻳﯾ"), []rune("ﻪﮫﻳﯾ"), []rune("ﺔﻳﯾ"), []rune("ﻩه"), []rune("ﺓة"), []rune("ﻱي"), }

Slide 33

Slide 33 text

Bleve Languages ❖ Arabic ❖ CJK ❖ Danish ❖ Dutch ❖ English ❖ Finnish ❖ French ❖ German ❖ Hindi ❖ Hungarian ❖ Italian ❖ Japanese ❖ Norwegian ❖ Persian ❖ Portuguese ❖ Romanian ❖ Russian ❖ Sorani ❖ Spanish ❖ Swedish ❖ Thai ❖ Turkish

Slide 34

Slide 34 text

A Failing Test Case $ go test -v -run=ArabicAnalyzer === RUN TestArabicAnalyzer --- FAIL: TestArabicAnalyzer (0.00s) analyzer_ar_test.go:175: expected [Start: 0 End: 16 Position: 1 Token: ﻚﻳﯾﺮﻣﺍا Type: 0], got [Start: 0 End: 16 Position: 1 Token: ﻲﻜﻳﯾﺮﻣﺍا Type: 0] analyzer_ar_test.go:176: expected d8 a7 d9 85 d8 b1 d9 8a d9 83, got d8 a7 d9 85 d8 b1 d9 8a d9 83 d9 8a FAIL exit status 1 FAIL github.com/blevesearch/bleve/analysis/language/ar 0.007s

Slide 35

Slide 35 text

A Closer Look expected Token: ﻚﻳﯾﺮﻣﺍا got Token: ﻲﻜﻳﯾﺮﻣﺍا

Slide 36

Slide 36 text

Sanity Check (hexdump Go source file) 00000680 2f 2f 20 70 6c 75 72 61 6c 20 2d 69 6e 0a 09 09 |// plural -in...| 00000690 7b 0a 09 09 09 69 6e 70 75 74 3a 20 5b 5d 62 79 |{....input: []by| 000006a0 74 65 28 22 d8 a3 d9 85 d8 b1 d9 8a d9 83 d9 8a |te("............| 000006b0 d9 8a d9 86 22 29 2c 0a 09 09 09 6f 75 74 70 75 |...."),....outpu| 000006c0 74 3a 20 61 6e 61 6c 79 73 69 73 2e 54 6f 6b 65 |t: analysis.Toke| 000006d0 6e 53 74 72 65 61 6d 7b 0a 09 09 09 09 26 61 6e |nStream{.....&an| 000006e0 61 6c 79 73 69 73 2e 54 6f 6b 65 6e 7b 0a 09 09 |alysis.Token{...| 000006f0 09 09 09 54 65 72 6d 3a 20 20 20 20 20 5b 5d 62 |...Term: []b| 00000700 79 74 65 28 22 d8 a7 d9 85 d8 b1 d9 8a d9 83 22 |yte(".........."| 00000710 29 2c 0a 09 09 09 09 09 50 6f 73 69 74 69 6f 6e |),......Position|

Slide 37

Slide 37 text

Sanity Check (hexdump Go test output) 00000050 65 73 74 2e 67 6f 3a 31 37 35 3a 20 65 78 70 65 |est.go:175: expe| 00000060 63 74 65 64 20 5b 53 74 61 72 74 3a 20 30 20 20 |cted [Start: 0 | 00000070 45 6e 64 3a 20 31 36 20 20 50 6f 73 69 74 69 6f |End: 16 Positio| 00000080 6e 3a 20 31 20 20 54 6f 6b 65 6e 3a 20 d8 a7 d9 |n: 1 Token: ...| 00000090 85 d8 b1 d9 8a d9 83 20 20 54 79 70 65 3a 20 30 |....... Type: 0| expected [Start: 0 End: 16 Position: 1 Token: ﻚﻳﯾﺮﻣﺍا

Slide 38

Slide 38 text

Mixing LTR and RTL Text

Slide 39

Slide 39 text

GopherCon Denver 2015 Lightning Talk

Slide 40

Slide 40 text

Hugo Site Integration Hugo, a fast and flexible static site generator built with love by spf13 and friends in Go

Slide 41

Slide 41 text

Search extension for Caddy Pedro Nasser

Slide 42

Slide 42 text

Caddy Integration ❖ Caddy is an alternative web server that is easy to configure and use. ❖ Search add-on activates a site search engine that includes a search page and JSON API ❖ HTML, Markdown, and .txt files are easily indexed automatically

Slide 43

Slide 43 text

Global Community

Slide 44

Slide 44 text

dotGo 2015

Slide 45

Slide 45 text

blevesearch.com @blevesearch