An Introduction to Search-index.js

Slide 1

Slide 1 text

An Introduction to search-index.js Fergus McDowall Web Rebels Oslo 2015

Slide 2

Slide 2 text

Deepish, but not too deep

Slide 3

Slide 3 text

Fergus McDowall https://github.com/fergiemcdowall https://twitter.com/fergiemcdowall https://no.linkedin.com/in/fergusmcdowall

Slide 4

Slide 4 text

search-index.js  Node.js module on the LeBroN stack containing core search functionality (A bit like Lucene) Norch  HTTP-GET wrapper around search-index.js (a bit like Solr/Elastic)

Slide 5

Slide 5 text

LeBroN (Level*, Browserify, Node.js) http://lebron.technology/

Slide 6

Slide 6 text

Old ’n’ Busted: Search Indexes on the server All magic happening on the server Clients sending search queries to server and reading returned responses

Slide 7

Slide 7 text

New Hotness: Search indexes in the Browser Less magic happening on the server Clients replicating and building their own indexes, query<->result is local

Slide 8

Slide 8 text

Make “Small Data” Search Apps

Slide 9

Slide 9 text

The GitHubs you need https://github.com/fergiemcdowall/search-index https://github.com/fergiemcdowall/norch

Slide 10

Slide 10 text

The Document Processing Pipeline

Slide 11

Slide 11 text

Document processing

Slide 12

Slide 12 text

Lodash is your friend! For example you might process: https://raw.githubusercontent.com/ fergiemcdowall/world-bank-dataset/ master/world-bank-projects.json With https://gist.github.com/fergiemcdowall/ dceec9930327cb92467b EXPLORE!

Slide 13

Slide 13 text

Norch.js Apps

Slide 14

Slide 14 text

Install norch.js: ➜ norchdir npm install norch Run norch.js: ➜ norchdir ./node_modules/norch/bin/norch Index some data into norch.js: ➜ datadir curl --form document=@node_modules/ reuters-21578-json/data/full/reuters-000.json http:// localhost:3030/indexer --form filterOn=places,topics,organisations

Slide 15

Slide 15 text

An ng front end that talks to your norch server. git clone https://github.com/fergiemcdowall/norch-angular-app cd norch-angular-app curl --form [email protected] http://localhost:3030/indexer --form filterOn=mjtheme,totalamt (for localhost development be aware of access-control-origin- header) norch -c http://localhost:8000

Slide 16

Slide 16 text

Node.js Apps

Slide 17

Slide 17 text

Main.js Run main.js: ➜ examples node main.js

Slide 18

Slide 18 text

Replicating an entire index over the net-    -MADNESS…?

Slide 19

Slide 19 text

…or perhaps not? Runs on ALL browsers persistent (because IndexedDB) Indexes are small Lower server costs Network caching is magical Net getting faster User experience

Slide 20

Slide 20 text

Browser Apps (Data from Browser)

Slide 21

Slide 21 text

index.html: main.js: Browserify that bad boy: ➜ dir browserify main.js -o bundle.js …and open index.html in a browser:

Slide 22

Slide 22 text

Browser Apps (Replicate to Browser)

Slide 23

Slide 23 text

Index some data and then use the replicate API to create a snapshot: Code snippet in main.js that handles replication: Run ➜ browserifydir node indexgenerator.js   ➜ browserifydir gunzip backup.gz ➜ browserifydir browserify main.js -o bundle.js

Slide 24

Slide 24 text

Browser Apps (data from PouchDB)

Slide 25

Slide 25 text

Index an entire PouchDB instance:

Slide 26

Slide 26 text

For source code, deeper explanation, and full demos check out https://github.com/fergiemcdowall/search- index/tree/master/examples

Slide 27

Slide 27 text

Indexing Documents (getting stuff in)

Slide 28

Slide 28 text

Document Format All fields are optional, but if id isn’t present, it will be autogenerated {  id: ‘aTotallyOptionalID’,  title: ‘A Really Cool Title’,  tags: [‘coolness’, ‘awsomeness’]  body: ‘Bla bla bla bla, lots of text here…’  }

Slide 29

Slide 29 text

Batch Format Use batches to index lots of data. Bigger batches are faster if your hardware can cope. [  {  id: ‘1’,  title: ‘A Really Cool Title’,  tags: [‘coolness’, ‘awsomeness’]  body: ‘Sparkly w00p w00p, lots of text here…’  },  {  id: ‘two’,  title: ‘A Really Boring Title’,  tags: [‘dullness’, ‘boringness’]  body: ‘Bla bla bla bla, lots of text here…’  }  ]

Slide 30

Slide 30 text

A word on numeric sorting search-index sorts alphabetically, so all numbers have to be stored as strings. [  {  id: ‘1’,  name: ‘Ruckus’,  price: [‘000000000050000’]  manufacturer: ‘Honda’  },  {  id: ‘2’,  name: ‘Grom’,  price: [‘000000000100000’]  manufacturer: ‘Honda’  }  ]

Slide 31

Slide 31 text

A word on numeric sorting Here’s a nice number stringify-padding function:

Slide 32

Slide 32 text

Working with Queries (getting stuff out)

Slide 33

Slide 33 text

Basic Queries Search all fields for “africa bank”  {  "query": {"*": ["africa", “bank"]}  } Search title field for “africa bank”  {  "query": {"title": ["africa", “bank"]}  } Search title field for “africa”, body for “bank”  {  "query": {"title": [“africa”], "body": [“bank”]}  }

Slide 34

Slide 34 text

Basic Queries Return everything in index  {  "query": {"*": [“*"]}  } 

Slide 35

Slide 35 text

Basic Queries Return everything in index  {  "query": {"*": [“*"]}  } 

Slide 36

Slide 36 text

Facets Simple facets  {  "query": {"*": ["africa", “bank”]},  ”facets”: {“totalamt": {}}  } Or define ranges of values  {  "query": {"*": ["africa", “bank”]},  ”facets”: {  "totalamt": {  "ranges":[  ["000000000000000","000000050000000"],  ["000000050000001","100000000000000"  ]  }  }  } You can also sort and limit your facets

Slide 37

Slide 37 text

Filters Filters allow you to query on facets  {  "query": {"*": ["africa", “bank”]},  ”filter”: {  “totalamt" {["000000000000000",  "000000050000000"]}  }  } You always specify a range so to filter on one value do  {  "query": {"*": ["africa", “bank”]},  ”filter”: {  “totalamt" {["000000050000000",  "000000050000000"]}  }  } You can filter on as many ranges as you want.

Slide 38

Slide 38 text

Other stuff pageSize  hits per page offset  used for paging teaser  creates a small text preview containing query terms in the document weight  used to create relevancy models

Slide 39

Slide 39 text

Results Example here

Slide 40

Slide 40 text

Parsing Results Caveman JavaScript Angular …and anything else you can think of

Slide 41

Slide 41 text

Replication

Slide 42

Slide 42 text

Replication in Norch Make snapshot of mother index  curl http://localhost:3030/snapshot -o snapshot.gz Empty target index (if necessary)  curl http://localhost:3030/empty Replicate into new or emptied index  curl -X POST http://localhost:3030/replicate  --data-binary @snapshot.gz  -H "Content-Type: application/gzip"

Slide 43

Slide 43 text

Replication in  search-index.js Make snapshot of mother index   Empty target index    Replicate into new or emptied index

Slide 44

Slide 44 text

Strengths and  Weaknesses

Slide 45

Slide 45 text

Strengths Super-portable Easy to install and use Performant for simple queries Runs on low-end hardware (server and browser) Replication Weaknesses Strictly a small data Limited feature set Relatively small community compared to other search technologies (Elastic, Solr)

Slide 46

Slide 46 text

Future Direction

Slide 47

Slide 47 text

Richer query syntax Better docs, examples and tutorials Service for generating indexes Better compression Mad science Performance, bugfixes, etc

Slide 48

Slide 48 text

Hey Browsers! Allow OpenSearch to talk to search indexes in the webpage

Slide 49

Slide 49 text

Get Involved Submit a pull-request Make cool stuff Anything on IOS or Android

Slide 50

Slide 50 text

Thanks For Listening! @fergiemcdowall https://github.com/fergiemcdowall/norch  https://github.com/fergiemcdowall/search-index