Slide 1

Slide 1 text

No content

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

Jurriaan Persyn @oemebamo – CTO Engagor

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

No content

Slide 7

Slide 7 text

SELECT * FROM myauwesomewebsite WHERE `text` LIKE ‘%shizzle%’

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

No content

Slide 10

Slide 10 text

TEXT SEARCH WITH PHP/MYSQL •  FULLTEXT index •  Only for CHAR, VARCHAR & TEXT columns •  For MyISAM & InnoDB tables •  Configurable Stop Words •  Types: • Natural Language • Natural Language with Query Expansion • Boolean Full Text

Slide 11

Slide 11 text

MYSQL FULLTEXT BOOLEAN MODE Operators: •  + AND •  - NOT •  OR implied •  ( ) Nesting •  * Wildcard •  “ Literal matches

Slide 12

Slide 12 text

TEXT SEARCH WITH PHP/MYSQL (CONT’D) •  Typical columns for search table: •  Type •  Id •  Text •  Priority •  Process: •  Blog posts, comments, … •  Save (filtered) duplicate of text in search table. •  When searching … •  Search table and translate to original data via type/id This is how most php/mysql sites implement their search, right?

Slide 13

Slide 13 text

SELECT * FROM mysearchtable WHERE MATCH(text) AGAINST (‘shizzle’)

Slide 14

Slide 14 text

SELECT * FROM mysearchtable WHERE MATCH(text) AGAINST (‘+shizzle –”ma nizzle”’ IN BOOLEAN MODE)

Slide 15

Slide 15 text

SELECT * FROM jobs WHERE role = ‘DEVELOPER’ AND MATCH(job_description) AGAINST (‘node.js’)

Slide 16

Slide 16 text

SELECT * FROM jobs j JOIN jobs_benefits jb ON j.id = jb.job_id WHERE j.role = ‘DEVELOPER’ AND (MATCH(job_description) AGAINST (‘node.js -asp’ IN BOOLEAN MODE) AND jb.free_espresso = TRUE

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

WHAT IS A SEARCH ENGINE? •  Efficient indexing of data •  On all fields / combination of fields •  Analyzing data •  Text Search •  Tokenizing •  Stemming •  Filtering •  Understanding locations •  Date parsing •  Relevance scoring

Slide 19

Slide 19 text

TOKENIZING •  Finding word boundaries •  Not just explode(‘ ‘, $text); •  Chinese has no spaces. (Not every single character is a word.) •  Understand patterns: •  URLs •  Emails •  #hashtags •  Twitter @mentions •  Currencies (EUR, €, …)

Slide 20

Slide 20 text

STEMMING •  “Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form.” •  Conjugations •  Plurals •  Example: •  Fishing, Fished, Fish, Fisher > Fish •  Better > Good •  Several ways to find the stem: •  Lookup tables •  Suffix-stripping •  Lemmatization •  … •  Different stemmers for every language.

Slide 21

Slide 21 text

FILTERING •  Remove stop words •  Different for every language •  HTML •  If you’re indexing web content, not every character is meaningful.

Slide 22

Slide 22 text

UNDERSTANDING LOCATIONS •  Reverse geocoding of locations to longitude & latitude •  Search on location: •  Bounding box searches •  Distance searches •  Searching nearby •  Geo Polygons •  Searching a country (Note: MySQL also has geospatial indeces.)

Slide 23

Slide 23 text

RELEVANCE SCORING •  From the matched documents, which ones do you show first? •  Several strategies: •  How many matches in document? •  How many matches in document as percentage of length? •  Custom scoring algorithms •  At index time •  At search time •  … A combination Think of Google PageRank.

Slide 24

Slide 24 text

“There’s an app software for that.”"

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

APACHE LUCENE •  “Information retrieval software library” •  Free/open source •  Supported by Apache Foundation •  Created by Doug Cutting •  Written in 1999

Slide 27

Slide 27 text

“There’s software a Java library for that.”"

Slide 28

Slide 28 text

No content

Slide 29

Slide 29 text

No content

Slide 30

Slide 30 text

ELASTICSEARCH •  “You know, for Search” •  Also Free & Open Source •  Built on top of Lucene •  Created by Shay Banon @kimchy •  Versions •  First public release, v0.4 in February 2010 •  A rewrite of earlier “Compass” project, now with scalability built-in from the very core •  Now stable version at 0.20.6 •  Beta branch at 0.90 (working towards 1.0 release) •  In Java, so inherently cross-platform

Slide 31

Slide 31 text

WHAT DOES IT ADD TO LUCENE? •  RESTfull Service •  JSON API over HTTP •  Want to use it from PHP? •  CURL Requests, as if you’d do requests to the Facebook Graph API. •  High Availability & Performance •  Clustering •  Long Term Persistency •  Write through to persistent storage system.

Slide 32

Slide 32 text

$  cd  ~/Downloads   $  wget  h2ps://github.com/…/elas;csearch-­‐0.20.5.tar.gz     $  tar  –xzf  elas;csearch-­‐0.20.5.tar.gz     $  cd  elas;csearch-­‐0.20.5/   $  ./bin/elas;csearch    

Slide 33

Slide 33 text

$  cd  ~/Downloads   $  wget  h2ps://github.com/…/elas;csearch-­‐0.20.5.tar.gz     $  tar  –xzf  elas;csearch-­‐0.20.5.tar.gz     $  git  clone  h2ps://github.com/elas;csearch/elas;csearch-­‐ servicewrapper.git  elas;csearch-­‐servicewrapper   $  sudo  mv  elas;csearch-­‐0.20.5  /usr/local/share   $  cd  elas;csearch-­‐servicewrapper   $  sudo  mv  service  /usr/local/share/elas;csearch-­‐0.20.5/bin   $  cd  /usr/local/share   $  sudo  ln  -­‐s  elas;csearch-­‐0.20.5  elas;csearch   $  sudo  chown  -­‐R  root:wheel  elas;csearch   $  cd  /usr/local/share/elas;csearch   $  sudo  bin/service/elas;csearch  start  

Slide 34

Slide 34 text

$  sudo  bin/service/elas;csearch  start   Star;ng  Elas;cSearch...   Wai;ng  for  Elas;cSearch...   .   .   .   running:  PID:83071   $    

Slide 35

Slide 35 text

No content

Slide 36

Slide 36 text

$  curl  -­‐XPUT  h2p://localhost:9200/test/stupid-­‐hypes/ planking  -­‐d  '{"name":"Planking",  "stupidity_level":"5"}'     {"ok":true,"_index":"test","_type":"stupid-­‐ hypes","_id":"planking","_version":1}  

Slide 37

Slide 37 text

$  curl  -­‐XPUT  h2p://localhost:9200/test/stupid-­‐hypes/gallon-­‐ smashing  -­‐d  '{"name":"Gallon  Smashing",   "stupidity_level":"5"}'     {"ok":true,"_index":"test","_type":"stupid-­‐ hypes","_id":"gallon-­‐smashing","_version":1}  

Slide 38

Slide 38 text

$  curl  -­‐XPUT  h2p://localhost:9200/test/stupid-­‐hypes/gallon-­‐ smashing  -­‐d  '{"name":"Gallon  Smashing",   "stupidity_level":"10"}'     {"ok":true,"_index":"test","_type":"stupid-­‐ hypes","_id":"gallon-­‐smashing","_version":2}  

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

$  curl  -­‐XPUT  h2p://localhost:9200/test/stupid-­‐hypes/gallon-­‐ smashing  -­‐d  '{"name":"Gallon  Smashing",   "stupidity_level":"10",  "life;me":30}’     {"ok":true,"_index":"test","_type":"stupid-­‐ hypes","_id":"gallon-­‐smashing","_version":3}  

Slide 41

Slide 41 text

No content

Slide 42

Slide 42 text

SCHEMALESS, DOCUMENT ORIENTED •  No need to configure schema upfront •  No need for slow ALTER TABLE –like operations •  You can define a mapping (schema) to customize the indexing process •  Require fields to be of certain type •  If you want text fields that should not be analyzed (stemming, …)

Slide 43

Slide 43 text

“Ok, so it’s a NoSQL store?”"

Slide 44

Slide 44 text

No content

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

TERMINOLOGY MySQL Elastic Search Database Index Table Type Row Document Column Field Schema Mapping Index Everything is indexed SQL Query DSL SELECT * FROM table … GET http://… UPDATE table SET … PUT http://…

Slide 47

Slide 47 text

DISTRIBUTED & HIGHLY AVAILABLE •  Multiple servers (nodes) running in a cluster •  Acting as single service •  Nodes in cluster that store data or nodes that just help in speeding up search queries. •  Sharding •  Indeces are sharded (# shards is configurable) •  Each shard can have zero or more replicas •  Replicas on different servers (server pools) for failover •  One in the cluster goes down? No problem. •  Master •  Automatic Master detection + failover •  Responsible for distribution/balancing of shards

Slide 48

Slide 48 text

No content

Slide 49

Slide 49 text

SCALING ISSUES? •  No need for an external load balancer •  Since cluster does it’s own routing. •  Ask any server in the cluster, it will delegate to correct node. •  What if … •  More data > More shards. •  More availability > More replicas per shard.

Slide 50

Slide 50 text

PERFORMANCE TWEAKING •  Bulk Indexing •  Multi-Get •  Avoids network latency (HTTP Api) •  Api with administrative & monitoring interface •  Cluster’s availability state •  Health •  Nodes’ memory footprint •  Alternatives voor HTTP Api? •  Java library •  PHP wrappers (Sherlock, Elastica, …) •  But simplicity of HTTP Api is brilliant to work with, latency is hardly an issue.

Slide 51

Slide 51 text

Still with me?

Slide 52

Slide 52 text

Some Examples

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

Query  DSL  Example:     (language:nl  OR  loca;on.country:be  OR  loca;on.country:aa)   (tag:sen;ment.nega;ve)  author.followers:[1000  TO  *]  (-­‐ sub_category:like)  ((-­‐status:857.assigned)  (-­‐status:857.done))  

Slide 58

Slide 58 text

No content

Slide 59

Slide 59 text

No content

Slide 60

Slide 60 text

No content

Slide 61

Slide 61 text

FACETS •  Instead of returning the matching documents … •  … return data about the distribution of values in the set of matching documents •  Or a subset of the matching documents •  Possibilities: •  Totals per unique value •  Averages of values •  Distributions of values •  …

Slide 62

Slide 62 text

TERMINOLOGY (CONT’D) MySQL Elastic Search SELECT field, COUNT(*) FROM table GROUP BY field Facet

Slide 63

Slide 63 text

No content

Slide 64

Slide 64 text

No content

Slide 65

Slide 65 text

ADVANCED FEATURES •  Nested documents (Child-Parent) •  Like MySQL joins? •  Percolation Index •  Store queries in Elastic •  Send it documents •  Get returned which queries match •  Index Warming •  Register search queries that cause heavy load •  New data added to index will be warmed •  So next time query is executed: pre cached

Slide 66

Slide 66 text

WHAT ARE MY OTHER OPTIONS? •  RDBMS •  MySQL, … •  NoSQL •  MongoDB, … •  Search Engines •  Solr •  Sphinx •  Xapian •  Lucene itself •  SaaS •  Amazon CloudSearch

Slide 67

Slide 67 text

… VS. SOLR •  + •  Also built on Lucene •  So similar feature set •  Also exposes Lucene functionality, like Elastic Search, so easy to extend. •  A part of Apache Lucene project •  Perfect for Single Server search •  - •  Clustering is there. But it’s definitely not as simple as ElasticSearch’ •  Fragmented code base. (Lots of branches.) Engagor used to run on Solr.

Slide 68

Slide 68 text

… VS. SPHINX •  + •  Great for single server full text searches; •  Has graceful integration with SQL database; •  (Eg. for indexing data) •  Faster than the others for simple searches; •  - •  No out of the box clustering; •  Not built on Lucene; lacks some advanced features; Netlog & Twoo use Sphinx.

Slide 69

Slide 69 text

WANT TO USE IT? •  In an existing project: •  As an extra layer next to your data … •  Send to both your database & elasticsearch; •  Consistency problems?; •  Or as replacement for database •  Elastic is as persistent as MySQL; •  If you don’t need RDBMS features; •  @Engagor: Our social messages are only in Elastic

Slide 70

Slide 70 text

“Users are incredibly bad at finding and researching things on the web.”" Nielsen (March 2013)! http://www.nngroup.com/articles/search-navigation/"

Slide 71

Slide 71 text

“Pathetic and useless are words that come to mind after this year’s user testing.”" Nielsen (March 2013)! http://www.nngroup.com/articles/search-navigation/"

Slide 72

Slide 72 text

“I’m searching for apples and pears.”"

Slide 73

Slide 73 text

“apples AND pears”"

Slide 74

Slide 74 text

No content

Slide 75

Slide 75 text

“apples OR pears”"

Slide 76

Slide 76 text

“It’s too young. Is it even stable enough?”" Your boss (Tomorrow Morning)!

Slide 77

Slide 77 text

No content

Slide 78

Slide 78 text

No content

Slide 79

Slide 79 text

No content

Slide 80

Slide 80 text

No content

Slide 81

Slide 81 text

elasticsearch.org irc.freenode.net #elasticsearch

Slide 82

Slide 82 text

elasticsearch node.js socket.io real time notifications rabbitmq backbone.js gearman [email protected]

Slide 83

Slide 83 text

No content

Slide 84

Slide 84 text

$  cd  /usr/local/share/elas;csearch   $  sudo  bin/service/elas;csearch  stop  

Slide 85

Slide 85 text

No content

Slide 86

Slide 86 text

Sources  include:   •  h2p://www.elas;csearch.org/videos/2010/02/07/es-­‐introduc;on.html   •  h2p://www.elas;csearchtutorial.com/   •  h2p://www.slideshare.net/clintongormley/cool-­‐bonsai-­‐cool-­‐an-­‐introduc;on-­‐to-­‐elas;csearch   •  h2p://www.slideshare.net/medcl/elas;c-­‐search-­‐quick-­‐intro   •  h2p://www.slideshare.net/macrochen/elas;c-­‐search-­‐apachesolr-­‐10881377   •  h2p://www.slideshare.net/cyber_jso/elas;c-­‐search-­‐introduc;on   •  h2p://www.slideshare.net/infochimps/elas;csearch-­‐v4   •  h2p://engineering.foursquare.com/2012/08/09/foursquare-­‐now-­‐uses-­‐elas;c-­‐search-­‐and-­‐on-­‐a-­‐related-­‐note-­‐slashem-­‐also-­‐ works-­‐with-­‐elas;c-­‐search/   •  h2p://stackoverflow.com/ques;ons/10213009/solr-­‐vs-­‐elas;csearch   •  h2p://stackoverflow.com/ques;ons/11115523/how-­‐does-­‐amazon-­‐cloudsearch-­‐compares-­‐to-­‐elas;csearch-­‐solr-­‐or-­‐sphinx-­‐ in-­‐terms-­‐o   •  h2p://blog.socialcast.com/real;me-­‐search-­‐solr-­‐vs-­‐elas;csearch/