Basic Introduction to Elasticsearch

ElasticSearch Introduction

What is it ? • Full text search engine based
on the Apache Lucene full text search library. • NoSQL Data Store for Structured/UnStructured data • Open Source but commercially supported by elasticsearch.com , and part of the ELK (Elasticsearch-LogStash-Kibana) product stack . • Competing Products - • None really in terms of feature-set and scalability • Apache Solr/SolrCloud, Cloudera Search (Hadoop + Solr) • Oracle Text • Splunk for machine data.

Features • Full text search as well as structured query
support. • All interaction via REST APIs (API Bindings available for all major languages) • Support fault-tolerant and automatic fail-over operations, as well as data replication out of box. • Various options for pre-processing data before being indexed. (Analyzers/Tokenizers/Filters) • Supports integration via rivers , ES-Hadoop, native integration. • Comes with sensible defaults out of box for ease of development/ deployment.

Core Concepts • Cluster : 1+ nodes, by default all
nodes are identical. Each cluster has a name that cannot change after being setup. • Node : A single JVM process. (It is possible to have 2+ nodes in a single box/vm) • Index : Equivalent to a DB Schema, contains have 1+ types. • Type : Equiv. to a DB Table, contains 1+ document. • Document : Equiv. to a DB Row. A JSON encoded structured/unstructured text. • Shard + Replicas : Index is broken into 1+ shards, where each shard lives on just one node in it’s entirety, and is replicated within the cluster as per the replication policy.

How is it deployed ? • Scalable from get go.
Just add more nodes to scale horizontally. (Current limit about 150 nodes/cluster). • Cluster has a master node elected amongst master eligible nodes. Auto failover to some other master-eligible node in case of master going down. • By Default all nodes are master-eligible and also act as data nodes and client nodes. • Nodes auto discover themselves using multicast. • 1 Node per box

How SHOULD it be deployed ? • Dedicated Master/Data/Client nodes
to avoid split brain and better load balancing. • Data nodes with locally attached disks. • Typical Data disk 8 cores, 64GB RAM, 4-6 1TB to 2 TB disks disks. Assign 30GB ram to VM leave rest to OS + FS Cache • Change default cluster name and use unicast discovery to avoid unexpected nodes from joining. • Nodes should be in a single subnet talk to each other over 1G /10G links. • Don’t grow cluster beyond 150 nodes, prefer separate clusters and use a “Tribe Node” for across cluster searches.

Data Ingestion • HTTP PUT to /<index>/<type>/<id>, will • Auto
create Index and Type • Use default Analyzer to analyze the input document • Store the input document at the given ID in the index along with the indexed data. • Replicate it as per replication policy • Make it available for immediate GET via ID, and in near- realtime for search.

e.g. curl -XPUT 'localhost:9200/customer/external/1' \ -d '{"name": "John Doe"}' And
the response: { "_index" : "customer", "_type" : "external", "_id" : "1", "_version" : 1, "created" : true }

Types of APIs • Document : Index/Get/Delete/Update + Bulk •
Search : Search/Count/Validate/Explain/MLT/Percolate • Indices : Index/Alias/Mappings Management +Monitoring • Cluster : Cluster wide health/state/status • Cat : APIs outputting compact and aligned text instead of JSON (scripting).

Analysis • Analyzer contains 1 Tokenizer & 0+ Token Filters
• Tokenizer can be preceded by 1+ Char Filter • Tokenizer breaks down strings in to stream of Tokens. e.g. “one two three” becomes ‘one’, ‘two’, ‘three’. Standard/Edge-ngram/keyword/pattern • Token Filters modify/delete/add tokens • Char Filters strip out /replace characters

Mapping • Unstructured data works but structured is even better
• Use Mapping API to pre populate the metadata about documents belonging to a certain index/type. • Used for specifying Analyzer, Field data types, transformations etc. • Highly recommended to use even if default mappings are good enough.

QUERY DSL • All queries are JSONs • Simplest is
query_string , field:’one or two or three’ • query string not recommended for prod. use. • query can comprise of search + filter. • search assigns a score , filter is just that a boolean gate. • Query DSL can be quite daunting at first, learn it like SQL.

Basic Introduction to Elasticsearch

Basic Introduction to Elasticsearch

Bhaskar V. Karambelkar

More Decks by Bhaskar V. Karambelkar

Other Decks in Technology

Featured

Transcript

ElasticSearch Introduction

What is it ? • Full text search engine based

Features • Full text search as well as structured query

Core Concepts • Cluster : 1+ nodes, by default all

How is it deployed ? • Scalable from get go.

How SHOULD it be deployed ? • Dedicated Master/Data/Client nodes

Data Ingestion • HTTP PUT to /<index>/<type>/<id>, will • Auto

e.g. curl -XPUT 'localhost:9200/customer/external/1' \ -d '{"name": "John Doe"}' And

Types of APIs • Document : Index/Get/Delete/Update + Bulk •

Analysis • Analyzer contains 1 Tokenizer & 0+ Token Filters

Mapping • Unstructured data works but structured is even better

QUERY DSL • All queries are JSONs • Simplest is

End.