Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Elasticsearch

Haydar Külekci
March 24, 2023
110

Introduction to Elasticsearch

Haydar Külekci

March 24, 2023
Tweet

Transcript

  1. • Elasticsearch is the distributed search and analytics engine at

    the heart of the Elastic Stack. • Elasticsearch provides near real-time search and analytics for all types of data. • Whether you have structured or unstructured text, numerical data, or geospatial data, Elasticsearch can efficiently store and index it in a way that supports fast searches. • You can go far beyond simple data retrieval and aggregate information to discover trends and patterns in your data. • As your data and query volume grows, the distributed nature of Elasticsearch enables your deployment to grow seamlessly right along with it. What is Elasticsearch?
  2. • For a search box in an application or website

    • Store and analyzing the logs, metrics, …etc. • Automate the business workflow as a storage engine • For analyzing spatial data. (Geographic Information System / GIS) • Analyze genetic data using Elasticsearch as a bioinformatic research tool So, what can we use it for?
  3. • Elasticsearch has an API and using JSON format for

    the interact with the users. • There are lots of clients for the most of the language to interact with API. We will mention at the end of the presentation. • When a document is stored, it is indexed and fully searchable in near real time-- within 1 second. • Elasticsearch also has the ability to be schema-less, which means that documents can be indexed without explicitly specifying how to handle each of the different fields that might occur in a document. • When dynamic mapping is enabled, Elasticsearch automatically detects and adds new fields to the index. More?
  4. • Cluster: A cluster is a collection of one or

    more nodes that together hold the entire data. It provides distributed indexing, high availability and search capabilities across all nodes. • Node: A node is a single server in a cluster, which stores data and participates in the cluster’s indexing and search capabilities. • Shard: Elasticsearch facilitates High availability, fault tolerance, scalability by providing the ability to subdivide the index into multiple pieces called shards. Each shard can be hosted on any node within the cluster. We can set a number of shards manually inside index schema for a particular Index. Some Keywords
  5. • Replica: Elasticsearch allows you to make one or more

    copies of your index’s shards called replica shards or replica. We can set a number of replicas manually inside index schema for a particular Index. • Index: An index is a collection of documents and is identified by a name. • Document: A document is a basic unit of information that can be indexed. It is stored in JSON format which is a global internet data interchange format. • Inverted Index: Elasticsearch uses a data structure called an inverted index that supports very fast full-text searches. More Keywords
  6. What is an Index? • An index in Elasticsearch is

    a collection of documents that share similar characteristics. • An index can be thought of as an optimized collection of documents and each document is a collection of fields, which are the key-value pairs that contain your data. • Elasticsearch uses an indexing process that automatically detects and maps new fields to the appropriate Elasticsearch data types, making it easy to index and explore your data without requiring you to explicitly specify how to handle each field which is stored in index mapping. • An index also stores settings for the data, such as the number of shards and replicas, which determine how the data is distributed and replicated across the Elasticsearch cluster.
  7. • Index name can be lowercase only • Cannot include

    \, /, *, ?, ", <, >, |, ` ` (space character), ,, # • Cannot start with -, _, + • Cannot be . or .. • Cannot be longer than 255 bytes Creating Index PUT accounts { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } }
  8. • There are three body parameter: ◦ Aliases: to set

    alias for the indices ◦ Settings: to set the configuration related with index. ◦ Mappings: to set field types for the index documents Creating Index PUT accounts { "aliases": { "alias-name": {} }, "settings": { "number_of_shards": 1, "number_of_replicas": 1, "analysis": {} }, "mappings": {} }
  9. • Elasticsearch is a distributed document store that uses JSON

    documents to store complex data structures. • Elasticsearch stores data using an inverted index that lists every unique word and identifies all the documents each word appears in. • Elasticsearch is schema-less, which means it can automatically detect and map new fields to the appropriate Elasticsearch data types. • You can define rules to control dynamic mapping and explicitly define mappings to take full control of how fields are stored and indexed. Document { "id": 1, "name": "AHMET", "surname": "MERT EKİNCİOĞLU", "iban": "TR910006200000100001234567", "balance": 15 }
  10. • Elasticsearch enables you to distinguish between full-text string fields

    and exact value string fields, perform language-specific text analysis, optimize fields for partial matching, use custom date formats, and use data types such as geo_point and geo_shape. • It's useful to index the same field in different ways for different purposes, such as indexing a string field as both a text field and as a keyword field. • The analysis chain that is applied to a full-text field during indexing is also used at search time. Document { "id": 1, "name": "AHMET", "surname": "MERT EKİNCİOĞLU", "iban": "TR910006200000100001234567", "balance": 15, "last_location": { "type": "Point", "coordinates": [-71.34, 41.12] } "account_options": { "option1": "value1", "option2": { "value": "value2", "group": 1 } } }
  11. Using Kibana For Request • We will use Kibana for

    sending requests to Elasticsearch • Please follow the path as Management > Dev Tools
  12. • We are using Rest API to create a document

    • We will use the Kibana representation of the request instead of using cURL command representation. Creating a Document POST accounts/_doc { "id": 1, "name": "AHMET", "surname": "MERT EKİNCİOĞLU", "iban": "TR910006200000100001234567", "balance": 15 }
  13. Creating a Document (cURL) curl -XPOST "http://127.0.0.1:9200/accounts/_doc" -H "kbn-xsrf: reporting"

    - H "Content-Type: application/json" -d' { "id": 1, "name": "AHMET", "surname": "MERT EKİNCİOĞLU", "iban": "TR910006200000100001234567", "balance": 15 }'
  14. • Elasticsearch will return to you some information when you

    index a new document. • “_index” will refer the which index you index the data • “_id” will refer the document id of you saved. This document id can be used to fetch the document back. • “_version” will refer how many times this document changed. • ”result” refers what is the action you did with the document. Creating Document Response { "_index": "accounts", "_id": "ZxOc-YYBXc7yhtrXGiTG", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 0, "_primary_term": 1 }
  15. • GET request with ”_id” of documents we see on

    the response. • There are some meta data which are using by Elasticsearch for the document Fetching a Document GET accounts/_doc/ZxOc-YYBXc7yhtrXGiTG { "_index": "accounts", "_id": "ZxOc-YYBXc7yhtrXGiTG", "_version": 1, "_seq_no": 9, "_primary_term": 1, "found": true, "_source": { "id": 1, "name": "AHMET", "surname": "MERT EKİNCİOĞLU", "iban": "TR910006200000100001234567", "balance": 15 } }
  16. • We are using Rest API to update a document

    • PUT request with document “_id” and the document itself. • This request will overwrite the document. Updating a Document PUT accounts/_doc/ZxOc-YYBXc7yhtrXGiTG { "id": 1, "name": "AHMET MERT", "surname": "EKİNCİOĞLU", "iban": "TR910006200000100001234567", "balance": 15 }
  17. • “result” filed will be “updated” • The ”version” field

    will increase when we compare with older. Updating Document Response { "_index": "accounts", "_id": "ZxOc-YYBXc7yhtrXGiTG", "_version": 2, "result": "updated", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 2, "_primary_term": 1 }
  18. • POST request with document “_id” to “_update” endpoint •

    You can send parts of the document instead of all document • You can send new field even to update the document. Partially Updating a Document POST accounts/_update/ZxOc-YYBXc7yhtrXGiTG { "doc": { "name": "AHMET SALİM" } }
  19. Partially Updating Responses { "_index": "accounts", "_id": "ZxOc-YYBXc7yhtrXGiTG", "_version": 2,

    "result": "noop", "_shards": { "total": 0, "successful": 0, "failed": 0 }, "_seq_no": 10, "_primary_term": 1 } { "_index": "accounts", "_id": "ZxOc-YYBXc7yhtrXGiTG", "_version": 2, "result": "updated", "_shards": { "total": 0, "successful": 0, "failed": 0 }, "_seq_no": 10, "_primary_term": 1 } Updated Response No Change Response
  20. • We are using Rest API to create a document

    • We will use the Kibana representation of the request instead of using cURL command representation. Deleting a Document DELETE accounts/_doc/ZxOc-YYBXc7yhtrXGiTG
  21. • “result” filed will be “deleted”. • The ”version” field

    will increase when we compare with older. • The version increased because delete action won’t delete the document totally. At first it will be flagged as deleted internally. Deleting Document Response { "_index": "accounts", "_id": "ZxOc-YYBXc7yhtrXGiTG", "_version": 3, "result": "deleted", "_shards": { "total": 2, "successful": 2, "failed": 0 }, "_seq_no": 8, "_primary_term": 1 }
  22. • Every request have network latency • Bulk operations will

    decrease the network latency Bulk Operations POST accounts/_bulk {action-1} {body-1} {action-2} {body-2} {action-3} {body-3} {action-4} {body-4}
  23. Bulk Operations POST _bulk {"index": {"_index": "accounts", "_id": "1"}} {"id":

    1, "name": "AHMET", "surname": "MERT EKİNCİOĞLU", "iban": "TR910006200000100001234567", "balance": 15} {"index": {"_index": "accounts", "_id": "2"}} {"id": 2, "name": "GÜLNAME", "surname": "KINDIR", "iban": "TR910006200000100001234568", "balance": 27} {"index": {"_index": "accounts", "_id": "3"}} {"id": 3, "name": "SALİM", "surname": "İLTEMİZ", "iban": "TR910006200000100001234569", "balance": 54} {"index": {"_index": "accounts", "_id": "4"}} {"id": 4, "name": "MÜBERRA", "surname": "GÜVEN MEŞE", "iban": "TR910006200000100001234570", "balance": 81}
  24. Reindex • Copies documents from a source to a destination.

    • The source can be any existing index, alias, or data stream. • The destination must differ from the source. For example, you cannot reindex a data stream into itself. • The read index privilege for the source is necessary. • The write index privilege for the destination is necessary. POST _reindex { "source": { "index": "accounts" }, "dest": { "index": "accounts_haydar" } }
  25. Reindex with Specific Data • We need to reindex the

    specific data to our new index • We can use query parameter for source with a Elasticsearch query or filter. • By default _reindex uses scroll batches of 1000. You can change the batch size with the “size” field in the source element. POST _reindex { "source": { "index": "accounts", "size": 100, "query": { "term": { "type": "retirement" } } }, "dest": { "index": "accounts_haydar" } }
  26. Reindex Query Parameters • You can use “max_docs” parameter if

    you want to reindex specific number of documents only. • You can specify ”require_alias” parameter as true if you want to index documents into just aliases instead of using real index name. POST _reindex?require_alias=true&max_docs=10 { "source": { "index": "accounts" }, "dest": { "index": "accounts_haydar" } }
  27. Delete By Query • You can delete the documents with

    bulk operation. • There is an “_delete_by_query” API endpoint to manage this operation POST accounts/_delete_by_query { "query": { "term": { "type": "retirement" } } }
  28. Update By Query • You can update the documents with

    bulk operation. • There is an “_update_by_query” API endpoint to manage this operation POST accounts/_update_by_query?conflicts=proceed { "script": { "source": "ctx._source.type = 'retired';", "lang": "painless" }, "query": { "term": { "type": "retirement" } } }