ElasticSearch: Introduction and lessons learned

ElasticSearch: Introduction and lessons learned

ElasticSearch is a full text search engine based on Apache Lucene. It’s the new kid of the block and competes with other projects like Apache Solr. It is open source under the Apache Licence and backed by the (well funded) ElasticSearch company which offers support and training.

This talk will provide an introduction to ElasticSearch, the community around it and the lessons learned when working with ElasticSearch to index hundreds of millions of documents in a multitude of languages and character sets. The rough outline of the talk will be:

- Introduction to myself, Artirix and the work that we do.
- ElasticSearch introduction - Project background and interesting features.
- An introduction to the community and third party plugins.
- Quick overview of usage (inserting, querying) and the different versions.
- Lessons learned when creating a large index.

305cabbbad7039ea3a7e8b9be6f59619?s=128

Dougal Matthews

July 02, 2013
Tweet

Transcript

  1. 2.

    WHAT DO I DO? •Working as a senior Python developer

    for Artirix. •Building backend systems and services. •Organiser of Python Glasgow. Maximising the Value of Content, Data & Information Tuesday, 2 July 13
  2. 3.

    elasticsearch • Open Source - Apache Licence. • Backed by

    the ElasticSearch company. • Careful feature development. • Primary Author is Shay Banon. Tuesday, 2 July 13
  3. 4.

    elasticsearch • Mostly by Shay Banon • Open Source -

    Apache Licence • Java • Backed by the ElasticSearch company • Careful feature development. Tuesday, 2 July 13
  4. 5.

    elasticsearch • Full text search • Big data • Faceting

    • GIS • Clustering • Logging and more. Tuesday, 2 July 13
  5. 6.

    Data Model • Document store - JSON everywhere. • Speaks

    HTTP (and thrift.) • Schemaless (kinda.) • Indexes, Types and Documents. Tuesday, 2 July 13
  6. 8.

    Getting started. OSX $ brew install elasticsearch $ elasticsearch -f

    -D es.config= /usr/local/opt/elasticsearch/config/elasticsearch.yml Tuesday, 2 July 13
  7. 9.

    $ curl -s -XGET 'localhost:9200/' { "ok" : true, "status"

    : 200, "name" : "Gigantus", "version" : { "number" : "0.90.2", "snapshot_build" : false, "lucene_version" : "4.3.1" }, "tagline" : "You Know, for Search" } Tuesday, 2 July 13
  8. 11.

    Indexing curl -XPUT localhost:9200/events/talk/123 -d ' {"title": "ElasticSearch: Introduction."} '

    | python -m json.tool { "_id": "123", "_index": "events", "_type": "talk", "_version": 1, "ok": true } Tuesday, 2 July 13
  9. 12.

    Fetching curl -XGET localhost:9200/events/talk/123 { "_id": "123", "_index": "events", "_source":

    { "title": "ElasticSearch: Introduction." }, "_type": "talk", "_version": 1, "exists": true } Tuesday, 2 July 13
  10. 13.

    Searching curl -XGET 'localhost:9200/events/_search?q=_id:123' { "_shards": { "failed": 0, "successful":

    5, "total": 5}, "hits": { "hits": [ { "_id": "123", "_index": "events", "_score": 1.0, "_source": { "title": "ElasticSearch: Introduction." }, "_type": "talk" } ], "max_score": 1.0, "total": 1 }, Tuesday, 2 July 13
  11. 14.

    Query DSL •Filters • Fast • Cached • Boolean •Queries

    • Fuzzy • Scored Tuesday, 2 July 13
  12. 15.

    { "bool": { "must": { "range": { "year": {"from": 2011,

    "to":2013} } }, "must_not": { "term": {"language": "PHP"} }, "should": [ { "term": {"tag": "elasticsearch"} }, { "term": {"tag": "python"} } ], "minimum_number_should_match": 1, "boost": 1.0 } } Tuesday, 2 July 13
  13. 17.

    Reverse Indexes The quick brown Fox jumps over the lazy

    dog The brown fox jumps quick brown fox jumps lazy dog 1 1, 3 1, 3 2, 3 2 2 Tuesday, 2 July 13
  14. 18.

    Some Lessons! •Indexing is really fast. •Use with another canonical

    storage database. •Bulk index around 5Mb at a time. •Run the latest version Oracle Java. •Define your schema. •OOM can be a problem. •Lots of facets = lots of memory. •ID’s not guaranteed to be unique with routing. •Don’t write Java plugins - hard to keep relevant. •Avoid using “Rivers” - use the Java API instead. Tuesday, 2 July 13
  15. 20.

    Python Integration •pyes - oldest, a bit hairy •pyelasticsearch -

    newer, nicer, low level •elasticutils - built on pyelasticsearch, feels ORM’y •django-haystack - Very easy integration with Django Tuesday, 2 July 13