Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Supporting Dynamic Growth with the Elastic Stack - WIDS 2016

Supporting Dynamic Growth with the Elastic Stack - WIDS 2016

World Internet Developer Summit 2016 - Supporting Dynamic Growth with the Elastic Stack

Elastic Co

June 07, 2016
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. What is the ‘Show Me The Movies’ offering? • A

    SaaS offering for media providers to enable streaming of their content • Each media provider will have video and meta data stored in the system • Analytical data on videos streamed will also be stored and offered to media providers • Our service provides a public API for media providers to use the meta and analytical data to build their solutions 4
  2. As our business grows How does this affect our system?

    5 • Increased traffic / load to our public API’s • Increased storage of media meta data • Increased storage of analytical data • Increased reliance of system stability
  3. Types and Mapping Example Mapping 12 { "person" : {

    "properties" : { "name" : { "properties" : { "first" : { "type" : "string" } } } } } }
  4. ‹#› How do we know how much data a shard

    can hold? Let’s talk sizing our cluster
  5. 15 1 2 3 Start with a single shard Ingest

    data we expect for production into test cluster Performance test application and evaluate metrics to find upper limit of single shard What’s the best cluster configuration? Well, it depends
  6. ‹#› We found our shard limits, now how can we

    scale for the future? Talking shard overallocation
  7. ‹#› I don’t know how big this is going to

    be, and I can’t change the index size later on, so to be on the safe side, I’ll just give this index 1,000 shards… New Elasticsearch user
  8. The ‘Kagillion Shard’ Problem • A shard is a Lucene

    index under the covers, which uses file handles, memory, and CPU cycles. • Every search request needs to hit a copy of every shard in the index. • Term statistics, used to calculate relevance, are per shard. Having a small amount of data in many shards leads to poor relevance. 19 A little overallocation is good. A kagillion shards is bad.
  9. Index Per Media Provider • Easy to implement by appending

    media provider ID to index name • Data is easily separated between media providers • Indexes can have specific shards and replicas to deal with the amount of data for the media provider • Works really well while the amount of media providers is small 28
  10. Index Per Media Provider Example creating 3 indexes for 3

    Media Providers 29 PUT /media_provider_1 { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } } PUT /media_provider_2 { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } } PUT /media_provider_3 { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } }
  11. 31

  12. 32

  13. Single Shared Index • Resources can be dedicated to the

    single index • Utilising filters and aliases, we can separate media providers data based on their ID • Define routing to increase search performance • Able to easily move heavy users off the shared index to their own dedicated index 33
  14. Single Shared Index New Media Providers Index 34 PUT /media_providers

    { "settings": { "number_of_shards": 5, "number_of_replicas": 1 } }
  15. Single Shared Index Add new meta data for Media Provider

    35 PUT /media_providers/media/1 { “media_provider_id”: 123, "title": "Zootopia" }
  16. Single Shared Index Alias 36 POST /_aliases { "actions": [

    {“add":{"alias": "media_provider_1", "index": “media_providers” }} ] }
  17. Separating Out Large Media Providers • We can create a

    separate index specifically for the large media provider • Migrating data from the existing shared index can be done using a scroll query and the bulk API • We can then set an index alias to point to the new index 38
  18. Separating Out Large Media Providers 39 PUT /big_media_provider_1 { "settings":

    { "number_of_shards": 2, "number_of_replicas": 1 } } POST /_aliases { "actions": [ {"remove": { "alias": “media_provider_1”, "index": “media_providers"}}, {“add":{"alias": "media_provider_1", "index": “big_media_provider_1” }} ] }
  19. Time based indices • A single large index would run

    out of space and resources quickly • Time based indexes are created for a specific time period, IE. Monthly • You can change the index configuration much quicker as new indexes are created more often • Use of aliases make querying and ingesting data easy • Your able to remove old data (old indexes) easily to save resources • Elastic Curator https://github.com/elastic/curator 41
  20. Time based indices Creating a time based index 42 PUT

    /logs_2016-06 { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } }
  21. Time based indices Aliases 43 POST /_aliases { "actions": [

    { "add": { "alias": "logs_current", "index": "logs_2016-06" }}, { "remove": { "alias": "logs_current", "index": "logs_2016-05" }}, { "add": { "alias": "last_month", "index": "logs_2016-05" }}, { "remove": { "alias": "last_month", "index": "logs_2016-04" }} ] }
  22. Ingest Pipeline - coming in 5.0 • Elasticsearch will have

    an ingest node, which will do document enrichment • Great for processing simple documents with some enrichment • Logstash will should still be used for more complex document enrichment 49
  23. Time based indexes for analytical data • Analytical data generally

    is associated with a timestamp • Allows us to query easily over a certain time period without overhead of querying all data • Example: How many plays of a particular media over the last day • We can easily remove old data that we no longer offer to our media providers • We only want to offer the last 3 years of analytical data 51
  24. Time based indexes for analytical data Example time based index

    for our analytical data 52 PUT /media_plays_01_06_2016 { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } }
  25. Kibana • Visualise all your data • Seamless integration into

    Elasticsearch • Build sophisticated dashboards for analytics • Visually interact with Elasticsearch 55
  26. 56

  27. 58 • Security (formally Shield) • Alerting (formally Watcher) •

    Monitoring (formally Marvel) • Graph • Reporting