Upgrade to Pro — share decks privately, control downloads, hide ads and more …

elasticsearch @ ferret go

elasticsearch @ ferret go

Short presentation on how we use (and *failed* to use properly) elasticsearch at ferret go, a media analysis startup. Given at the ES User Group Berlin, November 2012.

Fabian Neumann

November 27, 2012
Tweet

Other Decks in Technology

Transcript

  1. elasticsearch @ ferret go ES UG Berlin Meetup, 2012-11-27 Fabian

    Neumann (@hellp) Daniel Trümper (@truemped)
  2. "ferret go" -- THE PROJECT * media analysis * online,

    print, social * -> rss/atom -> storm -> ES -> web app * linguistics (sentiment, entity recognition etc.) * Also:redis, Python, Pyramid, ...
  3. "ferret go" -- THE LOCATION * Bernau b. Berlin, Brandenburg

    * Zickenschulze (German food) * No (good) Asian food * Rollator races * Like Kreuzberg without the fancy
  4. THE BLACK WEEK * WHY suddenly? * more data (1

    October = 4 Julies) * moving indexes; (bulk) re-indexing * more users * long-term queries now more long-term * config-/brain-less ES setup (which is nice!) only worked for us so long
  5. ES SETUP * 2 indices * 6 data nodes (i7,

    8 cores, 32G mem, 16G for ES) * each index: 12 shards * 3 replicas = 72 shards per index (too much, we know ...)
  6. SENSIBLE SHARD-BALANCING? INDX 1 INDX 2 NODE 1 ▒ ▒

    ▒ ▒ ▒ ▓ NODE 2 ▒ ▒ ▓ ▓ ▓ ▓ NODE 3 ▒ ▓ ▓ ▓ ▓ ▓ NODE 4 ▒ ▒ ▒ ▒ ▒ ▓ ... shard sizes ^-- 12G 0.5G --^
  7. SENSIBLE SHARD-BALANCING? INDX 1 INDX 2 NODE 1 ▒ ▒

    ▒ ▒ ▒ ▓ NODE 2 ▒ ▒ ▓ ▓ ▓ ▓ NODE 3 ▒ ▓ ▓ ▓ ▓ ▓ NODE 4 ▒ ▒ ▒ ▒ ▒ ▓ ... ^-- also more complex queries shard sizes ^-- 12G 0.5G --^
  8. SENSIBLE LOAD-BALANCING? NODE 1 ▒ ▒ ▒ ▒ ▒ <-

    NODE 2 ▒ ▒ <- NODE 3 ▒ <- NODE 4 ▒ ▒ ▒ ▒ ▒ <- > import pyes > # All nodes in a list, passed to urllib3 PoolManager, > # free load-balancing, yay! > conn = pyes.ES([node1, node2, node3, node4]) > res = conn.search(query_model.to_es_query()) > return res
  9. SENSIBLE LOAD-BALANCING? NODE 1 ▒ ▒ ▒ ▒ ▒ <-

    <- <- <- <- <- NODE 2 ▒ ▒ <- NODE 3 ▒ NODE 4 ▒ ▒ ▒ ▒ ▒ > import pyes > # All nodes in a list, passed to urllib3 PoolManager, > # free load-balancing, yay! NOT! 3 are just fallback. Oops. > conn = pyes.ES([node1, node2, node3, node4]) “The PoolManager will take care of reusing connections for you whenever you request the same host.”
  10. SENSIBLE NODE CONFIGURATION? NODE 1 ▒ ▒ ▒ ▒ ▒

    /(x.x)\ <-- JVM NODE 2 ▒ ▒ NODE 3 ▒ NODE 4 ▒ ▒ ▒ ▒ ▒ /(x.x)\ $ grep cache /etc/elasticsearch/elasticsearch.yml $ (hey, that looked like /dev/null ...) $ grep OutOfMemoryErr /var/log/elasticsearch/heck.log | wc -l 1337 $ # ... or rather n00b
  11. SENSIBLE NODE CONFIGURATION? NODE 1 ▒ ▒ ▒ ▒ ▒

    \(^.^)/ <-- JVM NODE 2 ▒ ▒ NODE 3 ▒ NODE 4 ▒ ▒ ▒ ▒ ▒ \(^.^)/ $ grep cache /etc/elasticsearch/elasticsearch.yml index.cache.field.type: soft $ grep OutOfMemoryErr /var/log/elasticsearch/heck.log | wc -l 0 $ # much better
  12. MANUAL BALANCED SHARDS INDX 1 INDX 2 NODE 1 ▒

    ▒ ▓ ▓ ▓ NODE 2 ▒ ▒ ▓ ▓ ▓ NODE 3 ▒ ▒ ▓ ▓ ▓ NODE 4 ▒ ▒ ▓ ▓ ▓
  13. NO-DATA NODES FOR LOAD-BALANCING INDX 1 INDX 2 NODE 1

    ▒ ▒ ▓ ▓ ▓ <- <- NODE 2 ▒ ▒ ▓ ▓ ▓ <- <- NODE 3 ▒ ▒ ▓ ▓ ▓ <- <- NODE 4 ▒ ▒ ▓ ▓ ▓ <- <- NODE 5 <- <- <- <- <- <- <- <- <- NODE 6 <- <- <- <- <- <- <-
  14. nodata node :) free LB; easy as HAProxy still too

    many shards :/ <- also queries/s :(
  15. NEXT STEPS -- TECH LEVEL * time slicing (flexibility in

    shard/index layout) * request/shard routing (but no good routing criteria yet) * further config optimizations (flush/refresh intervals etc.) * smoother recovery phases
  16. NEXT STEPS -- APP LEVEL * less query load (e.g.

    re-implement clustering process) * query optimizing (never cover the whole index, good, right?)
  17. AFTERMATH -- USER GROUP INSIGHTS * some problems known to

    ES core devs * some will be fixed * ferret is a faceting-heavy app, which uses lots of memory. we need to be more careful about that. * JVM choice matters * avoid many growing pains, read this: http://asquera.de/opensource/2012/11/25/elasticsearch-pre-flight-checklist/