Elastic{ON}15 - 500 Billion Documents and counting

Scaling Elasticsearch for Production Scale Data Bhaskar V. Karambelkar Senior
Security Data Scientist @ Verizon

{ } CC-BY-ND 4.0 Introduction • Security Data Scientist /
Tech Lead @ Verizon. • Avid Elasticsearch user and advocate since 2012. • Interested in information security and data analytics. 2

{ } CC-BY-ND 4.0 Why Elasticsearch ? • Logs stayed
on disks. • No easy way to fetch/search logs. • Hard to scale. • Untapped potential in log data DRUM ROLLS…… Log Management. 3 Before Elasticsearch Disks RDBMS Logs Events Incidents Tickets

{ } CC-BY-ND 4.0 Expectations • Must be able to
store massive amounts of data. • Must be able to get data in at a very high rate and get data out at an acceptable rate. • Should be schema agnostic. • Must be able to search/filter/analyze data. • Must support MULTI-TENANCY. • Distributed/fault-tolerant/load-balanced/etc. 4

{ } CC-BY-ND 4.0 Progress 5 Jul ‘13 Sept ‘13
Nov’ 13 Dec ‘14 Boxes/Nodes 14/14 28/28 28/56 128/128 Cores/RAM/DISK 12/128GB/ 3TBx12 8/64GB/1TBx6 AVG DAILY VOLUME 500 M 1 B. 2.5/3 B 10+ B TOTAL VOLUME ~ 10 B. ~ 100 B. ~ 200 B > 500 B & counting

{ } CC-BY-ND 4.0 Know your data and platform •
Volume / Velocity / Variety / Veracity will affect your choices and their interactions even more so. • Select a proper base config for your nodes. Get the CPU- Cores x RAM x Disk ratio right. • Decide on self hosted vs. cloud hosted. • Prefer JBODs for data disks over RAID, SAN/NAS. • Virtualization vs. bare metal: know the tradeoffs. • SSDs vs. Spinning Disks, (Speed vs. Capacity). 6

{ } CC-BY-ND 4.0 The Must DO’s • Change Cluster
name. • Dedicated Master, Data and Client nodes. • Use Aliases from get go. • Keep all nodes in same subnet. Use unicast discovery. • Have enough memory for JVM heap + FS Cache. • Tune kernel parameters, user/process/file/network limits. • Always CHECK JVM version compatibility. Also, stick to Oracle JVM. • Learn Query DSL and Elasticsearch APIs. 7

{ } CC-BY-ND 4.0 The Should DO’s • Tune JVM
params, but avoid going overboard. • Tune network/connectivity parameters. • Tune recovery parameters. • Configure gateway parameters. • Tune thread pools: Bulk/Index/Search. • Prefer bulk indexing. • Tune caching parameters, especially field data. • In general don’t be afraid to tweak-n-tune till you hit performance sweet spot. • Having a knowledge of text analytics in the team will go a long way. 8

{ } CC-BY-ND 4.0 The Do NOTs • Avoid running
Elasticsearch along with another service on the same box. • Avoid vertical scaling i.e. avoid 2+ nodes per box. • Don’t grow cluster beyond ~150 nodes. Deploy multiple clusters and use Tribe node. • Don’t allow unrestricted/unsupervised querying unless you know the user base. • Never send data for indexing or search queries to Master nodes. 9

{ } CC-BY-ND 4.0 Compared to Hadoop • In terms
of scalability & design Elasticsearch ≠ Hadoop • Not inferior, just different. • Bulkier nodes for Hadoop, leaner for Elasticsearch. • Scale Elasticsearch horizontally, never vertically. • Load characteristics, (CPU/Mem/IO), differ. 10

{ } CC-BY-ND 4.0 Indexing • Bulk indexing with refresh
interval = -1. Send data directly to data nodes. • Set aside more resources for bulk thread pool. • We saw slight performance gain when using raw TCP over HTTP. • Build new index per month/week/day/hour(?) and use aliases. • Number of Indexes x Shards x Replicas not only affects storage but also Memory. 11

{ } CC-BY-ND 4.0 Indexing cont. • Don’t use a
lot of “types” in a single index. • Use mappings/templates for pre-defining field types. • Disable ‘_all’ field unless really needed. Avoid ‘storing’ fields outside of ‘_source’. • Know which fields to NOT analyze. Decide which analyzer/token-filter/char-filter works best. 12

{ } CC-BY-ND 4.0 Searching • Use filters. • Avoid
‘query string’ query. • Know difference between “bool” and ‘and/or/not’ filters. • Know the impact of faceting/aggregations/sorting on field data cache for high cardinality fields like “timestamp”. • For bulk searching use ‘scroll’. • Rely on explain/validate queries for performance tuning. • Search Templates for simplified queries. • Send searches to client nodes. Not to data nodes and never ever to Master nodes. 13

{ } CC-BY-ND 4.0 Monitoring / Management • For monitoring
we prefer Nagios, but Marvel works really well too. • Use automated deployment / configuration management via Chef/Puppet/Salt/Ansible. • Prefer to share a single config file for all node types. • We retain raw data in HDFS for a year in case of data loss / re- indexing required. Data in Elasticsearch retained for max 90 days. • Know when to use rolling upgrades vs full upgrades. 14

{ } Thank You Questions / Comments T : @bhaskar_vk
, LI : bhaskarvk

{ } This work is licensed under the Creative Commons
Attribution-NoDerivatives 4.0 International License. To view a copy of this license, visit: http://creativecommons.org/licenses/by-nd/4.0/ or send a letter to: Creative Commons PO Box 1866 Mountain View, CA 94042 USA CC-BY-ND 4.0

Elastic{ON}15 - 500 Billion Documents and counting

Elastic{ON}15 - 500 Billion Documents and counting

Bhaskar V. Karambelkar

More Decks by Bhaskar V. Karambelkar

Other Decks in Technology

Featured

Transcript

Scaling Elasticsearch for Production Scale Data Bhaskar V. Karambelkar Senior

{ } CC-BY-ND 4.0 Introduction • Security Data Scientist /

{ } CC-BY-ND 4.0 Why Elasticsearch ? • Logs stayed

{ } CC-BY-ND 4.0 Expectations • Must be able to

{ } CC-BY-ND 4.0 Progress 5 Jul ‘13 Sept ‘13

{ } CC-BY-ND 4.0 Know your data and platform •

{ } CC-BY-ND 4.0 The Must DO’s • Change Cluster

{ } CC-BY-ND 4.0 The Should DO’s • Tune JVM

{ } CC-BY-ND 4.0 The Do NOTs • Avoid running

{ } CC-BY-ND 4.0 Compared to Hadoop • In terms

{ } CC-BY-ND 4.0 Indexing • Bulk indexing with refresh

{ } CC-BY-ND 4.0 Indexing cont. • Don’t use a

{ } CC-BY-ND 4.0 Searching • Use filters. • Avoid

{ } CC-BY-ND 4.0 Monitoring / Management • For monitoring

{ } Thank You Questions / Comments T : @bhaskar_vk

{ } This work is licensed under the Creative Commons