5 years of running Elasticsearch in production

Alexander Pepper [email protected] +49 30 747993-0 @apepper @_apepper

Index Size • ~8 million documents • ~45 GB data
• ~300 search requests/min • ~120 index requests/min

Our history with Elasticsearch • 2011: started with version 0.17
• 2014: migrated to 1.x (with new setup, regular maintenance and backups) • 2016: migrated to 2.x

Index Size • 100 shards • 2 replica

Overview • Infrastructure • Installation • Backup • Monitoring •
Maintenance • Pitfalls

Infrastructure

Cluster Location • Amazon Web Services (AWS) • Region: eu-west-1
(Ireland) • Using AWS Elastic Cloud Computing (EC2) • Management by AWS OpsWorks • Not accessible via the internet

3x EC2 Instances • r3.xlarge instance type • CPU: Intel
Xeon 2,5 GHz • RAM: 30 GB • Hard drive: 80 GB SSD • OS: Amazon Linux (based on Red Hat)

Cluster Discovery • Internal • cloud-aws Plugin • Based on
ec2 tags

Cluster Discovery • External • Private instances inside a Virtual
Private Cloud (VPC) • AWS Elastic Load Balancer (ELB) - only accessible from the VPC • API instances do have access to the ELB

VPC Pitfalls • Network Address Translation (NAT) instance needed •
Disable OpsWorks auto healing (for private instances)

Installation

Installation • OpsWorks uses Chef Cookbooks • Comparable to ansible
and puppet • Standard Cookbooks from  https://supermarket.chef.io • Custom Cookbooks

Packaging • On AWS Simple Storage Service (S3): • Cookbooks
• Java • Elasticsearch • Elasticsearch plugins

Cookbooks • disable swapiness • mount data volume • install
Java • install Elasticsearch (with Monit) • install Elasticsearch plugins (Kibana, Marvel, Sense, etc.) • install backups • install monitoring

Backup

Backup • Snapshots since Elasticsearch version 1.0 • Point in
time copy of all Elasticsearch data

Backup Cronjob • ruby script • only backup on master
node • daily snapshot repository on AWS S3 • 30 days data retention • 1st of month 365 days data retention • data retention via S3 lifecycle rules • hourly incremental backup • current size per day: 50 GB

Restore • ruby script • clones OpsWorks stack • starts
instances • restores requested backup • Current runtime: • instance boot ~7 min • restore snapshot ~22 min

Monitoring

Monitoring • Pingdom Server Monitoring (formerly known as Scout) •
CPU • Diskspace/Open ﬁles • Memory/Swap • Cluster status • Number of nodes • Backup ("Say cheese") • AWS ELB

Monitoring • API Monitoring • via Honeybadger • Warn about
slow requests (slower than 2 seconds)

Maintenance

Maintenance • Quarterly • Check for new versions • OS
• Cookbooks • Java • Elasticsearch • Plugins (Kibana, Marvel, etc.)

Maintenance • Check restore • Full reindex • For other
product: snapshot restore + partial reindex

Pitfalls

Pitfalls • Minimum Master Nodes • 50% RAM for Elasticsearch
• VPC: Network Address Translation (NAT) instance needed • Private VPC instance: Disable OpsWorks auto healing • OpsWorks: start Elasticsearch via monit

Thank you for your attention! Alexander Pepper [email protected] +49 30
747993-0 @apepper @_apepper

Picture Sources • https://www.flickr.com/photos/sigalrm/31560595165/ • https://www.flickr.com/photos/selda_eigler/8686009651/ • https://www.flickr.com/photos/aon/7817771968/ • https://www.flickr.com/photos/nathanf/2314676429/
• https://www.flickr.com/photos/renarl/3400468165 • https://www.flickr.com/photos/aon/6272938468/ • https://www.flickr.com/photos/muratlivaneli/6104145120 • https://www.flickr.com/photos/30884177@N08/4107269864/ • https://www.flickr.com/photos/aon/7817811212/ • https://www.flickr.com/photos/29278394@N00/4689679306/ • https://www.flickr.com/photos/pustovit/15867520885/

5 years of running Elasticsearch in production

5 years of running Elasticsearch in production

Alexander Pepper

Other Decks in Programming

Featured

Transcript