From Hackathon to Production: Elasticsearch @ Facebook

Elasticsearch@Facebook From Hackathon to Production Peter Vulgaris March 11, 2015

In The Beginning Google Search Appliance ▪ Sits in a
rack, takes URLs and scrapes them ▪ Basically a block box ▪ No structured data other than rendered HTML pages ▪ No selective boosting ▪ Not really the hacker way

Also In The Beginning Apache Solr ▪ No built-in way
to handle querying multiple indices ▪ Manual sharding ▪ Verbose XML queries :( ▪ Built tools to handle what elasticsearch does out of the box

Why Elasticsearch? It's easy to hack on ▪ Same power
of Lucene with simpler REST + JSON interface ▪ Quick to get it up and running ▪ Automatic replication and rebalancing ▪ Great community

2012 2015 ▪ 1 cluster (0.18) ▪ 1 application (internal
search) ▪ Tens of thousands of docs ▪ 7 clusters (that we know of, 1.2+) ▪ Dozens of applications ▪ 100+ nodes in multiple datacenters ▪ ~4 billion documents ▪ 4+ TB data ▪ 1500+ QPS (for WWW at least) ▪ 1 common deployment infrastructure ▪ 2-3 indexing frameworks

Goals for Elasticsearch Products ▪ Engineer comes in knowing nothing
about search ▪ Able to index docs and add search ASAP ▪ Path to more advanced usage

Goals for Elasticsearch Infrastructure ▪ Spin up a new cluster
with nodes in multiple DC's in minutes with common settings ▪ Survive the "storms" ▪ Transparent transitions for clients to new clusters ▪ Management, logging, alarms, etc.

Help Community ▪ fb.com/help/community ▪ Indexes and searches through user-generated
questions and answers ▪ 211 QPS to the cluster

Threat Exchange ▪ threatexchange.fb.com ▪ Platform for distributing threat intelligence
▪ 4M malware scans/week and just getting started

Getting Started With Elasticsearch The Old Way ▪ "I don't
know anything about search, but I'm tasked with adding it to my tool. Help!" -Engineer ▪ "Good luck!" -Me ▪ Lucene in Action ▪ elasticsearch.org

Getting Started With Elasticsearch The New Way ▪ Copy/paste documentation
▪ Sandbox environments ▪ Sample settings/mappings ▪ Indexing framework ▪ One config per index or type ▪ Scheduling, consistency, live updates, retries, etc.

Getting Started With Elasticsearch The New Way Continued ▪ Finding
stuff is still a little wild west ▪ Google-y query string query ▪ Elastica PHP library ▪ This tends to be the easier bit

Ramping Up With Elasticsearch Old Way vs New Way ▪
No longer spending entire internships adding indexing and search to products ▪ More tools teams have a “cool search guy” (quote is mine) ▪ More adoption and spreading to product teams ▪ Bottom line: less time learning elasticsearch and more time searching

Intern Search Putting it all together ▪ Configs for wiki,
dex, tasks, code, employees, etc. ▪ CTR, bounce rates and pins ▪ Query string query ▪ A/B testing

Single Cluster LOL ▪ Just one cluster for all projects
▪ Pros: ▪ Simpler migrations to new versions for Elasticsearch ▪ Easier to debug issues ▪ Faster ramp-up for other teams ▪ Cons: ▪ When the cluster goes down...

Multiple Clusters AKA Get More Sleep ▪ Engineers are dangerous
▪ Engineers don't want to worry about usage quotas ▪ Move fast and over-index your data ▪ Add search, head home and crack open a beer ▪ How do we scale this? 

Deployment Tupperware ▪ Simple config ▪ LXC containers ▪ Add/remove
nodes ▪ Health checks ▪ Automatic node replacement ▪ Log aggregation

Monitoring Scuba ▪ Free with config ▪ Watch CPU, heap,
requests by endpoint, etc. ▪ Alarms ▪ Why not Marvel?

Dealing With Disaster

Multiple Datacenters Redundancy ▪ Masters need low latency ▪ Still
experimenting ▪ Data nodes in multiple datacenters ▪ Needs SHIELD-ing

Rebuilding Indices AKA Disaster Recovery ▪ Cron job to save
settings and mappings with version control ▪ Indexer configs can rebuild most core indices ▪ Product-specific fallbacks ▪ Full-text query on backing DB ▪ Watch for shard failures ▪ We need snapshots

Migrations ▪ Attempt #1: 0.18 -> 0.19 ▪ Shutdown cluster,
backup data, build new version and restart ▪ Oops, only partial data copy ▪ Corruptions = complete rebuild ▪ Attempt #2: 0.19 -> 0.20 ▪ Shutdown cluster, update nodes and restart ▪ Worked great with about an hour of downtime

Live Migration! ▪ Attempt #3: 0.20 -> 0.90 ▪ Lots
more teams using ES now ▪ Build cross-cluster replication mechanism based on elasticsearch-changes-plugin ▪ Live migration to new cluster... ▪ ...rollback to old cluster when boosting bug was found ▪ Second live migration attempt a success...

fml.sh

Migrations Today Aliases for clusters ▪ Run shadow traffic to
new cluster ▪ Cluster data in sync ▪ Check for exceptions ▪ Stats for good measure ▪ Flip a switch when we're ready ▪ Can flip back too!

Shield Seamless security ▪ Shadow cluster with 1.4 and Shield
▪ On the fly HTTP -> HTTPS ▪ Let's try it in production! ▪ #yolo ▪ Didn't read the manual ▪ Now it's rock solid ▪ Running for weeks now ▪ H1 deploying to all clusters

What's next? Upgrades ▪ Default HTTPS ▪ Automated snapshots to
GlusterFS ▪ Role-based access control ▪ Not for security, but for sanity ▪ Wild-west cluster

Lessons Learned The hard way ▪ One cluster is easy
to manage. And easy to bring down. ▪ Search ranking is hard. So cheat. ▪ Make it easy for engineers and they will come.

Questions? Also, we're hiring ([email protected]) Peter Vulgaris March 11, 2015

From Hackathon to Production: Elasticsearch @ F...

From Hackathon to Production: Elasticsearch @ Facebook

Elastic Co

More Decks by Elastic Co

Other Decks in Technology

Featured

Transcript

Elasticsearch@Facebook From Hackathon to Production Peter Vulgaris March 11, 2015

In The Beginning Google Search Appliance ▪ Sits in a

Also In The Beginning Apache Solr ▪ No built-in way

Why Elasticsearch? It's easy to hack on ▪ Same power

2012 2015 ▪ 1 cluster (0.18) ▪ 1 application (internal

Goals for Elasticsearch Products ▪ Engineer comes in knowing nothing

Goals for Elasticsearch Infrastructure ▪ Spin up a new cluster

Help Community ▪ fb.com/help/community ▪ Indexes and searches through user-generated

Threat Exchange ▪ threatexchange.fb.com ▪ Platform for distributing threat intelligence

Tasks

Getting Started With Elasticsearch The Old Way ▪ "I don't

Getting Started With Elasticsearch The New Way ▪ Copy/paste documentation

Getting Started With Elasticsearch The New Way Continued ▪ Finding

Ramping Up With Elasticsearch Old Way vs New Way ▪

Intern Search Putting it all together ▪ Configs for wiki,

Single Cluster LOL ▪ Just one cluster for all projects

Multiple Clusters AKA Get More Sleep ▪ Engineers are dangerous

Deployment Tupperware ▪ Simple config ▪ LXC containers ▪ Add/remove

Monitoring Scuba ▪ Free with config ▪ Watch CPU, heap,

Dealing With Disaster

Multiple Datacenters Redundancy ▪ Masters need low latency ▪ Still

Rebuilding Indices AKA Disaster Recovery ▪ Cron job to save

Migrations ▪ Attempt #1: 0.18 -> 0.19 ▪ Shutdown cluster,

Live Migration! ▪ Attempt #3: 0.20 -> 0.90 ▪ Lots

fml.sh

Migrations Today Aliases for clusters ▪ Run shadow traffic to

Shield Seamless security ▪ Shadow cluster with 1.4 and Shield

What's next? Upgrades ▪ Default HTTPS ▪ Automated snapshots to

Lessons Learned The hard way ▪ One cluster is easy

Questions? Also, we're hiring ([email protected]) Peter Vulgaris March 11, 2015