Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From Hackathon to Production: Elasticsearch @ Facebook

From Hackathon to Production: Elasticsearch @ Facebook

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:

Facebook has been using Elasticsearch for 3 plus years, having gone from a simple enterprise search to over 40 tools across multiple clusters with 60+ million queries a day and growing. This talk will focus on the entire Elasticsearch journey, from a hackathon project to a self-service infrastructure used across internal tools and public production sites.

Presented by Peter Vulgaris, Facebook

Dd9d954997353b37b4c2684f478192d3?s=128

Elastic Co

March 11, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. None
  2. Elasticsearch@Facebook From Hackathon to Production Peter Vulgaris March 11, 2015

  3. In The Beginning Google Search Appliance ▪ Sits in a

    rack, takes URLs and scrapes them ▪ Basically a block box ▪ No structured data other than rendered HTML pages ▪ No selective boosting ▪ Not really the hacker way
  4. Also In The Beginning Apache Solr ▪ No built-in way

    to handle querying multiple indices ▪ Manual sharding ▪ Verbose XML queries :( ▪ Built tools to handle what elasticsearch does out of the box
  5. Why Elasticsearch? It's easy to hack on ▪ Same power

    of Lucene with simpler REST + JSON interface ▪ Quick to get it up and running ▪ Automatic replication and rebalancing ▪ Great community
  6. 2012 2015 ▪ 1 cluster (0.18) ▪ 1 application (internal

    search) ▪ Tens of thousands of docs ▪ 7 clusters (that we know of, 1.2+) ▪ Dozens of applications ▪ 100+ nodes in multiple datacenters ▪ ~4 billion documents ▪ 4+ TB data ▪ 1500+ QPS (for WWW at least) ▪ 1 common deployment infrastructure ▪ 2-3 indexing frameworks
  7. Goals for Elasticsearch Products ▪ Engineer comes in knowing nothing

    about search ▪ Able to index docs and add search ASAP ▪ Path to more advanced usage
  8. Goals for Elasticsearch Infrastructure ▪ Spin up a new cluster

    with nodes in multiple DC's in minutes with common settings ▪ Survive the "storms" ▪ Transparent transitions for clients to new clusters ▪ Management, logging, alarms, etc.
  9. Help Community ▪ fb.com/help/community ▪ Indexes and searches through user-generated

    questions and answers ▪ 211 QPS to the cluster
  10. Threat Exchange ▪ threatexchange.fb.com ▪ Platform for distributing threat intelligence

    ▪ 4M malware scans/week and just getting started
  11. Tasks

  12. Getting Started With Elasticsearch The Old Way ▪ "I don't

    know anything about search, but I'm tasked with adding it to my tool. Help!" -Engineer ▪ "Good luck!" -Me ▪ Lucene in Action ▪ elasticsearch.org
  13. Getting Started With Elasticsearch The New Way ▪ Copy/paste documentation

    ▪ Sandbox environments ▪ Sample settings/mappings ▪ Indexing framework ▪ One config per index or type ▪ Scheduling, consistency, live updates, retries, etc.
  14. Getting Started With Elasticsearch The New Way Continued ▪ Finding

    stuff is still a little wild west ▪ Google-y query string query ▪ Elastica PHP library ▪ This tends to be the easier bit
  15. Ramping Up With Elasticsearch Old Way vs New Way ▪

    No longer spending entire internships adding indexing and search to products ▪ More tools teams have a “cool search guy” (quote is mine) ▪ More adoption and spreading to product teams ▪ Bottom line: less time learning elasticsearch and more time searching
  16. Intern Search Putting it all together ▪ Configs for wiki,

    dex, tasks, code, employees, etc. ▪ CTR, bounce rates and pins ▪ Query string query ▪ A/B testing
  17. Single Cluster LOL ▪ Just one cluster for all projects

    ▪ Pros: ▪ Simpler migrations to new versions for Elasticsearch ▪ Easier to debug issues ▪ Faster ramp-up for other teams ▪ Cons: ▪ When the cluster goes down...
  18. Multiple Clusters AKA Get More Sleep ▪ Engineers are dangerous

    ▪ Engineers don't want to worry about usage quotas ▪ Move fast and over-index your data ▪ Add search, head home and crack open a beer ▪ How do we scale this?

  19. Deployment Tupperware ▪ Simple config ▪ LXC containers ▪ Add/remove

    nodes ▪ Health checks ▪ Automatic node replacement ▪ Log aggregation
  20. Monitoring Scuba ▪ Free with config ▪ Watch CPU, heap,

    requests by endpoint, etc. ▪ Alarms ▪ Why not Marvel?
  21. Dealing With Disaster

  22. Multiple Datacenters Redundancy ▪ Masters need low latency ▪ Still

    experimenting ▪ Data nodes in multiple datacenters ▪ Needs SHIELD-ing
  23. Rebuilding Indices AKA Disaster Recovery ▪ Cron job to save

    settings and mappings with version control ▪ Indexer configs can rebuild most core indices ▪ Product-specific fallbacks ▪ Full-text query on backing DB ▪ Watch for shard failures ▪ We need snapshots
  24. Migrations ▪ Attempt #1: 0.18 -> 0.19 ▪ Shutdown cluster,

    backup data, build new version and restart ▪ Oops, only partial data copy ▪ Corruptions = complete rebuild ▪ Attempt #2: 0.19 -> 0.20 ▪ Shutdown cluster, update nodes and restart ▪ Worked great with about an hour of downtime
  25. Live Migration! ▪ Attempt #3: 0.20 -> 0.90 ▪ Lots

    more teams using ES now ▪ Build cross-cluster replication mechanism based on elasticsearch-changes-plugin ▪ Live migration to new cluster... ▪ ...rollback to old cluster when boosting bug was found ▪ Second live migration attempt a success...
  26. fml.sh

  27. Migrations Today Aliases for clusters ▪ Run shadow traffic to

    new cluster ▪ Cluster data in sync ▪ Check for exceptions ▪ Stats for good measure ▪ Flip a switch when we're ready ▪ Can flip back too!
  28. Shield Seamless security ▪ Shadow cluster with 1.4 and Shield

    ▪ On the fly HTTP -> HTTPS ▪ Let's try it in production! ▪ #yolo ▪ Didn't read the manual ▪ Now it's rock solid ▪ Running for weeks now ▪ H1 deploying to all clusters
  29. What's next? Upgrades ▪ Default HTTPS ▪ Automated snapshots to

    GlusterFS ▪ Role-based access control ▪ Not for security, but for sanity ▪ Wild-west cluster
  30. Lessons Learned The hard way ▪ One cluster is easy

    to manage. And easy to bring down. ▪ Search ranking is hard. So cheat. ▪ Make it easy for engineers and they will come.
  31. Questions? Also, we're hiring (peterv@fb.com) Peter Vulgaris March 11, 2015