$30 off During Our Annual Pro Sale. View Details »

From Hackathon to Production: Elasticsearch @ Facebook

From Hackathon to Production: Elasticsearch @ Facebook

This talk was presented at the inaugural Elastic{ON} conference, http://elasticon.com

Session Abstract:

Facebook has been using Elasticsearch for 3 plus years, having gone from a simple enterprise search to over 40 tools across multiple clusters with 60+ million queries a day and growing. This talk will focus on the entire Elasticsearch journey, from a hackathon project to a self-service infrastructure used across internal tools and public production sites.

Presented by Peter Vulgaris, Facebook

Elastic Co

March 11, 2015
Tweet

More Decks by Elastic Co

Other Decks in Technology

Transcript

  1. View Slide

  2. Elasticsearch@Facebook
    From Hackathon to Production
    Peter Vulgaris
    March 11, 2015

    View Slide

  3. In The Beginning
    Google Search Appliance
    ▪ Sits in a rack, takes URLs and
    scrapes them
    ▪ Basically a block box
    ▪ No structured data other than
    rendered HTML pages
    ▪ No selective boosting
    ▪ Not really the hacker way

    View Slide

  4. Also In The Beginning
    Apache Solr
    ▪ No built-in way to handle
    querying multiple indices
    ▪ Manual sharding
    ▪ Verbose XML queries :(
    ▪ Built tools to handle what
    elasticsearch does out of the box

    View Slide

  5. Why Elasticsearch?
    It's easy to hack on
    ▪ Same power of Lucene with
    simpler REST + JSON interface
    ▪ Quick to get it up and running
    ▪ Automatic replication and
    rebalancing
    ▪ Great community

    View Slide

  6. 2012 2015
    ▪ 1 cluster (0.18)
    ▪ 1 application (internal search)
    ▪ Tens of thousands of docs
    ▪ 7 clusters (that we know of, 1.2+)
    ▪ Dozens of applications
    ▪ 100+ nodes in multiple datacenters
    ▪ ~4 billion documents
    ▪ 4+ TB data
    ▪ 1500+ QPS (for WWW at least)
    ▪ 1 common deployment infrastructure
    ▪ 2-3 indexing frameworks

    View Slide

  7. Goals for Elasticsearch
    Products
    ▪ Engineer comes in knowing nothing about search
    ▪ Able to index docs and add search ASAP
    ▪ Path to more advanced usage

    View Slide

  8. Goals for Elasticsearch
    Infrastructure
    ▪ Spin up a new cluster with nodes in multiple DC's in minutes with common
    settings
    ▪ Survive the "storms"
    ▪ Transparent transitions for clients to new clusters
    ▪ Management, logging, alarms, etc.

    View Slide

  9. Help Community
    ▪ fb.com/help/community
    ▪ Indexes and searches
    through user-generated
    questions and answers
    ▪ 211 QPS to the cluster

    View Slide

  10. Threat Exchange
    ▪ threatexchange.fb.com
    ▪ Platform for distributing
    threat intelligence
    ▪ 4M malware scans/week
    and just getting started

    View Slide

  11. Tasks

    View Slide

  12. Getting Started With Elasticsearch
    The Old Way
    ▪ "I don't know anything about
    search, but I'm tasked with adding
    it to my tool. Help!" -Engineer
    ▪ "Good luck!" -Me
    ▪ Lucene in Action
    ▪ elasticsearch.org

    View Slide

  13. Getting Started With Elasticsearch
    The New Way
    ▪ Copy/paste documentation
    ▪ Sandbox environments
    ▪ Sample settings/mappings
    ▪ Indexing framework
    ▪ One config per index or type
    ▪ Scheduling, consistency, live
    updates, retries, etc.

    View Slide

  14. Getting Started With Elasticsearch
    The New Way Continued
    ▪ Finding stuff is still a little wild west
    ▪ Google-y query string query
    ▪ Elastica PHP library
    ▪ This tends to be the easier bit

    View Slide

  15. Ramping Up With Elasticsearch
    Old Way vs New Way
    ▪ No longer spending entire internships adding indexing and search to products
    ▪ More tools teams have a “cool search guy” (quote is mine)
    ▪ More adoption and spreading to product teams
    ▪ Bottom line: less time learning elasticsearch and more time searching

    View Slide

  16. Intern Search
    Putting it all together
    ▪ Configs for wiki, dex,
    tasks, code, employees,
    etc.
    ▪ CTR, bounce rates and
    pins
    ▪ Query string query
    ▪ A/B testing

    View Slide

  17. Single Cluster
    LOL
    ▪ Just one cluster for all projects
    ▪ Pros:
    ▪ Simpler migrations to new versions
    for Elasticsearch
    ▪ Easier to debug issues
    ▪ Faster ramp-up for other teams
    ▪ Cons:
    ▪ When the cluster goes down...

    View Slide

  18. Multiple Clusters
    AKA Get More Sleep
    ▪ Engineers are dangerous
    ▪ Engineers don't want to worry about
    usage quotas
    ▪ Move fast and over-index your data
    ▪ Add search, head home and crack
    open a beer
    ▪ How do we scale this?


    View Slide

  19. Deployment
    Tupperware
    ▪ Simple config
    ▪ LXC containers
    ▪ Add/remove nodes
    ▪ Health checks
    ▪ Automatic node replacement
    ▪ Log aggregation

    View Slide

  20. Monitoring
    Scuba
    ▪ Free with config
    ▪ Watch CPU, heap, requests by
    endpoint, etc.
    ▪ Alarms
    ▪ Why not Marvel?

    View Slide

  21. Dealing With Disaster

    View Slide

  22. Multiple Datacenters
    Redundancy
    ▪ Masters need low latency
    ▪ Still experimenting
    ▪ Data nodes in multiple datacenters
    ▪ Needs SHIELD-ing

    View Slide

  23. Rebuilding Indices
    AKA Disaster Recovery
    ▪ Cron job to save settings and
    mappings with version control
    ▪ Indexer configs can rebuild most
    core indices
    ▪ Product-specific fallbacks
    ▪ Full-text query on backing DB
    ▪ Watch for shard failures
    ▪ We need snapshots

    View Slide

  24. Migrations
    ▪ Attempt #1: 0.18 -> 0.19
    ▪ Shutdown cluster, backup data,
    build new version and restart
    ▪ Oops, only partial data copy
    ▪ Corruptions = complete rebuild
    ▪ Attempt #2: 0.19 -> 0.20
    ▪ Shutdown cluster, update nodes
    and restart
    ▪ Worked great with about an hour
    of downtime

    View Slide

  25. Live Migration!
    ▪ Attempt #3: 0.20 -> 0.90
    ▪ Lots more teams using ES now
    ▪ Build cross-cluster replication
    mechanism based on
    elasticsearch-changes-plugin
    ▪ Live migration to new cluster...
    ▪ ...rollback to old cluster when
    boosting bug was found
    ▪ Second live migration attempt a
    success...

    View Slide

  26. fml.sh

    View Slide

  27. Migrations Today
    Aliases for clusters
    ▪ Run shadow traffic to new cluster
    ▪ Cluster data in sync
    ▪ Check for exceptions
    ▪ Stats for good measure
    ▪ Flip a switch when we're ready
    ▪ Can flip back too!

    View Slide

  28. Shield
    Seamless security
    ▪ Shadow cluster with 1.4 and
    Shield
    ▪ On the fly HTTP -> HTTPS
    ▪ Let's try it in production!
    ▪ #yolo
    ▪ Didn't read the manual
    ▪ Now it's rock solid
    ▪ Running for weeks now
    ▪ H1 deploying to all clusters

    View Slide

  29. What's next?
    Upgrades
    ▪ Default HTTPS
    ▪ Automated snapshots to
    GlusterFS
    ▪ Role-based access control
    ▪ Not for security, but for sanity
    ▪ Wild-west cluster

    View Slide

  30. Lessons Learned
    The hard way
    ▪ One cluster is easy to manage.
    And easy to bring down.
    ▪ Search ranking is hard. So cheat.
    ▪ Make it easy for engineers and
    they will come.

    View Slide

  31. Questions?
    Also, we're hiring ([email protected])
    Peter Vulgaris
    March 11, 2015

    View Slide