Rapid Full-Text Indexing with ElasticSearch and MySQL

Rapid Full-Text Indexing with ElasticSearch and MySQL

ElasticSearch is the premier solution for full-text search of structured documents in the cloud. But we all know that MySQL is everybody's favorite database, whose durability and performance in the field is battle-tested and well-understood. In this presentation, we review common cases for bidirectional integration of ElasticSearch and MySQL, and look at techniques that enable developers to use ElasticSearch and MySQL more efficiently together. We'll look inside a MySQL "River" (ElasticSearch indexing plugin) for ElasticSearch that allows binlog updates to propagate to ElasticSearch in real time, as well as an ElasticSearch plugin that allows ElasticSearch document updates to be applied to MySQL tables in real time. We'll pay special attention to the features and characteristics that make ElasticSearch and MySQL great in their own way, and why they will both have a place in your "cloud stack" for years to come.

Ac1a1d449af259bfb7191105db5212a2?s=128

Sunny Gleason

April 16, 2015
Tweet

Transcript

  1. Rapid Full-Text Indexing
 With ElasticSearch & MySQL Sunny Gleason Distributed

    Systems Engineer, SunnyCloud April 16, 2015
  2. Get The Slides • The slides for this presentation are

    available at:
 http://speakerdeck.com/sunnygleason • Check them out if you’d like to follow along! 2
  3. Who am I? 3 • Sunny Gleason – Distributed Systems

    Engineer – SunnyCloud, Boston MA • Prior Web Services Work – Amazon – Ning • Focus: Scalable, Reliable Storage Systems for Structured & Unstructured Data
  4. What’s this all about? • ElasticSearch is a premier search

    system • MySQL is performant & reliable for primary data • So, how can we use MySQL with ElasticSearch? 4
  5. Why are we using MySQL? • Strong ACID guarantees •

    High-performance DB engine • Battle-tested operations • Well understood on SSD • SQL Skill set • Memcache API • Straightforward data sizing 5 • BLOB support • Hosted offerings • Mature security model • Well-defined index model • Backup/Recovery • Mature replication • Ecosystem/Community
  6. Why are we using MySQL? 6 Trust Performance Support

  7. Why would we need more than MySQL? • Active-active is

    hard • Failover is still an expert operation • Binlog/file formats are complex • Full-Text support still maturing 7 • Not designed for cloud • Strict relational model • Schema changes are hard • Trendy developers • No REST API • Copyleft license
  8. What are the major limitations of MySQL? 8 Expert Operations

    Relational Model No Dynamic Scaling Licensing & Extension
  9. What is ElasticSearch? • Clustered, search-focused structured data store •

    Designed for cloud operations • Symmetric cluster model • Internal Sharding & Scaling 9 • Based on Apache Lucene • REST API • Dynamic schema • Open Source / Apache License • Commercial support
  10. Why should we consider ElasticSearch? • Fully dynamic clustering, REST

    API • Optimized for full-text search • Lightweight index filters & aliases • Extremely pluggable via Java or scripting: Scoring, Analysis, Tokenization, Indexing, Server Plugins • More automated sharding & scaling • Based on Apache Lucene (1999) 10 • High-performance • High availability (tunable) • Dynamic schema • Index & Data are same • Hosted offerings (found.no) • Strong community • Non-Copyleft License (Apache) • Commercial support
  11. Why would we use ElasticSearch? 11 Feature-Rich Search Performance &

    Availability Operational Model Community & Support
  12. What are some limitations of ElasticSearch? • Not ACID •

    Eventually consistent • Cluster authority vs. master/slave • Java failure modes (OOM) • Data set sizing is tricky • Data placement is tricky to control • Snapshot/Restore is tricky, new • Plugin deployment requires restart • Not good for blob data 12 • Dynamic schema • Young security model • Weak compliance story • Limited forensic tools • No equivalent to mysqldump • Vulnerable to block-level file system corruption • Not recommended as primary data store
  13. What are the key limitations of ElasticSearch? 13 Durability Consistency

    Model Operational Transparency Schema Migration Model Security
  14. What can we do with MySQL *and* ElasticSearch? 14 •

    Primary storage • System of record • Compact indexes • Higher consistency • SQL querying • Full-text search • Lightweight indexes • Custom scoring & analysis • Higher availability • Horizontal Scaling
  15. Bi-directional replication for ElasticSearch & MySQL 15 Sources: http://support.smartbear.com/ MySQL

    master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C
  16. Bi-directional replication for ElasticSearch & MySQL 16 Sources: http://support.smartbear.com/ MySQL

    master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C • Goal: move data from MySQL to ElasticSearch and vice-versa • Facilitate wide-area replication
 and integration • Use JSON in intermediate channel • Create useful primitives
 for connectors
  17. Getting Data out of MySQL • Solution: binlog-based replication client

    • Use open-replicator (java binlog client and parser) • Row-based replication • Turns row updates into a JSON stream • Tricky bit: table metadata 17
  18. Getting Data into MySQL • Solution: Java bridge from JSON

    to JDBC to MySQL • Use JDBI for easy SQL
 operations / queries • JSON data includes table name, has column names as field names • insert -> INSERT, update -> UPDATE, delete -> DELETE • tricky bit: unique id column 18
  19. Getting Data out of ElasticSearch • Solution: ElasticSearch “Changes”
 (updates

    plugin) to JSON • Updates include change type and document data • JSON documents include index name (corresponds to table name in MySQL) • Field names are same as column names in MySQL • Plugin runs on all nodes, but events only fire from primary 19
  20. Getting Data into ElasticSearch • Solution: ElasticSearch “River”
 (indexing plugin)

    from JSON • JSON data includes index name (corresponds to table name in MySQL) • Field names are same as column names in MySQL • River runs on all nodes, but events only fire from primary 20
  21. Covering all our bases • ElasticSearch -> MySQL • MySQL

    -> ElasticSearch • ElasticSearch -> ElasticSearch • MySQL -> MySQL 21 MySQL master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C
  22. What do we use as the channel? • Publish/Subscribe model

    • Reliable • Ordered • Appropriate for WAN use • Multi-Region, Multi-Availability zone • Encrypted 22 ?
  23. Intermediate Channel Options 23

  24. What do we use as the channel? 24 • Redis:

    not clustered,
 not fault-tolerant • ZeroMQ: not fault-tolerant • RabbitMQ: not multi-AZ, complex to manage • PubNub: high availability, reliable, encrypted, global message propagation within 250ms
  25. How does this all work? 25 • Each primary data

    store has a connector that writes to a
 PubNub channel • PubNub propagates data
 to all listeners • Listeners apply data updates to local data stores • PubNub provides reliability, ordering, easy integration
  26. Additional Connectors 26 • This talk covers MySQL and ElasticSearch

    connectors • Additional connectors in open source for MongoDB and Redis • MySQL will map into future support for Postgres & other relational databases that support replication clients • All you need to connect a data store is an operations log client and insert/ update/delete operations
  27. Thank You! 27