Rapid Full-Text Indexing with ElasticSearch and MySQL

Slide 1

Slide 1 text

Rapid Full-Text Indexing  With ElasticSearch & MySQL Sunny Gleason Distributed Systems Engineer, SunnyCloud April 16, 2015

Slide 2

Slide 2 text

Get The Slides • The slides for this presentation are available at:  http://speakerdeck.com/sunnygleason • Check them out if you’d like to follow along! 2

Slide 3

Slide 3 text

Who am I? 3 • Sunny Gleason – Distributed Systems Engineer – SunnyCloud, Boston MA • Prior Web Services Work – Amazon – Ning • Focus: Scalable, Reliable Storage Systems for Structured & Unstructured Data

Slide 4

Slide 4 text

What’s this all about? • ElasticSearch is a premier search system • MySQL is performant & reliable for primary data • So, how can we use MySQL with ElasticSearch? 4

Slide 5

Slide 5 text

Why are we using MySQL? • Strong ACID guarantees • High-performance DB engine • Battle-tested operations • Well understood on SSD • SQL Skill set • Memcache API • Straightforward data sizing 5 • BLOB support • Hosted offerings • Mature security model • Well-defined index model • Backup/Recovery • Mature replication • Ecosystem/Community

Slide 6

Slide 6 text

Why are we using MySQL? 6 Trust Performance Support

Slide 7

Slide 7 text

Why would we need more than MySQL? • Active-active is hard • Failover is still an expert operation • Binlog/file formats are complex • Full-Text support still maturing 7 • Not designed for cloud • Strict relational model • Schema changes are hard • Trendy developers • No REST API • Copyleft license

Slide 8

Slide 8 text

What are the major limitations of MySQL? 8 Expert Operations Relational Model No Dynamic Scaling Licensing & Extension

Slide 9

Slide 9 text

What is ElasticSearch? • Clustered, search-focused structured data store • Designed for cloud operations • Symmetric cluster model • Internal Sharding & Scaling 9 • Based on Apache Lucene • REST API • Dynamic schema • Open Source / Apache License • Commercial support

Slide 10

Slide 10 text

Why should we consider ElasticSearch? • Fully dynamic clustering, REST API • Optimized for full-text search • Lightweight index filters & aliases • Extremely pluggable via Java or scripting: Scoring, Analysis, Tokenization, Indexing, Server Plugins • More automated sharding & scaling • Based on Apache Lucene (1999) 10 • High-performance • High availability (tunable) • Dynamic schema • Index & Data are same • Hosted offerings (found.no) • Strong community • Non-Copyleft License (Apache) • Commercial support

Slide 11

Slide 11 text

Why would we use ElasticSearch? 11 Feature-Rich Search Performance & Availability Operational Model Community & Support

Slide 12

Slide 12 text

What are some limitations of ElasticSearch? • Not ACID • Eventually consistent • Cluster authority vs. master/slave • Java failure modes (OOM) • Data set sizing is tricky • Data placement is tricky to control • Snapshot/Restore is tricky, new • Plugin deployment requires restart • Not good for blob data 12 • Dynamic schema • Young security model • Weak compliance story • Limited forensic tools • No equivalent to mysqldump • Vulnerable to block-level file system corruption • Not recommended as primary data store

Slide 13

Slide 13 text

What are the key limitations of ElasticSearch? 13 Durability Consistency Model Operational Transparency Schema Migration Model Security

Slide 14

Slide 14 text

What can we do with MySQL *and* ElasticSearch? 14 • Primary storage • System of record • Compact indexes • Higher consistency • SQL querying • Full-text search • Lightweight indexes • Custom scoring & analysis • Higher availability • Horizontal Scaling

Slide 15

Slide 15 text

Bi-directional replication for ElasticSearch & MySQL 15 Sources: http://support.smartbear.com/ MySQL master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C

Slide 16

Slide 16 text

Bi-directional replication for ElasticSearch & MySQL 16 Sources: http://support.smartbear.com/ MySQL master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C • Goal: move data from MySQL to ElasticSearch and vice-versa • Facilitate wide-area replication  and integration • Use JSON in intermediate channel • Create useful primitives  for connectors

Slide 17

Slide 17 text

Getting Data out of MySQL • Solution: binlog-based replication client • Use open-replicator (java binlog client and parser) • Row-based replication • Turns row updates into a JSON stream • Tricky bit: table metadata 17

Slide 18

Slide 18 text

Getting Data into MySQL • Solution: Java bridge from JSON to JDBC to MySQL • Use JDBI for easy SQL  operations / queries • JSON data includes table name, has column names as field names • insert -> INSERT, update -> UPDATE, delete -> DELETE • tricky bit: unique id column 18

Slide 19

Slide 19 text

Getting Data out of ElasticSearch • Solution: ElasticSearch “Changes”  (updates plugin) to JSON • Updates include change type and document data • JSON documents include index name (corresponds to table name in MySQL) • Field names are same as column names in MySQL • Plugin runs on all nodes, but events only fire from primary 19

Slide 20

Slide 20 text

Getting Data into ElasticSearch • Solution: ElasticSearch “River”  (indexing plugin) from JSON • JSON data includes index name (corresponds to table name in MySQL) • Field names are same as column names in MySQL • River runs on all nodes, but events only fire from primary 20

Slide 21

Slide 21 text

Covering all our bases • ElasticSearch -> MySQL • MySQL -> ElasticSearch • ElasticSearch -> ElasticSearch • MySQL -> MySQL 21 MySQL master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C

Slide 22

Slide 22 text

What do we use as the channel? • Publish/Subscribe model • Reliable • Ordered • Appropriate for WAN use • Multi-Region, Multi-Availability zone • Encrypted 22 ?

Slide 23

Slide 23 text

Intermediate Channel Options 23

Slide 24

Slide 24 text

What do we use as the channel? 24 • Redis: not clustered,  not fault-tolerant • ZeroMQ: not fault-tolerant • RabbitMQ: not multi-AZ, complex to manage • PubNub: high availability, reliable, encrypted, global message propagation within 250ms

Slide 25

Slide 25 text

How does this all work? 25 • Each primary data store has a connector that writes to a  PubNub channel • PubNub propagates data  to all listeners • Listeners apply data updates to local data stores • PubNub provides reliability, ordering, easy integration

Slide 26

Slide 26 text

Additional Connectors 26 • This talk covers MySQL and ElasticSearch connectors • Additional connectors in open source for MongoDB and Redis • MySQL will map into future support for Postgres & other relational databases that support replication clients • All you need to connect a data store is an operations log client and insert/ update/delete operations

Slide 27

Slide 27 text

Thank You! 27