Rapid Full-Text Indexing with ElasticSearch and MySQL

Rapid Full-Text Indexing  With ElasticSearch & MySQL Sunny Gleason Distributed
Systems Engineer, SunnyCloud April 16, 2015

Get The Slides • The slides for this presentation are
available at:  http://speakerdeck.com/sunnygleason • Check them out if you’d like to follow along! 2

Who am I? 3 • Sunny Gleason – Distributed Systems
Engineer – SunnyCloud, Boston MA • Prior Web Services Work – Amazon – Ning • Focus: Scalable, Reliable Storage Systems for Structured & Unstructured Data

What’s this all about? • ElasticSearch is a premier search
system • MySQL is performant & reliable for primary data • So, how can we use MySQL with ElasticSearch? 4

Why are we using MySQL? • Strong ACID guarantees •
High-performance DB engine • Battle-tested operations • Well understood on SSD • SQL Skill set • Memcache API • Straightforward data sizing 5 • BLOB support • Hosted offerings • Mature security model • Well-defined index model • Backup/Recovery • Mature replication • Ecosystem/Community

Why are we using MySQL? 6 Trust Performance Support

Why would we need more than MySQL? • Active-active is
hard • Failover is still an expert operation • Binlog/file formats are complex • Full-Text support still maturing 7 • Not designed for cloud • Strict relational model • Schema changes are hard • Trendy developers • No REST API • Copyleft license

What are the major limitations of MySQL? 8 Expert Operations
Relational Model No Dynamic Scaling Licensing & Extension

What is ElasticSearch? • Clustered, search-focused structured data store •
Designed for cloud operations • Symmetric cluster model • Internal Sharding & Scaling 9 • Based on Apache Lucene • REST API • Dynamic schema • Open Source / Apache License • Commercial support

Why should we consider ElasticSearch? • Fully dynamic clustering, REST
API • Optimized for full-text search • Lightweight index filters & aliases • Extremely pluggable via Java or scripting: Scoring, Analysis, Tokenization, Indexing, Server Plugins • More automated sharding & scaling • Based on Apache Lucene (1999) 10 • High-performance • High availability (tunable) • Dynamic schema • Index & Data are same • Hosted offerings (found.no) • Strong community • Non-Copyleft License (Apache) • Commercial support

Why would we use ElasticSearch? 11 Feature-Rich Search Performance &
Availability Operational Model Community & Support

What are some limitations of ElasticSearch? • Not ACID •
Eventually consistent • Cluster authority vs. master/slave • Java failure modes (OOM) • Data set sizing is tricky • Data placement is tricky to control • Snapshot/Restore is tricky, new • Plugin deployment requires restart • Not good for blob data 12 • Dynamic schema • Young security model • Weak compliance story • Limited forensic tools • No equivalent to mysqldump • Vulnerable to block-level file system corruption • Not recommended as primary data store

What are the key limitations of ElasticSearch? 13 Durability Consistency
Model Operational Transparency Schema Migration Model Security

What can we do with MySQL *and* ElasticSearch? 14 •
Primary storage • System of record • Compact indexes • Higher consistency • SQL querying • Full-text search • Lightweight indexes • Custom scoring & analysis • Higher availability • Horizontal Scaling

Bi-directional replication for ElasticSearch & MySQL 15 Sources: http://support.smartbear.com/ MySQL
master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C

Bi-directional replication for ElasticSearch & MySQL 16 Sources: http://support.smartbear.com/ MySQL
master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C • Goal: move data from MySQL to ElasticSearch and vice-versa • Facilitate wide-area replication  and integration • Use JSON in intermediate channel • Create useful primitives  for connectors

Getting Data out of MySQL • Solution: binlog-based replication client
• Use open-replicator (java binlog client and parser) • Row-based replication • Turns row updates into a JSON stream • Tricky bit: table metadata 17

Getting Data into MySQL • Solution: Java bridge from JSON
to JDBC to MySQL • Use JDBI for easy SQL  operations / queries • JSON data includes table name, has column names as field names • insert -> INSERT, update -> UPDATE, delete -> DELETE • tricky bit: unique id column 18

Getting Data out of ElasticSearch • Solution: ElasticSearch “Changes”  (updates
plugin) to JSON • Updates include change type and document data • JSON documents include index name (corresponds to table name in MySQL) • Field names are same as column names in MySQL • Plugin runs on all nodes, but events only fire from primary 19

Getting Data into ElasticSearch • Solution: ElasticSearch “River”  (indexing plugin)
from JSON • JSON data includes index name (corresponds to table name in MySQL) • Field names are same as column names in MySQL • River runs on all nodes, but events only fire from primary 20

Covering all our bases • ElasticSearch -> MySQL • MySQL
-> ElasticSearch • ElasticSearch -> ElasticSearch • MySQL -> MySQL 21 MySQL master MySQL slave ElasticSearch Node A ElasticSearch Node B ElasticSearch Node C

What do we use as the channel? • Publish/Subscribe model
• Reliable • Ordered • Appropriate for WAN use • Multi-Region, Multi-Availability zone • Encrypted 22 ?

Intermediate Channel Options 23

What do we use as the channel? 24 • Redis:
not clustered,  not fault-tolerant • ZeroMQ: not fault-tolerant • RabbitMQ: not multi-AZ, complex to manage • PubNub: high availability, reliable, encrypted, global message propagation within 250ms

How does this all work? 25 • Each primary data
store has a connector that writes to a  PubNub channel • PubNub propagates data  to all listeners • Listeners apply data updates to local data stores • PubNub provides reliability, ordering, easy integration

Additional Connectors 26 • This talk covers MySQL and ElasticSearch
connectors • Additional connectors in open source for MongoDB and Redis • MySQL will map into future support for Postgres & other relational databases that support replication clients • All you need to connect a data store is an operations log client and insert/ update/delete operations

Thank You! 27

Rapid Full-Text Indexing with ElasticSearch and...

Rapid Full-Text Indexing with ElasticSearch and MySQL

Sunny Gleason

More Decks by Sunny Gleason

Other Decks in Technology

Featured

Transcript

Rapid Full-Text Indexing  With ElasticSearch & MySQL Sunny Gleason Distributed

Get The Slides • The slides for this presentation are

Who am I? 3 • Sunny Gleason – Distributed Systems

What’s this all about? • ElasticSearch is a premier search

Why are we using MySQL? • Strong ACID guarantees •

Why are we using MySQL? 6 Trust Performance Support

Why would we need more than MySQL? • Active-active is

What are the major limitations of MySQL? 8 Expert Operations

What is ElasticSearch? • Clustered, search-focused structured data store •

Why should we consider ElasticSearch? • Fully dynamic clustering, REST

Why would we use ElasticSearch? 11 Feature-Rich Search Performance &

What are some limitations of ElasticSearch? • Not ACID •

What are the key limitations of ElasticSearch? 13 Durability Consistency

What can we do with MySQL and ElasticSearch? 14 •

Bi-directional replication for ElasticSearch & MySQL 15 Sources: http://support.smartbear.com/ MySQL

Bi-directional replication for ElasticSearch & MySQL 16 Sources: http://support.smartbear.com/ MySQL

Getting Data out of MySQL • Solution: binlog-based replication client

Getting Data into MySQL • Solution: Java bridge from JSON

Getting Data out of ElasticSearch • Solution: ElasticSearch “Changes”  (updates

Getting Data into ElasticSearch • Solution: ElasticSearch “River”  (indexing plugin)

Covering all our bases • ElasticSearch -> MySQL • MySQL

What do we use as the channel? • Publish/Subscribe model

Intermediate Channel Options 23

What do we use as the channel? 24 • Redis:

How does this all work? 25 • Each primary data

Additional Connectors 26 • This talk covers MySQL and ElasticSearch

Thank You! 27