Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pg_chameleon a MySQL to PostgreSQL replica

pg_chameleon a MySQL to PostgreSQL replica

pg_chameleon is a lightweight replication system written in
python. The tool connects to the mysql replication protocol and replicates the data in PostgreSQL.
The history, the logic and the future of the tool.

Federico Campoli

May 30, 2017
Tweet

More Decks by Federico Campoli

Other Decks in Programming

Transcript

  1. pg chameleon MySQL to PostgreSQL replica Federico Campoli Transferwise 30

    May 2017 Federico Campoli (Transferwise) pg chameleon 30 May 2017 1 / 48
  2. Few words about the speaker Born in 1972 Passionate about

    IT since 1982 Federico Campoli (Transferwise) pg chameleon 30 May 2017 2 / 48
  3. Few words about the speaker Born in 1972 Passionate about

    IT since 1982 mostly because of the TRON movie Joined the Oracle DBA secret society in 2004 In love with PostgreSQL since 2006 Currently runs the Brighton PostgreSQL User group Works at Transferwise as Data Engineer Federico Campoli (Transferwise) pg chameleon 30 May 2017 2 / 48
  4. Table of contents 1 Some history 2 MySQL Replica in

    a nutshell 3 A chameleon in the middle 4 Replica in action 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) pg chameleon 30 May 2017 3 / 48
  5. Table of contents 1 Some history 2 MySQL Replica in

    a nutshell 3 A chameleon in the middle 4 Replica in action 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) pg chameleon 30 May 2017 4 / 48
  6. The beginnings Years 2006/2012 neo my2pg.py I wrote the script

    because of a struggling phpbb on MySQL The database migration was successful However phpbb didn’t work very well with PostgreSQL Federico Campoli (Transferwise) pg chameleon 30 May 2017 5 / 48
  7. The beginnings Years 2006/2012 neo my2pg.py I wrote the script

    because of a struggling phpbb on MySQL The database migration was successful However phpbb didn’t work very well with PostgreSQL The script is in python 2.6 It’s a monolith script And it’s slow, very slow Federico Campoli (Transferwise) pg chameleon 30 May 2017 5 / 48
  8. The beginnings Years 2006/2012 neo my2pg.py I wrote the script

    because of a struggling phpbb on MySQL The database migration was successful However phpbb didn’t work very well with PostgreSQL The script is in python 2.6 It’s a monolith script And it’s slow, very slow It’s a good checklist for things to avoid when coding https://github.com/the4thdoctor/neo my2pg Federico Campoli (Transferwise) pg chameleon 30 May 2017 5 / 48
  9. I’m not scared of using the ORMs Years 2013/2015 First

    attempt of pg chameleon Developed in Python 2.7 Used SQLAlchemy for extracting the MySQL’s metadata Proof of concept only Federico Campoli (Transferwise) pg chameleon 30 May 2017 6 / 48
  10. I’m not scared of using the ORMs Years 2013/2015 First

    attempt of pg chameleon Developed in Python 2.7 Used SQLAlchemy for extracting the MySQL’s metadata Proof of concept only It was built during the years of the roller coaster Therefore it was a just a way to discharge frustration Abandoned after a while Federico Campoli (Transferwise) pg chameleon 30 May 2017 6 / 48
  11. I’m not scared of using the ORMs Years 2013/2015 First

    attempt of pg chameleon Developed in Python 2.7 Used SQLAlchemy for extracting the MySQL’s metadata Proof of concept only It was built during the years of the roller coaster Therefore it was a just a way to discharge frustration Abandoned after a while SQLAlchemy’s limitations were frustrating as well And there was already pgloader doing the same job Federico Campoli (Transferwise) pg chameleon 30 May 2017 6 / 48
  12. pg chameleon reborn Year 2016 I revamped the project because

    I needed to replicate the data data from MySQL to PostgreSQL. And the library python-mysql-replication looked very promising for reading the mysql replica protocol. Trying won’t harm they said. Federico Campoli (Transferwise) pg chameleon 30 May 2017 7 / 48
  13. pg chameleon v1 Compatible with CPython 2.7/3.3+ Removed SQLAlchemy Replaced

    the mysqldb driver with PyMySQL Added a command line helper Installs in virtualenv and system wide Shipped via pypi for easy installation Federico Campoli (Transferwise) pg chameleon 30 May 2017 8 / 48
  14. Table of contents 1 Some history 2 MySQL Replica in

    a nutshell 3 A chameleon in the middle 4 Replica in action 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) pg chameleon 30 May 2017 9 / 48
  15. MySQL Replica MySQL replica is logical When configured the data

    changes are stored in the master’s binary log files The slave gets from the master the data changes The data changes are saved in the slave’s relay logs The relay logs are used to replay the data in the slave Federico Campoli (Transferwise) pg chameleon 30 May 2017 10 / 48
  16. Log formats STATEMENT: It logs the statements which are replayed

    on the slave. It’s the best solution for performance, however when replaying statements with not deterministic functions this format generates different values o the slave (e.g. using an insert wit the uuid function). ROW: It’s deterministic. This format logs the row image and the DDL queries. This format is compulsory for using pg chameleon. MIXED takes the best of both worlds. The master logs the statements unless a not deterministic function is used. In that case it logs the row image. Federico Campoli (Transferwise) pg chameleon 30 May 2017 11 / 48
  17. Table of contents 1 Some history 2 MySQL Replica in

    a nutshell 3 A chameleon in the middle 4 Replica in action 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) pg chameleon 30 May 2017 13 / 48
  18. pg chameleon pg chameleon mimics a mysql slave’s behaviour Reads

    the replica Stores the decoded rows into a PostgreSQL table PostgreSQL acts as relay log and replication slave A plpgSQL function decodes the rows and replay the changes Federico Campoli (Transferwise) pg chameleon 30 May 2017 14 / 48
  19. Features Read the schema and data from MySQL and restore

    it into a target PostgreSQL schema Setup PostgreSQL to act as a MySQL slave Basic DDL Support (CREATE/DROP/ALTER TABLE, DROP PRIMARY KEY/TRUNCATE) Handles the rubbish data coming from the replica stream and saves the problematic rows in sch chameleon.t discarded rows Supports multiple MySQL sources for replica There is a basic replica monitoring Can detach replica from MySQL leaving PostgreSQL ready to work as standalone server Federico Campoli (Transferwise) pg chameleon 30 May 2017 16 / 48
  20. Table of contents 1 Some history 2 MySQL Replica in

    a nutshell 3 A chameleon in the middle 4 Replica in action 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) pg chameleon 30 May 2017 17 / 48
  21. MySQL configuration The mysql configuration file is usually stored in

    /etc/mysql/my.cnf To enable the binary logging find the section [mysqld] and check the following parameters are set. binlog_format= ROW log-bin = mysql-bin server-id = 1 binlog-row-image = FULL Federico Campoli (Transferwise) pg chameleon 30 May 2017 18 / 48
  22. MySQL user for replica Setup a replication user on MySQL

    CREATE USER usr_replica ; SET PASSWORD FOR usr_replica =PASSWORD(’replica ’); GRANT ALL ON sakila .* TO ’usr_replica ’; GRANT RELOAD ON *.* to ’usr_replica ’; GRANT REPLICATION CLIENT ON *.* to ’usr_replica ’; GRANT REPLICATION SLAVE ON *.* to ’usr_replica ’; FLUSH PRIVILEGES; In our example we are using the sakila test database. https://dev.mysql.com/doc/sakila/en/ Federico Campoli (Transferwise) pg chameleon 30 May 2017 19 / 48
  23. PostgreSQL setup Add an user on PostgreSQL capable to create

    schemas and relations in the destination database CREATE USER usr_replica WITH PASSWORD ’replica ’; CREATE DATABASE db_replica WITH OWNER usr_replica; Federico Campoli (Transferwise) pg chameleon 30 May 2017 20 / 48
  24. Install pg chameleon The simplest way to install pg chameleon

    is with a virtual environment. However if you have root access on your system the installation can be system wide. It’s important to upgrade pip before installing the package. python3 -m venv venv source venv/bin/activate pip install pip --upgrade pip install pg_chameleon Execute chameleon.py to create the configuration directory $HOME/.pg chameleon/ chameleon.py Federico Campoli (Transferwise) pg chameleon 30 May 2017 21 / 48
  25. Replica setup cd in $HOME/.pg chameleon/ and copy config-yaml.example to

    default.yaml then edit the file adding the connection settings and the source and destination schemas. my_database: sakila pg_database: db_replica dest_schema: ’my_schema’ mysql_conn: host: derpy port: 3306 user: usr_replica passwd: replica pg_conn: host: derpy port: 5432 user: usr_replica password: replica Federico Campoli (Transferwise) pg chameleon 30 May 2017 22 / 48
  26. Init replica Activate the virtualenv and run chameleon.py create_schema --config

    default chameleon.py add_source --config default chameleon.py init_replica --config default Wait for the init replica to complete. If the database is large consider running the init replica in a screen or tmux session. Federico Campoli (Transferwise) pg chameleon 30 May 2017 23 / 48
  27. Start replica Start the replica with chameleon.py start_replica --config default

    Federico Campoli (Transferwise) pg chameleon 30 May 2017 24 / 48
  28. Project structure project directory scripts chameleon.py pg chameleon lib global

    lib.py mysql lib.py pg lib.py sqlutil lib.py Federico Campoli (Transferwise) pg chameleon 30 May 2017 25 / 48
  29. chameleon.py Command line wrapper Use argparse to execute the commands

    Supports several commands Federico Campoli (Transferwise) pg chameleon 30 May 2017 26 / 48
  30. chameleon.py Command line wrapper Use argparse to execute the commands

    Supports several commands After the installation executing chameleon.py creates the configuration directory $HOME/.pg chameleon/ with three subdirectories pid logs config Federico Campoli (Transferwise) pg chameleon 30 May 2017 26 / 48
  31. chameleon.py Commands drop schema Drops the service schema sch chameleon

    with cascade option. create schema Create the service schema sch chameleon. upgrade schema Upgrade an existing schema sch chameleon to an newer version. init replica Creates the table structure from the mysql to PostgreSQL. The mysql tables are locked in read only mode and the data is copied into the PostgreSQL database. The master’s coordinates are stored in the PostgreSQL replica catalogue. start replica Starts the replication from mysql to PostgreSQL using the master data stored in sch chameleon.t replica batch. The master’s position is updated when a new batch is processed. list config List the available configurations and their status (’ready’, ’initialising’,’initialised’,’stopped’,’running’,’error’) Federico Campoli (Transferwise) pg chameleon 30 May 2017 27 / 48
  32. chameleon.py add source register a new configuration file as source

    drop source remove the configuration from the registered sources stop replica ends the replica process gracefully disable replica ends the replica process and disable the restart enable replica enable the replica process sync replica sync the data between mysql and postgresql without dropping the tables show status displays the replication status for each source, with the lag in seconds and the last received event detach replica stops the replica stream, discards the replica setup and resets the sequences in PostgreSQL to work as a standalone db. Federico Campoli (Transferwise) pg chameleon 30 May 2017 28 / 48
  33. global lib.py class global config: loads the configuration parameters into

    the class attributes class replica engine: wraps the classes mysql engine and pgsql engine and setup the logging method. The global config instance is used to track the configuration settings. Federico Campoli (Transferwise) pg chameleon 30 May 2017 29 / 48
  34. mysql lib.py class mysql connection: connects to mysql using the

    parameters provided by replica engine class mysql engine: does all the magic for the replication setup and execution Federico Campoli (Transferwise) pg chameleon 30 May 2017 30 / 48
  35. pg lib.py class pg encoder: extends the class JSON and

    adds some special handling for types like decimal and datetime class pgsql connection: connects to the PostgreSQL database class pgsql engine: does all the magic for rebuilding the data structure, loading data and migrating the schema Federico Campoli (Transferwise) pg chameleon 30 May 2017 31 / 48
  36. sqlutil lib.py Consists in just one class sql token which

    tokenise the mysql queries using the regular expressions. Federico Campoli (Transferwise) pg chameleon 30 May 2017 32 / 48
  37. sqlutil lib.py Consists in just one class sql token which

    tokenise the mysql queries using the regular expressions. Yes, I have two problems now! Federico Campoli (Transferwise) pg chameleon 30 May 2017 32 / 48
  38. Limitations Tables for being replicated require primary keys No Daemonisation

    Binary data are hexified to avoid issues with PostgreSQL Federico Campoli (Transferwise) pg chameleon 30 May 2017 33 / 48
  39. The future pg chameleon v2 development is already started. The

    first alpha will come out soon. The new version is a reorganisation of the version 1 with several improvements. Reorganised configuration files Background copy with parallel processes Separate daemon for read and replay Improved monitoring Python 3 only Federico Campoli (Transferwise) pg chameleon 30 May 2017 34 / 48
  40. Table of contents 1 Some history 2 MySQL Replica in

    a nutshell 3 A chameleon in the middle 4 Replica in action 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) pg chameleon 30 May 2017 35 / 48
  41. init replica tune The replica initialisation required several improvements. The

    OOM killer is always happy to kill processes using large amount of memory Using a general slice size doesn’t work well because with large rows the process crashes Estimating the total rows for user’s feedback is faster but the output can be odd. Using not buffered cursors improves the speed and the memory usage. Federico Campoli (Transferwise) pg chameleon 30 May 2017 36 / 48
  42. Strictness is an illusion. MySQL doubly so MySQL’s lack of

    strictness is not a mystery. The funny way the default with NOT NULL is managed by MySQL can break the replica. Therefore any field with NOT NULL added after the initialisation are created always as NULLable in PostgreSQL. Federico Campoli (Transferwise) pg chameleon 30 May 2017 37 / 48
  43. I feel your lack of constraint disturbing Rubbish data in

    MySQL can be stored without errors raised by the DBMS. When this happens the replicator traps the error when the change is replayed on PostgreSQL and discards the problematic row. The value is stored hexified in the table t discarded rows for later analysis. Federico Campoli (Transferwise) pg chameleon 30 May 2017 38 / 48
  44. Table of contents 1 Some history 2 MySQL Replica in

    a nutshell 3 A chameleon in the middle 4 Replica in action 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) pg chameleon 30 May 2017 39 / 48
  45. Igor, the green little guy The chameleon logo has been

    developed by Elena Toma, a talented Italian Lady. https://www.facebook.com/Tonkipapperoart/ The name Igor is inspired by Martin Feldman’s Igor portraited in Young Frankenstein movie. Federico Campoli (Transferwise) pg chameleon 30 May 2017 40 / 48
  46. Some numbers Lines of code global lib.py 327 mysql lib.py

    401 pg lib.py 670 sql util.py 228 chameleon.py 58 Total lines of code 1684 Federico Campoli (Transferwise) pg chameleon 30 May 2017 41 / 48
  47. pg chameleon’s license 2 clause BSD License Copyright (c) 2016,2017

    Federico Campoli All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Federico Campoli (Transferwise) pg chameleon 30 May 2017 42 / 48
  48. Feedback please! Please report any issue on github! https://github.com/the4thdoctor/pg chameleon

    Federico Campoli (Transferwise) pg chameleon 30 May 2017 43 / 48
  49. Boring legal stuff MySQL Image source WikiCommons Hard Disk image

    source WikiCommons Tron image source Tron Wikia Federico Campoli (Transferwise) pg chameleon 30 May 2017 44 / 48
  50. Contacts and license Twitter: 4thdoctor scarf Blog:http://www.pgdba.co.uk Brighton PostgreSQL Meetup:

    http://www.meetup.com/Brighton-PostgreSQL-Meetup/ This document is distributed under the terms of the Creative Commons Federico Campoli (Transferwise) pg chameleon 30 May 2017 47 / 48
  51. pg chameleon MySQL to PostgreSQL replica Federico Campoli Transferwise 30

    May 2017 Federico Campoli (Transferwise) pg chameleon 30 May 2017 48 / 48