Upgrade to Pro — share decks privately, control downloads, hide ads and more …

pg_chameleon, MySQL to PostgreSQL replica made easy

pg_chameleon, MySQL to PostgreSQL replica made easy

pg_chameleon is a lightweight replication system written in python. The tool can connect to the mysql replication protocol and replicate the data changes in PostgreSQL.

Whether the user needs to setup a permanent replica between MySQL and PostgreSQL or perform an engine migration, pg_chamaleon is the perfect tool for the job.

Federico Campoli

December 19, 2017
Tweet

More Decks by Federico Campoli

Other Decks in Programming

Transcript

  1. pg chameleon MySQL to PostgreSQL replica made easy Federico Campoli

    Transferwise 19 Dec 2017 https://github.com/the4thdoctor @4thdoctor scarf Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 1 / 50
  2. Few words about the speaker Born in 1972 Passionate about

    IT since 1982 mostly because of the TRON movie Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 2 / 50
  3. Few words about the speaker Born in 1972 Passionate about

    IT since 1982 mostly because of the TRON movie Joined the Oracle DBA secret society in 2004 In love with PostgreSQL since 2006 Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 2 / 50
  4. Few words about the speaker Born in 1972 Passionate about

    IT since 1982 mostly because of the TRON movie Joined the Oracle DBA secret society in 2004 In love with PostgreSQL since 2006 Founded the Brighton PostgreSQL User group Works at Transferwise as Data Engineer Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 2 / 50
  5. Table of contents 1 History 2 MySQL Replica in a

    nutshell 3 A chameleon in the middle 4 Replica in action 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 3 / 50
  6. The beginnings Years 2006/2012 neo my2pg.py I wrote the script

    because of a struggling phpbb on MySQL The database migration was successful However phpbb didn’t work very well with PostgreSQL.1 1Opening a new connection for each query is not the smartest thing to do. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 5 / 50
  7. The beginnings Years 2006/2012 neo my2pg.py I wrote the script

    because of a struggling phpbb on MySQL The database migration was successful However phpbb didn’t work very well with PostgreSQL.1 The script is in python 2.6 It’s a monolith script And it’s slow, very slow 1Opening a new connection for each query is not the smartest thing to do. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 5 / 50
  8. The beginnings Years 2006/2012 neo my2pg.py I wrote the script

    because of a struggling phpbb on MySQL The database migration was successful However phpbb didn’t work very well with PostgreSQL.1 The script is in python 2.6 It’s a monolith script And it’s slow, very slow It’s a good checklist for things to avoid when coding https://github.com/the4thdoctor/neo my2pg 1Opening a new connection for each query is not the smartest thing to do. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 5 / 50
  9. I’m not scared of using the ORMs Years 2013/2015 First

    attempt of pg chameleon Developed in Python 2.7 Used SQLAlchemy for extracting the MySQL’s metadata Proof of concept only It was built during the years of the life on a roller coaster2 Therefore it was a just a way to discharge frustration 2Recording available here: https://youtu.be/R86dJcXofZg Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 6 / 50
  10. I’m not scared of using the ORMs Years 2013/2015 First

    attempt of pg chameleon Developed in Python 2.7 Used SQLAlchemy for extracting the MySQL’s metadata Proof of concept only It was built during the years of the life on a roller coaster2 Therefore it was a just a way to discharge frustration Abandoned after a while SQLAlchemy’s limitations were frustrating as well And pgloader did the same job much much better 2Recording available here: https://youtu.be/R86dJcXofZg Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 6 / 50
  11. pg chameleon reborn Year 2016 I revamped the project because

    I needed to replicate the data from MySQL to PostgreSQL. Luckily the amazing library python-mysql-replication can read the mysql replica protocol https://github.com/noplay/python-mysql-replication Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 7 / 50
  12. MySQL Replica The MySQL replica is logical The master stores

    the logical changes into the local binary log files The slave streams the binary log files from the master The slave saves the stramed data in local relay logs The relay logs are used to replay the data against the slave Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 9 / 50
  13. Log formats MySQL stores the binlog data with different strategies.

    STATEMENT: It logs the statements which are replayed on the slave. It’s the best solution for the bandwidth. However, when replaying statements with not deterministic functions this format generates different values on the slave (e.g. using an insert with a column autogenerated by the uuid function). ROW: It’s deterministic. This format logs the row images. MIXED takes the best of both worlds. The master logs the statements unless a not deterministic function is used. In that case it logs the row image. All three formats always log the DDL query events. The python-mysql-replication library used in pg chameleon requires the ROW format to work properly. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 11 / 50
  14. pg chameleon pg chameleon mimics a mysql slave’s behaviour It

    performs the initial load for the replicated tables It connects to the MySQL replica protocol It stores the row images into a PostgreSQL table A plpgSQL function decodes the rows and replay the changes It can detach the replica for minimal downtime migrations PostgreSQL acts as relay log and replication slave Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 13 / 50
  15. Why pg chameleon? PostgreSQL supports federated tables and tools like

    pgloader can migrate easily the data from MySQL. Why I should setup a replica instead? Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 15 / 50
  16. Why pg chameleon? PostgreSQL supports federated tables and tools like

    pgloader can migrate easily the data from MySQL. Why I should setup a replica instead? I’m happy with MySQL, I just need a database for analytics Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 15 / 50
  17. Why pg chameleon? PostgreSQL supports federated tables and tools like

    pgloader can migrate easily the data from MySQL. Why I should setup a replica instead? I’m happy with MySQL, I just need a database for analytics I want to migrate to PostgreSQL with minimal downtime Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 15 / 50
  18. Why pg chameleon? PostgreSQL supports federated tables and tools like

    pgloader can migrate easily the data from MySQL. Why I should setup a replica instead? I’m happy with MySQL, I just need a database for analytics I want to migrate to PostgreSQL with minimal downtime I have multiple MySQL databases and I want to aggregate data in a single machine Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 15 / 50
  19. Why pg chameleon? PostgreSQL supports federated tables and tools like

    pgloader can migrate easily the data from MySQL. Why I should setup a replica instead? I’m happy with MySQL, I just need a database for analytics I want to migrate to PostgreSQL with minimal downtime I have multiple MySQL databases and I want to aggregate data in a single machine My ginormous queries will kill the network if I’d use the foreign data wrapper Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 15 / 50
  20. Why pg chameleon? PostgreSQL supports federated tables and tools like

    pgloader can migrate easily the data from MySQL. Why I should setup a replica instead? I’m happy with MySQL, I just need a database for analytics I want to migrate to PostgreSQL with minimal downtime I have multiple MySQL databases and I want to aggregate data in a single machine My ginormous queries will kill the network if I’d use the foreign data wrapper I can’t get rid of MySQL but I need that super cool PostgreSQL’s feature Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 15 / 50
  21. pg chameleon 1.8.2 Final relese of the branch 1.x Compatible

    with CPython 2.7/3.3+ No more SQLAlchemy The MySQL driver is PyMySQL Command line helper Supports type override on the fly (Danger!) Installs in virtualenv and system wide Shipped via pypi for easy installation Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 16 / 50
  22. Version 1.8’s limitations Tables for being replicated require primary keys

    No daemon, the process always stays in foreground Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 17 / 50
  23. Version 1.8’s limitations Tables for being replicated require primary keys

    No daemon, the process always stays in foreground Single schema replica Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 17 / 50
  24. Version 1.8’s limitations Tables for being replicated require primary keys

    No daemon, the process always stays in foreground Single schema replica One configuration process per each schema Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 17 / 50
  25. Version 1.8’s limitations Tables for being replicated require primary keys

    No daemon, the process always stays in foreground Single schema replica One configuration process per each schema Network inefficient Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 17 / 50
  26. Version 1.8’s limitations Tables for being replicated require primary keys

    No daemon, the process always stays in foreground Single schema replica One configuration process per each schema Network inefficient All the affected tables are locked in read only mode during the init replica process A single error in the replay process and the replica is broken If using –thread, the threading library is not very efficient During the init replica the data is not accessible Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 17 / 50
  27. pg chameleon 2.0 #1 Currently in Beta, should become stable

    the in January 2018 Compatible with python 3.3+ Installs in virtualenv and system wide via pypi Replicates multiple schemas from a single MySQL into a target PostgreSQL database Conservative approach to the replica. Tables which generate errors are automatically excluded from the replica Daemonised replica process with two distinct subprocesses, for concurrent read and replay Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 18 / 50
  28. pg chameleon 2.0 #2 Soft locking replica initialisation. The tables

    are locked only during the copy Their log coordinates are used by the replica damon to gain the consistent status gradually Rollbar integration for a simpler error detection Experimental support for the PostgreSQL source type The tables are loaded in a separate schema which is swapped with the existing. This approach requires more space but makes less painful to re init a replica, leaving the old data accessible until the init replica is complete. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 19 / 50
  29. Version 2.0’s limitations Tables for being replicated require primary or

    unique keys When detaching the replica the foreign keys are created always ON DELETE/UPDATE RESTRICT The source type PostgreSQL supports only the init replica process Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 20 / 50
  30. Replica initialisation The version 1.8 runs the replica initialisation like

    stated on the mysql online manual. Flush the tables with read lock Get the master’s coordinates Copy the data Release the locks However... Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 22 / 50
  31. Replica initialisation The version 1.8 runs the replica initialisation like

    stated on the mysql online manual. Flush the tables with read lock Get the master’s coordinates Copy the data Release the locks However... pg chameleon 2.0 flushes the tables with read lock one by one. The lock is held only during the copy. The log coordinates are stored in the replica catalogue along the table’s name and used by the replica process to determine whether the table’s binlog data should be used or not. The replica starts inconsistent and gains consistency over time. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 22 / 50
  32. Fallback on failure The data is pulled from mysql using

    the CSV format in slices. This approach prevents the memory overload. Once the file is saved then is pushed into PostgreSQL using the COPY command. However... Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 23 / 50
  33. Fallback on failure The data is pulled from mysql using

    the CSV format in slices. This approach prevents the memory overload. Once the file is saved then is pushed into PostgreSQL using the COPY command. However... COPY is fast but is single transaction One failure and the entire batch is rolled back If this happens the procedure loads the same data using the INSERT statements Which can be very slow But at least discards only the problematic rows Cleaning the NUL markers allowed by MySQL Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 23 / 50
  34. MySQL configuration The mysql configuration file is usually stored in

    /etc/mysql/my.cnf To enable the binary logging find the section [mysqld] and check that the following parameters are set. binlog_format= ROW log-bin = mysql-bin server-id = 1 binlog-row-image = FULL Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 24 / 50
  35. MySQL user for replica Setup a replication user on MySQL

    CREATE USER usr_replica ; SET PASSWORD FOR usr_replica =PASSWORD(’replica ’); GRANT ALL ON sakila .* TO ’usr_replica ’; GRANT RELOAD ON *.* to ’usr_replica ’; GRANT REPLICATION CLIENT ON *.* to ’usr_replica ’; GRANT REPLICATION SLAVE ON *.* to ’usr_replica ’; FLUSH PRIVILEGES; In our example we are using the sakila test database. https://dev.mysql.com/doc/sakila/en/ Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 25 / 50
  36. PostgreSQL setup Add an user on PostgreSQL capable to create

    schemas and relations in the destination database CREATE USER usr_replica WITH PASSWORD ’replica ’; CREATE DATABASE db_replica WITH OWNER usr_replica; Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 26 / 50
  37. Install pg chameleon Install pg chameleon and create the configuration

    files pip install pip --upgrade pip install pg_chameleon==2.0.0b1 chameleon.py set_configuration_files cd ~/.pg_chameleon/configuration cp config-example.yml default.yml Edit the file default.yml setting the correct values for connection and source. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 27 / 50
  38. Configure the target database 1 2 # global settings 3

    p i d d i r : ’~/. pg_chameleon / pid / ’ 4 l o g d i r : ’~/. pg_chameleon / logs / ’ 5 l o g d e s t : f i l e 6 l o g l e v e l : debug 7 l o g d a y s k e e p : 2 8 r o l l b a r k e y : ’ rollbar_super_complex_key ’ 9 r o l l b a r e n v : ’london - pgug ’ 0 1 2 # postgres destination connection 3 pg conn: 4 host: " pgserver " 5 p o r t : " 5432 " 6 u s e r : " usr_replica " 7 password: " replica " 8 database: " db_replica " 9 c h a r s e t : " utf8 " Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 28 / 50
  39. Configure the source database and type override 1 2 s

    o u r c e s : 3 mysql: 4 db conn: 5 host: " myserver " 6 p or t : " 3306 " 7 u s e r : " usr_replica " 8 password: r e p l i c a 9 c h a r s e t : ’utf8 ’ 0 connect time out: 20 1 schema mappings: 2 s a k i l a : s c h s a k i l a 3 l i m i t t a b l e s : 4 s k i p t a b l e s : 5 g r a n t s e l e c t t o : 6 l o c k t i m e o u t : " 120 s" 7 m y s e r v e r i d : 100 8 r e p l i c a b a t c h s i z e : 100000 9 rep lay ma x r o w s: 20000 0 b a t c h r e t e n t i o n : ’1 day ’ 1 copy max memory: " 300 M" 2 copy mode: ’file ’ 3 o u t d i r : /tmp 4 s l e e p l o o p : 5 5 o n e r r o r r e p l a y : ’ continue ’ 6 type: mysql 7 8 t y p e o v e r r i d e : 9 " tinyint (1) ": 0 o v e r r i d e t o : b o o l e a n 1 o v e r r i d e t a b l e s : 2 - "*" Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 29 / 50
  40. Add the source and initialise the replica Add the source

    mysql and initialise the replica for it. We are using debug in order to get the logging on the console. chameleon.py create_replica_schema --debug chameleon.py add_source --config default --source mysql --debug chameleon.py init_replica --config default --source mysql --debug Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 30 / 50
  41. Start the replica Start the replica process chameleon.py start_replica --config

    default --source mysql Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 31 / 50
  42. Start the replica Start the replica process chameleon.py start_replica --config

    default --source mysql Show the replica status chameleon.py show_status --config default --source mysql Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 31 / 50
  43. init replica tune The replica initialisation required several improvements. The

    OOM killer is always happy to kill processes using large amount of memory Using a general slice size doesn’t work well because with large rows the process crashes Estimating the total rows for user’s feedback is faster but the output can be odd. Using not buffered cursors improves the speed and the memory usage. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 35 / 50
  44. Strictness is an illusion. MySQL doubly so MySQL’s lack of

    strictness is not a mystery. The funny way the default with NOT NULL is managed by MySQL can break the replica. Therefore any field with NOT NULL added after the initialisation are created always as NULLable in PostgreSQL. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 36 / 50
  45. I feel your lack of constraint disturbing Rubbish data in

    MySQL can be stored without errors raised by the DBMS. When this happens the replicator traps the error when the change is replayed on PostgreSQL and discards the problematic row. The value is stored hexified in the table t discarded rows for later analysis. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 37 / 50
  46. The DDL. A real pain in the back I initially

    tried to use sqlparse for tokenising the DDL emitted by MySQL. Unfortunately didn’t worked as I expected. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 38 / 50
  47. The DDL. A real pain in the back I initially

    tried to use sqlparse for tokenising the DDL emitted by MySQL. Unfortunately didn’t worked as I expected. So I decided to use the regular expressions. Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. -- Jamie Zawinski Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 38 / 50
  48. The DDL. A real pain in the back I initially

    tried to use sqlparse for tokenising the DDL emitted by MySQL. Unfortunately didn’t worked as I expected. So I decided to use the regular expressions. Some people, when confronted with a problem, think "I know, I’ll use regular expressions." Now they have two problems. -- Jamie Zawinski MySQL even in ROW format emits the DDL as statements The class sql token uses the regular expressions to tokenise the DDL The tokenised data is used to build the DDL in the PostgreSQL dialect Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 38 / 50
  49. Time for a demo Demo! Obviously the demo will fail

    miserably and you will hate this project forever. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 39 / 50
  50. To boldly go where no chameleon has gone before pg

    chameleon 2.1 will appear around april or may. Some of the ideas I’d like to add Parallel copy and index creation in order to speed up the init replica Re sync automatically the tables when they error on replay Logical decoding plugin for PostgreSQL Web management interface in flask Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 41 / 50
  51. Igor, the green little guy The chameleon logo has been

    developed by Elena Toma, a talented Italian Lady. https://www.facebook.com/Tonkipapperoart/ The name Igor is inspired by Martin Feldman’s Igor portraited in Young Frankenstein movie. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 42 / 50
  52. Some numbers Lines of python code for the version 2.0.0.beta1

    global lib.py 530 mysql lib.py 679 pg lib.py 1437 sql util.py 295 chameleon.py 49 Total lines of code 2990 Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 43 / 50
  53. pg chameleon’s license 2 clause BSD License Copyright (c) 2016,2017,2018

    Federico Campoli All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 44 / 50
  54. Feedback please! Please report any issue on github and follow

    pg chameleon on twitter for the announcements. https://github.com/the4thdoctor/pg chameleon @pg chameleon Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 45 / 50
  55. That’s all folks! Thank you for listening! Any questions? Federico

    Campoli (Transferwise) pg chameleon 19 Dec 2017 47 / 50
  56. Image credits MySQL Image source, WikiCommons Hard Disk image, source

    WikiCommons Tron image, source Tron Wikia Twitter icon, source Open Icon Library GitHub Octocat, source GitHub Logos The PostgreSQL logo, copyright the PostgreSQL global development group Boromir get rid of mysql, source imgflip Morpheus, source imgflip Keep calm chameleon, source imgflip The dolphin picture - Copyright artnoose It could work. Young Frankenstein - source quickmeme Perseus, Framed - Copyright Federico Campoli Pinkie Pie that’s all folks, Copyright by dan232323, used with permission Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 48 / 50
  57. License This document is distributed under the terms of the

    Creative Commons Attribution, Not Commercial, Share Alike Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 49 / 50
  58. pg chameleon MySQL to PostgreSQL replica made easy Federico Campoli

    Transferwise 19 Dec 2017 https://github.com/the4thdoctor @4thdoctor scarf Federico Campoli (Transferwise) pg chameleon 19 Dec 2017 50 / 50