IT since 1982 mostly because of the TRON movie Joined the Oracle DBA secret society in 2004 In love with PostgreSQL since 2006 Currently runs the Brighton PostgreSQL User group Works at Transferwise as Data Engineer Federico Campoli (Transferwise) The ninja elephant 4th July 2017 2 / 38
are late! 2 The eye of the storm 3 How we did it 4 Maximum effort 5 Lessons learned 6 Wrap up Federico Campoli (Transferwise) The ninja elephant 4th July 2017 3 / 38
engineer job in July 2016 I was involved in a task not customer facing However the task was very critical to the business Federico Campoli (Transferwise) The ninja elephant 4th July 2017 5 / 38
engineer job in July 2016 I was involved in a task not customer facing However the task was very critical to the business I had to fix the performance issues on the MySQL analytics database Which performed bad, despite the considerable resources assigned to the VM Federico Campoli (Transferwise) The ninja elephant 4th July 2017 5 / 38
pot of boiling water, it will of course frantically try to clamber out. But if you place it gently in a pot of tepid water and turn the heat will be slowly boiled to death. The performance issues worsened day by day without a single reason The data size on the MySQL master increased over time Sensitive data were obfuscated with custom views MySQL’s optimiser materialised the views on the fly per each query The analytics tools struggled just under normal load In busy periods the database became almost unusable Analysts were busy to tune existing queries rather writing new The replica had several conflicts with the analytics queries A new solution was needed Federico Campoli (Transferwise) The ninja elephant 4th July 2017 6 / 38
was no longer a good fit. However the new solution’s requirements had to meet some specific needs. Data updated in almost real time from the live database Personally identifiable information (PII) obfuscated for the analysts PII available in clear for the power users The system should be able to scale out for several years Modern SQL for better analytics queries Federico Campoli (Transferwise) The ninja elephant 4th July 2017 8 / 38
solutions. Each solution covered partially the requirements. Google BigQuery Amazon RedShift Snowflake PostgreSQL Federico Campoli (Transferwise) The ninja elephant 4th July 2017 9 / 38
solutions. Each solution covered partially the requirements. Google BigQuery Amazon RedShift Snowflake PostgreSQL Google BigQuery and Amazon RedShift did not suffice the analytics requirements and were removed from the list. Both PostgreSQL and Snowflake offered very good performance and modern SQL. Neither of them offered a replication system from the MySQL system. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 9 / 38
warehouse service. It’s based on Amazon S3 and comes with different sizing. Their pricing system is very appealing and the preliminary tests shown Snowflake outperforming PostgreSQL1. Using FiveTran, an impressive multi technology data pipeline, the data would flow in real time from our production server to Snowflake. 1PostgreSQL single machine vs cloud based parallel processing Federico Campoli (Transferwise) The ninja elephant 4th July 2017 10 / 38
warehouse service. It’s based on Amazon S3 and comes with different sizing. Their pricing system is very appealing and the preliminary tests shown Snowflake outperforming PostgreSQL1. Using FiveTran, an impressive multi technology data pipeline, the data would flow in real time from our production server to Snowflake. Unfortunately there was just one little catch. There was no support for obfuscation. 1PostgreSQL single machine vs cloud based parallel processing Federico Campoli (Transferwise) The ninja elephant 4th July 2017 10 / 38
customer’s data security. Our policy for the PII data is that any personal information moving outside our perimeter shall be obfuscated. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 11 / 38
to have a MySQL to PostgreSQL replica and in my spare time I built a proof of concept derived from the replica tool pg chameleon I was playing with. The initial tests on a reduced dataset were successful. Adding the obfuscation in real time required minimal changes. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 12 / 38
PostgreSQL to perform the obfuscation and then transfer the data in Snowflake using Fivetran. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 13 / 38
PostgreSQL to perform the obfuscation and then transfer the data in Snowflake using Fivetran. However, PostgreSQL proved to have good performance with good margin for scaling up. The final decision was to keep the data analytics data behind our perimeter. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 13 / 38
pg chameleon. The project was forked into a transferwise repository then I added the the obfuscation capabilities. As this project adds some custom needs to the pure replica provided by pg chameleon, we decided to rename it to pg ninja. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 15 / 38
tables in read only mode. To avoid the main database to be locked for several hours we used a secondary MySQL replica with the local query logging enabled. On this replica we also enabled the ROW binlog format required by pg ninja. On the master we still have the MIXED strategy. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 16 / 38
mysql using the CSV format for a fast load in PostgreSQL with the COPY command. Unfortunately MySQL doens’t have the COPY command. SELECT CASE WHEN data_type="enum" THEN SUBSTRING(COLUMN_TYPE ,5) END AS enum_list , CASE WHEN data_type IN (’"""+" ’,’". join(self.hexify)+""" ’) THEN concat(’hex(’,column_name ,’)’) WHEN data_type IN (’bit ’) THEN concat(’cast(‘’,column_name ,’‘ AS unsigned)’) ELSE concat(’‘’,column_name ,’‘’) END AS column_csv FROM information_schema .COLUMNS WHERE table_schema =%s AND table_name =%s ORDER BY ordinal_position ; Federico Campoli (Transferwise) The ninja elephant 4th July 2017 17 / 38
combined in a MySQL specific syntax which generates the CSV output for column in table_columns : column_list.append("COALESCE(REPLACE("+column["column_csv "]+", ’\"’, ’\"\"’) ,’NULL ’) ") columns_csv="REPLACE(CONCAT (’\"’, CONCAT_WS (’\",\"’,"+’,’.join( column_list )+") ,’\"’) ,’\"NULL\"’,’NULL ’)" csv_data="" sql_out="SELECT "+ columns_csv +" as data FROM "+table_name +";" self.mysql_con. connect_db_ubf () try: self.logger.debug("Executing query for table %s" % (table_name , )) self.mysql_con. my_cursor_ubf .execute(sql_out) except: self.logger.debug("error when pulling the data from the table %s - sql executed: %s" % (table_name , sql_out)) Federico Campoli (Transferwise) The ninja elephant 4th July 2017 18 / 38
slices in order to avoid memory overload. The file is then pushed into PostgreSQL using the COPY command. However... COPY is fast but is single transaction One failure and the entire batch is rolled back If this happens the procedure loads the same data using the INSERT statements Which can be very slow But at least discards only the problematic rows Federico Campoli (Transferwise) The ninja elephant 4th July 2017 19 / 38
table, column and obfuscation strategy u s e r d e t a i l s : last name: mode: normal n o n h a s h s t a r t : 0 n on h as h l e n g t h : 0 phone number: mode: normal n o n h a s h s t a r t : 1 n on h as h l e n g t h : 2 d a t e o f b i r t h : mode: date Federico Campoli (Transferwise) The ninja elephant 4th July 2017 20 / 38
uses the extension pgcrypt for hashing in sha256. When the replica is initialised the data is copied into the schema in clear The table locks are released The tables with personally identifiable information (PII) are copied and obfuscated in a separate schema The process builds the indices on the schemas with data in clear and obfuscated The tables without PII data are exposed to the normal users using simple views All the varchar fields in the obfuscated schema are converted in text fields Federico Campoli (Transferwise) The ninja elephant 4th July 2017 21 / 38
the data is replicated. The approach is very simple. When a row image is captured the process checks if the table contains PII data In that case the process generates a second jsonb element with the PII data obfuscated Federico Campoli (Transferwise) The ninja elephant 4th July 2017 22 / 38
previous slide) No invalid views No performance issues with views Complex analytics functions partitioning (thanks pg pathman!) BRIN indices some code was optimised inside, but actually very little - maybe 10-20% was improved. We’ll do more of that in the future, but not yet. The good thing is that the performance gains we have can mostly be attributed just to PG vs MySQL. So there’s a lot of scope to improve further. Jeff McClelland - Growth Analyst, data guru Federico Campoli (Transferwise) The ninja elephant 4th July 2017 27 / 38
strictness is not a mystery. The replica broke down several times because of the funny way the NOT NULL is managed by MySQL. To prevent any further replica breakdown the fields with NOT NULL added with ALTER TABLE, in PostgreSQL are always as NULLable. MySQL truncates the strings of characters at the varchar size automatically. This is a problem if the field is obfuscated on PostgreSQL because the hashed string could not fit into the corresponding varchar field. Therefore all the character varying on the obfuscated schema are converted to text. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 29 / 38
MySQL can be stored without errors raised by the DBMS. Evidence video #1 - Why you shouldn’t use MySQL When this happens the replicator traps the error and discards the problematic row. The failed row image is hexified and saved in the table t discarded rows for further analysis. Federico Campoli (Transferwise) The ninja elephant 4th July 2017 30 / 38
replica is possible with the usage of the regular expressions. Yes, I know. I have 2 problems now! Federico Campoli (Transferwise) The ninja elephant 4th July 2017 31 / 38
replica is possible with the usage of the regular expressions. Yes, I know. I have 2 problems now! MySQL even in ROW format emits the DDL as statements The class sql token uses the regular expressions to tokenise the DDL The tokenised data is used to build the DDL in the PostgreSQL dialect Federico Campoli (Transferwise) The ninja elephant 4th July 2017 31 / 38
now open source. You can download the tool on github. github:https://github.com/transferwise/pg ninja pypi coming soon:https://pypi.python.org/pypi/pg ninja License: Apache 2.0 Limitations Python 2.7 No daemon Replica requires primary keys Still optimising the newest additions (add table, sync obfuscation etc.) PR Absolutely Welcome! Federico Campoli (Transferwise) The ninja elephant 4th July 2017 33 / 38
The eye, phantom playground, light end tunnel - Copyright Federico Campoli The dolphin picture - Copyright artnoose It could work. Young Frankenstein - source quickmeme Tron image source Tron Wikia Deadpool Maximum Effort - source Deadpool Zoeiro Pinkie Pie that’s all folks, Copyright by dan232323, used with permission Federico Campoli (Transferwise) The ninja elephant 4th July 2017 37 / 38