Slide 1

Slide 1 text

MySQL and ProxySQL at Zuora

Slide 2

Slide 2 text

The following presentation is intended for information purposes only. The development, release, and timing of any features or functionality described for Zuora’s products remains at the sole discretion of Zuora and is subject to change. Any unreleased services or features referenced in this presentation are not currently available and may not be delivered on time or at all. Customers should make purchase decisions based upon product features that are currently available. In addition, statements in this presentation that refer to forecasts, future plans and expectations are forward-looking statements that involve risks, uncertainties, and assumptions, many of which involve factors or circumstances that are beyond our control. If any such uncertainties materialize, if any of the assumptions prove incorrect, or if priorities or resourcing changes, actual results could differ materially from such forward-looking statements. You are cautioned not to place undue reliance on these forward-looking statements as they reflect Zuora’s expectations as of the date of this presentation. In addition, actual results could differ materially from those expressed or implied in the forward-looking statements due to a number of factors, including but not limited to, risks detailed in our Annual Report on Form 10-K filed with the Securities and Exchange Commission (SEC) on April 18, 2019, our most recent Quarterly Report on Form 10-Q filed with the SEC on September 13, 2019, as well as other documents that may be filed by us from time to time with the SEC. Copies of the Annual Report on Form 10-K , the Quarterly Report on Form 10-Q, and other filings with the SEC may be obtained by visiting our Investor Relations website at investor.zuora.com or the SEC's website at www.sec.gov. Zuora assumes no obligation and does not intend to update these forward-looking statements after the date of this presentation to conform these statements to actual results or to changes in our expectations. SAFE HARBOR DISCLAIMER

Slide 3

Slide 3 text

● Checking out the nice title slide ● Go through agenda <- we are here! ● About us ● Introduction - abstracting the topology ● Fast failover with ProxySQL ○ How it leads to cost savings ● Multiplexing and connection pools ● Creating scrubbed data sets Agenda

Slide 4

Slide 4 text

About us

Slide 5

Slide 5 text

It all began with a vision in a tiny office in Mountain View, California. In 2007, after several years at WebEx & salesforce.com, K.V. Rao, Cheng Zou and Tien Tzuo began to evangelize a fundamentally new business model. One where companies of all sizes would offer broad libraries of services via subscriptions. And one that would challenge the 20th century manufacturing economy by shifting the focus to delivering services instead of products. They called this the Subscription Economy. Since then, Zuora has become the leading Subscription Economy® evangelist, providing the only SaaS platform that automates all subscription order-to-revenue operations in real-time for any business. About Zuora

Slide 6

Slide 6 text

● Zuora crosses $10B in Quarterly Transaction Volume threshold in Q4 2018 ● Over 1200 employees throughout North America, Europe, China, India, Japan, and Australia ● Zuora named a best workplace by Glassdoor, SF Business Times and Great Places to Work About Zuora

Slide 7

Slide 7 text

● Founder and Author of ProxySQL since 2013 ● 14 years of experience in MySQL ● Passion for: ○ Performance ○ High Availability ○ Solving problems About Rene

Slide 8

Slide 8 text

● Principal Database Architect @ Zuora ● More than 10 years of MySQL experience ● Interests ○ Low level details, performance tuning ○ High availability ○ Tooling About Peter

Slide 9

Slide 9 text

Introduction - Abstracting the topology

Slide 10

Slide 10 text

We use replication, with nice topologies

Slide 11

Slide 11 text

We use replication

Slide 12

Slide 12 text

● ProxySQL provides endpoints rather than exact machines somewhere ● Load balancing and endpoint across multiple database instances is possible ● Freedom to change the structure, while leaving the interfaces untouched Abstracting the topology

Slide 13

Slide 13 text

Fast failover with ProxySQL

Slide 14

Slide 14 text

ProxySQL has built-in support for various replication topologies: ● Traditional async replication has been supported since ProxySQL 1.x ○ Support for custom replication topologies was available via external scheduler scripts ● ProxySQL 1.3.x introduced some support for MySQL Group Replication ● Native support for MySQL Group Replication was added in ProxySQL 1.4.x ● Native support for Galera Replication was added in ProxySQL 2.x ● Native support for Amazon Aurora and RDS was added in ProxySQL 2.x Replication Topology Support

Slide 15

Slide 15 text

Before looking at High Availability we need to first look into how ProxySQL monitors availability of MySQL backends. ● ProxySQL is designed to perform various checks against backend servers ● The checks target various layers of communication ● Checks are built with dependencies, if a lower level check fails there is no need for a higher level check Monitoring Availability

Slide 16

Slide 16 text

The Monitor Module checks backends and ensures traffic is only sent to operational servers, let's look at the monitoring checks for traditional async / semi-sync replication: ● Connect – tests new connections to backend servers (MySQL connect) ● Ping – pings backend servers (MySQL ping) ● Replication Lag – checks Seconds_Behind_Master status ● Read only – checks read_only status The Monitor Module

Slide 17

Slide 17 text

The monitor module determines the status of a backend server, available in the “runtime_mysql_servers” table (this can be set manually in “mysql_servers”): ● ONLINE - backend server is fully operational ● SHUNNED - backend server is temporarily taken out of use because of either too many connection errors in a time that was too short, or replication lag exceeded the allowed threshold (automatically set by monitor) Backend Server States

Slide 18

Slide 18 text

OFFLINE_SOFT - when a server is put into OFFLINE_SOFT mode, new incoming connections aren't accepted anymore, while the existing connections are kept until they became inactive. In other words, connections are kept in use until the current transaction is completed. This allows to gracefully detach a backend (manually set). Backend Server States

Slide 19

Slide 19 text

OFFLINE_HARD - when a server is put into OFFLINE_HARD mode, the existing connections are dropped and no new incoming connections are accepted. This is equivalent to deleting the server from a hostgroup (indefinitely or for maintenance) and is done by ProxySQL on failover (manually / automatically set by monitor) Backend Server States

Slide 20

Slide 20 text

The most common topology for MySQL deployments is traditional primary / replica async (or semi-sync): Replication and Failover

Slide 21

Slide 21 text

Since there is no native support for failover in MySQL, external mechanisms are used to promote / demote servers: Replication Managers

Slide 22

Slide 22 text

During a failover replication managers always perform “roughly” the same actions, specifically: The Failover Process

Slide 23

Slide 23 text

During a failover replication managers always perform “roughly” the same actions, specifically: Unplanned Failover

Slide 24

Slide 24 text

During a failover replication managers always perform “roughly” the same actions, specifically: The Failover Process

Slide 25

Slide 25 text

During a failover replication managers always perform “roughly” the same actions, specifically: The Failover Process

Slide 26

Slide 26 text

During a failover replication managers always perform “roughly” the same actions, specifically: The Failover Process

Slide 27

Slide 27 text

ProxySQL transparently detects and adapts to these topology changes via the monitor modules checks ProxySQL Failover Detection

Slide 28

Slide 28 text

When the status of a Primary server changes to read-only ProxySQL will remove the server from the writer hostgroup (OFFLINE_HARD) ProxySQL Failover Detection

Slide 29

Slide 29 text

Momentarily there are no RW nodes available, ProxySQL holds on to the connections for mysql-connect_timeout_server_max ms ProxySQL Failover Detection

Slide 30

Slide 30 text

Once a RW is detected it is added to the writer_hostgroup (ONLINE) and connections are routed to the new Primary server ProxySQL Failover Detection

Slide 31

Slide 31 text

● ProxySQL will automatically re-route “writer” connections to the new Primary after a failover ● The failover time is configurable based on the read only monitoring check ● If the PRIMARY MySQL instance is lost while multiplexing is disabled or a transaction is active – the statement will fail since the full transaction should be retried. Graceful Failover

Slide 32

Slide 32 text

Although the standard failover is acceptable in an “emergency situation”, failovers are required for common maintenance activities including: ● Scaling up hardware and resources ● Operating system updates ● Upgrading MySQL to a newer release ● Executing certain blocking ALTERs e.g. converting a primary key from INT to BIGINT Graceful Failover

Slide 33

Slide 33 text

ProxySQL facilitates “planned failovers” as well as “unplanned failovers” ● In the “mysql_servers” table each server can have a status of ONLINE, SHUNNED, OFFLINE_SOFT, OFFLINE_HARD configured ● OFFLINE_SOFT is a special status that allows all currently executing connections (statements and transactions) to complete while preventing any new connections to be made, new connections are either sent to another available server in the same HG or held until a server is available Graceful Failover

Slide 34

Slide 34 text

Let’s look at a graceful failover. The initial state of the healthy MySQL cluster The Graceful Failover Process

Slide 35

Slide 35 text

Firstly we remove the replication hostgroup to prevent automatic failover as a safety precaution The Graceful Failover Process

Slide 36

Slide 36 text

Next we set the status of the Primary to OFFLINE_SOFT to allow current statements and transactions to complete their execution The Graceful Failover Process

Slide 37

Slide 37 text

Once completed we’ll see the connection count on hostgroup=0 drop to zero The Graceful Failover Process

Slide 38

Slide 38 text

Now it is time to perform the failover either manually or using a tool such as Orchestrator to promote a Replica to Primary The Graceful Failover Process

Slide 39

Slide 39 text

Traffic will continue to be served from the existing nodes and maintenance activities can be performed The Graceful Failover Process

Slide 40

Slide 40 text

At this point the replication hostgroup can be reconfigured to allow for any unplanned failover that may occur The Graceful Failover Process

Slide 41

Slide 41 text

Once maintenance is completed we can add the previous Primary to the Replica hostgroup=1 by dynamically updating the configuration The Graceful Failover Process

Slide 42

Slide 42 text

The steps for graceful failover can be executed manually or scripted ● Manually executing allows for more control while scripts are more precise and are likely to execute in a smaller time-frame ● NOTE: The time between draining connections and promoting a replica must be shorter than “mysql-connect_timeout_server_max” else the new waiting connections will return an error – you can tweak this before and after failover Implementing Graceful Failover

Slide 43

Slide 43 text

● Ensure that the number of available nodes support the maximum throughput required by applications ● Use the same sized servers for Primary and Replicas which are candidates for promotion (at least one pair should be equal) ● Prepare scripts to execute “planned failover” to ensure a smooth and repeatable process ● When integrating with Orchestrator consider implementing ProxySQL planned failover logic in the available hooks General Failover Considerations

Slide 44

Slide 44 text

Zuora’s implementation

Slide 45

Slide 45 text

● Consul for service discovery ● Head nodes are running consul servers and orchestrator-raft ● If a proxy can’t reach raft quorum, wipe itself from scheduler ● Application connects to all proxies in failover mode with multi-host connection at JDBC level Zuora’s implementation

Slide 46

Slide 46 text

jdbc:mysql://db1:3306,db2:3306/shard11 ?characterEncoding=UTF-8 &secondsBeforeRetryMaster=0 &queriesBeforeRetryMaster=0 &retriesAllDown=2 Zuora’s implementation

Slide 47

Slide 47 text

Zuora’s implementation

Slide 48

Slide 48 text

● Using SOFT_OFFLINE method with a wrapper for graceful failover ● Using orchestrator and reconfiguring ProxySQL from orchestrator hook for non-graceful ● We also update metadata in consul ● One shard can have any number of proxies Zuora’s implementation

Slide 49

Slide 49 text

● Because the failover is non-impactful, we can do it all the time and automatically ○ Takes less than 10 seconds total ○ Queueing time is typically 3-4 sec ● Faster administrative tasks ○ Version upgrades ○ OS upgrades ● Faster iteration on infrastructure level changes What it allowed us to do

Slide 50

Slide 50 text

● A long transaction holding SOFT_OFFLINE state too long can be just as bad ● Choose your timeouts carefully, some long running transactions might be killed at the time of failover ● Too long time in SOFT_OFFLINE can be just as bad as too short Advice for SOFT_OFFLINE

Slide 51

Slide 51 text

● AWS released the new r5 instances for the public ● We were on r3, benchmarked r5 and found it faster and cheaper ● Migrated the whole infrastructure from r3 to r5 totally online, with minimal impact to the application Benefits example - instance type

Slide 52

Slide 52 text

● Benchmarked available storage ● Using larger gp2 volumes compared to smaller provisioned IO had performance and cost benefits ● We could switch the volumes online ● The instance type and the volume change meant a 20% cost reduction Benefits example - storage

Slide 53

Slide 53 text

Multiplexing and connection pools

Slide 54

Slide 54 text

Multiplexing is a method whereby established backend connections are re-used for multiple frontend connection: ● The implementation is similar to a thread pool ● Backend connections are allocated to frontend connections while active ● ProxySQL maintains idle frontend connections asynchronously to alleviate the work from the database ● There are certain conditions whereby multiplexing is disabled to safeguard the state of a connection What is Multiplexing in ProxySQL?

Slide 55

Slide 55 text

Multiplexing: reduce number of connections

Slide 56

Slide 56 text

When a transaction is active in a connection, multiplexing is disabled until the transaction commits or rollbacks. A transaction becomes active when: ● “START TRANSACTION” is executed ● “BEGIN” is executed ● The first statement is executed in a session where “AUTOCOMMIT = OFF” Conditions that disable multiplexing

Slide 57

Slide 57 text

● If LOCK TABLE, LOCK TABLES or FLUSH TABLES WITH READ LOCK is executed, multiplexing is disabled until UNLOCK TABLES is executed ● If GET_LOCK() is executed, multiplexing is disabled and is never enabled again ● If a query contains SQL_CALC_FOUND_ROWS ● If CREATE TEMPORARY TABLE is executed ● If PREPARE is executed (creation of prepared statements using the TEXT protocol and not the BINARY protocol) , multiplexing is not enabled again on the connection. ● If SQL_LOG_BIN is set to 0, multiplexing is disabled until SQL_LOG_BIN is set back to 1. Conditions that disable multiplexing

Slide 58

Slide 58 text

All queries that have @ in their query syntax will disable multiplexing, and it will never be enabled again. The same applies to the following: ● SET SQL_SAFE_UPDATES / SQL_SELECT_LIMIT / MAX_JOIN_SIZE (not from 2.0.7) ● SET FOREIGN_KEY_CHECKS ● SET UNIQUE_CHECKS ● SET AUTO_INCREMENT_INCREMENT ● SET AUTO_INCREMENT_OFFSET ● SET GROUP_CONCAT_MAX_LEN There are a few exceptions that will NOT disable multiplexing e.g: ● SELECT @@tx_isolation ● SELECT @@version Conditions that disable multiplexing

Slide 59

Slide 59 text

Often, multiplexing will be disabled due to default session variables set by an ORM or by an application component designed for generic use however this is not necessarily required. ● In these cases a ProxySQL Query Rule can be implemented to explicitly for multiplexing to remain enabled ● A typical scenario is related to the max_allowed_packet queries; a query rule to counteract this would look like: INSERT INTO mysql_query_rules (active, match_digest, multiplex) VALUES ('1', '^SELECT @@max_allowed_packet', 2); Optimizing Multiplexing

Slide 60

Slide 60 text

Typically the queries that disable multiplexing can be identified by querying the stats_mysql_query_digest table. ● We can query the table and look for specific query syntax which disables multiplexing to evaluate whether it is mandatory for multiplexing to be disabled. ● Starting from ProxySQL 2.0.6 , SET queries that can’t be parsed disable multiplexing and are logged to ProxySQL’s error log during runtime making this process even more straightforward Optimizing Multiplexing

Slide 61

Slide 61 text

ProxySQL maintains a connection pool per backend server ● Connections to backend are never pre-allocated if there is no need, so at start up there will 0 connections to the backend. ● When application starts sending traffic to proxysql, this identifies to which backend if needs to send traffic. If there is a connection in the connection pool for that backend, that connection is used, otherwise a new connection is created. ● When the connection completes serving the client's request, it is sent back to the pool if the connection is safe to share and the connection pool isn't full. Connection Pooling in ProxySQL

Slide 62

Slide 62 text

Connection pools are configured by default via the “mysql_servers” table in the admin interface, below we can see an example from a typical configuration: +----+----------------+--------+----------+-----------------+ | HG | Host | status | weight | max_connections | +----+----------------+--------+----------+-----------------+ | 0 | xxx.xx.xxx.121 | ONLINE | 10000000 | 2000 | | 1 | xxx.xx.xxx.121 | ONLINE | 1 | 2000 | | 1 | xxx.xx.xxx.122 | ONLINE | 10000000 | 2000 | | 2 | xxx.xx.xxx.221 | ONLINE | 10000000 | 2000 | +----+----------------+--------+----------+-----------------+ Connection Pooling in ProxySQL

Slide 63

Slide 63 text

We can also see view the status connections pools the stats_mysql_connection_pool table. Notice that the xxx.xx.xxx.121 node in the Reader HG1 is not used due to the configured in the mysql_servers table): +----+----------------+--------+----------+----------+--------+---------+------------+------------+ | HG | Host | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Latency_us | +----+----------------+--------+----------+----------+--------+---------+------------+------------+ | 0 | xxx.xx.xxx.121 | ONLINE | 1168 | 160 | 2991 | 44 | 1453995458 | 1442 | | 1 | xxx.xx.xxx.121 | ONLINE | 0 | 0 | 0 | 0 | 0 | 1442 | | 1 | xxx.xx.xxx.122 | ONLINE | 768 | 76 | 13 | 0 | 16083 | 1322 | | 2 | xxx.xx.xxx.221 | ONLINE | 0 | 10 | 10 | 0 | 20572 | 382 | +----+----------------+--------+----------+----------+--------+---------+------------+------------+ Connection Pooling in ProxySQL

Slide 64

Slide 64 text

Each connection pool will maintain a certain amount of free connections for re-use, this can be controlled within ProxySQL at a global level using the mysql-free_connections_pct variable. ● This variable controls the percentage of open idle connections from the total maximum number of connections for a specific server in a hostgroup. ● For each hostgroup/backend pair, the Hostgroup Manager will keep in the connection pool up to mysql-free_connections_pct * mysql_servers.max_connections / 100 connections. ● Connections are kept open with periodic pings. ● A connection is idle if it hasn't used since the last round of pings. The time interval between two such rounds of pings for idle connections is controlled by the variable mysql-ping_interval_server_msec. Connection Pooling in ProxySQL

Slide 65

Slide 65 text

Case study: shard overloaded

Slide 66

Slide 66 text

● Around 6:30 PM, we received an alert that a database server has high concurrency (Threads_running around 700) ● Overloading the server with lots of short queries ● We had multiplexing enabled, but we lowered the number of backend connections to 100 ● On the next slide, spot the time when this happened Case study: shard overloaded

Slide 67

Slide 67 text

Multiplexing in real life: QPS increase during overload

Slide 68

Slide 68 text

● Reducing parallelism allowed the database server to use the same amount of CPU, but not for context switching, but to do more meaningful work ● QPS was increased, but what about individual query types? Maybe one query type became fast of a side load was added. Case study: shard overloaded

Slide 69

Slide 69 text

QPS increase during overload per query type

Slide 70

Slide 70 text

● Throughput was increased for every query type ● What about response time? Better throughput with worse response time might be a tradeoff. Case study: shard overloaded

Slide 71

Slide 71 text

Response time per type type during overload

Slide 72

Slide 72 text

● The last 2 graphs were created by sending metadata from stats_mysql_query_digest to our monitoring system ● We gained lots of new insights this way from ProxySQL Case study: shard overloaded

Slide 73

Slide 73 text

SQL state [HY000]; error code [9001]; Max connect timeout reached while reaching hostgroup 0 after 15080ms In this case we know exactly that the transaction didn’t start yet (compared to the overloaded case, when we hit timeout at random places). Be aware: new error with multiplexing

Slide 74

Slide 74 text

● Earlier it was always the database ○ MySQL hostname and port in exception ● Now it’s always ProxySQL ○ ProxySQL hostname and port in exception New thing to blame

Slide 75

Slide 75 text

Query Rewrite

Slide 76

Slide 76 text

This functionality is one of the main reasons to implement a proxy that sits between the application and the database server: ● Query rewriting on the fly without having to involve the development team to correct misbehaving queries or forcing indexing ● Query rewriting is achieved by means of query rules that can be chained i.e. a series of query rules can be applied to a specific queries which matches the conditions of the query rule Query rewrite on the fly

Slide 77

Slide 77 text

As with all query rules, the criteria to identify a query that requires rewrite includes the following main criteria (and more!): ● Match of a specific digest ● Match on a specific string sequence of a query digest text ● Match on a specific query value or sequence of values ● Regexp either on a query or digest text ● And many others such as username, schemaname, client address, proxy address, previous query rule executed etc. Query rewrite criteria

Slide 78

Slide 78 text

Here is a simple example of a query rewrite rule which will force a specific index to be used on a query, it will match a pattern and replace this on the fly: INSERT INTO mysql_query_rules (rule_id, active, match_pattern, replace_pattern) VALUES (1, 1, "^SELECT first FROM names", "SELECT first FROM names FORCE INDEX(id)"); Query rewrite example 1

Slide 79

Slide 79 text

Here is another example of a query rewrite rule which will mask a value for a specific query, naturally depending the criteria this can also be customised with a regexp to match a range of queries and conditions: INSERT INTO mysql_query_rules (rule_id, active, match_pattern, replace_pattern) VALUES (2, 1, "^SELECT first", "SELECT ‘secret’ as first"); Query rewrite example 2

Slide 80

Slide 80 text

Creating scrubbed data sets

Slide 81

Slide 81 text

● Rene and Fred has an excellent talk about data masking ○ https://archive.fosdem.org/2017/schedule/event/proxysqldatamasking/ ● Which has an even more excellent follow up blog post ○ https://proxysql.com/blog/obfuscate-data-from-mysqldump TL;DR dump scrubbed data from one interface

Slide 82

Slide 82 text

● Create an instance from a snapshot ● Configure an ephemeral ProxySQL for scrubbing ● Parallel-dump the instance through ProxySQL ○ Without replication running ● Much faster and needs less resources compared to our previous methods Our implementation

Slide 83

Slide 83 text

Q&A

Slide 84

Slide 84 text

Thanks!