Upgrade to Pro — share decks privately, control downloads, hide ads and more …

MySQL with ProxySQL at Zuora

ProxySQL LLC
September 19, 2019

MySQL with ProxySQL at Zuora

This presentation was given by Rene Cannao (ProxySQL) and Peter Boros (Zuora) at Oracle Code One 2019.

The talk abstract is as follows:

"Zuora, which believes that the world is moving from products to subscription services, is 100% focused on building the world’s first platform for subscription businesses. ProxySQL is Sakila’s most scalable and widely used high-performance MySQL proxy. Implementing ProxySQL was an essential improvement for Zuora’s database infrastructure. ProxySQL gave Zuora the opportunity to abstract the provided services away from the physical topology/implementation. With fast and reliable failover, the company can change its architecture fast, knowing that it isn’t breaking anything by keeping the same interface. This session discusses Zuora’s high-availability architecture, which is based on MySQL, ProxySQL, Consul, and Orchestrator".

ProxySQL LLC

September 19, 2019
Tweet

More Decks by ProxySQL LLC

Other Decks in Technology

Transcript

  1. The following presentation is intended for information purposes only. The

    development, release, and timing of any features or functionality described for Zuora’s products remains at the sole discretion of Zuora and is subject to change. Any unreleased services or features referenced in this presentation are not currently available and may not be delivered on time or at all. Customers should make purchase decisions based upon product features that are currently available. In addition, statements in this presentation that refer to forecasts, future plans and expectations are forward-looking statements that involve risks, uncertainties, and assumptions, many of which involve factors or circumstances that are beyond our control. If any such uncertainties materialize, if any of the assumptions prove incorrect, or if priorities or resourcing changes, actual results could differ materially from such forward-looking statements. You are cautioned not to place undue reliance on these forward-looking statements as they reflect Zuora’s expectations as of the date of this presentation. In addition, actual results could differ materially from those expressed or implied in the forward-looking statements due to a number of factors, including but not limited to, risks detailed in our Annual Report on Form 10-K filed with the Securities and Exchange Commission (SEC) on April 18, 2019, our most recent Quarterly Report on Form 10-Q filed with the SEC on September 13, 2019, as well as other documents that may be filed by us from time to time with the SEC. Copies of the Annual Report on Form 10-K , the Quarterly Report on Form 10-Q, and other filings with the SEC may be obtained by visiting our Investor Relations website at investor.zuora.com or the SEC's website at www.sec.gov. Zuora assumes no obligation and does not intend to update these forward-looking statements after the date of this presentation to conform these statements to actual results or to changes in our expectations. SAFE HARBOR DISCLAIMER
  2. • Checking out the nice title slide • Go through

    agenda <- we are here! • About us • Introduction - abstracting the topology • Fast failover with ProxySQL ◦ How it leads to cost savings • Multiplexing and connection pools • Creating scrubbed data sets Agenda
  3. It all began with a vision in a tiny office

    in Mountain View, California. In 2007, after several years at WebEx & salesforce.com, K.V. Rao, Cheng Zou and Tien Tzuo began to evangelize a fundamentally new business model. One where companies of all sizes would offer broad libraries of services via subscriptions. And one that would challenge the 20th century manufacturing economy by shifting the focus to delivering services instead of products. They called this the Subscription Economy. Since then, Zuora has become the leading Subscription Economy® evangelist, providing the only SaaS platform that automates all subscription order-to-revenue operations in real-time for any business. About Zuora
  4. • Zuora crosses $10B in Quarterly Transaction Volume threshold in

    Q4 2018 • Over 1200 employees throughout North America, Europe, China, India, Japan, and Australia • Zuora named a best workplace by Glassdoor, SF Business Times and Great Places to Work About Zuora
  5. • Founder and Author of ProxySQL since 2013 • 14

    years of experience in MySQL • Passion for: ◦ Performance ◦ High Availability ◦ Solving problems About Rene
  6. • Principal Database Architect @ Zuora • More than 10

    years of MySQL experience • Interests ◦ Low level details, performance tuning ◦ High availability ◦ Tooling About Peter
  7. • ProxySQL provides endpoints rather than exact machines somewhere •

    Load balancing and endpoint across multiple database instances is possible • Freedom to change the structure, while leaving the interfaces untouched Abstracting the topology
  8. ProxySQL has built-in support for various replication topologies: • Traditional

    async replication has been supported since ProxySQL 1.x ◦ Support for custom replication topologies was available via external scheduler scripts • ProxySQL 1.3.x introduced some support for MySQL Group Replication • Native support for MySQL Group Replication was added in ProxySQL 1.4.x • Native support for Galera Replication was added in ProxySQL 2.x • Native support for Amazon Aurora and RDS was added in ProxySQL 2.x Replication Topology Support
  9. Before looking at High Availability we need to first look

    into how ProxySQL monitors availability of MySQL backends. • ProxySQL is designed to perform various checks against backend servers • The checks target various layers of communication • Checks are built with dependencies, if a lower level check fails there is no need for a higher level check Monitoring Availability
  10. The Monitor Module checks backends and ensures traffic is only

    sent to operational servers, let's look at the monitoring checks for traditional async / semi-sync replication: • Connect – tests new connections to backend servers (MySQL connect) • Ping – pings backend servers (MySQL ping) • Replication Lag – checks Seconds_Behind_Master status • Read only – checks read_only status The Monitor Module
  11. The monitor module determines the status of a backend server,

    available in the “runtime_mysql_servers” table (this can be set manually in “mysql_servers”): • ONLINE - backend server is fully operational • SHUNNED - backend server is temporarily taken out of use because of either too many connection errors in a time that was too short, or replication lag exceeded the allowed threshold (automatically set by monitor) Backend Server States
  12. OFFLINE_SOFT - when a server is put into OFFLINE_SOFT mode,

    new incoming connections aren't accepted anymore, while the existing connections are kept until they became inactive. In other words, connections are kept in use until the current transaction is completed. This allows to gracefully detach a backend (manually set). Backend Server States
  13. OFFLINE_HARD - when a server is put into OFFLINE_HARD mode,

    the existing connections are dropped and no new incoming connections are accepted. This is equivalent to deleting the server from a hostgroup (indefinitely or for maintenance) and is done by ProxySQL on failover (manually / automatically set by monitor) Backend Server States
  14. The most common topology for MySQL deployments is traditional primary

    / replica async (or semi-sync): Replication and Failover
  15. Since there is no native support for failover in MySQL,

    external mechanisms are used to promote / demote servers: Replication Managers
  16. ProxySQL transparently detects and adapts to these topology changes via

    the monitor modules checks ProxySQL Failover Detection
  17. When the status of a Primary server changes to read-only

    ProxySQL will remove the server from the writer hostgroup (OFFLINE_HARD) ProxySQL Failover Detection
  18. Momentarily there are no RW nodes available, ProxySQL holds on

    to the connections for mysql-connect_timeout_server_max ms ProxySQL Failover Detection
  19. Once a RW is detected it is added to the

    writer_hostgroup (ONLINE) and connections are routed to the new Primary server ProxySQL Failover Detection
  20. • ProxySQL will automatically re-route “writer” connections to the new

    Primary after a failover • The failover time is configurable based on the read only monitoring check • If the PRIMARY MySQL instance is lost while multiplexing is disabled or a transaction is active – the statement will fail since the full transaction should be retried. Graceful Failover
  21. Although the standard failover is acceptable in an “emergency situation”,

    failovers are required for common maintenance activities including: • Scaling up hardware and resources • Operating system updates • Upgrading MySQL to a newer release • Executing certain blocking ALTERs e.g. converting a primary key from INT to BIGINT Graceful Failover
  22. ProxySQL facilitates “planned failovers” as well as “unplanned failovers” •

    In the “mysql_servers” table each server can have a status of ONLINE, SHUNNED, OFFLINE_SOFT, OFFLINE_HARD configured • OFFLINE_SOFT is a special status that allows all currently executing connections (statements and transactions) to complete while preventing any new connections to be made, new connections are either sent to another available server in the same HG or held until a server is available Graceful Failover
  23. Let’s look at a graceful failover. The initial state of

    the healthy MySQL cluster The Graceful Failover Process
  24. Firstly we remove the replication hostgroup to prevent automatic failover

    as a safety precaution The Graceful Failover Process
  25. Next we set the status of the Primary to OFFLINE_SOFT

    to allow current statements and transactions to complete their execution The Graceful Failover Process
  26. Now it is time to perform the failover either manually

    or using a tool such as Orchestrator to promote a Replica to Primary The Graceful Failover Process
  27. Traffic will continue to be served from the existing nodes

    and maintenance activities can be performed The Graceful Failover Process
  28. At this point the replication hostgroup can be reconfigured to

    allow for any unplanned failover that may occur The Graceful Failover Process
  29. Once maintenance is completed we can add the previous Primary

    to the Replica hostgroup=1 by dynamically updating the configuration The Graceful Failover Process
  30. The steps for graceful failover can be executed manually or

    scripted • Manually executing allows for more control while scripts are more precise and are likely to execute in a smaller time-frame • NOTE: The time between draining connections and promoting a replica must be shorter than “mysql-connect_timeout_server_max” else the new waiting connections will return an error – you can tweak this before and after failover Implementing Graceful Failover
  31. • Ensure that the number of available nodes support the

    maximum throughput required by applications • Use the same sized servers for Primary and Replicas which are candidates for promotion (at least one pair should be equal) • Prepare scripts to execute “planned failover” to ensure a smooth and repeatable process • When integrating with Orchestrator consider implementing ProxySQL planned failover logic in the available hooks General Failover Considerations
  32. • Consul for service discovery • Head nodes are running

    consul servers and orchestrator-raft • If a proxy can’t reach raft quorum, wipe itself from scheduler • Application connects to all proxies in failover mode with multi-host connection at JDBC level Zuora’s implementation
  33. • Using SOFT_OFFLINE method with a wrapper for graceful failover

    • Using orchestrator and reconfiguring ProxySQL from orchestrator hook for non-graceful • We also update metadata in consul • One shard can have any number of proxies Zuora’s implementation
  34. • Because the failover is non-impactful, we can do it

    all the time and automatically ◦ Takes less than 10 seconds total ◦ Queueing time is typically 3-4 sec • Faster administrative tasks ◦ Version upgrades ◦ OS upgrades • Faster iteration on infrastructure level changes What it allowed us to do
  35. • A long transaction holding SOFT_OFFLINE state too long can

    be just as bad • Choose your timeouts carefully, some long running transactions might be killed at the time of failover • Too long time in SOFT_OFFLINE can be just as bad as too short Advice for SOFT_OFFLINE
  36. • AWS released the new r5 instances for the public

    • We were on r3, benchmarked r5 and found it faster and cheaper • Migrated the whole infrastructure from r3 to r5 totally online, with minimal impact to the application Benefits example - instance type
  37. • Benchmarked available storage • Using larger gp2 volumes compared

    to smaller provisioned IO had performance and cost benefits • We could switch the volumes online • The instance type and the volume change meant a 20% cost reduction Benefits example - storage
  38. Multiplexing is a method whereby established backend connections are re-used

    for multiple frontend connection: • The implementation is similar to a thread pool • Backend connections are allocated to frontend connections while active • ProxySQL maintains idle frontend connections asynchronously to alleviate the work from the database • There are certain conditions whereby multiplexing is disabled to safeguard the state of a connection What is Multiplexing in ProxySQL?
  39. When a transaction is active in a connection, multiplexing is

    disabled until the transaction commits or rollbacks. A transaction becomes active when: • “START TRANSACTION” is executed • “BEGIN” is executed • The first statement is executed in a session where “AUTOCOMMIT = OFF” Conditions that disable multiplexing
  40. • If LOCK TABLE, LOCK TABLES or FLUSH TABLES WITH

    READ LOCK is executed, multiplexing is disabled until UNLOCK TABLES is executed • If GET_LOCK() is executed, multiplexing is disabled and is never enabled again • If a query contains SQL_CALC_FOUND_ROWS • If CREATE TEMPORARY TABLE is executed • If PREPARE is executed (creation of prepared statements using the TEXT protocol and not the BINARY protocol) , multiplexing is not enabled again on the connection. • If SQL_LOG_BIN is set to 0, multiplexing is disabled until SQL_LOG_BIN is set back to 1. Conditions that disable multiplexing
  41. All queries that have @ in their query syntax will

    disable multiplexing, and it will never be enabled again. The same applies to the following: • SET SQL_SAFE_UPDATES / SQL_SELECT_LIMIT / MAX_JOIN_SIZE (not from 2.0.7) • SET FOREIGN_KEY_CHECKS • SET UNIQUE_CHECKS • SET AUTO_INCREMENT_INCREMENT • SET AUTO_INCREMENT_OFFSET • SET GROUP_CONCAT_MAX_LEN There are a few exceptions that will NOT disable multiplexing e.g: • SELECT @@tx_isolation • SELECT @@version Conditions that disable multiplexing
  42. Often, multiplexing will be disabled due to default session variables

    set by an ORM or by an application component designed for generic use however this is not necessarily required. • In these cases a ProxySQL Query Rule can be implemented to explicitly for multiplexing to remain enabled • A typical scenario is related to the max_allowed_packet queries; a query rule to counteract this would look like: INSERT INTO mysql_query_rules (active, match_digest, multiplex) VALUES ('1', '^SELECT @@max_allowed_packet', 2); Optimizing Multiplexing
  43. Typically the queries that disable multiplexing can be identified by

    querying the stats_mysql_query_digest table. • We can query the table and look for specific query syntax which disables multiplexing to evaluate whether it is mandatory for multiplexing to be disabled. • Starting from ProxySQL 2.0.6 , SET queries that can’t be parsed disable multiplexing and are logged to ProxySQL’s error log during runtime making this process even more straightforward Optimizing Multiplexing
  44. ProxySQL maintains a connection pool per backend server • Connections

    to backend are never pre-allocated if there is no need, so at start up there will 0 connections to the backend. • When application starts sending traffic to proxysql, this identifies to which backend if needs to send traffic. If there is a connection in the connection pool for that backend, that connection is used, otherwise a new connection is created. • When the connection completes serving the client's request, it is sent back to the pool if the connection is safe to share and the connection pool isn't full. Connection Pooling in ProxySQL
  45. Connection pools are configured by default via the “mysql_servers” table

    in the admin interface, below we can see an example from a typical configuration: +----+----------------+--------+----------+-----------------+ | HG | Host | status | weight | max_connections | +----+----------------+--------+----------+-----------------+ | 0 | xxx.xx.xxx.121 | ONLINE | 10000000 | 2000 | | 1 | xxx.xx.xxx.121 | ONLINE | 1 | 2000 | | 1 | xxx.xx.xxx.122 | ONLINE | 10000000 | 2000 | | 2 | xxx.xx.xxx.221 | ONLINE | 10000000 | 2000 | +----+----------------+--------+----------+-----------------+ Connection Pooling in ProxySQL
  46. We can also see view the status connections pools the

    stats_mysql_connection_pool table. Notice that the xxx.xx.xxx.121 node in the Reader HG1 is not used due to the configured in the mysql_servers table): +----+----------------+--------+----------+----------+--------+---------+------------+------------+ | HG | Host | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Latency_us | +----+----------------+--------+----------+----------+--------+---------+------------+------------+ | 0 | xxx.xx.xxx.121 | ONLINE | 1168 | 160 | 2991 | 44 | 1453995458 | 1442 | | 1 | xxx.xx.xxx.121 | ONLINE | 0 | 0 | 0 | 0 | 0 | 1442 | | 1 | xxx.xx.xxx.122 | ONLINE | 768 | 76 | 13 | 0 | 16083 | 1322 | | 2 | xxx.xx.xxx.221 | ONLINE | 0 | 10 | 10 | 0 | 20572 | 382 | +----+----------------+--------+----------+----------+--------+---------+------------+------------+ Connection Pooling in ProxySQL
  47. Each connection pool will maintain a certain amount of free

    connections for re-use, this can be controlled within ProxySQL at a global level using the mysql-free_connections_pct variable. • This variable controls the percentage of open idle connections from the total maximum number of connections for a specific server in a hostgroup. • For each hostgroup/backend pair, the Hostgroup Manager will keep in the connection pool up to mysql-free_connections_pct * mysql_servers.max_connections / 100 connections. • Connections are kept open with periodic pings. • A connection is idle if it hasn't used since the last round of pings. The time interval between two such rounds of pings for idle connections is controlled by the variable mysql-ping_interval_server_msec. Connection Pooling in ProxySQL
  48. • Around 6:30 PM, we received an alert that a

    database server has high concurrency (Threads_running around 700) • Overloading the server with lots of short queries • We had multiplexing enabled, but we lowered the number of backend connections to 100 • On the next slide, spot the time when this happened Case study: shard overloaded
  49. • Reducing parallelism allowed the database server to use the

    same amount of CPU, but not for context switching, but to do more meaningful work • QPS was increased, but what about individual query types? Maybe one query type became fast of a side load was added. Case study: shard overloaded
  50. • Throughput was increased for every query type • What

    about response time? Better throughput with worse response time might be a tradeoff. Case study: shard overloaded
  51. • The last 2 graphs were created by sending metadata

    from stats_mysql_query_digest to our monitoring system • We gained lots of new insights this way from ProxySQL Case study: shard overloaded
  52. SQL state [HY000]; error code [9001]; Max connect timeout reached

    while reaching hostgroup 0 after 15080ms In this case we know exactly that the transaction didn’t start yet (compared to the overloaded case, when we hit timeout at random places). Be aware: new error with multiplexing
  53. • Earlier it was always the database ◦ MySQL hostname

    and port in exception • Now it’s always ProxySQL ◦ ProxySQL hostname and port in exception New thing to blame
  54. This functionality is one of the main reasons to implement

    a proxy that sits between the application and the database server: • Query rewriting on the fly without having to involve the development team to correct misbehaving queries or forcing indexing • Query rewriting is achieved by means of query rules that can be chained i.e. a series of query rules can be applied to a specific queries which matches the conditions of the query rule Query rewrite on the fly
  55. As with all query rules, the criteria to identify a

    query that requires rewrite includes the following main criteria (and more!): • Match of a specific digest • Match on a specific string sequence of a query digest text • Match on a specific query value or sequence of values • Regexp either on a query or digest text • And many others such as username, schemaname, client address, proxy address, previous query rule executed etc. Query rewrite criteria
  56. Here is a simple example of a query rewrite rule

    which will force a specific index to be used on a query, it will match a pattern and replace this on the fly: INSERT INTO mysql_query_rules (rule_id, active, match_pattern, replace_pattern) VALUES (1, 1, "^SELECT first FROM names", "SELECT first FROM names FORCE INDEX(id)"); Query rewrite example 1
  57. Here is another example of a query rewrite rule which

    will mask a value for a specific query, naturally depending the criteria this can also be customised with a regexp to match a range of queries and conditions: INSERT INTO mysql_query_rules (rule_id, active, match_pattern, replace_pattern) VALUES (2, 1, "^SELECT first", "SELECT ‘secret’ as first"); Query rewrite example 2
  58. • Rene and Fred has an excellent talk about data

    masking ◦ https://archive.fosdem.org/2017/schedule/event/proxysqldatamasking/ • Which has an even more excellent follow up blog post ◦ https://proxysql.com/blog/obfuscate-data-from-mysqldump TL;DR dump scrubbed data from one interface
  59. • Create an instance from a snapshot • Configure an

    ephemeral ProxySQL for scrubbing • Parallel-dump the instance through ProxySQL ◦ Without replication running • Much faster and needs less resources compared to our previous methods Our implementation
  60. Q&A