Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postgres load balancing is secretly broken-the cancellation problem | Uptime 2022 | Jelte Fennema

Citus Data
September 14, 2022

Postgres load balancing is secretly broken-the cancellation problem | Uptime 2022 | Jelte Fennema

On certain workloads a single Postgres server cannot deliver the desired performance. If the traffic is read heavy, you could already resolve this by adding read replicas. If the traffic is write heavy then the upcoming Citus 11 release comes to the rescue.

In both cases the queries of your application need to be sent to a randomly chosen Postgres server, to make these servers share the load. A TCP load balancer can do this easily. However, such a load balancer has a hidden downside. When you use it and try to cancel a query, you'll notice that cancelling only works some of the time.

This talk will explain why this problem with cancelations occurs. And it shows a few ways to work around it, including changes that I proposed to Postgres and PgBouncer.

If any of the following topics sound interesting then this talk is for you:
1. Postgres read replicas
2. Scaling writes with Citus 11
3. Running PgBouncer on multiple CPU cores
4. Implementation details of query cancellations

Citus Data

September 14, 2022
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Postgres load balancing is secretly broken: The cancellation problem Jelte

    Fennema (@JelteF) Developing Citus and Postgres at Microsoft & PgBouncer maintainer 2022-09-14
  2. ©Microsoft Corporation Azure What am I going to talk about?

    • Load balancing across Postgres servers • Read replicas • Citus • Low level details about cancelling queries • PgBouncer
  3. ©Microsoft Corporation Azure What is load balancing? • Sharing workload

    across servers • Different servers handle independent requests
  4. ©Microsoft Corporation Azure Why load balancing for Postgres? • Performance

    • Scaling reads with Postgres read replicas • Scaling writes with the Citus extension for Postgres
  5. ©Microsoft Corporation Azure How to do load balancing? • Option

    1: Client-side load balancing Client knows about all servers and connects to all of them • Option 2: Server-side load balancing Client only knows a single logical server, that server “secretly” forwards requests to multiple servers
  6. ©Microsoft Corporation Azure Client-side load balancing explained Common approaches: •

    Hardcoded list of IPs or domains used by client • Multiple DNS records for the same domain, one for each server • With smart client • Or with round-robin resolver Downsides: • Every client needs to know about all the servers • Every client needs logic to choose a server • Caching: what happens when you add a server
  7. ©Microsoft Corporation Azure Client-side load balancing with Postgres • JDBC

    (Java) and Npgsql (C#) support • PgBouncer released client-side load balancing last March in v1.17.0 • libpq does not support client-side load balancing natively I submitted a patch to add support for this: https://commitfest.postgresql.org/39/3679/ • Many DNS resolvers always return a single fixed result So, a single client does not load balance
  8. ©Microsoft Corporation Azure Server-side load balancing explained Common approaches: •

    Proxy server • Software defined networking (SDN) Problems: • Proxy server introduces latency • Costs extra money Big advantage: • Clients don’t need special support and/or configuration
  9. ©Microsoft Corporation Azure Server-side load balancing with Postgres Options: 1.

    A dedicated PgBouncer server configured in client-side load balancing mode • Can use transaction or session load balancing • Extra network hop introduces latency • Single threaded 2. Off the shelf TCP load balancer • Each TCP stream is assigned to a different server • Probably has “fancy” things like health-checks • No extra network hop in case of SDN based load balancer
  10. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer CANCEL SECRET-TOKEN-123
  11. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123
  12. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 I don’t know SECRET-TOKEN-123. This is probably a hacker. Let’s ignore this request
  13. ©Microsoft Corporation Azure Not so fast (continued) • Important assumption:

    Different TCP streams are independent • Correct: Query on same Postgres session == same TCP stream • Correct: New session == new TCP stream • Wrong: Query cancellation request == new TCP stream
  14. ©Microsoft Corporation Azure The cancellation problem • Query cancellations end

    up at the wrong server most of the time • Workaround: Trigger cancellations multiple time at the client side • Solutions: None exist so far
  15. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123
  16. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123
  17. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 I don’t know SECRET-TOKEN-123 This was probably meant for another server. Let me ask Server A
  18. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123
  19. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123
  20. ©Microsoft Corporation Azure How to make this idea work? •

    Postgres does not support this out of the box • What can we do to make it work?
  21. ©Microsoft Corporation Azure Option 1: Postgres extension • No extension

    hooks exist for cancellation related code  • Getting hooks (or built-in functionality) in Postgres will take a long time and won’t work on old Postgres versions • Need to build cancellation forwarding support
  22. ©Microsoft Corporation Azure Option 2: Modify PgBouncer • PgBouncer can

    run in front of any Postgres server version • PgBouncer already has cancellation forwarding code to forward cancellations to Postgres • This code only needs to be modified to send to other PgBouncer servers • PR for this can be found here: https://github.com/pgbouncer/pgbouncer/pull/666
  23. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  24. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  25. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B I don’t know SECRET-TOKEN-123 This was probably meant for another server. Let me ask Server A
  26. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  27. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  28. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  29. ©Microsoft Corporation Azure Making PgBouncer performant • Single-threaded design becomes

    a bottleneck on high throughput servers • PgBouncer supports multi-process by using so_reuseport=1 • The Linux kernel then load-balances TCP streams across all processes • So, what we end up with is a multi-layered load balancer
  30. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  31. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  32. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  33. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  34. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  35. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  36. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  37. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  38. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  39. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  40. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  41. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  42. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  43. ©Microsoft Corporation Azure The final problem • To which one

    should PgBouncer forward it? • Sending the cancellation to all servers is quite heavy
  44. ©Microsoft Corporation Azure The final solution • Encode an identifier

    in the cancellation token: e.g. SECRET-TOKEN-123-A1