$30 off During Our Annual Pro Sale. View Details »

Postgres load balancing is secretly broken-the cancellation problem | Uptime 2022 | Jelte Fennema

Citus Data
September 14, 2022

Postgres load balancing is secretly broken-the cancellation problem | Uptime 2022 | Jelte Fennema

On certain workloads a single Postgres server cannot deliver the desired performance. If the traffic is read heavy, you could already resolve this by adding read replicas. If the traffic is write heavy then the upcoming Citus 11 release comes to the rescue.

In both cases the queries of your application need to be sent to a randomly chosen Postgres server, to make these servers share the load. A TCP load balancer can do this easily. However, such a load balancer has a hidden downside. When you use it and try to cancel a query, you'll notice that cancelling only works some of the time.

This talk will explain why this problem with cancelations occurs. And it shows a few ways to work around it, including changes that I proposed to Postgres and PgBouncer.

If any of the following topics sound interesting then this talk is for you:
1. Postgres read replicas
2. Scaling writes with Citus 11
3. Running PgBouncer on multiple CPU cores
4. Implementation details of query cancellations

Citus Data

September 14, 2022
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Postgres load balancing is secretly broken: The cancellation problem Jelte

    Fennema (@JelteF) Developing Citus and Postgres at Microsoft & PgBouncer maintainer 2022-09-14
  2. ©Microsoft Corporation Azure What am I going to talk about?

    • Load balancing across Postgres servers • Read replicas • Citus • Low level details about cancelling queries • PgBouncer
  3. ©Microsoft Corporation Azure What is load balancing? • Sharing workload

    across servers • Different servers handle independent requests
  4. ©Microsoft Corporation Azure Why load balancing for Postgres? • Performance

    • Scaling reads with Postgres read replicas • Scaling writes with the Citus extension for Postgres
  5. ©Microsoft Corporation Azure Scaling reads or writes? Why? Clients

  6. ©Microsoft Corporation Azure Scaling reads or writes? Why? Clients

  7. ©Microsoft Corporation Azure Scaling reads or writes? Why? Clients

  8. ©Microsoft Corporation Azure Scaling reads or writes? Why? Clients

  9. ©Microsoft Corporation Azure What are read replicas? Client Client Clients

  10. ©Microsoft Corporation Azure What is Citus?

  11. ©Microsoft Corporation Azure How to do load balancing? • Option

    1: Client-side load balancing Client knows about all servers and connects to all of them • Option 2: Server-side load balancing Client only knows a single logical server, that server “secretly” forwards requests to multiple servers
  12. ©Microsoft Corporation Azure Client-side load balancing explained Common approaches: •

    Hardcoded list of IPs or domains used by client • Multiple DNS records for the same domain, one for each server • With smart client • Or with round-robin resolver Downsides: • Every client needs to know about all the servers • Every client needs logic to choose a server • Caching: what happens when you add a server
  13. ©Microsoft Corporation Azure Client-side load balancing with Postgres • JDBC

    (Java) and Npgsql (C#) support • PgBouncer released client-side load balancing last March in v1.17.0 • libpq does not support client-side load balancing natively I submitted a patch to add support for this: https://commitfest.postgresql.org/39/3679/ • Many DNS resolvers always return a single fixed result So, a single client does not load balance
  14. ©Microsoft Corporation Azure Server-side load balancing explained Common approaches: •

    Proxy server • Software defined networking (SDN) Problems: • Proxy server introduces latency • Costs extra money Big advantage: • Clients don’t need special support and/or configuration
  15. ©Microsoft Corporation Azure Server-side load balancing with Postgres Options: 1.

    A dedicated PgBouncer server configured in client-side load balancing mode • Can use transaction or session load balancing • Extra network hop introduces latency • Single threaded 2. Off the shelf TCP load balancer • Each TCP stream is assigned to a different server • Probably has “fancy” things like health-checks • No extra network hop in case of SDN based load balancer
  16. ©Microsoft Corporation Azure Decision seems simple Off the shelf TCP

    based load balancer it is
  17. ©Microsoft Corporation Azure Not so fast • Important assumption: Different

    TCP streams are independent
  18. ©Microsoft Corporation Azure How Postgres cancellation requests work Client

  19. ©Microsoft Corporation Azure How Postgres cancellation requests work Client

  20. ©Microsoft Corporation Azure How Postgres cancellation requests work Client

  21. ©Microsoft Corporation Azure How Postgres cancellation requests work Client

  22. ©Microsoft Corporation Azure How Postgres cancellation requests work Client CANCEL

    SECRET-TOKEN-123
  23. ©Microsoft Corporation Azure How Postgres cancellation requests work Client CANCEL

    SECRET-TOKEN-123
  24. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer
  25. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer
  26. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer
  27. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer CANCEL SECRET-TOKEN-123
  28. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123
  29. ©Microsoft Corporation Azure So what happens with cancellations and a

    load balancer Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 I don’t know SECRET-TOKEN-123. This is probably a hacker. Let’s ignore this request
  30. ©Microsoft Corporation Azure Not so fast (continued) • Important assumption:

    Different TCP streams are independent • Correct: Query on same Postgres session == same TCP stream • Correct: New session == new TCP stream • Wrong: Query cancellation request == new TCP stream
  31. ©Microsoft Corporation Azure The cancellation problem • Query cancellations end

    up at the wrong server most of the time • Workaround: Trigger cancellations multiple time at the client side • Solutions: None exist so far
  32. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer
  33. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer
  34. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer
  35. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123
  36. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123
  37. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 I don’t know SECRET-TOKEN-123 This was probably meant for another server. Let me ask Server A
  38. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123
  39. ©Microsoft Corporation Azure Idea: Make Postgres servers know about each

    other Client Load balancer CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123
  40. ©Microsoft Corporation Azure How to make this idea work? •

    Postgres does not support this out of the box • What can we do to make it work?
  41. ©Microsoft Corporation Azure Option 1: Postgres extension • No extension

    hooks exist for cancellation related code  • Getting hooks (or built-in functionality) in Postgres will take a long time and won’t work on old Postgres versions • Need to build cancellation forwarding support
  42. ©Microsoft Corporation Azure Option 2: Modify PgBouncer • PgBouncer can

    run in front of any Postgres server version • PgBouncer already has cancellation forwarding code to forward cancellations to Postgres • This code only needs to be modified to send to other PgBouncer servers • PR for this can be found here: https://github.com/pgbouncer/pgbouncer/pull/666
  43. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    PgBouncer B PgBouncer A Postgres A Postgres B
  44. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    PgBouncer B PgBouncer A Postgres A Postgres B
  45. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    PgBouncer B PgBouncer A Postgres A Postgres B
  46. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    PgBouncer B PgBouncer A Postgres A Postgres B
  47. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  48. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  49. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B I don’t know SECRET-TOKEN-123 This was probably meant for another server. Let me ask Server A
  50. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  51. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  52. ©Microsoft Corporation Azure Option 2: Modify PgBouncer Client Load balancer

    CANCEL SECRET-TOKEN-123 CANCEL SECRET-TOKEN-123 PgBouncer B PgBouncer A Postgres A Postgres B
  53. ©Microsoft Corporation Azure Making PgBouncer performant • Single-threaded design becomes

    a bottleneck on high throughput servers • PgBouncer supports multi-process by using so_reuseport=1 • The Linux kernel then load-balances TCP streams across all processes • So, what we end up with is a multi-layered load balancer
  54. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  55. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  56. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  57. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  58. ©Microsoft Corporation Azure What we end up with Client Load

    balancer PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  59. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  60. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  61. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  62. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  63. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  64. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  65. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  66. ©Microsoft Corporation Azure What we end up with Client Load

    balancer CANCEL SECRET-TOKEN-123 PgBouncer B1 PgBouncer A1 Postgres A Postgres B PgBouncer A2 Linux kernel PgBouncer B2 Linux kernel
  67. ©Microsoft Corporation Azure The final problem • To which one

    should PgBouncer forward it? • Sending the cancellation to all servers is quite heavy
  68. ©Microsoft Corporation Azure The final solution • Encode an identifier

    in the cancellation token: e.g. SECRET-TOKEN-123-A1
  69. © Copyright Microsoft Corporation. All rights reserved. Any questions?