Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Postgres load balancing is secretly broken-the cancellation problem | Uptime 2022 | Jelte Fennema

Citus Data
September 14, 2022

Postgres load balancing is secretly broken-the cancellation problem | Uptime 2022 | Jelte Fennema

On certain workloads a single Postgres server cannot deliver the desired performance. If the traffic is read heavy, you could already resolve this by adding read replicas. If the traffic is write heavy then the upcoming Citus 11 release comes to the rescue.

In both cases the queries of your application need to be sent to a randomly chosen Postgres server, to make these servers share the load. A TCP load balancer can do this easily. However, such a load balancer has a hidden downside. When you use it and try to cancel a query, you'll notice that cancelling only works some of the time.

This talk will explain why this problem with cancelations occurs. And it shows a few ways to work around it, including changes that I proposed to Postgres and PgBouncer.

If any of the following topics sound interesting then this talk is for you:
1. Postgres read replicas
2. Scaling writes with Citus 11
3. Running PgBouncer on multiple CPU cores
4. Implementation details of query cancellations

Citus Data

September 14, 2022
Tweet

More Decks by Citus Data

Other Decks in Technology

Transcript

  1. Postgres load balancing
    is secretly broken:
    The cancellation problem
    Jelte Fennema (@JelteF)
    Developing Citus and Postgres at Microsoft & PgBouncer maintainer
    2022-09-14

    View Slide

  2. ©Microsoft Corporation
    Azure
    What am I going to talk about?
    • Load balancing across Postgres servers
    • Read replicas
    • Citus
    • Low level details about cancelling queries
    • PgBouncer

    View Slide

  3. ©Microsoft Corporation
    Azure
    What is load balancing?
    • Sharing workload across servers
    • Different servers handle independent requests

    View Slide

  4. ©Microsoft Corporation
    Azure
    Why load balancing for Postgres?
    • Performance
    • Scaling reads with Postgres read replicas
    • Scaling writes with the Citus extension for Postgres

    View Slide

  5. ©Microsoft Corporation
    Azure
    Scaling reads or writes? Why?
    Clients

    View Slide

  6. ©Microsoft Corporation
    Azure
    Scaling reads or writes? Why?
    Clients

    View Slide

  7. ©Microsoft Corporation
    Azure
    Scaling reads or writes? Why?
    Clients

    View Slide

  8. ©Microsoft Corporation
    Azure
    Scaling reads or writes? Why?
    Clients

    View Slide

  9. ©Microsoft Corporation
    Azure
    What are read replicas?
    Client
    Client
    Clients

    View Slide

  10. ©Microsoft Corporation
    Azure
    What is Citus?

    View Slide

  11. ©Microsoft Corporation
    Azure
    How to do load balancing?
    • Option 1: Client-side load balancing
    Client knows about all servers and connects to all of them
    • Option 2: Server-side load balancing
    Client only knows a single logical server, that server “secretly” forwards requests to multiple servers

    View Slide

  12. ©Microsoft Corporation
    Azure
    Client-side load balancing explained
    Common approaches:
    • Hardcoded list of IPs or domains used by client
    • Multiple DNS records for the same domain, one for each server
    • With smart client
    • Or with round-robin resolver
    Downsides:
    • Every client needs to know about all the servers
    • Every client needs logic to choose a server
    • Caching: what happens when you add a server

    View Slide

  13. ©Microsoft Corporation
    Azure
    Client-side load balancing with Postgres
    • JDBC (Java) and Npgsql (C#) support
    • PgBouncer released client-side load balancing last March in v1.17.0
    • libpq does not support client-side load balancing natively
    I submitted a patch to add support for this: https://commitfest.postgresql.org/39/3679/
    • Many DNS resolvers always return a single fixed result
    So, a single client does not load balance

    View Slide

  14. ©Microsoft Corporation
    Azure
    Server-side load balancing explained
    Common approaches:
    • Proxy server
    • Software defined networking (SDN)
    Problems:
    • Proxy server introduces latency
    • Costs extra money
    Big advantage:
    • Clients don’t need special support and/or configuration

    View Slide

  15. ©Microsoft Corporation
    Azure
    Server-side load balancing with Postgres
    Options:
    1. A dedicated PgBouncer server configured in client-side load balancing mode
    • Can use transaction or session load balancing
    • Extra network hop introduces latency
    • Single threaded
    2. Off the shelf TCP load balancer
    • Each TCP stream is assigned to a different server
    • Probably has “fancy” things like health-checks
    • No extra network hop in case of SDN based load balancer

    View Slide

  16. ©Microsoft Corporation
    Azure
    Decision seems simple
    Off the shelf TCP based load balancer it is

    View Slide

  17. ©Microsoft Corporation
    Azure
    Not so fast
    • Important assumption: Different TCP streams are independent

    View Slide

  18. ©Microsoft Corporation
    Azure
    How Postgres cancellation requests work
    Client

    View Slide

  19. ©Microsoft Corporation
    Azure
    How Postgres cancellation requests work
    Client

    View Slide

  20. ©Microsoft Corporation
    Azure
    How Postgres cancellation requests work
    Client

    View Slide

  21. ©Microsoft Corporation
    Azure
    How Postgres cancellation requests work
    Client

    View Slide

  22. ©Microsoft Corporation
    Azure
    How Postgres cancellation requests work
    Client
    CANCEL SECRET-TOKEN-123

    View Slide

  23. ©Microsoft Corporation
    Azure
    How Postgres cancellation requests work
    Client
    CANCEL SECRET-TOKEN-123

    View Slide

  24. ©Microsoft Corporation
    Azure
    So what happens with cancellations and a load balancer
    Client
    Load balancer

    View Slide

  25. ©Microsoft Corporation
    Azure
    So what happens with cancellations and a load balancer
    Client
    Load balancer

    View Slide

  26. ©Microsoft Corporation
    Azure
    So what happens with cancellations and a load balancer
    Client
    Load balancer

    View Slide

  27. ©Microsoft Corporation
    Azure
    So what happens with cancellations and a load balancer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123

    View Slide

  28. ©Microsoft Corporation
    Azure
    So what happens with cancellations and a load balancer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123

    View Slide

  29. ©Microsoft Corporation
    Azure
    So what happens with cancellations and a load balancer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123
    I don’t know SECRET-TOKEN-123.
    This is probably a hacker.
    Let’s ignore this request

    View Slide

  30. ©Microsoft Corporation
    Azure
    Not so fast (continued)
    • Important assumption: Different TCP streams are independent
    • Correct: Query on same Postgres session == same TCP stream
    • Correct: New session == new TCP stream
    • Wrong: Query cancellation request == new TCP stream

    View Slide

  31. ©Microsoft Corporation
    Azure
    The cancellation problem
    • Query cancellations end up at the wrong server most of the time
    • Workaround: Trigger cancellations multiple time at the client side
    • Solutions: None exist so far

    View Slide

  32. ©Microsoft Corporation
    Azure
    Idea: Make Postgres servers know about each other
    Client
    Load balancer

    View Slide

  33. ©Microsoft Corporation
    Azure
    Idea: Make Postgres servers know about each other
    Client
    Load balancer

    View Slide

  34. ©Microsoft Corporation
    Azure
    Idea: Make Postgres servers know about each other
    Client
    Load balancer

    View Slide

  35. ©Microsoft Corporation
    Azure
    Idea: Make Postgres servers know about each other
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123

    View Slide

  36. ©Microsoft Corporation
    Azure
    Idea: Make Postgres servers know about each other
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123

    View Slide

  37. ©Microsoft Corporation
    Azure
    Idea: Make Postgres servers know about each other
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123
    I don’t know SECRET-TOKEN-123
    This was probably meant for another
    server. Let me ask Server A

    View Slide

  38. ©Microsoft Corporation
    Azure
    Idea: Make Postgres servers know about each other
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123

    View Slide

  39. ©Microsoft Corporation
    Azure
    Idea: Make Postgres servers know about each other
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123

    View Slide

  40. ©Microsoft Corporation
    Azure
    How to make this idea work?
    • Postgres does not support this out of the box
    • What can we do to make it work?

    View Slide

  41. ©Microsoft Corporation
    Azure
    Option 1: Postgres extension
    • No extension hooks exist for cancellation related code 
    • Getting hooks (or built-in functionality) in Postgres will take a long time and won’t
    work on old Postgres versions
    • Need to build cancellation forwarding support

    View Slide

  42. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    • PgBouncer can run in front of any Postgres server version
    • PgBouncer already has cancellation forwarding code to forward cancellations to
    Postgres
    • This code only needs to be modified to send to other PgBouncer servers
    • PR for this can be found here:
    https://github.com/pgbouncer/pgbouncer/pull/666

    View Slide

  43. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  44. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  45. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  46. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  47. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  48. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  49. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B
    I don’t know SECRET-TOKEN-123
    This was probably meant for another
    server. Let me ask Server A

    View Slide

  50. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  51. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  52. ©Microsoft Corporation
    Azure
    Option 2: Modify PgBouncer
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    CANCEL SECRET-TOKEN-123
    PgBouncer B
    PgBouncer A
    Postgres A Postgres B

    View Slide

  53. ©Microsoft Corporation
    Azure
    Making PgBouncer performant
    • Single-threaded design becomes a bottleneck on high throughput servers
    • PgBouncer supports multi-process by using so_reuseport=1
    • The Linux kernel then load-balances TCP streams across all processes
    • So, what we end up with is a multi-layered load balancer

    View Slide

  54. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  55. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  56. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  57. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  58. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  59. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  60. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  61. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  62. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  63. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  64. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  65. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  66. ©Microsoft Corporation
    Azure
    What we end up with
    Client
    Load balancer
    CANCEL SECRET-TOKEN-123
    PgBouncer B1
    PgBouncer A1
    Postgres A Postgres B
    PgBouncer A2
    Linux kernel
    PgBouncer B2
    Linux kernel

    View Slide

  67. ©Microsoft Corporation
    Azure
    The final problem
    • To which one should PgBouncer forward it?
    • Sending the cancellation to all servers is quite heavy

    View Slide

  68. ©Microsoft Corporation
    Azure
    The final solution
    • Encode an identifier in the cancellation token: e.g. SECRET-TOKEN-123-A1

    View Slide

  69. © Copyright Microsoft Corporation. All rights reserved.
    Any questions?

    View Slide