Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failsafe Patroni 3.0

Failsafe Patroni 3.0

What?! Patroni is the tool for implementing PostgreSQL high-availability and automatic failover, isn't it already failsafe on its own?

If you are an experienced Patroni user, you know that it relies on DCS (Distributed Configuration Store) to keep PostgreSQL cluster information in a consistent way ensuring that there is only one leader at a time. And of course, you also know that primary is demoted if Patroni can’t update the leader lock when DCS (Etcd, Consul, Zookeeper, or Kubernetes API) is not accessible or experiencing temporary problems, which could be very frustrating.

In this talk we will introduce a new Patroni feature – DCS failsafe mode, which is aimed at keeping primary running in case of a DCS failure. We will reveal some ideas behind, share important implementation details, do a live demo, and give guidance on considerations whether the feature should be used in specific environments, or it is better to refrain from it.

Alexander Kukushkin

February 06, 2023
Tweet

More Decks by Alexander Kukushkin

Other Decks in Technology

Transcript

  1. Failsafe
    Patroni 3.0
    Prague PostgreSQL
    Developer Day
    Alexander Kukushkin &
    Polina Bungina
    2023 • 02 • 01
    Presented by

    View Slide

  2. About us
    • Principal Software Engineer
    @Microsoft
    • The Patroni guy
    [email protected]
    • Twitter: @cyberdemn
    • Software Engineer @ZalandoTech
    [email protected]
    • Twitter: @hugh_capet
    Polina Bungina
    Alexander Kukushkin

    View Slide

  3. Agenda
    Introduction to Patroni
    Observer problem
    Demo 1
    DCS dailsafe feature
    Demo 2
    Conclusion

    View Slide

  4. 4
    ● Service-Level Agreement (SLA)
    ● Recovery point objective (RPO)
    ● Recovery time objective (RTO)
    Do we need it at all?

    View Slide

  5. 5
    ● Cluster state stored in
    Distributed Configuration Store (DCS)
    ○ ZooKeeper
    ○ Etcd
    ○ Consul
    ○ Kubernetes control-plane
    ● Session/TTL to expire data (i.e. leader key)
    ● Atomic CAS operations
    ● Watches for important keys
    Architecture overview

    View Slide

  6. 6
    Leader race
    DCS
    CREATE “/leader”, “A”, ttl=30,
    prevExists=False
    CREATE “/leader”, “B”, ttl=30,
    prevExists=False
    Success
    Fail
    A
    B
    promote

    View Slide

  7. 7
    Normal operational mode
    DCS
    UPDATE “/leader”, “A”, ttl=30
    prevValue=”A”
    Success
    A
    B
    WATCH(“/leader”)
    primary
    replica

    View Slide

  8. 8
    Normal operational mode
    DCS
    NOTIFY(“/leader”, expired=True)
    replica B

    View Slide

  9. 9
    Normal operational mode
    DCS
    B
    promote
    primary
    CREATE “/leader”, “B”, ttl=30,
    prevExists=False
    Success

    View Slide

  10. 10
    DCS can't be accessed
    UPDATE “/leader”, “A”, ttl=30
    prevValue=”A”
    Fail
    A
    B
    primary
    replica

    View Slide

  11. 11
    DCS can't be accessed
    UPDATE “/leader”, “A”, ttl=30
    prevValue=”A”
    Success
    A
    B
    primary
    replica

    View Slide

  12. 12
    Why did update fail?
    ● DCS is down?
    ● Network issues?

    View Slide

  13. 13
    Network partition
    UPDATE “/leader”, “A”, ttl=30
    prevValue=”A”
    Fail
    A
    primary
    B
    replica

    View Slide

  14. 14
    primary
    Leader key expired
    UPDATE “/leader”, “A”, ttl=30
    prevValue=”A”
    Fail
    B
    CREATE “/leader”, “B”, ttl=30
    prevExists=False
    Success
    promote
    A
    primary

    View Slide

  15. 15
    So, to be on the safe side…
    UPDATE “/leader”, “A”, ttl=30
    prevValue=”A”
    Fail
    A
    demote
    primary
    B
    replica

    View Slide

  16. 16
    So, to be on the safe side…
    A
    replica
    primary B
    promote
    CREATE “/leader”, “B”, ttl=30
    prevExists=False
    Success

    View Slide

  17. 17
    Still not perfect
    A
    replica
    CREATE “/leader”, “B”, ttl=30
    prevExists=False
    Fail
    B
    replica

    View Slide

  18. View Slide

  19. 19
    ● Etcd, Zookeeper - very unlikely (if configured
    correctly)
    ● Consul - local agent is a SPoF!
    ● Kubernetes control-plane - typical SLA for managed
    services is 99.95% (4h22m per year)
    DCS down

    View Slide

  20. 20
    What if…
    A
    B
    standby
    primary
    Do you see DCS?
    NO

    View Slide

  21. 21
    Split-brain!
    A
    B
    standby
    primary
    Do you see DCS?
    NO
    C
    primary

    View Slide

  22. 22
    ● Continue to run as primary if can see ALL
    Patroni nodes
    ● Don’t allow “unknown” nodes to become
    primary!
    Idea

    View Slide

  23. 23
    ● Patroni clusters are mainly “static”, but nodes can join and
    leave
    ● If topology changes - write list of of Patroni nodes names
    to DCS
    ● Nodes outside of this list are “unknown” and not allowed
    to become primary
    “Unknown” node?

    View Slide

  24. 24
    DCS failsafe mode
    UPDATE “/leader”, “A”, ttl=30
    Fail
    node1
    nodeN
    1
    POST /failsafe 2
    [cache primary
    data for ttl]
    3
    4
    200
    OK
    /failsafe: node1, node2, …, nodeN

    View Slide

  25. 25
    ● Introduce /failsafe key - list of currently presented
    members in the cluster
    ○ Maintained by the leader
    ○ Cache its value in Patroni (on all nodes)
    ● Introduce POST /failsafe REST API endpoint
    ○ Payload contains information about primary and
    permanent logical slots
    ○ Primary checks response code and demotes if
    not 200
    Implementation details

    View Slide

  26. 26
    ● Replica disqualifies itself from the leader race if not
    listed in the DCS /failsafe key
    ● Primary executes the failsafe check only with nodes from
    the failsafe list
    ○ Continue as primary if ALL nodes are accessible
    ○ Otherwise demote
    ● Replicas call pg_advance_replication_slot() if necessary.
    Implementation details (continue)

    View Slide

  27. 27
    $ patronictl edit-config
    ---
    +++
    @@ -4,3 +4,4 @@
    use_pg_rewind: true
    retry_timeout: 10
    ttl: 30
    +failsafe_mode: on
    Apply these changes? [y/N]: y
    Configuration changed
    $ etcdctl get /service/batman/failsafe
    {
    "postgresql0": "http://127.0.0.1:8008/patroni",
    "postgresql1": "http://127.0.0.1:8009/patroni"
    }
    How to enable failsafe mode
    $ curl http://127.0.0.1:8008/failsafe
    {
    "postgresql0": "http://127.0.0.1:8008/patroni",
    "postgresql1": "http://127.0.0.1:8009/patroni"
    }

    View Slide

  28. 28
    $ curl -s http://127.0.0.1:8008/patroni | jq .
    {
    "state": "running",
    "postmaster_start_time": "2023-01-26 16:11:04.848424+00:00",
    "role": "master",
    "server_version": 150001,
    "xlog": {"location": 67419584},
    "timeline": 2,
    "replication": [
    {"usename": "replicator", "application_name": "postgresql1", "client_addr": "127.0.0.1",
    "state": "streaming", "sync_state": "async", "sync_priority": 0}],
    "cluster_unlocked": true,
    "failsafe_mode_is_active": true,
    "dcs_last_seen": 1674749503,
    "database_system_identifier": "7192993973708324892",
    "patroni": {"version": "3.0.0", "scope": "demo"}
    }
    Monitoring

    View Slide

  29. View Slide

  30. 30
    ● When nodes could change their names after “restart”
    (with old storage)
    ○ If ALL nodes are restarted at the same time
    cluster will not recover automatically
    Example:
    ● K8s deployment without StatefulSet
    ○ Crunchy Postgres Operator (PGO)
    When not to use it

    View Slide

  31. 31
    Thank you!
    Questions?

    View Slide