$30 off During Our Annual Pro Sale. View Details »

Upgrading Log-Analytics Clusters to OpenSearch (Amitai Stern, LogzIO) | RTA Summit 2023

Upgrading Log-Analytics Clusters to OpenSearch (Amitai Stern, LogzIO) | RTA Summit 2023

Here at Logz.io, the open-source observability and security company, we run ElasticSearch for over 1300 companies in highly scalable multi-cloud deployments. With the license changes of 2021 we needed to migrate to an open-source platform, and OpenSearch was where we were going to contribute and what we wanted to run in production.

Many equate upgrading to OpenSearch from Elasticsearch in production as changing the tires on a moving bus. Upgrading has many risks, and if the cluster is in continuous production use, ingesting terabytes of data daily, the risks can seem overbearing.

In this talk, we will cover multiple upgrade strategies, including version requirements, and their pros and cons. Additionally, we will cover a different option, which is the way we, at Logz.io, upgraded all our clusters to OpenSearch without significant extra costs while minimizing risk. Not only did we upgrade to OpenSearch, but we also migrated our AWS workloads to Graviton2 instances.

StarTree
PRO

May 23, 2023
Tweet

More Decks by StarTree

Other Decks in Technology

Transcript

  1. Upgrading log-analytics clusters
    to OpenSearch
    @amitaistern
    Amitai Stern
    Software Engineer and Telemetry
    Storage Team Lead at Logz.io

    View Slide

  2. View Slide

  3. 6.8 7.0.0 7.10 8.7
    7.11
    1.0.0 2.6

    View Slide

  4. January 2021
    6.8 7.0.0 7.10 8.7
    7.11
    1.0.0 2.6

    View Slide

  5. July 2021
    6.8 7.0.0 7.10 8.7
    7.11
    1.0.0 2.6
    January 2021

    View Slide

  6. Apache 2.0-licensed open source
    Server Side Public License (SSPL)
    6.8 7.0.0 7.10 8.7
    7.11
    1.0.0 2.6

    View Slide

  7. Supports both AMD64 and ARM64
    architectures
    Supports AMD64
    architecture
    6.8 7.0.0 7.10 8.7
    7.11
    1.0.0 2.6

    View Slide

  8. 6.8 7.0.0 7.10 8.7
    7.11
    1.0.0 2.6

    View Slide

  9. Log-Engine
    Application
    Kibana Query-Service
    Other
    Microservices
    search
    ingest
    Log Analytics Cluster Architecture
    Amazon S3
    (cluster snapshots)

    View Slide

  10. Log-Engine
    Application
    Kibana Query-Service
    Other
    Microservices
    search
    ingest
    Log Analytics Cluster Architecture
    Amazon S3
    (cluster snapshots)

    View Slide

  11. Preparing for the upgrade
    Index
    versions
    Deprecated
    APIs
    Breaking
    changes
    Test env
    Cluster
    Settings
    thresholds

    View Slide

  12. Common Upgrading
    Strategies

    View Slide

  13. Common Upgrading Strategies: Blue/Green
    Cluster Cluster
    read
    write

    View Slide

  14. Common Upgrading Strategies: Blue/Green
    Cluster Cluster
    read
    write

    View Slide

  15. Common Upgrading Strategies: In-Place
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes
    Cluster

    View Slide

  16. Balancing
    Risk, Cost, and Speed

    View Slide

  17. The Drain Method
    Data
    nodes
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.exclude._ip": "172.22.4.9"
    }
    }
    172.22.4.9

    View Slide

  18. The Drain Method
    Data
    nodes
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.exclude._ip": "172.22.4.9",
    "indices.recovery.max_bytes_per_sec": "150mb"
    }
    }
    172.22.4.9

    View Slide

  19. The Drain Method
    Data
    nodes
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.include._ip": "172.33.14.1,172.22.4.9"
    }
    }
    172.22.4.9
    172.33.14.1

    View Slide

  20. The Drain Method
    Data
    nodes
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.include._ip": "172.33.14.1,172.22.4.9"
    }
    }

    View Slide

  21. The Drain Method: Upgrade Process Overview
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.include._ip": ""
    }
    }
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  22. The Drain Method: Upgrade Process Overview
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.include._ip": "",
    "cluster.routing.allocation.exclude._ip": ""
    }
    }
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  23. The Drain Method: Upgrade Process Overview
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.include._ip": "",
    "cluster.routing.allocation.exclude._ip": ""
    }
    }
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  24. The Drain Method: Upgrade Process Overview
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.include._ip": "",
    "cluster.routing.allocation.exclude._ip": ""
    }
    }
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes
    "indices.recovery.max_bytes_per_sec": "300mb"

    View Slide

  25. The Drain Method: Upgrade Process Overview
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.include._ip": "",
    "cluster.routing.allocation.exclude._ip": ""
    }
    }
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes
    "indices.recovery.max_bytes_per_sec": "0mb"

    View Slide

  26. The Drain Method: Upgrade Process Overview
    PUT _cluster/settings
    {
    "persistent": {
    "cluster.routing.allocation.include._ip": null,
    "cluster.routing.allocation.exclude._ip": null
    }
    }
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  27. The Drain Method: Upgrade Process Overview
    |
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes
    load balancer
    DNS record

    View Slide

  28. The Drain Method: Upgrade Process Overview
    - New LB
    - OpenSearch Coordinator Nodes
    Data
    nodes
    load balancer
    Coordinator
    nodes
    Cluster
    manager
    nodes
    load balancer
    DNS record

    View Slide

  29. - New LB
    - OpenSearch Coordinator Nodes
    - Override DNS record
    DNS record
    The Drain Method: Upgrade Process Overview
    Data
    nodes
    load balancer
    Coordinator
    nodes
    Cluster
    manager
    nodes
    load balancer

    View Slide

  30. The Drain Method: Upgrade Process Overview
    - New LB
    - OpenSearch Coordinator Nodes
    - Override DNS record
    - Remove old Coordinating Nodes
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes
    load balancer

    View Slide

  31. |
    The Drain Method: Upgrade Process Overview
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  32. The Drain Method: Upgrade Process Overview
    - Add 3 more Cluster manager Nodes
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  33. The Drain Method: Upgrade Process Overview
    - Add 3 more Cluster manager Nodes
    - Remove the old ones one at a time (elected one last)
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  34. The Drain Method: Upgrade Process Overview
    - Add 3 more Cluster manager Nodes
    - Remove the old ones one at a time (elected one last)
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  35. The Drain Method: Upgrade Process Overview
    - Add 3 more Cluster manager Nodes
    - Remove the old ones one at a time (elected one last)
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  36. The Drain Method: Upgrade Process Overview
    - Add 3 more Cluster manager Nodes
    - Remove the old ones one at a time (elected one last)
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  37. The Drain Method: Upgrade Process Overview
    - Add 3 more Cluster manager Nodes
    - Remove the old ones one at a time (elected one last)
    - Await Cluster Manager Node reelection
    Data
    nodes
    Coordinator
    nodes
    ???
    Cluster
    manager
    nodes

    View Slide

  38. DONE :)
    The Drain Method: Upgrade Process Overview
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  39. |
    The Drain Method: Upgrade Process Overview
    Data
    nodes
    Coordinator
    nodes
    Cluster
    manager
    nodes

    View Slide

  40. The Drain Method: Managing risk
    Log-Engine
    Application
    Kibana
    Query-Service
    search
    ingest
    Amazon S3
    (cluster snapshots)

    View Slide

  41. Backup
    cluster
    Backup
    Log-Engine
    Application
    Kibana
    Query-Service
    search
    The Drain Method: Managing risk
    Log-Engine
    ingest
    Amazon S3
    (cluster snapshots)

    View Slide

  42. Backup
    cluster
    Application
    Kibana
    Query-Service
    search
    The Drain Method: Managing risk
    Log-Engine
    ingest
    Amazon S3
    (cluster snapshots)
    Backup
    Log-Engine

    View Slide

  43. The Drain Method: Managing risk
    Log-Engine
    Application
    Kibana
    Query-Service
    search
    ingest
    Amazon S3
    (cluster snapshots)

    View Slide

  44. Backup
    cluster
    The Drain Method: Managing risk
    Log-Engine
    ingest
    Amazon S3
    (cluster snapshots)
    Backup
    Log-Engine
    Application
    Kibana
    Query-Service
    search

    View Slide

  45. Backup
    cluster
    Application
    Kibana
    Query-Service
    search
    The Drain Method: Managing risk
    Log-Engine
    ingest
    Amazon S3
    (cluster snapshots)
    Backup
    Log-Engine

    View Slide

  46. Backup
    cluster
    Backup
    Log-Engine
    Application
    Kibana
    Query-Service
    search
    The Drain Method: Managing risk
    ingest
    Amazon S3
    (cluster snapshots)

    View Slide

  47. Backup
    Log-Engine
    Application
    Kibana
    Query-Service
    search
    The Drain Method: Managing risk
    ingest
    Amazon S3
    (cluster snapshots)
    Restore from Snapshot
    Backup
    cluster

    View Slide

  48. Backup
    cluster
    Backup
    Log-Engine
    Application
    Kibana
    Query-Service
    search
    The Drain Method: Managing risk
    ingest
    Amazon S3
    (cluster snapshots)

    View Slide

  49. Blue/Green In Place Drain
    Pros Fully revertable (instantly)
    Can replace hardware as
    well
    Fast (within a few hours)
    Cheap (0 extra nodes)
    Fully revertable (within
    hours)
    Rather fast (many hours)
    Can replace hardware as well
    Cheaper than Blue/Green
    Cons Slow upgrade (days/weeks)
    Complexity grows over time
    Double the cluster cost for
    the duration
    No rolling back
    No hardware change
    Costs more than In Place
    Complex upgrade process
    Complex rollback
    Summary
    Drain

    View Slide

  50. Upgrading log-analytics clusters to
    OpenSearch
    Q&A
    @amitaistern
    Amitai Stern
    Software Engineer and Telemetry
    Storage Team Lead at Logz.io

    View Slide