Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10 ways to deploy Apache Kafka® and have fun along the way

10 ways to deploy Apache Kafka® and have fun along the way

If you have ever been involved in deploying an Apache Kafka cluster
I’m sure you have faced the question, how do I deploy it? From having
everything in one data center, multiple data centers or even the
cloud, Apache Kafka will give you flexibility and adapt to your
situation.

In this talk we’re going to review the different scenarios one might
be facing when installing a new cluster, from a single cluster, to
many clusters and stretched clusters, in all we’re going to review
pros, cons and the guarantees we can expect from them.

By the end of this talk you will know how to deploy Apache Kafka and
get the best of every situation, getting the most of your deployment
with understanding of how the cluster is going to behave.

Pere Urbón

June 05, 2019
Tweet

More Decks by Pere Urbón

Other Decks in Technology

Transcript

  1. 1
    Deploying Apache Kafka, a
    journey recap
    Pere Urbon-Bayes
    @purbon
    Technology Architect
    Confluent

    View Slide

  2. 2
    Topics for today
    1. Apache Kafka, the different components
    2. Deployment situations
    1. Single Data Center
    2. Multi Data Center
    1. Active – Passive
    2. Active - Active
    3. Stretched Cluster
    1. 3 DC
    2. 2 DC
    3. 2.5 DC
    4. The cloud, or someone else machines

    View Slide

  3. 3
    Apache Kafka
    internals report

    View Slide

  4. 4
    What is Kafka?

    View Slide

  5. 5
    Apache Kafka, a distributed system

    View Slide

  6. 6

    View Slide

  7. 7
    Understanding the process of a Request

    View Slide

  8. 8
    Deploying Apache Kafka
    (and Apache Zookeeper)

    View Slide

  9. 9
    Ways to deploy a fresh and shiny Kafka Cluster
    ● Manually, you probably are not considering this.
    ○ Available as rpm, deb, zip and tar.gz
    ○ https://docs.confluent.io/current/installation/index.html
    ● Infrastructure as Code:
    ○ Ansible: https://github.com/confluentinc/cp-ansible
    ○ Puppet: https://forge.puppet.com/modules?utf-8=%E2%9C%93&page_size=25&sort=rank&q=confluent
    ○ Chef: https://supermarket.chef.io/cookbooks?utf8=%E2%9C%93&q=confluent&platforms%5B%5D= (bit outdated)
    ○ Terraform:
    ■ https://github.com/Mongey/terraform-provider-kafka
    ■ https://github.com/astubbs/cp-cluster-multi-region-terraform
    ● Available in DockerHub as well.

    View Slide

  10. 10
    10
    1 Data Center

    View Slide

  11. 11
    1 Data Center
    ● Full deployment under the same location (data center).
    ● Good latency numbers, all the relevant actors are nearby.
    ● In case of data center problems, is all or nothing.
    ○ But probably all your other apps are having problems as well.

    View Slide

  12. 12
    12
    Single Cluster
    Deployment
    ● floor(N/2) Zookeeper
    ○ 3 nodes
    ○ 5 nodes
    ○ Do I need more? …..
    ● N number of brokers
    ● The challenge of co-location
    ● Rack awareness

    View Slide

  13. 13
    1 Data center
    Thinking where to deploy your Apache Kafka Cluster?
    ○ Zookeeper is well known to be sensitive to latency. Please try to avoid deploying Apache Kafka
    with under any type of SAN.
    ○ Long pauses, for example due to GC, might make zookeeper to think a broker is dead (while only
    being paused).
    ■ Having VMware+VMotion, you should test the latency impacts when VMotion is active.
    ○ Apache Kafka does not need lots of Java Heap , it uses 0-copy.
    ○ If possible use SSD for your Zookeeper deployment

    View Slide

  14. 14
    1 Data center
    Thinking where to deploy your Apache Kafka Cluster?
    ○ RAID or JOBD ?
    ○ Using virtualization
    ■ Where are your VM’s hosted
    ■ Having noisy neighbors
    ○ Using SAN?

    View Slide

  15. 15
    1 Data center
    Have you add your monitoring?

    View Slide

  16. 16
    1 Data center (security)
    ● Using Security with TLS
    ○ Not an option anymore to do 0-copy
    ○ Increased java heap requirements, min 4Gb.
    ○ Impact in throughput performance (around 30%), not anymore with Java11
    ● Handling certificates (CRL or OCSP), need to use the JVM for this
    ● Managing the JVM stores and others [KIP-226]

    View Slide

  17. 17
    1 Data center (but generally for everyone)
    Understanding the moving parts.
    ○ Topics has partitions, with
    ■ Leaders: Each partitions has one leader, and many followers
    ■ replication.factor: How many copies of each partition are going to be created
    ■ min.insync.replicas: Minimum number of replicas that are required to be in sync
    ○ ACK’s: Number of replicas that need to acknowledge receiving the message
    ○ Batching: The producer will batch a number of messages together to increase performance
    ○ Retries: If something goes wrong, messages will be retried until a certain limit.

    View Slide

  18. 18
    18
    Multi data center

    View Slide

  19. 19
    Active - Passive
    ● The active part, the main cluster were your apps goes to.
    ● There is an standby, or follower, cluster were data is being replicated.
    ● This configuration is good for:
    ○ Disaster recovery (natural failover)
    ○ You can leverage the follower cluster for offline workloads
    ● On the other side, this ads challenges for:
    ○ Maintenance and monitoring burden
    ○ HW cost

    View Slide

  20. 20

    View Slide

  21. 21
    Active - Active
    ● There are two, or more, clusters where “online” applications are writing and
    reading data.
    ● Both clusters are now more utilized.
    ● Replication needs to be set both sides now (add namespaces)
    ● In case of disaster recovery this mode ads extra failover

    View Slide

  22. 22

    View Slide

  23. 23
    23
    Stretching your cluster

    View Slide

  24. 24
    Stretching your cluster
    Very important:
    Do you know what you’re doing? Think again,
    lots of things can go wrong with this
    architecture.

    View Slide

  25. 25
    Stretching your cluster
    ● An stretch cluster can exist with different setups:
    ○ Over 3 datacenters.
    ■ With Brokers and Zookeepers on each location.
    ■ With Brokers in 2 locations and Zookeepers in 3 (two and a half locations).
    ○ Over 2 datacenters. What happens now with consensus?
    ● You might want to stretch your cluster make internal Apache Kafka replication
    work for you.

    View Slide

  26. 26
    3 Data Centers
    ● The most natural way to stretch a cluster is over three datacenters
    ● Brokers (N) and Zookeepers (5) are distributed in all the locations
    ● Remember latency will be critical to the success of this deployment
    ● Replication factor and ACK’s get very important to ensure the cluster health

    View Slide

  27. 27

    View Slide

  28. 28
    3 Data Centers
    ● Good things of this architecture
    ○ Easy to setup, is a single cluster over different locations.
    ○ Transparent out of the box failover (included in the Apache Kafka protocol / clients)
    ○ Can survive without downtime 1 full DC failure.
    ● But there are challenges as well
    ○ Latency could become a problem (like in many other situations)
    ○ Clients are not smart (location based), they read from their partition leaders

    View Slide

  29. 29
    You can add back cluster in
    the 3 DC for recovery / back
    purposes.

    View Slide

  30. 30
    2 Data Centers
    ● Most common than having three Data Centers, not many orgs have 3 DC’s.
    ● This DC are usually nearby, including a good connection link, so less latency.
    ● But many questions arise:
    ○ How are we going to setup coordination / quorum with Zookeeper?
    ○ How many brokers should I have?
    ○ What happen if I loss one data center?
    ○ Is there anything the clients should be enforcing?

    View Slide

  31. 31

    View Slide

  32. 32
    what happen if:
    ● Zookeeper 3 in DC3 is
    unavailable (for example due to
    performance)
    How do we maintain?
    ● ISR list
    ● quorum / leader election?
    ● …
    We might try to bring an ensemble
    back, but data loss, duplication and
    divergence are easy to happen.

    View Slide

  33. 33

    View Slide

  34. 34
    Two and a half Data Centers
    What about if we could add a
    new DC under your desk?

    View Slide

  35. 35
    Two and a half Data Centers
    ● The most common scenario are organisations with only two data centers
    available, but what if:
    ○ You could use a third one, might be with lover resilience?
    ○ Or the cloud?
    ● Are we going to have a more resilient deployment?

    View Slide

  36. 36

    View Slide

  37. 37
    Remember
    ● In a 2 DC deployment:
    ○ Use zookeeper hierarchical quorum to achieve consistency.
    ○ You will have to choose between availability and consistency
    ● Use ack=all and min.isr > 50% to assure data is replicated over the different
    nodes in the data center.
    ● Remember: rack awareness is only enforced during topic creation.
    ● If you can, avoid stretching your cluster, but if you should use 3 data centers.

    View Slide

  38. 38
    38
    Because not many have on prem,
    what about the cloud?

    View Slide

  39. 39
    CLOUD THERE IS NOT,
    ANOTHER PERSON’S
    COMPUTER’S IT IS

    View Slide

  40. 40
    Deploying in the cloud
    ● A region is a collection of nearby data centers, cross region clusters are
    discouraged.
    ● Stretch your cluster over 3 availability zones in your region, making your cluster
    more resilient.
    ● Think about your storage, instance, volumes (EBS) or temporary storage?
    ○ Faster recovery time?
    ○ Do I need replication.factor anymore if I have shared volumes (EBS)?
    ○ Is temporary storage of any benefit?
    ● Every cloud have network usage limitations, remember the
    replica.fetch.min.bytes can be of help here.

    View Slide

  41. 41
    Deploying in the cloud
    ● Better to benchmark to be sure, keep an eye on the IO performance to be as
    expected.
    ● If latency becomes a problem, you can try increasing the zookeeper timeouts,
    do that with responsibility.
    ● Immutable infrastructure for your installations and upgrades.
    ● Monitoring is more important than ever:
    ○ Keep an eye on brokers who has a sudden increase in latency for produce or fetch requests.

    View Slide

  42. 42
    Deploying in the cloud
    ● Doing autoscaling? with or without Kubernetes?
    ○ how do you handle volume reassignment?
    ○ What about partitions reassignment?
    ○ When do you trigger it?

    View Slide

  43. 43
    Thanks!
    Questions?
    Pere Urbon-Bayes
    @purbon
    Technology Architect
    Confluent

    View Slide