Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10 ways to deploy Apache Kafka® and have fun along the way

10 ways to deploy Apache Kafka® and have fun along the way

If you have ever been involved in deploying an Apache Kafka cluster
I’m sure you have faced the question, how do I deploy it? From having
everything in one data center, multiple data centers or even the
cloud, Apache Kafka will give you flexibility and adapt to your
situation.

In this talk we’re going to review the different scenarios one might
be facing when installing a new cluster, from a single cluster, to
many clusters and stretched clusters, in all we’re going to review
pros, cons and the guarantees we can expect from them.

By the end of this talk you will know how to deploy Apache Kafka and
get the best of every situation, getting the most of your deployment
with understanding of how the cluster is going to behave.

Pere Urbón

June 05, 2019
Tweet

More Decks by Pere Urbón

Other Decks in Technology

Transcript

  1. 1
    Deploying Apache Kafka, a
    journey recap
    Pere Urbon-Bayes
    @purbon
    Technology Architect
    Confluent

    View full-size slide

  2. 2
    Topics for today
    1. Apache Kafka, the different components
    2. Deployment situations
    1. Single Data Center
    2. Multi Data Center
    1. Active – Passive
    2. Active - Active
    3. Stretched Cluster
    1. 3 DC
    2. 2 DC
    3. 2.5 DC
    4. The cloud, or someone else machines

    View full-size slide

  3. 3
    Apache Kafka
    internals report

    View full-size slide

  4. 4
    What is Kafka?

    View full-size slide

  5. 5
    Apache Kafka, a distributed system

    View full-size slide

  6. 7
    Understanding the process of a Request

    View full-size slide

  7. 8
    Deploying Apache Kafka
    (and Apache Zookeeper)

    View full-size slide

  8. 9
    Ways to deploy a fresh and shiny Kafka Cluster
    ● Manually, you probably are not considering this.
    ○ Available as rpm, deb, zip and tar.gz
    ○ https://docs.confluent.io/current/installation/index.html
    ● Infrastructure as Code:
    ○ Ansible: https://github.com/confluentinc/cp-ansible
    ○ Puppet: https://forge.puppet.com/modules?utf-8=%E2%9C%93&page_size=25&sort=rank&q=confluent
    ○ Chef: https://supermarket.chef.io/cookbooks?utf8=%E2%9C%93&q=confluent&platforms%5B%5D= (bit outdated)
    ○ Terraform:
    ■ https://github.com/Mongey/terraform-provider-kafka
    ■ https://github.com/astubbs/cp-cluster-multi-region-terraform
    ● Available in DockerHub as well.

    View full-size slide

  9. 10
    10
    1 Data Center

    View full-size slide

  10. 11
    1 Data Center
    ● Full deployment under the same location (data center).
    ● Good latency numbers, all the relevant actors are nearby.
    ● In case of data center problems, is all or nothing.
    ○ But probably all your other apps are having problems as well.

    View full-size slide

  11. 12
    12
    Single Cluster
    Deployment
    ● floor(N/2) Zookeeper
    ○ 3 nodes
    ○ 5 nodes
    ○ Do I need more? …..
    ● N number of brokers
    ● The challenge of co-location
    ● Rack awareness

    View full-size slide

  12. 13
    1 Data center
    Thinking where to deploy your Apache Kafka Cluster?
    ○ Zookeeper is well known to be sensitive to latency. Please try to avoid deploying Apache Kafka
    with under any type of SAN.
    ○ Long pauses, for example due to GC, might make zookeeper to think a broker is dead (while only
    being paused).
    ■ Having VMware+VMotion, you should test the latency impacts when VMotion is active.
    ○ Apache Kafka does not need lots of Java Heap , it uses 0-copy.
    ○ If possible use SSD for your Zookeeper deployment

    View full-size slide

  13. 14
    1 Data center
    Thinking where to deploy your Apache Kafka Cluster?
    ○ RAID or JOBD ?
    ○ Using virtualization
    ■ Where are your VM’s hosted
    ■ Having noisy neighbors
    ○ Using SAN?

    View full-size slide

  14. 15
    1 Data center
    Have you add your monitoring?

    View full-size slide

  15. 16
    1 Data center (security)
    ● Using Security with TLS
    ○ Not an option anymore to do 0-copy
    ○ Increased java heap requirements, min 4Gb.
    ○ Impact in throughput performance (around 30%), not anymore with Java11
    ● Handling certificates (CRL or OCSP), need to use the JVM for this
    ● Managing the JVM stores and others [KIP-226]

    View full-size slide

  16. 17
    1 Data center (but generally for everyone)
    Understanding the moving parts.
    ○ Topics has partitions, with
    ■ Leaders: Each partitions has one leader, and many followers
    ■ replication.factor: How many copies of each partition are going to be created
    ■ min.insync.replicas: Minimum number of replicas that are required to be in sync
    ○ ACK’s: Number of replicas that need to acknowledge receiving the message
    ○ Batching: The producer will batch a number of messages together to increase performance
    ○ Retries: If something goes wrong, messages will be retried until a certain limit.

    View full-size slide

  17. 18
    18
    Multi data center

    View full-size slide

  18. 19
    Active - Passive
    ● The active part, the main cluster were your apps goes to.
    ● There is an standby, or follower, cluster were data is being replicated.
    ● This configuration is good for:
    ○ Disaster recovery (natural failover)
    ○ You can leverage the follower cluster for offline workloads
    ● On the other side, this ads challenges for:
    ○ Maintenance and monitoring burden
    ○ HW cost

    View full-size slide

  19. 21
    Active - Active
    ● There are two, or more, clusters where “online” applications are writing and
    reading data.
    ● Both clusters are now more utilized.
    ● Replication needs to be set both sides now (add namespaces)
    ● In case of disaster recovery this mode ads extra failover

    View full-size slide

  20. 23
    23
    Stretching your cluster

    View full-size slide

  21. 24
    Stretching your cluster
    Very important:
    Do you know what you’re doing? Think again,
    lots of things can go wrong with this
    architecture.

    View full-size slide

  22. 25
    Stretching your cluster
    ● An stretch cluster can exist with different setups:
    ○ Over 3 datacenters.
    ■ With Brokers and Zookeepers on each location.
    ■ With Brokers in 2 locations and Zookeepers in 3 (two and a half locations).
    ○ Over 2 datacenters. What happens now with consensus?
    ● You might want to stretch your cluster make internal Apache Kafka replication
    work for you.

    View full-size slide

  23. 26
    3 Data Centers
    ● The most natural way to stretch a cluster is over three datacenters
    ● Brokers (N) and Zookeepers (5) are distributed in all the locations
    ● Remember latency will be critical to the success of this deployment
    ● Replication factor and ACK’s get very important to ensure the cluster health

    View full-size slide

  24. 28
    3 Data Centers
    ● Good things of this architecture
    ○ Easy to setup, is a single cluster over different locations.
    ○ Transparent out of the box failover (included in the Apache Kafka protocol / clients)
    ○ Can survive without downtime 1 full DC failure.
    ● But there are challenges as well
    ○ Latency could become a problem (like in many other situations)
    ○ Clients are not smart (location based), they read from their partition leaders

    View full-size slide

  25. 29
    You can add back cluster in
    the 3 DC for recovery / back
    purposes.

    View full-size slide

  26. 30
    2 Data Centers
    ● Most common than having three Data Centers, not many orgs have 3 DC’s.
    ● This DC are usually nearby, including a good connection link, so less latency.
    ● But many questions arise:
    ○ How are we going to setup coordination / quorum with Zookeeper?
    ○ How many brokers should I have?
    ○ What happen if I loss one data center?
    ○ Is there anything the clients should be enforcing?

    View full-size slide

  27. 32
    what happen if:
    ● Zookeeper 3 in DC3 is
    unavailable (for example due to
    performance)
    How do we maintain?
    ● ISR list
    ● quorum / leader election?
    ● …
    We might try to bring an ensemble
    back, but data loss, duplication and
    divergence are easy to happen.

    View full-size slide

  28. 34
    Two and a half Data Centers
    What about if we could add a
    new DC under your desk?

    View full-size slide

  29. 35
    Two and a half Data Centers
    ● The most common scenario are organisations with only two data centers
    available, but what if:
    ○ You could use a third one, might be with lover resilience?
    ○ Or the cloud?
    ● Are we going to have a more resilient deployment?

    View full-size slide

  30. 37
    Remember
    ● In a 2 DC deployment:
    ○ Use zookeeper hierarchical quorum to achieve consistency.
    ○ You will have to choose between availability and consistency
    ● Use ack=all and min.isr > 50% to assure data is replicated over the different
    nodes in the data center.
    ● Remember: rack awareness is only enforced during topic creation.
    ● If you can, avoid stretching your cluster, but if you should use 3 data centers.

    View full-size slide

  31. 38
    38
    Because not many have on prem,
    what about the cloud?

    View full-size slide

  32. 39
    CLOUD THERE IS NOT,
    ANOTHER PERSON’S
    COMPUTER’S IT IS

    View full-size slide

  33. 40
    Deploying in the cloud
    ● A region is a collection of nearby data centers, cross region clusters are
    discouraged.
    ● Stretch your cluster over 3 availability zones in your region, making your cluster
    more resilient.
    ● Think about your storage, instance, volumes (EBS) or temporary storage?
    ○ Faster recovery time?
    ○ Do I need replication.factor anymore if I have shared volumes (EBS)?
    ○ Is temporary storage of any benefit?
    ● Every cloud have network usage limitations, remember the
    replica.fetch.min.bytes can be of help here.

    View full-size slide

  34. 41
    Deploying in the cloud
    ● Better to benchmark to be sure, keep an eye on the IO performance to be as
    expected.
    ● If latency becomes a problem, you can try increasing the zookeeper timeouts,
    do that with responsibility.
    ● Immutable infrastructure for your installations and upgrades.
    ● Monitoring is more important than ever:
    ○ Keep an eye on brokers who has a sudden increase in latency for produce or fetch requests.

    View full-size slide

  35. 42
    Deploying in the cloud
    ● Doing autoscaling? with or without Kubernetes?
    ○ how do you handle volume reassignment?
    ○ What about partitions reassignment?
    ○ When do you trigger it?

    View full-size slide

  36. 43
    Thanks!
    Questions?
    Pere Urbon-Bayes
    @purbon
    Technology Architect
    Confluent

    View full-size slide