DataEngConf 2017 - Beyond 50,000 Partitions: How Heroku Operates and Pushes the Limits of Kafka at Scale

Beyond 50,000 Partitions How Heroku Operates and Pushes the Limits
of Kafka at Scale Jeff Chao DataEngConf 2017

Core concepts & fundamentals of Kafka

Distributed messaging system Apache Kafka is a

Five fundamental components And consists of

Brokers Topics and Partitions Producers and Consumers

Broker Topic Partitions p0 p1 p2

p0 m 1 m 2 . . . m n

m 1 m 2 . . . m n m
1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n Topic: user_events Topic: system_events Topic: alerts Topic: control_messages

m 1 m 2 . . . m n m
1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n m 1 m 2 . . . m n ( low throughput ) ( high throughput ) ( oldest ) ( newest )

Broker Topic Partitions p0 p1 p2

Topic : user_events p0 p1 p2 Producers writes to Consumers
reads from

Topic : user_events p0 p1 p2 Producers { “user_id”: 1,
“message”: “foo” }

Topic : user_events p0 p1 p2 Consumers Consumer 1 Consumer
2 p2, offset: 25 p0, offset: 10 p1, offset: 5

Process at their own speeds Allows consumers to not only

Recover their progress after a failure occurs Consumers can also
individually

Many other moving pieces Operating Kafka as scale

Replication How does Kafka achieve HA?

Quota management and security How do we prevent abuse from
misbehaving users?

OS, JVM, Kafka Tuning Heap, off-heap, I/O threads, network request
threads, controllers, leaders, logs, log segments

Managing Zookeeper A whole other talk

Underlying physical or virtual infrastructure

Monitoring & Metrics VPC, SSL, JMX, RMI, Jolokia Networking Disk
Automatic volume growth, corruption & replacement Logging

Version upgrades How do you upgrade to new versions? What
do you do when faced with breaking changes?

Maintenance How do you detect underlying server or disk corruption
and automate a replacement strategy?

PRODUCERS CONSUMERS BROKER TOPIC PARTITION Apache Kafka Fully-Managed

Hundreds of clusters Thousands of partitions per broker

150 MB/s sustained Individual clusters — produce and consume traffic,
< 5ms latency

Data centers around the world Operate clusters in

Striped across availability zones All brokers and Zookeeper nodes

Rack-aware partition placement strategy Across availability zones

Interested in pushing the limits of Kafka Get ahead of
challenges that might come up

Our users wouldn’t have to face them In production

Built internal tooling and infrastructure

Variety of challenges operating Kafka at scale

Network threads getting NPEs Observed behavior 3 Observed behavior 1
Brokers were crashing and being restarted Users were exceeding their produce quotas Observed behavior 2

Quotas Major key alert

Leverage Kafka’s quota implementation At Heroku, we

Traffic Shaping over Traffic Policing Kafka favors

Avoid dropping messages Motivation

Enforced on either Producers or Consumers Can throttle on ingest
or consumption

( quota = 5 MB/s ) Producers { “id”: 1,
“msg”: “...” } { “id”: 2, “msg”: “...” } { “id”: 5, “msg”: “...” } { “id”: 4, “msg”: “...” } { “id”: 2, “msg”: “...” } { “id”: 3, “msg”: “...” } { “id”: 3, “msg”: “...” } { “id”: 1, “msg”: “...” } { “id”: 2, “msg”: “...” } { “id”: 1, “msg”: “...” } ( write throughput = 10 MB/s ) ( throughput = 10 MB/s ) { “id”: 1, “msg”: “...” } { “id”: 2, “msg”: “...” } { “id”: 5, “msg”: “...” } { “id”: 4, “msg”: “...” } { “id”: 2, “msg”: “...” } User’s Perspective { “id”: 3, “msg”: “...” } { “id”: 3, “msg”: “...” } { “id”: 1, “msg”: “...” } { “id”: 2, “msg”: “...” } { “id”: 1, “msg”: “...” } ( delayed )

Requires enough space on the broker For holding all of
these queued requests

Stores only what’s necessary Kafka does a great job in
making sure that it

There was a bug Turns out...

pri=ERROR t=kafka-network-thread-0-SSL-0 at=Processor Processor got uncaught exception. java.lang.OutOfMemoryError: Java heap
space

Network threads getting NPEs Observed behavior 3 Observed behavior 1
Brokers were crashing and being restarted Users were exceeding their produce quotas Observed behavior 2

Dealing with heap-based OOM problems First thing

Produce Request Lifecycle Writing messages to Kafka

KafkaApis#handleProducerRequest replicaManager.appendRecords( produceRequest.timeout.toLong, produceRequest.acks, internalTopicsAllowed, authorizedRequestInfo, sendResponseCallback) produceRequest.clearPartitionRecords()

ReplicaManager#appendRecords // true if required acks = all, data to
append, or partial success if (delayedProduceRequestRequired(requiredAcks, ...)) { ... // calls responseCallback eventually delayedProducePurgatory.tryCompleteElseWatch(...) } else { ... responseCallback(produceResponseStatus) }

ClientQuotaManager#recordAndMaybeThrottle try { // trigger the callback immediately if quota
is not violated callback(0) } catch { case _: QuotaViolationException => val throttleTimeMs = ... // Compute the delay clientSensors.throttleTimeSensor.record(throttleTimeMs) delayQueue.add(new ThrottledResponse(...)) }

KafkaApis#handleProducerRequest replicaManager.appendRecords( produceRequest.timeout.toLong, produceRequest.acks, internalTopicsAllowed, authorizedRequestInfo, sendResponseCallback) produceRequest.clearPartitionRecords()

Messages already extracted from the request After this point

Unnecessarily taking up memory If held

Wasn’t entirely true Turns out...

Not actually clearing out all references Heap-based OOM

Filed a JIRA

Blocked the release

Submitted a patch upstream

Landed in Kafka 0.10.2

1 2 Heap-Based OOM

Some brokers booted, but suddenly fail again Observed behavior 3
Observed behavior 1 Brokers restarted, not able to boot up Cascading failure of brokers Observed behavior 2

Shouldn’t expect availability 100% of the time Services can fail

Services hosted on AWS

Failing regularly and unable to restart Still odd that brokers
were

What is Kafka’s HA Model? What happens when a broker
goes down? What happens when a broker comes up?

p0-lead user_events p2-lead p1-lead p0 writes reads ( replicated )
p1 p2 [ p0, p1, p2 ] p1 p2 p0 p2 p1 p0 other broker other broker

Balanced leadership among brokers In general, Kafka tries to

Lead for only a subset of all partitions Brokers are
generally

Example - load distribution 5 brokers, 1 topic, 50 partitions

Example - load distribution A broker will be leader for
10 partitions (on average, 50 partitions / 5 brokers)

Example - load distribution Process read and write requests For
10 out of 50 partitions for that given topic

Example - fault tolerance Replication factor = 3 is set
per topic 2 additional copies of each partition will live on other brokers in the cluster

Replication is pull-based Followers run ReplicaFetcherThreads and continuously pull messages
from their leaders

Each leader maintains In-Sync Replicas Leaders manage a set of
In-Sync Replicas (ISR) that represent all followers that are caught up

Leaders will evict lagging replicas If followers get stuck or
lagged, their leader will remove them from the ISR

Important for when a broker fails over

Only followers in the ISR may be elected On failover

Controller 1 per cluster, KafkaController.scala Manages states of partitions and
replicas Using state machines

Which brokers are leader for which partitions Important for administrative
tasks such as determining

PartitionStateMachine.scala NonExistentPartition -> NewPartition NewPartition -> OnlinePartition OnlinePartition, OfflinePartition ->
OnlinePartition NewPartition, OnlinePartition, OfflinePartition -> OfflinePartition OfflinePartition -> NonExistentPartition

Many partitions will go Offline at once When a broker
goes down generally

Lead for a subset of all partitions Brokers are generally

Two cases for brokers failing Clean Shutdown Unclean Shutdown

KafkaServer#shutdown controlledShutdown() brokerState.newState(BrokerShuttingDown) if (socketServer != null) socketServer.shutdown() if (requestHandlerPool
!= null) requestHandlerPool.shutdown() // ... Shutdown other managers and listeners ... info(“shut down completed”) Clean Shutdown

Brokers won’t be able to notify the controller Unclean shutdown

KafkaHealthcheck#register // Creates Zookeeper ephemeral znode on broker start zkUtils.registerBrokerInZk(...)
Unclean Shutdown

Broker dies = Zookeeper session dies

Ephemeral znode goes away When a Zookeeper session dies the

PartitionStateMachine#handleStateChange try { targetState match { ... case OfflinePartition =>
// pre: partition should be in New or Online state assertValidPreviousStates(...) partitionState.put(topicAndPartition, OfflinePartition) // post: partition has no alive leader ... } } Unclean Shutdown

KafkaController#onBrokerFailure val deadBrokersThatWereShuttingDown = ... val partitionsWithoutLeader = ... partitionStateMachine.handleStateChanges(partitionsWithoutLeader,
...) partitionStateMachine.triggerOnlinePartitionStateChange() if (partitionsWithoutLeader.isEmpty) { sendUpdateMetadataRequest(context.liveOrShuttingDownBrokerIds.toSeq) } Unclean Shutdown

KafkaServer#startup (abridged) logManager.startup() metadataCache = new MetadataCache(config.brokerId) credentialProvider = new
CredentialProvider(config.saslEnabledMechanisms) socketServer.startup() replicaManager.startup() kafkaController.startup() groupCoordinator.startup() apis = new KafkaApis(...) requestHandlerPool = new KafkaRequestHandlerPool(...) dynamicConfigManager.startup() Broker Starting

KafkaServer#startup (cont.) /* tell everyone we are alive */ val
listeners = config.advertisedListeners.map { endpoint => ... } kafkaHealthcheck.startup() checkpointBrokerId(config.brokerId) registerStats() brokerState.newState(RunningAsBroker) startupComplete.set(true) isStartingUp.set(false) AppInfoParser.registerAppInfo(jmxPrefix, config.brokerId.toString) info(“started”) Broker Starting

A lot of work for failovers and restarts Major key
takeaway

Work = f(number of partitions on a broker)

More partitions = more work

More time required for brokers to recover More work =

This could take awhile Thousands of partitions per broker

Automation did not consider this Unfortunately at the time...

Automation would restart brokers just fine Recovery could take longer
than expected Automation would time out the broker

Automation was a little too aggressive Wow. Much restarts. Such
remediation.

The Fix Introduce recovery state

The Fix Allow brokers enough time to recover

The Fix Monitor and alert on usually long recovery

The Fix On long recovery, page a human Don’t attempt
automatic remediation

1 2 3 Heap-Based OOM Automation & Remediation

Under-replicated partitions, request timeouts Observed behavior 3 Observed behavior 1
Brokers crashing at various stages on boot Long period of stability, sudden failure Observed behavior 2

Automation was not interfering What do we say about interfering
with recovery? “Not today!”

Brokers load all of their partitions Recall that brokers go
into recovery on boot up

What does it mean to load a partition?

Log Anatomy of a Partition (a Log) m 1 m
2 m 3 m 4 m 5 log segment 1 m 6 m 7 m 8 m 9 m 10 log segment 2 m 11 m 12 m 13 m 14 m 15 log segment 3 00:00 - 03:00 or 0 GB - 1 GB 03:00 - 06:00 or 1 GB - 2 GB 06:00 - 09:00 or 2 GB - 3 GB

Many files on disk In aggregate

Thousands of partitions per broker

That’s a lot of file descriptors!

Read and completely load all log segments Recall that on
boot up, brokers need to

Use FDs only for active segments Otherwise during normal operation,
Kafka tries to

Temporarily hold onto FDs for old segments While logs are
being rolled

Looking at the service’s operational logs

pri=ERROR t=kafka-socket-acceptor-ListenerName at=Acceptor Error while accepting connection java.io.IOException: Too many
open files in system

$ cat /proc/sys/fs/file-nr 100_000 0 100_000 $ echo ‘Need more
file descriptors!’

fs.file-max={system_file_descriptor_limit} // sysctl.conf {system_user} - nofile {file_descriptor_limit} // limits.conf $
echo ‘Increase file descriptors and keep a buffer for the system.’ The Fix

Rule of Thumb ~1.5 file descriptors per log segment Log
segments consists of 3 actual files on disk: .log, .index, and .timeindex However, in our testing, we found on average, ~1.5 FDs per log segment Kafka tries to hold FDs only for active segments

Rule of Thumb Inputs (per broker)

Rule of Thumb Inputs (per broker) partitions_replicas = (topics *
partitions * replication_factor) / brokers

partitions * replication_factor) / brokers segments_per_partition_replicas = retention_time / log.roll.hours

partitions * replication_factor) / brokers segments_per_partition_replicas = retention_time / log.roll.hours fd_buffer = tunable (e.g., 1, 1.5, 2) fd = partitions_replicas * segments_per_partition_replicas * fd_buffer

Can the broker boot now? Not yet.

Under-replicated partitions, request timeouts Observed behavior 3 Observed behavior 1
Brokers crashing at various stages on boot Long period of stability, sudden failure Observed behavior 2

3 different errors In the broker’s service logs

pri=FATAL t=ReplicaFetcherThread-0-6 at=ReplicaFetcherThread [ReplicaFetcherThread-0-6], Disk error while replicating data for
salmon-84320.messages4-24 kafka.common.KafkaStorageException: I/O exception in append to log

pri=ERROR t=kafka-request-handler-6 at=logger Error on broker 1 while processing LeaderAndIsr
request with correlationId 8 received from controller 1 epoch 244 java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map ... at kafka.utils.CoreUtils$.inLock at kafka.log.AbstractIndex.resize at kafka.log.LogSegment.truncateTo

pri=FATAL t=main at=KafkaServerStartable Fatal error during KafkaServerStartable startup. Prepare to
shutdown java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map ... at kafka.log.AbstractIndex.resize at kafka.log.AbstractIndex.trimToValidSize at kafka.log.LogSegment.recover at kafka.log.Log.recoverLog at kafka.log.Log.loadSegments

Another out-of-memory problem? Another memory leak?

Memory was not even saturated When brokers crashed

[GC pause (G1 Evacuation Pause) (young), 0.0167712 secs] ... [Eden:
420.0M(192.0M)->0.0B(998.0M) Survivors: 12.0M->12.0M Heap: 1809.5M(4096.0M)->1389.5M(4096.0M)] [Times: user=0.28 sys=0.00, real=0.03 secs]

$ htop Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||4911/1604MB]

FileChannelImpl.c#Java_..._FileChannelImpl_map0 mapAddress = mmap64(...); if (mapAddress == MAP_FAILED) { if
(errno == ENOMEM) { JNU_ThrowOutOfMemoryError(env, “Map failed”); return IOS_THROWN; } return handle(env, -1, “Map failed”); }

Kafka internally leverages native memory

Not a heap problem

Not apparent from GC logs or heap dumps

Diagnose by correlating JMX metrics Using various JMX tooling

Not sure if legit

Native memory Given that the issue was around

Mapped byte buffers growing uncontrollably

$ cat /proc/$pid/maps | wc -l

That’s a lot of mmaps!

Why would this cause a broker to fail?

Plenty of memory, disk, and other resources For native memory

That was a lie Actually,

$ cat /proc/sys/vm/max_map_count

$ cat /proc/sys/vm/max_map_count 65535

Seems fine?

Many partitions per broker Remember that

Many log segments Anatomy of a partition, or a log

Loaded completely when a broker boots up Each and every
one of these log segments must be

64-bit JVM Additional context, no funny business

Not enough memory-mapped regions

Why does this matter?

Brokers survived restarts many times before

Not reproducible in staging environment

A subtle difference...

Log rolling

Need enough traffic or retention Enough to force logs to
be rolled

More log segments

More memory mapped files Required on broker bootup

Memory looked fine during a sudden crash This is why

Brokers would crash during recovery This is why

vm.max_map_count={vm_max_map_count} // sysctl.conf $ echo ‘Make memory-mapped regions sufficiently large
and keep a buffer.’ The Fix

Brokers able to boot successfully again

A lot of file descriptors and mmaps

Ran with new configuration in staging For completeness, under heavy
load for a long period of time Ruled out a potential leak scenario No FD or mmap leaks

Everything was fine

Wrap up 1 2 3 Heap-Based OOM Automation & Remediation
Insufficient Resources ✓

Kafka & internals Supports many use cases, many things to
configure, scalable Heroku offers fully-managed Apache Kafka Integrated with the Heroku ecosystem Lots to consider when operating at scale An operational undertaking

1 Heap-Based OOM 2 Automation & Remediation 3 Insufficient Resources
✓

Challenge 1 How quotas work Release-blocking heap-based OOM bug around
quota management and shipped a patch upstream that landed in 0.10.2

Challenge 2 How Kafka achieves HA Automation aggressively remediating brokers
on recovery

Challenge 3 How partitions are implemented What kind of resources
are needed More partitions requires more file descriptors and memory-mapped byte buffers

Thank you. Jeff Chao jchao [at] heroku.com

DataEngConf 2017 - Beyond 50,000 Partitions: Ho...

DataEngConf 2017 - Beyond 50,000 Partitions: How Heroku Operates and Pushes the Limits of Kafka at Scale

More Decks by Jeff Chao

Other Decks in Programming

Featured

Transcript