Observability and the Glorious Future

Slide 1

Slide 1 text

V6-21 Charity Majors (slides by Liz Fong-Jones) CTO, Honeycomb @mipsytipsy at Infrastructure & Ops Superstream: Observability Observability And the Glorious Future w/ illustrations by @emilywithcurls! “You may experience a bit of cognitive dissonance during this talk, since you are probably familiar with liz’s slide style, even handedness, and general diplomatic approach. I have tried to play liz on TV, it didn’t fool anybody. So anytime you see an annoying little pony pop up mouthing off: don’t blame liz.”

Slide 89

Slide 89 text

V6-21 © 2021 Hound Technology, Inc. All Rights Reserved. Our month of Kafka pain Read more: go.hny.co/kafka-lessons Longtime Confluent Kafka users First to use Kafka on Graviton2 at scale Changed multiple variables at once ● move to tiered storage ● i3en → c6gn ● AWS Nitro between shepherd and retriever sits kafka it allows us to decouple those two services and replay event streams We were having scalability issues with our kafka and needed to improve the reliability of them by consolidating. Instead of having 30 kafka nodes with very very large SSDs, we realized that because we are only replaying the most recent hour or two of data (unless something goes catastrophically wrong) on local ssd. Not only that, but there were out of these 30 individual kafka brokers, if any one of them went bad, you would be be in the middle of reshuffling nodes, and then if you lost another one it would just be siting idle because you can’t do a kafka rebalance while another rebalance is in process. So we tried tiered storage which would let us shrink from 30 to 6 kafka nodes. And the disks on those kafka brokers might be a little larger, but not 5x larger. So instead we’re sending that extra data off to aws s3. Then liz, loving arm 64 so much, was like why are we even using these monolithic nodes and local disks, isn’t ebs good enough? Can’t we use the highest compute power nodes and the highest performance disk perf. So we are now doing three changes at the same time. we were actually testing Kafka on Graviton2 before even Confluent did probably the first to use it for production workloads changed too many variables at once wanted to move to tiered storage to reduce the number of instances but also tried the arch switch from i3en to c6gn+EBS at the same time we also introduced AWS Nitro (hypervisor) that was a mistake we published a blog post on this experience as well as a full incident report I highly recommend that you go read it to better understand the decisions we made and what we learned

Slide 1

Slide 1 text

Slide 2

Slide 2 text

Slide 3

Slide 3 text

Slide 4

Slide 4 text

Slide 5

Slide 5 text

Slide 6

Slide 6 text

Slide 7

Slide 7 text

Slide 8

Slide 8 text

Slide 9

Slide 9 text

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Slide 12

Slide 12 text

Slide 13

Slide 13 text

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Slide 19

Slide 19 text

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Slide 22

Slide 22 text

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Slide 26

Slide 26 text

Slide 27

Slide 27 text

Slide 28

Slide 28 text

Slide 29

Slide 29 text

Slide 30

Slide 30 text

Slide 31

Slide 31 text

Slide 32

Slide 32 text

Slide 33

Slide 33 text

Slide 34

Slide 34 text

Slide 35

Slide 35 text

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Slide 38

Slide 38 text

Slide 39

Slide 39 text

Slide 40

Slide 40 text