Balthazar Rouberol showcases the tooling his Data Reliability Team team built at Datadog to alleviate operational toil when running large Kafka clusters. He dives into sources of toil and time consumption, tools implemented to alleviate the amount of toil, as well as monitoring and general good practices as well.