This was a talk given at SF Big Analytics in July 2017. This talk details a subset of challenges Heroku has encountered in operating Apache Kafka as a service in production for the last several years. For each challenge and incident, we talk about the observed behaviors, followed by deep dive into Kafka internals, and then key insights, takeaways, and practical fixes.