Beyond Shuffling covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We end with a preview of codegen improvements coming to Spark ML/MLLib (still a work in progress).