The public is beginning to recognize the effects of ML-based decision-making. This is not the only reason why it is important to consider non-functional characteristics such as fairness or data protection. How can we ensure that ML-based decisions are made "fairly" and without algorithmic bias? At the same time, the testing of ML-based software is an open field without established best practices. What can we do to meet these challenges? And what exactly makes ML testing in running systems so complicated?