Production-quality Mesos frameworks must be able to continue managing tasks despite unreliable networks and faulty computers. Mesos provides tools to help developers do fault-tolerant task management, but putting these tools together effectively remains something of a black art. This talk will offer practical guidance to current and prospective framework developers to help them understand how Mesos deals with failures and the tools it provides to enable fault tolerant frameworks. Mesos operators will also benefit from a discussion of exactly how Mesos behaves during network partitions and other failure scenarios.