Complex Distributed Systems are hard to operate and has very complex failure modes. In this talk, we are going to discuss how we can build confidence in large scale distributed systems by introducing random but controlled failures in them in production and understand how services de-generate and work towards healing and recovering from failures automatically. We will also discuss patterns and various techniques for designing highly available and resilient distributed systems.