Speech synthesis is a technology that generates speech waveforms corresponding to input text, and it has been a subject of sustained research for many years. The fundamental framework of statistical speech synthesis is to generate speech for any given input text based on a large database of paired speech waveforms and corresponding text. This generation process is generally divided into multiple stagesโtext analysis, acoustic modeling, and waveform generationโeach of which is modeled using statistical machine learning techniques, and more recently, deep learning methods. In this talk, I will provide an overview of the progress made in statistical approaches to speech synthesis over the past few decades, incorporating personal episodes and even some failure stories from my own experience. I will also discuss ongoing challenges related to speech quality, controllability, and application diversity, and present my personal perspective on future directions in the field.