Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to SPTK: A Toolkit for Speech Sign...

Introduction to SPTK: A Toolkit for Speech Signal Processing

The Speech Signal Processing Toolkit (SPTK) is an open-source suite of tools for speech signal processing, including speech analysis and synthesis. It has been actively maintained and widely used in the speech processing community since its initial release in 1998. This lecture introduces the core concepts of SPTK, along with a brief overview of the fundamentals of speech signal processing. In addition, a differentiable extension of SPTK, called diffsptk, is also introduced. Designed for integration with deep learning workflows, diffsptk helps bridge the gap between classical signal processing and modern neural network architectures.

Avatar for Takenori Yoshimura

Takenori Yoshimura

September 16, 2025
Tweet

Other Decks in Technology

Transcript

  1. Introduction to SPTK: A Toolkit for Speech Signal Processing Takenori

    Yoshimura (Nagoya Institute of Technology) Voice of Wellness 2025 09/16
  2. Self-Introduction ◆ Academic background ⚫ Ph.D. in Nitech (2018) ⚫

    Visiting researcher at University of Edinburgh (2015) ◆ Work experience ⚫ Researcher at Nitech (2020–) ⚫ Researcher at Techno-Speech (2021–) ⚫ Engineer at Human Dataware Lab. (2019–) ◆ Technical contributions ⚫ Main maintainer of SPTK ⚫ Contributor of ESPnet 2
  3. Outline ◆ SPTK ⚫ History ⚫ Concepts ⚫ Installation ⚫

    Features ◆ diffsptk ⚫ Motivation ⚫ Concepts ⚫ Installation ⚫ Features 3
  4. What Is SPTK? ◆ OSS for speech signal processing ⚫

    Developed at Tokyo Institute of Technology ➢ Maintained at Nagoya Institute of Technology ⚫ Provides 100+ Unix-like commands ➢ Speech analysis / synthesis ➢ Speech coding, etc. ⚫ Provides a static library ⚫ Developed in C / C++ ⚫ Released under permissive license ⚫ Maintained actively on GitHub 4
  5. History of SPTK 5 1998 SPTK 1.0 70+ cmds Written

    in C 2000 SPTK 2.0 90+ cmds SPTK 3.0 Changed to modified BSD 2002 2007 2017 SPTK 3.11 130+ cmds SPTK 3.1 Hosted on SourceForge SPTK 4.0 Rewritten in C++ Changed to Apache 2.0 Hosted on GitHub 2021 SPTK 4.3 140+ cmds 2024 diffsptk 1.0 Rewritten in PyTorch 70+ modules 2023 2025 diffsptk 3.3 120+ modules Annual releases
  6. Concepts of SPTK (1/3) ◆ Standard-I/O based ⚫ SPTK commands

    use stdin/stdout ⚫ Allows users to chain multiple commands using pipes (|) ➢ $ x2x +sd data.short | frame | window | lpc > data.lpc ⚫ Users can perform complex signal processing ➢ without the need for temporary files ⚫ Can be combined shell commands ➢ e.g., cat, less, wc, sox ➢ Enables seamless interaction with existing tools 6
  7. Concepts of SPTK (2/3) ◆ Raw data format ⚫ SPTK

    uses pure binary format without headers ⚫ No compression is applied ⚫ Opposite to structured formats ➢ Numpy (.npy), Kaldi (.ark), HDF (.h5) ⚫ Users can see file contents via binary dump tools ⚫ Enables reading by other tools (e.g., numpy.fromfile) ⚫ The default data type used in SPTK4 is float64 ➢ SPTK<4 is float32 7
  8. Concepts of SPTK (2/3) 8 Top of stream 𝑡 =

    0 𝑡 = 1 𝑡 = 2 𝑡 = 3 … Dimension of the vector End of stream ◆ Data ordering is C-order (raw-major order) 𝑡 Sequential data
  9. Concepts of SPTK (3/3) ◆ Minimum requirements ⚫ External libraries

    (e.g., Boost, Eigen) are useful but … ➢ May cause installation issues ➢ Make software licensing complicated ⚫ We avoid relying on external libraries ➢ Exclude pitch extraction algorithms (due to their highly specialized nature) ⚫ Core DSP including FFT is written from scratch 9
  10. Installation of SPTK ◆ Requirements ⚫ GCC 4.8.5+ / Clang

    3.5.0+ / Visual Studio 2015+ ⚫ CMake 3.1+ ◆ Linux / macOS ⚫ $ git clone https://github.com/sp-nitech/SPTK.git ⚫ $ cd SPTK && make ◆ Windows ⚫ $ git clone https://github.com/sp-nitech/SPTK.git ⚫ $ cd SPTK && make.bat 10
  11. Features of SPTK ◆ Data type conversion ◆ Feature extraction

    (Speech analysis) ◆ Graph drawing ◆ Linear time-variant filtering (Speech synthesis) ◆ Parameter transformation ◆ Speech coding ◆ Subband decomposition ◆ Statistics compuation ◆ etc... 11
  12. Data Type Conversion ◆ x2x: The most frequently used command

    in SPTK ⚫ Bridge between SPTK commands and inputs/outputs ⚫ Example: x2x +sa data.short | less ➢ +xy means conversion from x to y ➢ s: short, i: int, f: float, d: double, a: ascii ◆ dmp ⚫ Example: dmp +s data.short | less ➢ +x means conversion from x to ASCII ➢ Line numbers are printed together with values 12 7 14 19 22 0 7 1 14 2 19 3 22
  13. Features of SPTK ◆ Data type conversion ◆ Feature extraction

    (Speech analysis) ◆ Graph drawing ◆ Linear time-variant filtering (Speech synthesis) ◆ Parameter transformation ◆ Speech coding ◆ Subband decomposition ◆ Statistics compuation ◆ etc... 13
  14. Process of Feature Extraction ◆ Audio is a non-stationary signal

    ⚫ Its statistical properties change over time ⇒ Break down a waveform into small, manageable parts 14 (N+1)th Frame Nth Frame ⚫ Frame length ⚫ Window length Window × ⚫ Frame shift ⚫ Frame period ⚫ Hop length Window × Analysis Analysis Feature vector of Nth frame Feature vector of (N+1)th frame
  15. Example of Feature Extraction ⚫ x2x – Converts data type

    ➢ +sd means conversion from short to double ⚫ frame – Divides audio into segments ➢ -p and –l specify frame shift and frame length (in samples) ⚫ window – Applies a window function ➢ -l means frame length, -L 512 pads with zeros for FFT ⚫ mgcep – Extracts mel-cepstral coefficients ➢ -m specifies the order of coefficients 15 $ x2x +sd data.short | frame –p 80 –l 400 | window –l 400 –L 512 | mgcep –m 24 –l 512 –a 0.42 > data.mc
  16. Feature Extraction Tools ⚫ spec – Amplitude spectrum ⚫ phase

    – Phase spectrum ⚫ grpdelay – Group delay ⚫ acorr - Autocorrelation ⚫ lpc – LPC coefficients ⚫ fftcep – Cepstral coefficients ⚫ mgcep – Mel-cepstral coefficients ⚫ smcep – Mel-cepstral coefficients ⚫ fbank – Mel spectrogram ⚫ mfcc – MFCC ⚫ plp – PLP coefficients 16 The following commands do not require the use of the frame command, as they perform frame processing internally ⚫ pitch_spec – Amplitude spectrum ⚫ ap – Aperiodicity ⚫ pitch – Pitch (F0) ⚫ pitch_mark – GCI ⚫ zcross – Zero-crossings
  17. Graph Drawing ◆ Written in Python, not C ⚫ Powered

    by Plotly ➢ Can generate modern, high quality images ➢ Supports PNG, JPEG, PDF, SVG, and WebP 17 $ bcp –m 24 –s 0 –e 0 data.mc | fdrw data.mc.png Can check sanity of data visually
  18. Features of SPTK ◆ Data type conversion ◆ Feature extraction

    (Speech analysis) ◆ Graph drawing ◆ Linear time-variant filtering (Speech synthesis) ◆ Parameter transformation ◆ Speech coding ◆ Subband decomposition ◆ Statistics compuation ◆ etc... 19
  19. Speech Production Mechanism 1. Lungs (Air source) ⚫ Air is

    pushed from the lungs 2. Vocal folds ⚫ Airflow passes through the vocal folds ⚫ Vibration creates voiced sounds 3. Vocal tract ⚫ Sound is shaped by the vocal tract ⚫ Produces different speech sounds 20
  20. Simulating Speech Production Mechanism The source–filter model approximates human speech

    production in a simplified way 21 Linear time-variant system ) (n h ) (n e ) ( * ) ( ) ( n e n h n x = Excitation Pulse train White noise Source (Vocal folds) Filter (Vocal tract) Natural speech
  21. Example of Speech Synthesis ⚫ excite – Generates a simple

    excitation signal ➢ data.pit is a file containing extracted pitch sequence ➢ -p specifies frame shift ⚫ mglsadf – Performs filtering using mel cepstrum ➢ data.mc is a file containing extracted mel cepstrum ➢ -m specifies the order of the mel cepstrum ⚫ x2x – Converts data type ➢ +ds means conversion from double to short 22 $ excite –p 80 data.pit | mglsadf –p 80 –m 24 –a 0.42 data.mc | x2x +ds > data.syn
  22. Speech Synthesis Tools ⚫ poledf – LPC ⚫ zerodf –

    impulse response ⚫ ltcdf – PARCOR ⚫ lspdf – LSP ⚫ mglsadf – Mel-cepstral coefficients Inverse filtering ⚫ iltcdf – PARCOR ⚫ imglsadf – Mel-cepstral coefficients 23 The following command assumes a mixed excitation signal rather than a simple excitation signal for better waveform reconstruction ⚫ world_synth
  23. Note: Handling F0 in Unvoiced Regions ◆ In SPTK, F0

    values in unvoiced regions are: ⚫ 0 (linear scale) ⚫ -1.0e+10 (logarithmic scale) ◆ There are gaps between voiced/unvoiced regions ⚫ The discontinuities are not suitable for NN training ⚫ Linear interpolation is commonly used to fill the region ⚫ A smooth counter can be obtained using magic_intpl 24
  24. Features of SPTK ◆ Data type conversion ◆ Feature extraction

    (Speech analysis) ◆ Graph drawing ◆ Linear time-variant filtering (Speech synthesis) ◆ Parameter transformation ◆ Speech coding ◆ Subband decomposition ◆ Statistics compuation ◆ etc... 25
  25. Parameter Transformation 26 Green arrows represent SPTK commands Yellow circles

    represent acoustic features Red flow arrow represent synthesis filters
  26. Example of Parameter Transformation ⚫ lpc2lsp – Convert from LPC

    to LSP ➢ data.lpc is a file containing LPC coefficients ➢ -m specifies the order of coefficients ➢ -o specifies the output format (unit) of LSP ⚫ lsp2lpc – Convert from LSP to LPC ➢ data.lpc2 should be identical to data.lpc ➢ -m specifies the order of coefficients ➢ -q specifies the input format (unit) of LSP 27 $ lpc2lsp –m 24 –o 0 data.lpc > data.lsp $ lsp2lpc –m 24 –q 0 data.lsp > data.lpc2
  27. Speech Coding ◆ Transmit speech signals at low bit rates

    ⚫ Waveform coding ➢ Transmit quantized waveform data ⚫ Parametric coding ➢ Transmit quantized acoustic features ⚫ Coding can be lossless or lossy ➢ Whether the waveform can be perfectly reconstructed 28 Encoder Decoder Transmission Original waveform Reconstructed waveform
  28. Example of Waveform Coding 29 ⚫ ulaw / iulaw –

    𝜇–law companding / expanding ➢ Apply a nonlinear function to the waveform ➢ Assumes speech signals concentrate around zero, this improves quantization efficiency ⚫ quantize / dequantize – Perform scalar quantization ➢ -b specifies the number of quantization bits ➢ -t specifies the quantization type $ x2x +sd data.short | ulaw | quantize –b 8 –t 0 > data.transmit $ dequantize –b 8 –t 0 < data.transmit | iulaw | x2x +ds > data.rec.short
  29. Example of Parametric Coding 30 ⚫ lbg – Codebook generation

    ➢ LBG: A kind of 𝑘-means algorithm ➢ -e specifies the size of the codebook ⚫ msvq / imsvq – Perform vector quantization ➢ -s specifies the file of codebook ➢ “ms” means multi-stage ⇒ -s option can be specified multiple times $ lbg –m 24 –e 32 data.mc > mc.cb $ msvq –m 24 –s mc.cb < data.mc > data.index $ imsvq –m 24 –s mc.cb < data.index > data.rec.mc
  30. Subband Decomposition 31 ◆ Motivation ⚫ Signals often span a

    wide frequency range ⚫ Direct processing of full-band signals can be inefficient ◆ Concept ⚫ A signal is split into multiple frequency bands (= subbands) ⚫ A subband signal can be processed separately ⚫ Subband signals can be reconstructed Analyzer Synthesizer High-pass signal Original waveform Reconstructed waveform Low-pass signal
  31. Example of Subband Decomposition ⚫ pqmf / ipqmf – Analyzer

    and synthesizer based on PQMF ➢ -k specifies the number of subbands ➢ -m specifies the order of filter (The order determines how sharp the filter response is) ⚫ PQMF ➢ Provides near-perfect reconstruction ➢ Produces uniform subbands ➢ Can be implemented efficiently using FIR filters 32 $ x2x +sd data.short | pqmf –k 2 –m 20 | ipqmf –k 2 –m 20 | x2x +ds > data.rec.short
  32. Features of SPTK ◆ Data type conversion ◆ Feature extraction

    (Speech analysis) ◆ Graph drawing ◆ Linear time-variant filtering (Speech synthesis) ◆ Parameter transformation ◆ Speech coding ◆ Subband decomposition ◆ Statistics computation ◆ etc... 33
  33. Statistics Computation 34 ◆ Numeiral descirptors capture properties of data

    ◆ Uses: ⚫ Feature aggregation ➢ Summarize feature vectors over time ➢ Use aggregated statistics as inputs to models ⚫ Feature normalization ➢ Apply (zero-mean unit-variance) scaling ➢ Help stability and convergence in DNN training ⚫ Evaluation metrics ⚫ Outlier detection
  34. Example of Statistics Computation ⚫ vstat – Compute vector statistics

    ➢ data.mc is a file containing mel-cepstral coefficients ➢ -m specifies the order of coefficients ➢ -o specifies the output format ➢ -d means only printing diagonal entries ⚫ sopr – Perform scalar operations ➢ -SQRT applies the square root operation 35 $ vstat –m 24 –o 2 –d data.mc | sopr –SQRT > data.mc.sdev
  35. Statistics Computation Tools ⚫ average – Mean ⚫ vstat –

    Mean and variance ⚫ vsum – Summation ⚫ vprod – Product ⚫ median – Median ⚫ mode – Mode ⚫ minmax – Minimum and maximum Tips: If you encounter NaN values in your statistics, you can identify where they occur in your data using the nan command 36
  36. Note: How to Use sopr Command ◆ Operations are processed

    sequentially ⚫ echo 0 1 2 3 | x2x +ad | sopr –m 2 –a 1 | x2x +da ⇒ 1 3 5 7 ⚫ echo 0 1 2 3 | x2x +ad | sopr –a 1 –m 2 | x2x +da ⇒ 2 4 6 8 ◆ Magic number processing ⚫ echo 0 1 | x2x +ad | sopr –magic 0 –MAGIC 2 | x2x +da ⇒ 2 1 ⚫ echo 0 1 | x2x +ad | sopr –magic 0 –LOG –MAGIC 0 | x2x +da ⇒ 0 0 37
  37. One Direction of Research Topic ◆ DSP (Digital signal processing)

    ⚫ Fewer free parameters ⚫ Highly controllable and efficient ◆ DNN (Deep neural network) ⚫ Many free parameters ⚫ High accuracy, but less efficienct ⇒ Combine Classical DSP with modern DNN ⚫ Embed DSP as a differentiable module within a DNN 39 The SPTK working group provides PyTorch-based DSP modules under the name diffsptk
  38. Combining DSP and DNN 40 DSP DNN DNN Input Output

    Target Loss computation Backpropagation If DSP is not differentiable, the DNN before it cannot be updated via backpropgagation
  39. Example of Combining DSP and DNN ◆ Neural vocoder ⚫

    LPCNet [Valin; ’19]: LPC ⚫ Multi-band MelGAN [Yang; ’20]: PQMF ⚫ DDSP [Engel; ’20]: Harmonic plus noise ⚫ MLSANet [Yoshimura; ’22]: MLSA filter ◆ Neural codec ⚫ SoundStream [Zeghidour; ’21]: Mutli-stage VQ ◆ Feature extractor ⚫ SincNet [Ravanelli; ’18]: Band-pass filter ⚫ CombNet [Churchwell; ’25]: IIR filter 41
  40. What Is diffsptk? ◆ OSS for differentiable DSP ⚫ Developed

    at Nagoya Institute of Technology ⚫ Availlable as a pip-installable package ➢ Provides SPTK-combatible features ➢ Complementary to the other packages such as TorchAudio ⚫ Implemented in PyTorch ⚫ Released under a permissive license ⚫ Actively maintained on GitHub 42
  41. Concept of diffsptk (1/2) ◆ Non-recursive ⚫ DSP algorithms often

    involves recursion ⚫ Recursion is efficient for certain tasks, but: ➢ Not well-suited for neural network training (less compatible with GPU parallel computing) ⚫ Recursive parts are replaced with non-recursive ones whenever possible ⚫ Key techniques: ➢ Matrix multiplication ➢ FFT 43
  42. Concept of diffsptk (2/2) ◆ Dimension-last ⚫ The shape of

    tensors is assumed to be (B, N, D) ➢ B: mini-batch size ➢ N: data length ➢ D: data dimensionality ⚫ Compatible with SPTK ➢ Data ordering is C-order ⚫ Note that Conv1d in PyTorch assumes (B, D, N) ➢ Need to transpose tensors depending on the operation 44
  43. Installation of diffsptk ◆ Requirements (version 1.0.0) ⚫ Python 3.8+

    ⚫ PyTorch 1.11.0+ ◆ Requirements (version 3.3.1) ⚫ Python 3.10+ ⚫ PyTorch 2.3.1+ ◆ How to install ⚫ Prepare your Python environment ⚫ $ pip install diffsptk 45
  44. Features of diffsptk ◆ Includes almost all features of SPTK

    ◆ Additional features: ⚫ CQT / inverse CQT ⚫ MDCT / inverse MDCT ⚫ Gammatone filter bank analysis / synthesis ⚫ Griffin-Lim phase reconstruction ◆ Supports both Module class / Functional API ⚫ Similar to torch.nn.Module / torch.nn.functional ⚫ The Module class is more efficient for repeated use 46
  45. Example of Feature Extraction ◆ SPTK ◆ diffsptk 47 import

    diffsptk x, sr = diffsptk.read("assets/data.wav") # Compute the STFT amplitude of x. stft = diffsptk.STFT(frame_length=400, frame_period=80, fft_length=512) X = stft(x) # Estimate the mel-cepstrum of x. alpha = diffsptk.get_alpha(sr) mcep = diffsptk.MelCepstralAnalysis(fft_length=512, cep_order=24, alpha=alpha) mc = mcep(X) $ sox –t wav data.wav –c 1 –t s16 –r 16000 | x2x +sd | frame –p 80 –l 400 | window –l 400 –L 512 | mgcep –m 24 –l 512 –a 0.42 > data.mc
  46. How to Combine DSP with DNN ◆ Declare diffsptk module

    instance ⚫ All diffsptk modules inherit torch.nn.Module ◆ Combine them with exsiting PyTorch modules ⚫ Example: 48 import torch import diffsptk model = torch.nn.Sequential( torch.nn.Linear(256, 256), diffsptk.DCT(256), torch.nn.Linear(256, 1), ) inputs = torch.randn(8, 256) outputs = model(inputs)
  47. Hands-on Tutorial ◆ SPTK ⚫ https://colab.research.google.com/drive/1vmbIJQDhT 5F26eCE5iYKQuEEGxYUv-uJ?usp=drive_link ◆ diffsptk ⚫

    https://colab.research.google.com/drive/1xAoUKqXad vJXJ7RzN0OceB6y7q5i7Sn6?usp=drive_link 49 Try them on your own computer or on Google Colab to better understand SPTK!
  48. Thanks! 50 Any questions? The next lecture will be a

    hands-on practice session on SPTK / diffpstk