Introduction to SPTK: A Toolkit for Speech Signal Processing

Introduction to SPTK: A Toolkit for Speech Signal Processing Takenori
Yoshimura (Nagoya Institute of Technology) Voice of Wellness 2025 09/16

Self-Introduction ◆ Academic background ⚫ Ph.D. in Nitech (2018) ⚫
Visiting researcher at University of Edinburgh (2015) ◆ Work experience ⚫ Researcher at Nitech (2020–) ⚫ Researcher at Techno-Speech (2021–) ⚫ Engineer at Human Dataware Lab. (2019–) ◆ Technical contributions ⚫ Main maintainer of SPTK ⚫ Contributor of ESPnet 2

Outline ◆ SPTK ⚫ History ⚫ Concepts ⚫ Installation ⚫
Features ◆ diffsptk ⚫ Motivation ⚫ Concepts ⚫ Installation ⚫ Features 3

What Is SPTK? ◆ OSS for speech signal processing ⚫
Developed at Tokyo Institute of Technology ➢ Maintained at Nagoya Institute of Technology ⚫ Provides 100+ Unix-like commands ➢ Speech analysis / synthesis ➢ Speech coding, etc. ⚫ Provides a static library ⚫ Developed in C / C++ ⚫ Released under permissive license ⚫ Maintained actively on GitHub 4

History of SPTK 5 1998 SPTK 1.0 70+ cmds Written
in C 2000 SPTK 2.0 90+ cmds SPTK 3.0 Changed to modified BSD 2002 2007 2017 SPTK 3.11 130+ cmds SPTK 3.1 Hosted on SourceForge SPTK 4.0 Rewritten in C++ Changed to Apache 2.0 Hosted on GitHub 2021 SPTK 4.3 140+ cmds 2024 diffsptk 1.0 Rewritten in PyTorch 70+ modules 2023 2025 diffsptk 3.3 120+ modules Annual releases

Concepts of SPTK (1/3) ◆ Standard-I/O based ⚫ SPTK commands
use stdin/stdout ⚫ Allows users to chain multiple commands using pipes (|) ➢ $ x2x +sd data.short | frame | window | lpc > data.lpc ⚫ Users can perform complex signal processing ➢ without the need for temporary files ⚫ Can be combined shell commands ➢ e.g., cat, less, wc, sox ➢ Enables seamless interaction with existing tools 6

Concepts of SPTK (2/3) ◆ Raw data format ⚫ SPTK
uses pure binary format without headers ⚫ No compression is applied ⚫ Opposite to structured formats ➢ Numpy (.npy), Kaldi (.ark), HDF (.h5) ⚫ Users can see file contents via binary dump tools ⚫ Enables reading by other tools (e.g., numpy.fromfile) ⚫ The default data type used in SPTK4 is float64 ➢ SPTK<4 is float32 7

Concepts of SPTK (2/3) 8 Top of stream 𝑡 =
0 𝑡 = 1 𝑡 = 2 𝑡 = 3 … Dimension of the vector End of stream ◆ Data ordering is C-order (raw-major order) 𝑡 Sequential data

Concepts of SPTK (3/3) ◆ Minimum requirements ⚫ External libraries
(e.g., Boost, Eigen) are useful but … ➢ May cause installation issues ➢ Make software licensing complicated ⚫ We avoid relying on external libraries ➢ Exclude pitch extraction algorithms (due to their highly specialized nature) ⚫ Core DSP including FFT is written from scratch 9

Installation of SPTK ◆ Requirements ⚫ GCC 4.8.5+ / Clang
3.5.0+ / Visual Studio 2015+ ⚫ CMake 3.1+ ◆ Linux / macOS ⚫ $ git clone https://github.com/sp-nitech/SPTK.git ⚫ $ cd SPTK && make ◆ Windows ⚫ $ git clone https://github.com/sp-nitech/SPTK.git ⚫ $ cd SPTK && make.bat 10

Features of SPTK ◆ Data type conversion ◆ Feature extraction
(Speech analysis) ◆ Graph drawing ◆ Linear time-variant filtering (Speech synthesis) ◆ Parameter transformation ◆ Speech coding ◆ Subband decomposition ◆ Statistics compuation ◆ etc... 11

Data Type Conversion ◆ x2x: The most frequently used command
in SPTK ⚫ Bridge between SPTK commands and inputs/outputs ⚫ Example: x2x +sa data.short | less ➢ +xy means conversion from x to y ➢ s: short, i: int, f: float, d: double, a: ascii ◆ dmp ⚫ Example: dmp +s data.short | less ➢ +x means conversion from x to ASCII ➢ Line numbers are printed together with values 12 7 14 19 22 0 7 1 14 2 19 3 22

Process of Feature Extraction ◆ Audio is a non-stationary signal
⚫ Its statistical properties change over time ⇒ Break down a waveform into small, manageable parts 14 (N+1)th Frame Nth Frame ⚫ Frame length ⚫ Window length Window × ⚫ Frame shift ⚫ Frame period ⚫ Hop length Window × Analysis Analysis Feature vector of Nth frame Feature vector of (N+1)th frame

Example of Feature Extraction ⚫ x2x – Converts data type
➢ +sd means conversion from short to double ⚫ frame – Divides audio into segments ➢ -p and –l specify frame shift and frame length (in samples) ⚫ window – Applies a window function ➢ -l means frame length, -L 512 pads with zeros for FFT ⚫ mgcep – Extracts mel-cepstral coefficients ➢ -m specifies the order of coefficients 15 $ x2x +sd data.short | frame –p 80 –l 400 | window –l 400 –L 512 | mgcep –m 24 –l 512 –a 0.42 > data.mc

Feature Extraction Tools ⚫ spec – Amplitude spectrum ⚫ phase
– Phase spectrum ⚫ grpdelay – Group delay ⚫ acorr - Autocorrelation ⚫ lpc – LPC coefficients ⚫ fftcep – Cepstral coefficients ⚫ mgcep – Mel-cepstral coefficients ⚫ smcep – Mel-cepstral coefficients ⚫ fbank – Mel spectrogram ⚫ mfcc – MFCC ⚫ plp – PLP coefficients 16 The following commands do not require the use of the frame command, as they perform frame processing internally ⚫ pitch_spec – Amplitude spectrum ⚫ ap – Aperiodicity ⚫ pitch – Pitch (F0) ⚫ pitch_mark – GCI ⚫ zcross – Zero-crossings

Graph Drawing ◆ Written in Python, not C ⚫ Powered
by Plotly ➢ Can generate modern, high quality images ➢ Supports PNG, JPEG, PDF, SVG, and WebP 17 $ bcp –m 24 –s 0 –e 0 data.mc | fdrw data.mc.png Can check sanity of data visually

Graph Drawing Tools 18 gwave gseries gspecgram glogsp grlogsp gpolezero
fdrw

Speech Production Mechanism 1. Lungs (Air source) ⚫ Air is
pushed from the lungs 2. Vocal folds ⚫ Airflow passes through the vocal folds ⚫ Vibration creates voiced sounds 3. Vocal tract ⚫ Sound is shaped by the vocal tract ⚫ Produces different speech sounds 20

Simulating Speech Production Mechanism The source–filter model approximates human speech
production in a simplified way 21 Linear time-variant system ) (n h ) (n e ) ( * ) ( ) ( n e n h n x = Excitation Pulse train White noise Source (Vocal folds) Filter (Vocal tract) Natural speech

Example of Speech Synthesis ⚫ excite – Generates a simple
excitation signal ➢ data.pit is a file containing extracted pitch sequence ➢ -p specifies frame shift ⚫ mglsadf – Performs filtering using mel cepstrum ➢ data.mc is a file containing extracted mel cepstrum ➢ -m specifies the order of the mel cepstrum ⚫ x2x – Converts data type ➢ +ds means conversion from double to short 22 $ excite –p 80 data.pit | mglsadf –p 80 –m 24 –a 0.42 data.mc | x2x +ds > data.syn

Speech Synthesis Tools ⚫ poledf – LPC ⚫ zerodf –
impulse response ⚫ ltcdf – PARCOR ⚫ lspdf – LSP ⚫ mglsadf – Mel-cepstral coefficients Inverse filtering ⚫ iltcdf – PARCOR ⚫ imglsadf – Mel-cepstral coefficients 23 The following command assumes a mixed excitation signal rather than a simple excitation signal for better waveform reconstruction ⚫ world_synth

Note: Handling F0 in Unvoiced Regions ◆ In SPTK, F0
values in unvoiced regions are: ⚫ 0 (linear scale) ⚫ -1.0e+10 (logarithmic scale) ◆ There are gaps between voiced/unvoiced regions ⚫ The discontinuities are not suitable for NN training ⚫ Linear interpolation is commonly used to fill the region ⚫ A smooth counter can be obtained using magic_intpl 24

Parameter Transformation 26 Green arrows represent SPTK commands Yellow circles
represent acoustic features Red flow arrow represent synthesis filters

Example of Parameter Transformation ⚫ lpc2lsp – Convert from LPC
to LSP ➢ data.lpc is a file containing LPC coefficients ➢ -m specifies the order of coefficients ➢ -o specifies the output format (unit) of LSP ⚫ lsp2lpc – Convert from LSP to LPC ➢ data.lpc2 should be identical to data.lpc ➢ -m specifies the order of coefficients ➢ -q specifies the input format (unit) of LSP 27 $ lpc2lsp –m 24 –o 0 data.lpc > data.lsp $ lsp2lpc –m 24 –q 0 data.lsp > data.lpc2

Speech Coding ◆ Transmit speech signals at low bit rates
⚫ Waveform coding ➢ Transmit quantized waveform data ⚫ Parametric coding ➢ Transmit quantized acoustic features ⚫ Coding can be lossless or lossy ➢ Whether the waveform can be perfectly reconstructed 28 Encoder Decoder Transmission Original waveform Reconstructed waveform

Example of Waveform Coding 29 ⚫ ulaw / iulaw –
𝜇–law companding / expanding ➢ Apply a nonlinear function to the waveform ➢ Assumes speech signals concentrate around zero, this improves quantization efficiency ⚫ quantize / dequantize – Perform scalar quantization ➢ -b specifies the number of quantization bits ➢ -t specifies the quantization type $ x2x +sd data.short | ulaw | quantize –b 8 –t 0 > data.transmit $ dequantize –b 8 –t 0 < data.transmit | iulaw | x2x +ds > data.rec.short

Example of Parametric Coding 30 ⚫ lbg – Codebook generation
➢ LBG: A kind of 𝑘-means algorithm ➢ -e specifies the size of the codebook ⚫ msvq / imsvq – Perform vector quantization ➢ -s specifies the file of codebook ➢ “ms” means multi-stage ⇒ -s option can be specified multiple times $ lbg –m 24 –e 32 data.mc > mc.cb $ msvq –m 24 –s mc.cb < data.mc > data.index $ imsvq –m 24 –s mc.cb < data.index > data.rec.mc

Subband Decomposition 31 ◆ Motivation ⚫ Signals often span a
wide frequency range ⚫ Direct processing of full-band signals can be inefficient ◆ Concept ⚫ A signal is split into multiple frequency bands (= subbands) ⚫ A subband signal can be processed separately ⚫ Subband signals can be reconstructed Analyzer Synthesizer High-pass signal Original waveform Reconstructed waveform Low-pass signal

Example of Subband Decomposition ⚫ pqmf / ipqmf – Analyzer
and synthesizer based on PQMF ➢ -k specifies the number of subbands ➢ -m specifies the order of filter (The order determines how sharp the filter response is) ⚫ PQMF ➢ Provides near-perfect reconstruction ➢ Produces uniform subbands ➢ Can be implemented efficiently using FIR filters 32 $ x2x +sd data.short | pqmf –k 2 –m 20 | ipqmf –k 2 –m 20 | x2x +ds > data.rec.short

(Speech analysis) ◆ Graph drawing ◆ Linear time-variant filtering (Speech synthesis) ◆ Parameter transformation ◆ Speech coding ◆ Subband decomposition ◆ Statistics computation ◆ etc... 33

Statistics Computation 34 ◆ Numeiral descirptors capture properties of data
◆ Uses: ⚫ Feature aggregation ➢ Summarize feature vectors over time ➢ Use aggregated statistics as inputs to models ⚫ Feature normalization ➢ Apply (zero-mean unit-variance) scaling ➢ Help stability and convergence in DNN training ⚫ Evaluation metrics ⚫ Outlier detection

Example of Statistics Computation ⚫ vstat – Compute vector statistics
➢ data.mc is a file containing mel-cepstral coefficients ➢ -m specifies the order of coefficients ➢ -o specifies the output format ➢ -d means only printing diagonal entries ⚫ sopr – Perform scalar operations ➢ -SQRT applies the square root operation 35 $ vstat –m 24 –o 2 –d data.mc | sopr –SQRT > data.mc.sdev

Statistics Computation Tools ⚫ average – Mean ⚫ vstat –
Mean and variance ⚫ vsum – Summation ⚫ vprod – Product ⚫ median – Median ⚫ mode – Mode ⚫ minmax – Minimum and maximum Tips: If you encounter NaN values in your statistics, you can identify where they occur in your data using the nan command 36

38 Break Any questions?

One Direction of Research Topic ◆ DSP (Digital signal processing)
⚫ Fewer free parameters ⚫ Highly controllable and efficient ◆ DNN (Deep neural network) ⚫ Many free parameters ⚫ High accuracy, but less efficienct ⇒ Combine Classical DSP with modern DNN ⚫ Embed DSP as a differentiable module within a DNN 39 The SPTK working group provides PyTorch-based DSP modules under the name diffsptk

Combining DSP and DNN 40 DSP DNN DNN Input Output
Target Loss computation Backpropagation If DSP is not differentiable, the DNN before it cannot be updated via backpropgagation

Example of Combining DSP and DNN ◆ Neural vocoder ⚫
LPCNet [Valin; ’19]: LPC ⚫ Multi-band MelGAN [Yang; ’20]: PQMF ⚫ DDSP [Engel; ’20]: Harmonic plus noise ⚫ MLSANet [Yoshimura; ’22]: MLSA filter ◆ Neural codec ⚫ SoundStream [Zeghidour; ’21]: Mutli-stage VQ ◆ Feature extractor ⚫ SincNet [Ravanelli; ’18]: Band-pass filter ⚫ CombNet [Churchwell; ’25]: IIR filter 41

What Is diffsptk? ◆ OSS for differentiable DSP ⚫ Developed
at Nagoya Institute of Technology ⚫ Availlable as a pip-installable package ➢ Provides SPTK-combatible features ➢ Complementary to the other packages such as TorchAudio ⚫ Implemented in PyTorch ⚫ Released under a permissive license ⚫ Actively maintained on GitHub 42

Concept of diffsptk (1/2) ◆ Non-recursive ⚫ DSP algorithms often
involves recursion ⚫ Recursion is efficient for certain tasks, but: ➢ Not well-suited for neural network training (less compatible with GPU parallel computing) ⚫ Recursive parts are replaced with non-recursive ones whenever possible ⚫ Key techniques: ➢ Matrix multiplication ➢ FFT 43

Concept of diffsptk (2/2) ◆ Dimension-last ⚫ The shape of
tensors is assumed to be (B, N, D) ➢ B: mini-batch size ➢ N: data length ➢ D: data dimensionality ⚫ Compatible with SPTK ➢ Data ordering is C-order ⚫ Note that Conv1d in PyTorch assumes (B, D, N) ➢ Need to transpose tensors depending on the operation 44

Installation of diffsptk ◆ Requirements (version 1.0.0) ⚫ Python 3.8+
⚫ PyTorch 1.11.0+ ◆ Requirements (version 3.3.1) ⚫ Python 3.10+ ⚫ PyTorch 2.3.1+ ◆ How to install ⚫ Prepare your Python environment ⚫ $ pip install diffsptk 45

Features of diffsptk ◆ Includes almost all features of SPTK
◆ Additional features: ⚫ CQT / inverse CQT ⚫ MDCT / inverse MDCT ⚫ Gammatone filter bank analysis / synthesis ⚫ Griffin-Lim phase reconstruction ◆ Supports both Module class / Functional API ⚫ Similar to torch.nn.Module / torch.nn.functional ⚫ The Module class is more efficient for repeated use 46

Example of Feature Extraction ◆ SPTK ◆ diffsptk 47 import
diffsptk x, sr = diffsptk.read("assets/data.wav") # Compute the STFT amplitude of x. stft = diffsptk.STFT(frame_length=400, frame_period=80, fft_length=512) X = stft(x) # Estimate the mel-cepstrum of x. alpha = diffsptk.get_alpha(sr) mcep = diffsptk.MelCepstralAnalysis(fft_length=512, cep_order=24, alpha=alpha) mc = mcep(X) $ sox –t wav data.wav –c 1 –t s16 –r 16000 | x2x +sd | frame –p 80 –l 400 | window –l 400 –L 512 | mgcep –m 24 –l 512 –a 0.42 > data.mc

How to Combine DSP with DNN ◆ Declare diffsptk module
instance ⚫ All diffsptk modules inherit torch.nn.Module ◆ Combine them with exsiting PyTorch modules ⚫ Example: 48 import torch import diffsptk model = torch.nn.Sequential( torch.nn.Linear(256, 256), diffsptk.DCT(256), torch.nn.Linear(256, 1), ) inputs = torch.randn(8, 256) outputs = model(inputs)

Hands-on Tutorial ◆ SPTK ⚫ https://colab.research.google.com/drive/1vmbIJQDhT 5F26eCE5iYKQuEEGxYUv-uJ?usp=drive_link ◆ diffsptk ⚫
https://colab.research.google.com/drive/1xAoUKqXad vJXJ7RzN0OceB6y7q5i7Sn6?usp=drive_link 49 Try them on your own computer or on Google Colab to better understand SPTK!

Thanks! 50 Any questions? The next lecture will be a
hands-on practice session on SPTK / diffpstk

Introduction to SPTK: A Toolkit for Speech Sign...

Introduction to SPTK: A Toolkit for Speech Signal Processing

Other Decks in Technology

Featured

Transcript