論文紹介: wav2vec 2.0

音声タスク（特に音声認識）においてBest Scoreを多数実現している事前学習モデル（2023年1月時点）。 wav2vec とは ref. https://paperswithcode.com/area/speech

wav2vec 2.0 の前に wav2vec と vq-wav2vec が存在します。しかし、以下の理由から wav2vec 2.0
のみを精読しました。 • 各手法は異なる点が多い • wav2vec 2.0 の応用幅が圧倒的にひろい • 事前学習モデルや深層学習モデルとして参考になる点が多い wav2vec 2.0 ”のみ” 精読

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020.
Wav2vec 2.0: a framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS'20). Curran Associates Inc., Red Hook, NY, USA, Article 1044, 12449–12460. • 音声データのための事前学習モデル ◦ 他モデル(e.g., Transformer)と組み合わせて推論 • 生の音声波形データから学習 ◦ フィルタバンクやスペクトログラムを使わない（！） • ラベルなしデータから自己教師あり学習 ◦ BERTと同様にマスキングでベクトル表現を学習 • 10 分間データの転移学習で WER 4.8/8.2 (Librispeech clean/other) ◦ 53,000 時間のラベルなしデータの事前学習 +10 分間のラベルデータの教師あり学習 ◦ 960時間のラベルデータだと WER 1.8/3.3 (Librispeech clean/other) wav2vec 2.0: 概要

Model 事前学習モデルは4段階で構成。音源データから量子化表現やコンテキスト表現を獲得していく。 • X = Raw waveform • Z =
Latent speech representations • Q = Quantized representations (Loss計算にのみ利用) • C = Context representations process

Raw waveformをストライド（重複させながらスライド）しながらCNNへ入力*1。その後、Layer normalization*2し、活性化関数GELU（ドロップアウト効果を持つReLU）*3を適用。 Model: X = Raw waveform ->
Z = Latent speech representations Ba, Jimmy, Jamie Ryan Kiros and Geoffrey E. Hinton. “Layer Normalization.” ArXiv abs/1607.06450 (2016): n. pag. Hendrycks, Dan and Kevin Gimpel. “Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units.” ArXiv abs/1606.08415 (2016): n. pag. *1 *2 *3 ref. https://cvml-expertguide.net/terms/dl/layers/activation-function/relu-like-activation/gelu/ ref. https://qiita.com/nishiha/items/207f6d7eacce344e3c5e ref. https://data-analytics.fun/2020/07/16/understanding-layer-normalization/

Product quantization（量子化）*1のため、サブベクトル(u)+再現値(c)+量子ベクトル(q)+再現値IDを作成。 Zと再現値IDを対応づけるため、全結合層 + Gumbel-Softmax*2で再現値IDの確率表現として算出。 Straight-Through*3によってone-hot表現とした再現値IDに量子ベクトルを掛け合わせることで、Zを量子化する。 Model: Z = Latent
speech representations -> Q = Quantized representations H. Jégou, M. Douze and C. Schmid, "Product Quantization for Nearest Neighbor Search," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117-128, Jan. 2011, doi: 10.1109/TPAMI.2010.57. Jang, Eric, Shixiang Shane Gu and Ben Poole. “Categorical Reparameterization with Gumbel-Softmax.” ArXiv abs/1611.01144 (2016): n. pag. *1 *2 *3 コードブック代表値 ref. https://www.pinecone.io/learn/product-quantization/ Gumbel分布ハイパラ ref.https://data-analytics.fun/2021/04/06/understanding-gumbel-max-trick/

Transformer層で系列情報を学習。 Attentionスコアには、Relational Positional Encodingを活用（CNN手法とTransformer-XL手法*1がある）。そして再度GELUとLayer normalizationを実施してコンテクスト表現を獲得*2。 Model: Z = Latent
speech representations -> C = Context representations Mohamed, Abdel-rahman, Dmytro Okhonko and Luke Zettlemoyer. “Transformers with convolutional context for ASR.” ArXiv abs/1904.11660 (2019): n. pag. Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le and Ruslan Salakhutdinov. “Transformer-XL: Attentive Language Models beyond a Fixed-Length Context.” ArXiv abs/1901.02860 (2019): n. pag. *1 *2 GELU & LN 正弦波 position i と j の差分

Model まとめ • X -> CNN (with stride) -> Layer
norm -> GELU = Z • Z -> Linear -> one-hot (by Straight-Through Gumbel-Softmax) -> Product quantization -> Linear = Q • Z -> Transformer (with Relational Positional Encoding) -> GELU -> Layer norm -> Linear = C process

トレーニング時は、BERTのようにマスキングした特徴量を予測することで自己教師あり学習を行う。全time-stepsからpの割合でtime-stepsを選択し*1、M stepsをマスクする（overlapを許容。詳細は後述）*2。（マスキングする特徴量はZのみ） Training: Masking ref. https://towardsdatascience.com/wav2vec-2-0-a-framework-for-self-supervised-learning-of-speech-representations-7d3728688cae M *1
*2

マスキング対象データからLossを計算する。Lossは2種類を組み合わせてハイパラαで調整*1。 • Lm: Contrastive Loss（CがQと類似するように） ◦ c_(t): timestep(=t)ごとのコンテキスト表現（学習対象） ◦ q_(t):
timestep(=t)ごとの正しい量子化表現（学習対象） ◦ q~: 不正解の量子（同一発話内から抽出）（個数はハイパラ） ◦ sim: コサイン類似度 ◦ κ: ハイパラ • Ld: Diversity Loss（Qの情報エントロピーを最大に） ◦ G: 量子化コードブック（＝サブベクトル(u)）数 ◦ V: コードブック内のベクトル（＝再現値(c)）数 ◦ p-_(g,v): Z->QにおけるGumbel-softmaxの否定 Training: Loss *1

論文中では、Fine-tuning性能評価にTransformerをデコーダにした音声認識システムを利用。 CTC lossを損失関数とする（=与えられた特徴量(=x)から正解文字列(=z=l)を出力する確率を最大化するようなp(x)*1の最尤推定法問題*2として損失関数を定義し、前向き後ろ向きアルゴリズム*3で算出する）。 Training: Fine-tuning Alex Graves, Santiago Fernández,
Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (ICML '06). Association for Computing Machinery, New York, NY, USA, 369–376. https://doi.org/10.1145/1143844.1143891 *1 *3 *2 ref. http://musyoku.github.io/2017/06/16/Connectionist-Temporal-Classification/

事前学習には以下データを使用。 • LS-960: Librispeechから960時間 • LV-60k: LibriVoxから5320時間ファインチューニング(for Speech Recognition)には以下データを使用。
• all: 960 hours of transcribed Librispeech • 100 hours labeled: the train-clean-100 subset from Librispeech • 10 hours labeled: Libri-light limited (half clean（=雑音が少ない）, half other（=雑音あり）) • 1 hours labeled: Libri-light limited (half clean, half other) • 10 min labeled: Libri-light limited (half clean, half other) ファインチューニング(for Phoneme Recognition)には以下データを使用。 • TIMIT Experimental Setup: Datasets Kahn, Jacob, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazar'e, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdel-rahman Mohamed and Emmanuel Dupoux. “Libri-Light: A Benchmark for ASR with Limited or No Supervision.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019): 7669-7673.

事前学習モデルとして、BASEモデルとLARGEモデルの2種類を作成。それぞれに対して、各事前学習データと各ファインチューニングデータを学習させる。 Experimental Setup: Pre-training

入力音源Xの入力は40ms。20msずつストライドしながら入力。（ストライドする幅=timestep=20ms） Experimental Setup: Pre-training: X = Raw waveform

CNNは7層実施。各層は512チャンネルであり、カーネルサイズは(10,3,3,3,3,2,2)、ストライドは(5,2,2,2,2,2,2) で畳み込み。 Experimental Setup: Pre-training: Z: Latent speech representations

コードブックはG = 2種類、ベクトル数は V = 320個の量子化ベクトルを作成。 Gumbel softmax のハイパラ τ
（値が小さいほどone-hot表現に）は更新ごとに 0.999995 倍する。 • BASE の場合は τ: 2 -> 0.5 • LARGE の場合は τ: 2 -> 0.1 Experimental Setup: Pre-training: Q: Quantized representations

Transformerは2種類を用意。 • BASE ◦ 12 transformer blocks ◦ with 768
model dimension ◦ 3,072 inner dimension (FFN) ◦ 8 attention heads Experimental Setup: Pre-training: C: Context representations • LARGE ◦ 24 transformer blocks ◦ with 1,024 model dimension ◦ 4,096 inner dimension (FFN) ◦ 16 attention heads

マスキングは確率計算を活用して（重複を許しながら）実施。 • p = 0.065（の確率でtimestepからM個がマスキングされる） • M = 10 timesteps
15秒のオーディオにこの確率分布を適用する*1と、中央値は10 timesteps、平均は14.7 timesteps(=299ms)、最大値は100 timestepsとなる。 Experimental Setup: Pre-training: Masking *1

Lossのハイパラ α（Ld（=Diversity Loss）の割合）は0.1。q~（ネガティブ量子サンプル）は100個選択。 checkpointの選択にはLm（=Contrastive Loss）しか使わない。 Adamで最適化。lrを32000回warmupしてピークまで学習させたあと線形減衰。 CとQを緩やかにL2正則化*1するため、 weight_decayを0.1とし、最終レイヤの勾配計算を1/10する。 Experimental Setup:
Pre-training: Loss & Training ref. https://zero2one.jp/learningblog/yobinori-collab-regularization/ *1

• BASE ◦ Adamのlr: 5 × 10−4 をピーク ◦ 音声を250kサンプル(15.6[sec/サンプル])に切り取り
◦ 1.4m[timesteps/batch]を超えないよう学習 ◦ 64 V100 GPUs で 1.6 日 • LARGE ◦ Adamのlr: 3 × 10−4 をピーク ◦ 音声を320kサンプル(20.0[sec/サンプル])に切り取り ◦ 1.2m[timesteps/batch]を超えないよう学習 ◦ 128 V100 GPUs で 2.3 日 Experimental Setup: Pre-training: Loss & Training per models

SpecAugment*1と似た手法でデータオーグメンテーションを行う（wav2vec層からの出力データに対してランダムにインデックスを選択し、そこから10 timestepsのデータにマスキングを実施した特徴量で学習を進める。そうすることで実データ量よりも豊富な学習が可能になる） timestepとchannelで別々のpを設定する*2。 timestepもchannelもM=10（選択した時点から10個マスクする）。 Experimental Setup: Fine-tuning: Masking
Park, Daniel S., William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin Dogus Cubuk and Quoc V. Le. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.” ArXiv abs/1904.08779 (2019): n. pag. *1 ref. https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html *2

Layer Drop*1も実施しながら学習する（各epoch内で確率に応じてレイヤを徐々にドロップしながら学習する。学習深度を分離することができ、ロバスト性を向上させる）。ドロップ率はモデルで異なる。 • BASE: p=0.05（12 layers） • LARGE:
p=0.1（24 layers） Experimental Setup: Fine-tuning: Layer Drop *1 Fan, Angela, Edouard Grave and Armand Joulin. “Reducing Transformer Depth on Demand with Structured Dropout.” ArXiv abs/1909.11556 (2019): n. pag.

Librispeech LM コーパスから2種類のLM（言語モデル）を定義。（詳細は追いきれず...） Ax*を使って、言語モデル内の重みを[0,5] 単語挿入ペナルティを[-5,5] の範囲でベイズ最適化。最適化の過程で128 回のビームサーチを試み、dev-other（開発用に分割したデータセットのうち雑音を含むもの。含まないのは dev-clean）でのパフォーマンスに従って最適な重みのセットを選択*1。 •
4-gram ◦ ビームサーチ top 500 • Transformer ◦ ビームサーチ top 50 ◦ [G. Synnaeve et al. 2020]のモデルをそのまま Experimental Setup: Language Models and Decoding *1 G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert. End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures. arXiv, abs/1911.08460, 2020 * Ax: https://github.com/facebook/Ax

Librispeech の clean と other で Word Error Rate 比較。
• 10minのfine-tuneでも十分な性能*1 • LV-60k(from LibriVox)の事前学習が効果あり*2 ◦ 事前学習データ量は多いほうがいい • fine-tuneデータを増やすほど性能が向上*3 Result: Speech Recognition: Training time 比較 *1 *3 *3 *3 *3 *3 *3 *3 *3 *2

Conformerよりも高い性能を発揮*1。スクラッチでも高精度だが、事前学習の効果を示した*2。 Result: Speech Recognition: Training 960 hours of transcribed
Librispeech *1 *2 *2

• 10mの学習では微妙な差異がたくさん • 10hでも既に誤り回数は少ない • a -> the は（モデルによって増減がないため）データセットの問題そう
Result: Speech Recognition Error Analysis

TIMEIT の dev と test で Phoneme Error Rate 比較。
性能は既往研究よりも向上。 P(Phoneme|wav2vec Encoder Output) をプロットする*1と、音素をハッキリ捉えられていることがわかる。 Result: Phoneme Recognition *1

Transformer への入力（=inputs）とLm（=Contrastive Loss）で学習する正しい表現（=targets）の組み合わせにおいて、 Continuous feature（=Z）をinputsとし、Quantized feature（=Q）をtargetsとすることが最も性能がよかった。 Ablations: quantizing

Ablations: masking 最適なマスキングパラメータを調査。 • マスキング長が長すぎると精度悪化 • マスキング確率を上げすぎると精度悪化 • オーバーラップを許容せず分布関数を適用するのは困難 •
オーバーラップを許容せず固定長を適用すると精度悪化

その他、ハイパラの調査。 • α（Ld（=Diversity Loss）の割合）は精度に大きく影響*1 • 量子化表現は効いてる*2 • ネガティブサンプルの取得方法を変えると精度低下*3 ◦ 同一発話内からのサンプル取得が高精度
◦ マスク対象も含めたサンプル取得は効果薄*4 • Codebookの増加は効果なし*5 • Loss計算対象をマスク端からU個に絞ると精度向上*6 ◦ ただしトレーニング時間が長くなってしまう Ablations: hyper parameters *1 *2 *2 *5 *3 *6 *4

• 音源(=X) から潜在表現(=Z) を経由して量子化表現(=Q) を学習する事前学習モデル • マスキングした特徴量をTransformerが予測できるように学習していく •
コンテキスト表現(=C) と量子化表現(=Q) の両方を学習できるLoss関数を設定 ◦ 最終層の勾配計算を調整して潜在表現(=Z) とTransformerに学習を集中させる工夫も • ファインチューニング時にもマスキングすることでオーグメンテーション • WER（単語）は10min程度でも高精度 ◦ データを増やせば音声認識精度が上がった • PER（音素）はハッキリ予測できている • ハイパラは多いがAblations Studyで詳細に検証済 ◦ 検証し切れていない箇所もある（e.g, 最終レイヤの勾配計算を1/10する） wav2vec 2.0 まとめ

convolutionの仕組みを画像と自然言語へ応用し、汎用モデル（≠マルチモーダルモデル）を構築。 Data2vec 2.0ではマスキング手法を改善し、10倍の学習速度と精度向上を実現。派生研究: Data2vec Baevski, Alexei, Wei-Ning Hsu, Qiantong
Xu, Arun Babu, Jiatao Gu, and Michael Auli. ‘Data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language’. arXiv, 25 October 2022. https://doi.org/10.48550/arXiv.2202.03555 Baevski, Alexei, Arun Babu, Wei-Ning Hsu, and Michael Auli. ‘Efficient Self-Supervised Learning with Contextualized Target Representations for Vision, Speech and Language’. Speech and Language, n.d.

ほぼ同様のモデルでクロスリンガルな事前学習モデルを構築。言語認識で最高精度を獲得。他にも少量データセットによる音声認識の精度向上も。派生研究: XLS-R Babu, Arun, Changhan Wang, Andros
Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Miguel Pino, Alexei Baevski, Alexis Conneau and Michael Auli. “XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale.” Interspeech (2021).

音声合成精度の評価点を予測するタスクでただファインチューニングするだけで高精度を獲得。未知のシステムについて、 Fine-tuneなしでも相応の精度を発揮。モデルが大きければ精度が高いという訳でもない事例も。派生研究: VOICE-MOS Prediction Cooper, Erica,
Wen-Chin Huang, Tomoki Toda, and Junichi Yamagishi. ‘Generalization Ability of MOS Prediction Networks’. arXiv, 14 February 2022. https://doi.org/10.48550/arXiv.2110.02635.

既存の優秀な音声認識モデルと組み合わせて学習。 LibriSpeech test-cleanトップスコア (2023年1月) ほぼ同率1位の W2V-BERT (Chung et al., 2021)
では noisyな音声にQuantizationが有効であると示される。派生研究: Conformer + Wav2vec 2.0 Zhang, Yu, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le and Yonghui Wu. “Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition.” ArXiv abs/2010.10504 (2020): n. pag.

680,000時間の教師あり学習。特別なマルチタスクトークンでTransformerを学習させることでwav2vecを超える性能を発揮。関連研究: Whisper Radford, Alec, Jong Wook Kim, Tao
Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. ‘Robust Speech Recognition via Large-Scale Weak Supervision’. arXiv, 6 December 2022. https://doi.org/10.48550/arXiv.2212.04356.

参考: 解説記事 wav2vec -> vq-wav2vec https://www.slideshare.net/YosukeKashiwagi1/wav2vec-unsupervised-pretraining-for-speech-recognition wav2vec -> vq-wav2vec ->
wav2vec 2.0 https://maelfabien.github.io/machinelearning/wav2vec/# data2vec https://ai.facebook.com/blog/the-first-high-performance-self-supervised-algorithm-that-works-for-speech-vision-and-text/ data2vec 2.0 https://ai.facebook.com/blog/ai-self-supervised-learning-data2vec/?utm_source=twitter&utm_medium=organic_social&utm_id=blog&ut m_content=technical_deep_dive

参考: wav2vec 2.0 ソースコード fairseq(data2vec 2.0時点) https://github.com/facebookresearch/fairseq/tree/d871f6169f8185837d1c11fb28da56abfd83841c/examples/wav2vec fairseq(wav2vec 2.0時点) https://github.com/facebookresearch/fairseq/tree/621e834103b13318cb48d41fc713b580f0da6b24/examples/wav2vec
huggingface https://github.com/huggingface/transformers/blob/31d452c68b34c2567b62924ee0df40a83cbc52d5/src/transformers/models/wav2ve c2 torchaudio https://github.com/pytorch/audio/blob/a5664ca9c3ad9116ccb26befdf620cd9c71a6952/torchaudio/models/wav2vec2

論文紹介: wav2vec 2.0

論文紹介: wav2vec 2.0

sadahry

More Decks by sadahry

Other Decks in Technology

Featured

Transcript