OpenAIのWhisper でオフライン文字起こし(STT)

OpenAIのWhisper でオフライン文字起こし(STT) Kenichiro Matohara(matoken) <[email protected]> 1

南隅から参加(鹿児島の右下) 好きなLinuxディストリビューションはDebian お仕事募集 mailto:work＠matohara.org Kenichiro Matohara(matoken) https://matoken.org 2

Speach To Text(STT) 音声認識してテキストデータに SaaSではいくつか要回線 (鹿児島Linux勉強会 2022.04) がローカルかつ無料で使えるSTT の
Whisper とモデルデータを無料で公開! Speach To Text To Translation(Azure版) OpenAI(人工知能研究所) 3

Whisper OpenAI が開発 Python製MITライセンス(モデルもMIT?) 68万時間の音声から学習(680000/24/365=77.6…) transcribe(文字起こし)は99言語に対応 translate(文字起こし+翻訳)は英語への翻訳のみ openai/whisper: Robust Speech
Recognition via Large-Scale Weak Supervision 4

OracleCloud Free Tier Ampere A1 Compute VM で試す CPU Ampere
A1 Compute(aarch64) x 4 RAM 24GB OS Ubuntu 20.04.5 LTS 5

環境構築 1 Python 3.9.9以上が必要なのでppaから3.10を導入 2 pipでWhisper導入 3 ffmpegも必要 $ sudo
add-apt-repository ppa:deadsnakes/ppa $ sudo apt install python3.10-minimal python3.10-venv $ python3.10 -m venv venv $ source venv/bin/activate $ pip install git+https://github.com/openai/whisper.git $ sudo apt install ffmpeg 1 2 3 6

usage $ whisper usage: whisper [-h] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large}] [--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,f
[--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE] [--patie [--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--fp16 FP16] [--tempe [--logprob_threshold LOGPROB_THRESHOLD] [--no_speech_threshold NO_SPEECH_THRESHO audio [audio ...] whisper: error: the following arguments are required: audio 7

Whisper のモデル Whisper実行時に必要なモデルが存在しない場合ダウンロードされるモデルは，英語モデル4種と，多言語モデル5種がありモデルが大きいほどリソースを消費するが精度が良くなる $ ls -sS1 ~/.cache/whisper/
total 7374952 3014656 large.pt 1492200 medium.pt 1492200 medium.en.pt 472284 small.pt 472284 small.en.pt 141860 base.pt 141860 base.en.pt 73804 tiny.pt 73804 tiny.en.pt 8

日本語音声の文字起こしを試す YouTubeからサンプルデータの入手 $ yt-dlp -F https://www.youtube.com/watch?v=GiglWCcVi5o | grep -i audio
599 m4a audio only 2 | 224.57KiB 31k https | audio only mp4a.40.5 31k 22k ultr 600 webm audio only 2 | 238.16KiB 33k https | audio only opus 33k 48k ultr 139 m4a audio only 2 | 354.97KiB 49k https | audio only mp4a.40.5 49k 22k low, 249 webm audio only 2 | 342.58KiB 47k https | audio only opus 47k 48k low, 250 webm audio only 2 | 524.67KiB 72k https | audio only opus 72k 48k low, 140 m4a audio only 2 | 939.76KiB 130k https | audio only mp4a.40.2 130k 44k medi 251 webm audio only 2 | 1.02MiB 144k https | audio only opus 144k 48k medi $ yt-dlp -f 140 https://www.youtube.com/watch?v=GiglWCcVi5o $ ffprobe -i ./大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし\ \[GiglWCcVi5o\].m4a 2>&1 | grep In : Duration: 00:00:59.42, start: 0.000000, bitrate: 129 kb/s Stream #0:0[0x1](eng): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, stereo, fltp, 127 kb/s : $ yt-dlp --no-download --write-auto-subs --sub-langs ja https://www.youtube.com/watch?v=GiglWCc 10

tiny $ time whisper ./大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし\ \[GiglWCcVi5o\].webm --language /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78: UserWarnin warnings.warn("FP16
is not supported on CPU; using FP32 instead") [00:00.000 --> 00:04.000] 予報センターから自信の情報を伝えしています。 [00:04.000 --> 00:10.000] レージ25%九州地方で最大シーンと500の自信が発生しました。 [00:10.000 --> 00:14.000] 新源中はオースミハントと方向を起き、 [00:14.000 --> 00:18.000] 自信の希望示すまぐに中度は5.8 [00:18.000 --> 00:22.000] 新源の重はおよそ30kmとなっています。 [00:22.000 --> 00:26.000] この自信によるつなびの心配はありません。 [00:26.000 --> 00:29.000] この自信によるつなびの心配はありません。 [00:29.000 --> 00:33.000] 自信の御着を観測したのは、 [00:33.000 --> 00:36.000] 西断し、新度御着を観測したのは、 [00:36.000 --> 00:39.000] 西断し、新度4を観測したのは、 [00:39.000 --> 00:42.000] 高なベッチョー、シントミッチョー、 [00:42.000 --> 00:44.000] 宮崎し、 [00:44.000 --> 00:45.000] 崎しまし、 [00:45.000 --> 00:49.000] 宮崎の上司、こばやししとなっています。 [00:49.000 --> 00:52.000] そのほか九州エリア広い入れ、 [00:52.000 --> 00:55.000] 新度3を観測しています。 [00:55.000 --> 00:58.000] この自信によるつなびの心配はありません。 [00:58.000 --> 01:00.000] ではありません。 real 6m0.515s user 11m10.055s sys 1m36.165s 11

base $ time whisper ./大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし\ \[GiglWCcVi5o\].webm --language /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78: UserWarnin warnings.warn("FP16
is not supported on CPU; using FP32 instead") [00:00.000 --> 00:10.180] 余補センターから自身の情報をお伝えしています冷時2分後ろ九州地方で最大進度5弱の自身 [00:10.180 --> 00:18.260] 新原地は大済み半島東方大き自身の規模を示すマグに中度は5.8 [00:18.260 --> 00:25.800] 新原の重はおよそ30kmとなっていますこの自身による津波の心配はありません [00:25.800 --> 00:49.460] 新度5弱を観測したのは日南市新度5弱を観測したのは日南市新度4を観測したのは高なベ [00:49.460 --> 00:59.960] その他九州エリア広い配で新度3 新度にを観測していますこの自身による津波の心配はあり real 8m36.310s user 16m39.758s sys 2m29.937s 12

small $ time whisper ./大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし\ \[GiglWCcVi5o\].webm --language /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78: UserWarnin warnings.warn("FP16
is not supported on CPU; using FP32 instead") [00:00.000 --> 00:03.840] 予報センターから地震の情報をお伝えしています。 [00:03.840 --> 00:10.140] 0時2分ごろ九州地方で最大震度5弱の地震が発生しました。 [00:10.140 --> 00:18.300] 震源地は大隅半島東方置き、地震の希望を示すマグニチュードは5.8。 [00:18.300 --> 00:22.400] 震源の深さはおよそ30kmとなっています。 [00:22.400 --> 00:25.800] この地震による津波の心配はありません。 [00:25.800 --> 00:29.700] この地震による津波の心配はありません。 [00:29.700 --> 00:36.640] 震度5弱を観測したのは日南市、震度5弱を観測したのは日南市、 [00:36.640 --> 00:44.040] 震度4を観測したのは高鍋町、震都道町、宮崎市、 [00:44.040 --> 00:49.400] 駆島市、宮子の城市、小林市となっています。 [00:49.400 --> 00:55.800] その他、九州エリア広い灰で震度3、震度2を観測しています。 [00:55.800 --> 00:59.800] この地震による津波の心配はありません。 real 35m3.326s user 74m50.110s sys 8m49.869s 13

medium $ time whisper ./大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし\ \[GiglWCcVi5o\].webm --language /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78: UserWarnin warnings.warn("FP16
is not supported on CPU; using FP32 instead") [00:00.000 --> 00:10.240] 予報センターから地震の情報をお伝えしています 0時2分ごろ九州地方で最大震度5弱の地震 [00:10.240 --> 00:18.320] 震源地は大隅半島東方起地震の規模を示すマグニチュードは5.8 [00:18.320 --> 00:25.920] 震源の深さはおよそ30kmとなっていますこの地震による津波の心配はありません [00:25.920 --> 00:33.360] 震度5弱を観測したのは日南市震度5弱を観測したのは日南市 [00:33.360 --> 00:45.840] 震度4を観測したのは高鍋町新富町宮崎市福島市都の城市小林市となっています [00:45.840 --> 00:52.320] その他九州エリア広井配で震度3 震度2を観測しています [00:52.320 --> 01:00.880] この地震による津波の心配はありません real 84m22.932s user 191m7.199s sys 23m17.116s 14

large $ time whisper ./大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし\ \[GiglWCcVi5o\].webm --language /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78: UserWarnin warnings.warn("FP16
is not supported on CPU; using FP32 instead") [00:00.000 --> 00:03.760] 予報センターから地震の情報をお伝えしています。 [00:03.760 --> 00:10.080] 0時2分ごろ九州地方で最大震度5弱の地震が発生しました。 [00:10.080 --> 00:13.680] 震源地は大隅半島東方沖。 [00:13.680 --> 00:18.240] 地震の規模を示すマグニ中度は5.8。 [00:18.240 --> 00:22.320] 震源の深さはおよそ30キロメートルとなっています。 [00:22.320 --> 00:25.760] この地震による津波の心配はありません。 [00:25.760 --> 00:29.640] この地震による津波の心配はありません。 [00:29.640 --> 00:33.240] 震度5弱を観測したのは日南市。 [00:33.240 --> 00:36.520] 震度5弱を観測したのは日南市。 [00:36.520 --> 00:43.920] 震度4を観測したのは高鍋町、新富町、宮崎市、 [00:43.920 --> 00:49.280] 久島市、宮古の城市、小林市となっています。 [00:49.280 --> 00:55.640] その他九州エリア広い範囲で震度3、震度2を観測しています。 [00:55.640 --> 00:59.760] この地震による津波の心配はありません。 real 210m5.865s user 435m17.317s sys 80m58.978s 15

BGMもなく条件の良さそうなデータ tiny, base はちょっと辛い small くらいから大分良くなる，medium, large で少しづつ良くなるが処理時間がだいぶ伸びる large
でも地方の地名は無理な感じ 16

セリフ引用も認識して「」でくくってくれるよう https://nitter.matoken.org/miyagawa/status/1592766792283607040#m 17

英語音声の文字起こし $ wget https://www3.nhk.or.jp/nhkworld/upld/medias/en/radio/news/20221010183000_english_1.mp3 $ ffmpeg -i ./20221010183000_english_1.mp3 -map 0
-c copy -f segment -segment_time 60 -reset_ti $ ls -s1 20221010183000_english_1* 476 20221010183000_english_1-00.mp3 472 20221010183000_english_1-01.mp3 476 20221010183000_english_1-02.mp3 476 20221010183000_english_1-03.mp3 456 20221010183000_english_1-04.mp3 2332 20221010183000_english_1.mp3 $ ffprobe ./20221010183000_english_1-00.mp3 2>&1 | grep Input -A10 Input #0, mp3, from './20221010183000_english_1-00.mp3': Metadata: encoder : Lavf59.27.100 Duration: 00:01:00.00, start: 0.025057, bitrate: 64 kb/s Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 64 kb/s Metadata: encoder : Lavc57.64 19

tiny $ time whisper ./20221010183000_english_1-00.mp3 --language English --model tiny /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78:
UserWarnin warnings.warn("FP16 is not supported on CPU; using FP32 instead") [00:00.000 --> 00:06.500] This is Asian View from N.H.K. Roll in Japan. [00:06.500 --> 00:08.000] I'm here with you. [00:08.000 --> 00:12.000] Japanese Foreign Minister Hayashi Yoshimasa met his Malaysian counte [00:12.000 --> 00:15.000] Sifuiting Abdullah and Kuala Lumpur on Sunday. [00:15.000 --> 00:19.000] Hayashi conveyed his strong opposition to any attempt to unilaterall [00:19.000 --> 00:24.000] change the status quo in the east and south China seas by force. [00:24.000 --> 00:30.000] Malaysia is a strategic partner that shares our basic values and str [00:30.000 --> 00:35.000] I think we were able to have a very meaningful discussion on future [00:35.000 --> 00:41.000] Hayashi also explained the importance of maintaining and strengtheni [00:41.000 --> 00:44.000] and responding to economic coercion. [00:44.000 --> 00:49.000] On Russia's invasion of Ukraine, Hayashi sent the act goes against i [00:49.000 --> 00:51.000] and should not be condoned. [00:51.000 --> 00:57.000] Hayashi and Sifuiting confirmed that they will continue to coordinat [00:57.000 --> 01:13.000] Thai Prime Minister Pryu Chan Ocha has it. real 2m58.637s user 6m42.815s sys 0m46.548s 20

base $ time whisper ./20221010183000_english_1-00.mp3 --language English --model base /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78:
UserWarnin warnings.warn("FP16 is not supported on CPU; using FP32 instead") [00:00.000 --> 00:06.800] This is Asian View from NHK Roll Japan. [00:06.800 --> 00:08.800] I'm Hiragokitadai. [00:08.800 --> 00:13.720] Japanese Foreign Minister Hayashi Yoshimasa met his Malaysian counte [00:13.720 --> 00:15.880] in Kuala Lumpur on Sunday. [00:15.880 --> 00:20.460] Hayashi conveyed his strong opposition to any attempt to unilaterall [00:20.460 --> 00:25.120] quo in the East and South China Seas by force. [00:25.120 --> 00:30.320] Malaysia is a strategic partner that shares our basic values and str [00:30.320 --> 00:35.400] I think we were able to have a very meaningful discussion on future [00:35.400 --> 00:40.400] Hayashi also explained the importance of maintaining and strengtheni [00:40.400 --> 00:43.760] order and responding to economic coercion. [00:43.760 --> 00:49.000] On Russia's invasion of Ukraine, Hayashi said the act goes against i [00:49.000 --> 00:51.000] should not be condoned. [00:51.000 --> 00:55.960] Hayashi and Saifudin confirmed that they will continue to coordinate [00:55.960 --> 00:57.840] the conflict. [00:57.840 --> 01:27.400] The Thai Prime Minister Priyut Chan Ocha has it. real 6m41.061s user 14m49.295s sys 1m46.802s 21

small $ time whisper ./20221010183000_english_1-00.mp3 --language English --model small /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78:
UserWarnin warnings.warn("FP16 is not supported on CPU; using FP32 instead") [00:00.000 --> 00:13.720] Japanese Foreign Minister Hayashi Yoshimasa met his Malaysian counte [00:13.720 --> 00:15.880] in Kuala Lumpur on Sunday. [00:15.880 --> 00:20.440] Hayashi conveyed his strong opposition to any attempt to unilaterall [00:20.440 --> 00:24.320] quo in the East and South China Seas by force. [00:24.320 --> 00:30.280] Malaysia is a strategic partner that shares our basic values and str [00:30.280 --> 00:35.400] I think we were able to have a very meaningful discussion on future [00:35.400 --> 00:40.400] Hayashi also explained the importance of maintaining and strengtheni [00:40.400 --> 00:43.800] order and responding to economic coercion. [00:43.800 --> 00:48.840] On Russia's invasion of Ukraine, Hayashi said the act goes against i [00:48.840 --> 00:51.040] and should not be condoned. [00:51.040 --> 00:55.840] Hayashi and Saifudin confirmed that they will continue to coordinate [00:55.840 --> 01:00.000] to the conflict. real 27m32.630s user 56m35.610s sys 8m19.331s 22

medium $ time whisper ./20221010183000_english_1-00.mp3 --language English --model medium /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78:
UserWarnin warnings.warn("FP16 is not supported on CPU; using FP32 instead") [00:00.000 --> 00:06.800] This is Asian View from NHK World Japan. [00:06.800 --> 00:08.600] I'm Hiroko Kitadai. [00:08.600 --> 00:13.720] Japanese Foreign Minister Hayashi Yoshimasa met his Malaysian counte [00:13.720 --> 00:15.800] in Kuala Lumpur on Sunday. [00:15.800 --> 00:20.440] Hayashi conveyed his strong opposition to any attempt to unilaterall [00:20.440 --> 00:25.560] quo in the East and South China Seas by force. [00:25.560 --> 00:30.360] Russia is a strategic partner that shares our basic values and strat [00:30.360 --> 00:35.400] I think we were able to have a very meaningful discussion on future [00:35.400 --> 00:40.440] Hayashi also explained the importance of maintaining and strengtheni [00:40.440 --> 00:43.840] order and responding to economic coercion. [00:43.840 --> 00:49.000] On Russia's invasion of Ukraine, Hayashi said the act goes against i [00:49.000 --> 00:51.000] should not be condoned. [00:51.000 --> 00:55.800] Hayashi and Saifuddin confirmed that they will continue to coordinat [00:55.800 --> 00:57.800] to the conflict. [00:57.800 --> 01:24.760] Thai Prime Minister Prayut Chan-o-cha has a real 79m3.566s user 178m34.357s sys 22m48.464s 23

large $ time whisper ./20221010183000_english_1-00.mp3 --language English --model large /home/ubuntu/src/whisper/venv/lib/python3.10/site-packages/whisper/transcribe.py:78:
UserWarnin warnings.warn("FP16 is not supported on CPU; using FP32 instead") [00:00.000 --> 00:08.000] This is Asian View from NHK World Japan. I'm Hiroko Kitadai. [00:08.000 --> 00:15.000] Japanese Foreign Minister Hayashi Yoshimasa met his Malaysian counte [00:15.000 --> 00:24.000] Hayashi conveyed his strong opposition to any attempt to unilaterall [00:24.000 --> 00:30.000] Malaysia is a strategic partner that shares our basic values and str [00:30.000 --> 00:35.000] I think we were able to have a very meaningful discussion on future [00:35.000 --> 00:43.000] Hayashi also explained the importance of maintaining and strengtheni [00:43.000 --> 00:50.000] On Russia's invasion of Ukraine, Hayashi said the act goes against i [00:50.000 --> 00:57.000] Hayashi and Saifuddin confirmed that they will continue to coordinat real 133m44.308s user 289m38.870s sys 49m56.902s 24

ja en 時間比較 ja en tiny 6m 2m58 base 8m36
6m41 small 35m3 27m32 medium 84m22 79m3 large 210m5 133m44 25

翻訳結果とsubtitle file $ ls -1 大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし\ \[GiglWCcVi5o\].webm.* '大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし [GiglWCcVi5o].webm.srt' '大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし
[GiglWCcVi5o].webm.txt' '大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし [GiglWCcVi5o].webm.vtt' $ cat 大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし\ \[GiglWCcVi5o\].webm.txt 予報センターから地震の情報をお伝えしています。 0時2分ごろ九州地方で最大震度5弱の地震が発生しました。震源地は大隅半島東方沖。地震の規模を示すマグニ中度は5.8。震源の深さはおよそ30キロメートルとなっています。この地震による津波の心配はありません。この地震による津波の心配はありません。震度5弱を観測したのは日南市。震度5弱を観測したのは日南市。震度4を観測したのは高鍋町、新富町、宮崎市、久島市、宮古の城市、小林市となっています。その他九州エリア広い範囲で震度3、震度2を観測しています。この地震による津波の心配はありません。 26

subtitle埋め込み YouTube の文字起こしはWhisper より精度悪いので差し替えると良さそう  subtitleのときは以下を使うと良さそう(未確認) https://trac.ffmpeg.org/wiki/HowToBurnSubtitlesIntoVideo $ ffmpeg
-i video.webm -vf \ subtitles=video.webm.vtt \ video_sub.webm jianfch/stable-ts: Stabilizing timestamps of OpenAI’s Whisper outputs down to word-level 27

Speach to Text to Translate  English only  英語以外に翻訳するときはtranslate-shell
やみんなの自動翻訳 @TexTraの利用できるtransも便利※要回線 $ whisper './大隅半島東方沖で地震　宮崎県で震度5弱　津波の心配なし [GiglWCcVi5o].m4a' \ --language Japanese --model small --task translate 2>/dev/null [00:00.000 --> 00:03.640] We have information about the earthquake from both centers. [00:03.640 --> 00:09.960] The largest earthquake in the world occurred at around 2 a.m. [00:09.960 --> 00:13.640] The earthquake was in Osumi-Hanto, Toho-Oki. [00:13.640 --> 00:18.040] The magnitude of the earthquake was 5.8. [00:18.040 --> 00:22.280] The height of the earthquake is approximately 30 km. [00:22.280 --> 00:25.800] There is no worry about the tsunami caused by this earthquake. [00:25.800 --> 00:29.600] There is no worry about the tsunami caused by this earthquake. [00:29.600 --> 00:33.280] The earthquake that affected the 5-degrees in Shinto is Nichi-Nan-sh [00:33.280 --> 00:36.360] The earthquake that affected the 5-degrees in Shinto is Nichi-Nan-sh [00:36.360 --> 00:49.240] The earthquake that affected the 4-degrees in Shinto is Takanabe-cho [00:49.240 --> 00:54.560] In addition, the earthquake in the Kyushu area and the Shinto-san in [00:54.560 --> 00:59.160] There is no worry about the tsunami caused by this earthquake. コマンドラインで翻訳 28

Whisper.cpp Python製のWhisperをC++で再実装したもの少リソースかつ高速にWhisperが利用できる GPU非対応 ( で知る ) https://github.com/ggerganov/whisper.cpp [Galene] Whisper
transcriptions? 29

build $ git clone https://github.com/ggerganov/whisper.cpp $ make cc -I. -O3
-std=c11 -pthread -mfma -mf16c -mavx -mavx2 -c ggml.c -o ggml.o g++ -I. -I./examples -O3 -std=c++11 -pthread -c whisper.cpp -o whisper.o g++ -I. -I./examples -O3 -std=c++11 -pthread examples/main/main.cpp ggml.o whisper.o -o main ./main -h usage: ./main [options] file0.wav file1.wav ... options: -h, --help show this help message and exit -s SEED, --seed SEED RNG seed (default: -1) -t N, --threads N number of threads to use during computation (default: 4) -p N, --processors N number of processors to use during computation (default: 1) -ot N, --offset-t N time offset in milliseconds (default: 0) -on N, --offset-n N segment index offset (default: 0) -d N, --duration N duration of audio to process in milliseconds (default: 0) -mc N, --max-context N maximum number of text context tokens to store (default: max) -ml N, --max-len N maximum segment length in characters (default: 0) -wt N, --word-thold N word timestamp probability threshold (default: 0.010000) -su, --speed-up speed up audio by factor of 2 (faster processing, reduced accuracy -v, --verbose verbose output --translate translate from source language to english -otxt, --output-txt output result in a text file -ovtt, --output-vtt output result in a vtt file -osrt, --output-srt output result in a srt file -owts, --output-words output script for generating karaoke video -ps, --print_special print special tokens -pc, --print_colors print colors -nt, --no_timestamps do not print timestamps -l LANG, --language LANG spoken language (default: en) -m FNAME, --model FNAME model path (default: models/ggml-base.en.bin) -f FNAME, --file FNAME input WAV file path 31

sample $ ./main -m models/ggml-base.en.bin -f samples/jfk.wav whisper_model_load: loading model
from models/ggml-base.en.bin whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 2 whisper_model_load: mem_required = 506.00 MB whisper_model_load: adding 1607 extra tokens whisper_model_load: ggml ctx size = 140.60 MB whisper_model_load: memory size = 22.83 MB whisper_model_load: model size = 140.54 MB system_info: n_threads = 4 / 4 | AVX2 = 1 | AVX512 = 0 | NEON = 0 | FP16_VA = 0 | WASM_SIMD = 0 main: processing samples/jfk.wav (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en [00:00:00.000 --> 00:00:11.000] And so my fellow Americans, ask not what your country can do whisper_print_timings: load time = 542.31 ms whisper_print_timings: mel time = 174.94 ms whisper_print_timings: sample time = 14.82 ms whisper_print_timings: encode time = 6282.06 ms / 1047.01 ms per layer 32

モデルのダウンロード $ bash ./models/download-ggml-model.sh Usage: ./models/download-ggml-model.sh <model> Available models: tiny.en
tiny base.en base small.en small medium.en medium large $ bash ./models/download-ggml-model.sh large $ bash ./models/download-ggml-model.sh medium : 33

Whisper.cpp の入力フォーマット Whisper は動画だろうが処理してくれたが，Whisper.cpp は wav である必要があるよう 16kHz wav
である必要があるよう変換例 error: failed to open 'input.webm' as WAV file ./main: WAV file 'input.wav' must be 16 kHz $ ffmpeg -o input.webm -ar 16000 out.wav 34

Whisper と cpp の速度比較表 1. 1分の音声をIntel Core i5-7300U で
Whisper cpp large 29m38 8m9s medium 12m46s 4m4s small 2m28s 1m6s base 47s 20s tiny 27s 19s Benchmark results → https://github.com/ggerganov/whisper.cpp/issues/89 35

Whisper.cpp でstreaming文字起こし Intel Core i5-7300U でbaseだと処理が間に合わず以下のメッセージが大量に出る．tinyでも処理が間に合わないことがあるよう $ sudo
apt-get install libsdl2-dev $ make stream $ ./stream --language ja -m models/ggml-tiny.bin main: WARNING: cannot process audio fast enough, dropping audio ... 37

英語音声をリアルタイム文字起こししつつ，transrate-shell で日本語に翻訳 loopbackデバイスを用意して自分の介さない言語のビデオミーティングなどの音声を流し込むと便利 $ ./stream -t 4
--language en -m models/ggml-tiny.bin | pee cat "trans :ja -b" : Thanks for the main menu and the me use this for points for sure. メインメニューに感謝します。私はこれをポイントとして使用しています。 [2K the fights are hot and we hope that we will be able to provide all the amenities that we 戦いは熱いので、私たちが提供できるすべてのアメニティを提供できることを願っています [2K I don't know if I can take the phone to the phone I don't know if I can take the phone to the 電話を電話に持っていけるかどうかわからない電話を電話に持っていけるかどうかわからない [2K I want to be a Nielsen, you all are even happy to be here. Yeah, be a Nielsen. Yeah, be a Niel 私はニールセンになりたいです。皆さんもここにいられて幸せです。ええ、ニールセンになりましょう。ええ、ニールセンに ^C 38

まとめ OpenAI の公開したSTTなWhisper精度高くローカルで利用できて便利 CPU 環境ではWhisper.cpp が速いのでおすすめ(GPU利用時はGPU利用オプションをつけたWhisperのほうが恐らく速い) transrate-shellなどと組み合わせると夢の?リアルタイムに99ヶ国語の音声を文字起こしして日本語に翻訳が可能
ただ手持ちの端末では処理不足なのでもう少し強いマシンもしくはクラウド利用? 39

奥付発表 2022-11-27(sun) 発表者利用ソフトウェアライセンス鹿児島Linux勉強会 2022.11(オンライン開催) Kenichiro Matohara(matoken)
Asciidoctor Reveal.js CC BY 4.0 40

OpenAIのWhisper でオフライン文字 起こし(STT)

OpenAIのWhisper でオフライン文字 起こし(STT)

More Decks by Kenichiro MATOHARA

Other Decks in Technology

Featured

Transcript

OpenAIのWhisper でオフライン文字起こし(STT)

OpenAIのWhisper でオフライン文字起こし(STT)