业界良心OpenAI开源的Whisper模型是开源语音转文字领域的执牛耳者,白璧微瑕之处在于无法通过苹果M芯片优化转录效率,Whisper.cpp 则是 Whisper 模型的 C/C++ 移植版本,它具有无依赖项、内存使用量低等特点,重要的是增加了 Core ML 支持,完美适配苹果M系列芯片。
Whisper.cpp的张量运算符针对苹果M芯片的 CPU 进行了大量优化,根据计算大小,使用 Arm Neon SIMD instrisics 或 CBLAS Accelerate 框架例程,后者对于更大的尺寸特别有效,因为 Accelerate 框架可以使用苹果M系列芯片中提供的专用 AMX 协处理器。
配置Whisper.cpp老规矩,运行git命令来克隆Whisper.cpp项目:
(资料图)
git clone https://github.com/ggerganov/whisper.cpp.git
随后进入项目的目录:
cd whisper.cpp
项目默认的基础模型不支持中文,这里推荐使用medium模型,通过shell脚本进行下载:
bash ./models/download-ggml-model.sh medium
下载完成后,会在项目的models目录保存ggml-medium.bin模型文件,大小为1.53GB:
whisper.cpp git:(master) cd models ➜ models git:(master) ll total 3006000 -rw-r--r-- 1 liuyue staff 3.2K 4 21 07:21 README.md -rw-r--r-- 1 liuyue staff 7.2K 4 21 07:21 convert-h5-to-ggml.py -rw-r--r-- 1 liuyue staff 9.2K 4 21 07:21 convert-pt-to-ggml.py -rw-r--r-- 1 liuyue staff 13K 4 21 07:21 convert-whisper-to-coreml.py drwxr-xr-x 4 liuyue staff 128B 4 22 00:33 coreml-encoder-medium.mlpackage -rwxr-xr-x 1 liuyue staff 2.1K 4 21 07:21 download-coreml-model.sh -rw-r--r-- 1 liuyue staff 1.3K 4 21 07:21 download-ggml-model.cmd -rwxr-xr-x 1 liuyue staff 2.0K 4 21 07:21 download-ggml-model.sh -rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-base.bin -rw-r--r-- 1 liuyue staff 573K 4 21 07:21 for-tests-ggml-base.en.bin -rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-large.bin -rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-medium.bin -rw-r--r-- 1 liuyue staff 573K 4 21 07:21 for-tests-ggml-medium.en.bin -rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-small.bin -rw-r--r-- 1 liuyue staff 573K 4 21 07:21 for-tests-ggml-small.en.bin -rw-r--r-- 1 liuyue staff 562K 4 21 07:21 for-tests-ggml-tiny.bin -rw-r--r-- 1 liuyue staff 573K 4 21 07:21 for-tests-ggml-tiny.en.bin -rwxr-xr-x 1 liuyue staff 1.4K 4 21 07:21 generate-coreml-interface.sh -rwxr-xr-x@ 1 liuyue staff 769B 4 21 07:21 generate-coreml-model.sh -rw-r--r-- 1 liuyue staff 1.4G 3 22 16:04 ggml-medium.bin
模型下载以后,在根目录编译可执行文件:
make
程序返回:
➜ whisper.cpp git:(master) make I whisper.cpp build info: I UNAME_S: Darwin I UNAME_P: arm I UNAME_M: arm64 I CFLAGS: -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -DGGML_USE_ACCELERATE I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread I LDFLAGS: -framework Accelerate I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1) I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1) c++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread examples/bench/bench.cpp ggml.o whisper.o -o bench -framework Accelerate
至此,Whisper.cpp就配置好了。
牛刀小试现在我们来测试一段语音,看看效果:
./main -osrt -m ./models/ggml-medium.bin -f samples/jfk.wav
这行命令的含义是通过刚才下载ggml-medium.bin模型来对项目中的samples/jfk.wav语音文件进行识别,这段语音是遇刺的美国总统肯尼迪的著名演讲,程序返回:
➜ whisper.cpp git:(master) ./main -osrt -m ./models/ggml-medium.bin -f samples/jfk.wav whisper_init_from_file_no_state: loading model from "./models/ggml-medium.bin" whisper_model_load: loading model whisper_model_load: n_vocab = 51865 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 1024 whisper_model_load: n_audio_head = 16 whisper_model_load: n_audio_layer = 24 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 1024 whisper_model_load: n_text_head = 16 whisper_model_load: n_text_layer = 24 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 4 whisper_model_load: mem required = 1725.00 MB (+ 43.00 MB per decoder) whisper_model_load: adding 1608 extra tokens whisper_model_load: model ctx = 1462.35 MB whisper_model_load: model size = 1462.12 MB whisper_init_state: kv self size = 42.00 MB whisper_init_state: kv cross size = 140.62 MB system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | main: processing "samples/jfk.wav" (176000 samples, 11.0 sec), 4 threads, 1 processors, lang = en, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:11.000] And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country. output_srt: saving output to "samples/jfk.wav.srt"
只需要11秒,同时语音字幕会写入samples/jfk.wav.srt文件。
英文准确率是百分之百。
现在我们来换成中文语音,可以随便录制一段语音,需要注意的是,Whisper.cpp只支持wav格式的语音文件,这里先通过ffmpeg将mp3文件转换为wav:
ffmpeg -i ./test1.mp3 -ar 16000 -ac 1 -c:a pcm_s16le ./test1.wav
程序返回:
ffmpeg version 5.1.2 Copyright (c) 2000-2022 the FFmpeg developers built with Apple clang version 14.0.0 (clang-1400.0.29.202) configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/5.1.2_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libbluray --enable-libdav1d --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-neon libavutil 57. 28.100 / 57. 28.100 libavcodec 59. 37.100 / 59. 37.100 libavformat 59. 27.100 / 59. 27.100 libavdevice 59. 7.100 / 59. 7.100 libavfilter 8. 44.100 / 8. 44.100 libswscale 6. 7.100 / 6. 7.100 libswresample 4. 7.100 / 4. 7.100 libpostproc 56. 6.100 / 56. 6.100 [mp3 @ 0x130e05580] Estimating duration from bitrate, this may be inaccurate Input #0, mp3, from "./test1.mp3": Duration: 00:05:41.33, start: 0.000000, bitrate: 48 kb/s Stream #0:0: Audio: mp3, 24000 Hz, mono, fltp, 48 kb/s Stream mapping: Stream #0:0 -> #0:0 (mp3 (mp3float) -> pcm_s16le (native)) Press [q] to stop, [?] for help Output #0, wav, to "./test1.wav": Metadata: ISFT : Lavf59.27.100 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s Metadata: encoder : Lavc59.37.100 pcm_s16le [mp3float @ 0x132004260] overread, skip -6 enddists: -4 -4ed=N/A Last message repeated 1 times [mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1 [mp3float @ 0x132004260] overread, skip -7 enddists: -2 -2 [mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1 [mp3float @ 0x132004260] overread, skip -9 enddists: -2 -2 [mp3float @ 0x132004260] overread, skip -5 enddists: -1 -1 Last message repeated 1 times [mp3float @ 0x132004260] overread, skip -7 enddists: -3 -3 [mp3float @ 0x132004260] overread, skip -8 enddists: -5 -5 [mp3float @ 0x132004260] overread, skip -5 enddists: -2 -2 [mp3float @ 0x132004260] overread, skip -6 enddists: -1 -1 [mp3float @ 0x132004260] overread, skip -7 enddists: -3 -3 [mp3float @ 0x132004260] overread, skip -6 enddists: -2 -2 [mp3float @ 0x132004260] overread, skip -6 enddists: -3 -3 [mp3float @ 0x132004260] overread, skip -7 enddists: -6 -6 [mp3float @ 0x132004260] overread, skip -9 enddists: -6 -6 [mp3float @ 0x132004260] overread, skip -5 enddists: -3 -3 [mp3float @ 0x132004260] overread, skip -5 enddists: -2 -2 [mp3float @ 0x132004260] overread, skip -5 enddists: -3 -3 [mp3float @ 0x132004260] overread, skip -7 enddists: -1 -1 size= 10667kB time=00:05:41.32 bitrate= 256.0kbits/s speed=2.08e+03x video:0kB audio:10666kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.000714%
这里将一段五分四十一秒的语音转换为wav文件。
随后运行命令开始转录:
./main -osrt -m ./models/ggml-medium.bin -f samples/test1.wav -l zh
这里需要加上参数-l,告知程序为中文语音,程序返回:
➜ whisper.cpp git:(master) ./main -osrt -m ./models/ggml-medium.bin -f samples/test1.wav -l zh whisper_init_from_file_no_state: loading model from "./models/ggml-medium.bin" whisper_model_load: loading model whisper_model_load: n_vocab = 51865 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 1024 whisper_model_load: n_audio_head = 16 whisper_model_load: n_audio_layer = 24 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 1024 whisper_model_load: n_text_head = 16 whisper_model_load: n_text_layer = 24 whisper_model_load: n_mels = 80 whisper_model_load: f16 = 1 whisper_model_load: type = 4 whisper_model_load: mem required = 1725.00 MB (+ 43.00 MB per decoder) whisper_model_load: adding 1608 extra tokens whisper_model_load: model ctx = 1462.35 MB whisper_model_load: model size = 1462.12 MB whisper_init_state: kv self size = 42.00 MB whisper_init_state: kv cross size = 140.62 MB system_info: n_threads = 4 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 | COREML = 0 | main: processing "samples/test1.wav" (5461248 samples, 341.3 sec), 4 threads, 1 processors, lang = zh, task = transcribe, timestamps = 1 ... [00:00:00.000 --> 00:00:03.340] Hello 大家好,这里是刘越的技术博客。 [00:00:03.340 --> 00:00:05.720] 最近的事情大家都晓得了, [00:00:05.720 --> 00:00:07.880] 某公司技术经理魅上欺下, [00:00:07.880 --> 00:00:10.380] 打工人应对进队,不易快灾, [00:00:10.380 --> 00:00:12.020] 不易壮灾, [00:00:12.020 --> 00:00:14.280] 所谓魅上者必欺下, [00:00:14.280 --> 00:00:16.020] 古人诚不我窃。 [00:00:16.020 --> 00:00:17.360] 技术经理者, [00:00:17.360 --> 00:00:20.160] 公然在聊天群里大玩职场PUA, [00:00:20.160 --> 00:00:22.400] 气焰嚣张,有恃无恐, [00:00:22.400 --> 00:00:23.700] 最终引发众目, [00:00:23.700 --> 00:00:26.500] 嘿嘿,技术经理,团队领导, [00:00:26.500 --> 00:00:29.300] 原来团队领导这四个字是这么用的, [00:00:29.300 --> 00:00:31.540] 奴媚显达,构陷下属, [00:00:31.540 --> 00:00:32.780] 人文巨损, [00:00:32.780 --> 00:00:33.840] 逢迎上意, [00:00:33.840 --> 00:00:34.980] 傲然下欺, [00:00:34.980 --> 00:00:36.080] 装腔作势, [00:00:36.080 --> 00:00:37.180] 极尽投机, [00:00:37.180 --> 00:00:38.320] 负他人之负, [00:00:38.320 --> 00:00:39.620] 康他人之愷, [00:00:39.620 --> 00:00:42.180] 如此者,可谓团队领导也。 [00:00:42.180 --> 00:00:43.980] 中国的所谓传统文化, [00:00:43.980 --> 00:00:45.320] 除了仁义理智性, [00:00:45.320 --> 00:00:46.620] 除了金石子极, [00:00:46.620 --> 00:00:47.820] 除了争争风骨, [00:00:47.820 --> 00:00:49.560] 其实还有很多别的东西, [00:00:49.560 --> 00:00:52.020] 被大家或有意或无意的忽视了, [00:00:52.020 --> 00:00:53.300] 比如功利实用, [00:00:53.300 --> 00:00:54.300] 屈颜附示, [00:00:54.300 --> 00:00:55.360] 以兼至善, [00:00:55.360 --> 00:01:01.000] 官本位和钱规则的传统,在某种程度上,传统文化这没硬币的另一面, [00:01:01.000 --> 00:01:03.900] 才是更需要我们去面对和正视的, [00:01:03.900 --> 00:01:07.140] 我以为,这在目前盛行实惠价值观的时候, [00:01:07.140 --> 00:01:08.940] 提一提还是必要的, [00:01:08.940 --> 00:01:10.240] 有的人说了, [00:01:10.240 --> 00:01:13.740] 在开发群里对领导,非常痛快,非常爽, [00:01:13.740 --> 00:01:17.180] 但是,然后呢,有用吗? [00:01:17.180 --> 00:01:19.260] 倒霉的还不是自己, [00:01:19.260 --> 00:01:22.520] 没错,这就是功利且实用的传统, [00:01:22.520 --> 00:01:28.780] 各种精神,思辨,反抗,愤怒,都抵不过三个字,有用吗? [00:01:28.780 --> 00:01:31.820] 事实上,但凡叫做某种精神的, [00:01:31.820 --> 00:01:33.320] 那就是哲学思辨, [00:01:33.320 --> 00:01:36.220] 就是一种相对无用的思辨和学术, [00:01:36.220 --> 00:01:39.180] 而中国职场有很强的实用传统, [00:01:39.180 --> 00:01:42.140] 但这不是学术思辨,也没有理论构架, [00:01:42.140 --> 00:01:44.380] 仅仅是一种短视的经验论, [00:01:44.380 --> 00:01:47.220] 所以,功利主义,是密尔, [00:01:47.220 --> 00:01:48.980] 编庆的伦理价值学说, [00:01:48.980 --> 00:01:52.700] 强调的是,追求幸福,如何获得最大效用, [00:01:52.700 --> 00:01:55.580] 实用主义,是西方的一个学术流派, [00:01:55.580 --> 00:01:58.260] 比如杜威,胡适,就是代表, [00:01:58.260 --> 00:02:01.180] 实用主义的另一个名字,叫人本主义, [00:02:01.180 --> 00:02:04.780] 意思是,以人作为经验和万物的尺度, [00:02:04.780 --> 00:02:06.080] 换句话说, [00:02:06.080 --> 00:02:09.420] 功利主义,反对的正是那种短视的功利, [00:02:09.420 --> 00:02:13.220] 实用主义,反对的也正是那种凡是看对自己, [00:02:13.220 --> 00:02:15.220] 是不是有利的局限判断, [00:02:15.220 --> 00:02:17.260] 而在中国职场功利, [00:02:17.260 --> 00:02:21.060] 实用的传统中,恰恰是不会有这些理论构架的, [00:02:21.060 --> 00:02:23.700] 并且,不仅没有理论构架, [00:02:23.700 --> 00:02:26.140] 还要对那些无用的,思辨的, [00:02:26.140 --> 00:02:29.980] 纯粹的精神,视如避喜,吃之以鼻, [00:02:29.980 --> 00:02:32.260] 没错,在技术团队里, [00:02:32.260 --> 00:02:35.260] 我们重视技术,重视实用的科学, [00:02:35.260 --> 00:02:38.900] 但是主流职场并不鼓励去搞那些看似无用的东西, [00:02:38.900 --> 00:02:41.380] 比如普通劳动者的合法权益, [00:02:41.380 --> 00:02:43.580] 张义谋的满江红, [00:02:43.580 --> 00:02:45.220] 大家想必也都看了的, [00:02:45.220 --> 00:02:46.820] 人们总觉得很奇怪, [00:02:46.820 --> 00:02:48.300] 为什么那么坏的人, [00:02:48.300 --> 00:02:50.020] 皇帝为啥不罢免他? [00:02:50.020 --> 00:02:53.140] 为什么小人能当权来构陷好人呢? [00:02:53.140 --> 00:02:55.980] 当我们了解了传统文化中的法家思想, [00:02:55.980 --> 00:02:57.300] 就了然了, [00:02:57.300 --> 00:02:59.260] 在法家的思想规则下, [00:02:59.260 --> 00:03:01.660] 小人得是,忠良备辱, [00:03:01.660 --> 00:03:03.140] 事事所必然, [00:03:03.140 --> 00:03:04.900] 因为他一开始的设定, [00:03:04.900 --> 00:03:07.540] 就使得劣币驱逐良币的游戏规则, [00:03:07.540 --> 00:03:09.940] 所以,在这种观念下, [00:03:09.940 --> 00:03:12.460] 古代常见的一种职场智慧就是, [00:03:12.460 --> 00:03:14.820] 自污名节,以求自保, [00:03:14.820 --> 00:03:16.420] 在这种环境下, [00:03:16.420 --> 00:03:17.780] 要想生存, [00:03:17.780 --> 00:03:19.260] 就只有一条出路, [00:03:19.260 --> 00:03:20.900] 那就是依附权力, [00:03:20.900 --> 00:03:23.700] 并且,谁能拥有更大的权力, [00:03:23.700 --> 00:03:25.700] 谁就能生存得更好, [00:03:25.700 --> 00:03:27.500] 如何依附权力呢? [00:03:27.500 --> 00:03:29.180] 那就是现在正在发生的, [00:03:29.180 --> 00:03:31.900] 肆无忌惮的大腕职场PUA, [00:03:31.900 --> 00:03:33.060] 除此之外, [00:03:33.060 --> 00:03:34.340] 这种权力关系, [00:03:34.340 --> 00:03:36.900] 在古代会渗透到方方面面, [00:03:36.900 --> 00:03:40.300] 因为权力系统是一个复杂而高效的运行机器, [00:03:40.300 --> 00:03:42.940] CPU,内存,硬盘, [00:03:42.940 --> 00:03:44.900] 甚至一颗C面底螺丝钉, [00:03:44.900 --> 00:03:47.140] 都是权力机器上的一个环节, [00:03:47.140 --> 00:03:48.060] 于是, [00:03:48.060 --> 00:03:50.420] 官僚体系之外的一切职场人, [00:03:50.420 --> 00:03:52.340] 都会面临一个尴尬的处境, [00:03:52.340 --> 00:03:54.340] 一方面遭遇权力的打压, [00:03:54.340 --> 00:03:55.340] 另一方面, [00:03:55.340 --> 00:03:57.900] 也都会多少尝到权力的甜头, [00:03:57.900 --> 00:03:58.900] 于是乎, [00:03:58.900 --> 00:04:01.420] 权力的细胞渗透到角角落落, [00:04:01.420 --> 00:04:02.980] 即便没有组织权力, [00:04:02.980 --> 00:04:04.620] 也要追求文化权力, [00:04:04.620 --> 00:04:05.500] 父权, [00:04:05.500 --> 00:04:06.380] 夫权, [00:04:06.380 --> 00:04:07.460] 家长权力, [00:04:07.460 --> 00:04:08.580] 宗族权力, [00:04:08.580 --> 00:04:09.660] 老师权力, [00:04:09.660 --> 00:04:10.780] 公司权力, [00:04:10.780 --> 00:04:12.140] 团队领导权力, [00:04:12.140 --> 00:04:13.100] 点点滴滴, [00:04:13.100 --> 00:04:15.580] 滴滴点点,追逐权力, [00:04:15.580 --> 00:04:18.140] 几乎成为人们生活的全部意义, [00:04:18.140 --> 00:04:18.980] 故而, [00:04:18.980 --> 00:04:19.980] 服从权力, [00:04:19.980 --> 00:04:21.180] 服从上级, [00:04:21.180 --> 00:04:22.420] 不得罪同事, [00:04:22.420 --> 00:04:23.660] 不得罪朋友, [00:04:23.660 --> 00:04:25.060] 不得罪陌生人, [00:04:25.060 --> 00:04:26.100] 因为你不知道, [00:04:26.100 --> 00:04:28.260] 他们背后有什么的权力关系, [00:04:28.260 --> 00:04:30.940] 他们又会不会用这个权力来对付你, [00:04:30.940 --> 00:04:31.940] 没错, [00:04:31.940 --> 00:04:34.380] 当我们解构群里那位领导的行为时, [00:04:34.380 --> 00:04:36.220] 我们也在解构我们自己, [00:04:36.220 --> 00:04:37.420] 毫无疑问, [00:04:37.420 --> 00:04:39.380] 对于这位敢于发声的职场人, [00:04:39.380 --> 00:04:41.180] 深安职场底层逻辑的, [00:04:41.180 --> 00:04:43.220] 我们一定能猜到他的结局, [00:04:43.220 --> 00:04:44.700] 他的结局是注定的, [00:04:44.700 --> 00:04:46.220] 同时也是悲哀的, [00:04:46.220 --> 00:04:47.340] 问题是, [00:04:47.340 --> 00:04:48.540] 这样做, [00:04:48.540 --> 00:04:49.660] 值得吗? [00:04:49.660 --> 00:04:52.580] 香港著名导演王家卫拍过一部电影, [00:04:52.580 --> 00:04:54.420] 叫做东邪西毒, [00:04:54.420 --> 00:04:56.340] 电影中有这样一个情节, [00:04:56.340 --> 00:04:59.620] 有个女人的弟弟被太尉府的一群刀客杀了, [00:04:59.620 --> 00:05:00.860] 他想报仇, [00:05:00.860 --> 00:05:02.300] 可自己没有武功, [00:05:02.300 --> 00:05:04.060] 只能请刀客出手, [00:05:04.060 --> 00:05:05.540] 但家里穷没钱, [00:05:05.540 --> 00:05:08.540] 最有价值的资产是一篮子鸡蛋, [00:05:08.540 --> 00:05:09.260] 于是, [00:05:09.260 --> 00:05:10.900] 他提着那一篮子鸡蛋, [00:05:10.900 --> 00:05:13.420] 天天站在刀客剑客们经过的路口, [00:05:13.420 --> 00:05:14.700] 请求他们出手, [00:05:14.700 --> 00:05:16.220] 报仇就是鸡蛋, [00:05:16.220 --> 00:05:17.860] 没有人愿意为了鸡蛋, [00:05:17.860 --> 00:05:20.020] 去单挑太尉府的刀客, [00:05:20.020 --> 00:05:21.460] 除了洪七, [00:05:21.460 --> 00:05:24.260] 洪七独自力战太尉府那帮刀客, [00:05:24.260 --> 00:05:26.780] 所得的报仇是一个鸡蛋, [00:05:26.780 --> 00:05:29.020] 但是洪七付出的代价太大, [00:05:29.020 --> 00:05:30.060] 混战中, [00:05:30.060 --> 00:05:32.700] 洪七被对手砍断了一根手指, [00:05:32.700 --> 00:05:33.820] 为了一个鸡蛋, [00:05:33.820 --> 00:05:35.500] 而失去一只手指, [00:05:35.500 --> 00:05:36.740] 值得吗? [00:05:36.740 --> 00:05:37.860] 不值得, [00:05:37.860 --> 00:05:39.300] 但是我觉得痛快, [00:05:39.300 --> 00:05:40.540] 因為這才是我自己 output_srt: saving output to "samples/test1.wav.srt" whisper_print_timings: load time = 978.82 ms whisper_print_timings: fallbacks = 0 p / 0 h whisper_print_timings: mel time = 438.81 ms whisper_print_timings: sample time = 980.66 ms / 2343 runs ( 0.42 ms per run) whisper_print_timings: encode time = 31476.10 ms / 13 runs ( 2421.24 ms per run) whisper_print_timings: decode time = 47833.70 ms / 2343 runs ( 20.42 ms per run) whisper_print_timings: total time = 81797.88 ms
五分钟的语音,只需要一分钟多一点就可以转录完成,效率满分。
当然,精确度还有待提高,提高精确度可以选择large模型,但转录时间会相应增加。
苹果M芯片模型转换基于苹果Mac系统的用户有福了,Whisper.cpp可以通过Core ML在Apple Neural Engine (ANE)上执行编码器推理,这可以比仅使用CPU执行快出三倍以上。
首先安装转换依赖:
pip install ane_transformers pip install openai-whisper pip install coremltools
接着运行转换脚本:
./models/generate-coreml-model.sh medium
这里参数即模型的名称。
程序返回:
➜ models git:(master) python3 convert-whisper-to-coreml.py --model medium --encoder-only True scikit-learn version 1.2.0 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API. ModelDimensions(n_mels=80, n_audio_ctx=1500, n_audio_state=1024, n_audio_head=16, n_audio_layer=24, n_vocab=51865, n_text_ctx=448, n_text_state=1024, n_text_head=16, n_text_layer=24) /opt/homebrew/lib/python3.10/site-packages/whisper/model.py:166: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can"t record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape" /opt/homebrew/lib/python3.10/site-packages/whisper/model.py:97: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the "trunc" function NOT "floor"). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode="trunc"), or for actual floor division, use torch.div(a, b, rounding_mode="floor"). scale = (n_state // self.n_head) ** -0.25 Converting PyTorch Frontend ==> MIL Ops: 100%|▉| 1971/1972 [00:00<00:00, 3247.25 Running MIL frontend_pytorch pipeline: 100%|█| 5/5 [00:00<00:00, 54.69 passes/s] Running MIL default pipeline: 100%|████████| 57/57 [00:09<00:00, 6.29 passes/s] Running MIL backend_mlprogram pipeline: 100%|█| 10/10 [00:00<00:00, 444.13 passe done converting
转换好以后,重新进行编译:
make clean WHISPER_COREML=1 make -j
随后用转换后的模型进行转录即可:
./main -m models/ggml-medium.bin -f samples/jfk.wav
至此,Mac用户立马荣升一等公民。
结语Whisper.cpp是Whisper的精神复刻与肉体重生,完美承袭了Whisper的所有功能,在此之上,提高了语音转录文字的速度和效率以及跨平台移植性,百尺竿头更进一步,开源技术的高速发展让我们明白了一件事,那就是高品质技术的传播远比技术本身更加宝贵。
标签:
仓储物流“成渝圈”如何乘势而上? 12月3日,连接昆明和万象的中老铁路全线开通运营,被惠及的显...
两件西周青铜簋时隔三千年成功配对 考古工作者介绍,这个铜簋的盖、身分别时隔40余年出土,纹饰...
“医保砍价”不是一个人在战斗 晁星 “我眼泪都快掉下来了”“每一个小群体都不该被放弃”…...
“购物成瘾”真的是一种病 刘艳 牛雅娟 本周日即将迎来“双十二”促销季,很多人又开始摩拳...
因迷恋山间风景,一男子在甘孜州稻城县海拔4000多米的无人区迷失方向,随后与同伴失联。12月的稻城...
嫌疑人DNA信息比中后,成都市公安局刑侦支队技术处DNA实验室民警白小刚一下坐在凳子上,恍惚迟疑间...
一批反映南京大屠杀历史的新书发布 新华社南京12月7日电(记者邱冰清、蒋芳)“以史为鉴,开创未来...
我在现场·照片背后的故事|电影《亲爱的》里面没有的结局,在我眼前“上映” 12月6日,在深圳市...
冥想?泡脚?不如听听助眠音乐 晚上睡不着,白天睡不醒,成为最贴合都市人群的“睡眠画像”。随...
养老话题 老年教育面临缺口 “终身教育”潜力无限 【现实挑战】“新老年”群体愿意在培养兴...
孙海洋被拐14年儿子如何找到的? 警方侦办另一宗拐骗儿童案时发现线索,通过人像比对、DNA确认找...
北京天文馆、圆明园将对未成年人免费开放 12月6日,北京天文馆发布通知称,12月8日起试行对未成...
今年全国粮食总产量再创新高 连续7年保持在1 3万亿斤以上 根据对全国31个省(区、市)的抽样调...
斑块软的很危险 硬的就无碍? 血管里的“垃圾”分类 赶快学起来! 一项最新研究显示:中国...
诺西那生钠注射液大幅降价 聚焦医保谈判背后脊髓性肌萎缩症家庭 医保目录公布那天 好多家长都...
抖音“窗花剪剪”遭抄袭 被判获赔20万元 法院认为“窗花剪剪”的这种表达方式理应受到《著作权...
公安机关近日侦破3起拐卖儿童案件 失散十几年 3组家庭终于团圆了 北京青年报记者12月6日从公...
2021年度十大网络用语发布 本报讯(记者 路艳霞)作为年度“汉语盘点”活动最具网络特色的组成部...
北京天文馆向未成年人免费开放 本报讯(记者 牛伟坤)北京天文馆对票价免费及优惠政策作出调整:1...
2021北京百个网红打卡地发布 本报讯(记者 李洋)2021北京网红打卡地推荐榜单昨晚正式发布。自然...