Faster Whisper技能使用说明

2026-03-27 新闻来源：网淘吧围观:297

电脑广告

手机广告

更快Whisper

使用faster-whisper进行本地语音转文字——这是基于CTranslate2对OpenAI的Whisper的重新实现，其运行速度快4至6倍且准确率相同。借助GPU加速，可实现约20倍实时速度的转录（一个10分钟的音频文件约需30秒）。

何时使用

当你需要时使用此技能：

转录音频/视频文件——会议、访谈、播客、讲座、YouTube视频
生成字幕——SRT、VTT、ASS、LRC或TTML广播标准字幕
识别说话者——通过说话人日志化标记谁说了什么（--diarize）
从URL转录——YouTube链接和直接音频URL（通过yt-dlp自动下载）
转录播客源——--rss <源URL>获取并转录剧集
批量处理文件— 支持通配符模式、目录、跳过已存在文件；自动显示预计完成时间
本地语音转文本— 无需API费用，可离线工作（下载模型后）
翻译成英文— 将任何语言翻译为英文，使用--translate
进行多语言转录— 支持99种以上语言，具备自动检测功能
批量转录不同语言的文件—--language-map为每个文件分配不同语言
转录多语言音频—--multilingual用于混合语言音频
转录包含特定术语的音频— 使用--initial-prompt对于包含大量专业术语或其他需要注意的术语的内容
预处理含有噪声的音频（在转录之前）—--normalize和--denoise在转录之前
流式输出—--stream在转录时显示片段
剪辑时间范围—--clip-timestamps用于转录特定部分
搜索转录文本—--search "term"查找单词/短语出现的所有时间戳
检测章节—--detect-chapters根据静音间隙查找章节分隔点
导出说话人音频—--export-speakers 目录将每位发言人的对话片段保存为单独的WAV文件
电子表格输出—--format csv生成带时间戳的标准引用格式CSV文件

触发短语："转录此音频", "语音转文字", "他们说了什么", "制作文字稿", "音频转文字", "为视频添加字幕", "谁在说话", "翻译此音频", "翻译成英语", "查找X被提及的地方", "在文字稿中搜索", "他们什么时候说的", "在哪个时间点", "添加章节", "检测章节", "查找音频中的断点", "此录音的目录", "TTML字幕", "DFXP字幕", "广播格式字幕", "Netflix格式", "ASS字幕", "aegisub格式", "高级字幕alpha", "mpv字幕", "LRC字幕", "定时歌词", "卡拉OK字幕", "音乐播放器歌词", "HTML文字稿", "置信度着色文字稿", "颜色编码文字稿", "按说话人分离音频", "导出说话人音频", "按说话人分割", "文字稿导出为CSV", "电子表格输出", "转录播客", "播客RSS源", "批量处理不同语言", "按文件指定语言", "以多种格式转录", "同时生成srt和txt", "同时输出srt和文本", "去除填充词", "清理嗯和呃", "去除犹豫声音", "删除'你知道'和'我的意思是'", "转录左声道", "转录右声道", "立体声声道", "仅左声道", "字幕换行", "每行字符限制", "每行字幕最大字符数", "检测段落", "段落分隔", "分组为段落", "添加段落间距"

⚠️ 代理引导 — 保持调用最简化：

核心规则：默认命令 (./scripts/transcribe audio.mp3) 是最快路径 — 仅当用户明确要求该功能时才添加标志。

转录：

仅当--diarize如果用户询问“谁说了什么”/“识别说话者”/“标注说话者”
仅当用户要求该格式的字幕/字幕时添加--format srt/vtt/ass/lrc/ttml仅当用户要求CSV或电子表格输出时添加
--format csv仅当用户需要单词级时间戳时添加--word-timestamps
仅当有特定领域的术语需要预设时添加--initial-prompt仅当用户希望将非英语音频翻译成英语时添加
--translate仅当用户提到音频质量差或有噪音时添加--normalize
/--denoiseif the user wants non-English audio translated to English
Only add--normalize/--denoiseif the user mentions bad audio quality or noise
Only add--stream如果用户希望对长文件进行实时/渐进式输出
仅当用户需要特定时间范围时，才添加--clip-timestamps仅当模型对音乐/静音产生幻觉时，才添加
--temperature 0.0仅当VAD（语音活动检测）过于激进地切断语音或包含噪音时，才添加--vad-threshold
仅当您知道说话者数量时，才添加--min-speakers/
--max-speakers仅当令牌未缓存于以下位置时，才添加--hf-token~/.cache/huggingface/token仅当针对长片段需要提升字幕可读性时，才添加
--max-words-per-line--hf-tokenif the token is not cached at~/.cache/huggingface/token
Only add--max-words-per-linefor subtitle readability on long segments
仅当转录文本包含明显伪影（音乐标记、重复内容）时添加--filter-hallucinations仅当用户要求句子级字幕提示时添加
--merge-sentences仅当用户要求删除填充词（嗯、呃、你知道、我是说、犹豫音）时添加--clean-filler
仅当用户提及立体声轨道、双声道录音或要求特定声道时添加--channel left|right当用户指定每行字幕字符限制时添加（例如"Netflix格式"、"每行42字符"）；此参数优先级高于
--max-chars-per-line N--max-words-per-line仅当用户要求段落分隔或结构化文本输出时添加
--detect-paragraphs--paragraph-gapwhen the user specifies a character limit per subtitle line (e.g., "Netflix format", "42 chars per line"); takes priority over--max-words-per-line
Only add--detect-paragraphsif the user asks for paragraph breaks or structured text output;--paragraph-gap（默认3.0秒）仅当用户想要自定义间隔时才添加
仅当用户提供了真实姓名来替换SPEAKER_1/2时才添加 --speaker-names "Alice,Bob" — 并且总是需要同时添加--diarize仅当用户指定了特定的、由--initial-prompt
无法很好处理的罕见术语时，才添加 --hotwords WORDS；对于一般的领域术语，更推荐使用--initial-prompt仅当用户确切知道音频开头是什么词时才添加 --prefix TEXT仅当用户只想识别语言，而不需要转录时，才添加 --detect-language-only如果用户要求性能统计、RTF或基准测试信息，才添加 --stats-file PATHfor general domain jargon
Only add--prefix TEXTwhen the user knows the exact words the audio starts with
Only add--detect-language-onlywhen the user only wants to identify the language, not transcribe
Only add--stats-file PATHif the user asks for performance stats, RTF, or benchmark info
Only add--parallel N适用于大型CPU批量任务；GPU自身能高效处理单个文件——对于单个文件或小批量任务无需添加此参数
仅在--retries N处理预期可能出现瞬时故障的不可靠输入源（如URL、网络文件）时添加
仅在--burn-in OUTPUT用户明确要求将字幕嵌入/烧录到视频中时使用；需要ffmpeg和视频文件输入
仅在--keep-temp用户可能重新处理相同URL以避免重复下载时添加
仅在--output-template批量模式下用户指定自定义命名规则时添加
多格式输出(--format srt,text)：仅当用户明确要求单次处理生成多种格式时使用；必须始终与-o <目录>
配对使用任何词级功能都会自动运行wav2vec2对齐（约产生5-10秒额外开销）
--diarize在此基础上增加约20-30秒

搜索：

仅在--search "关键词"当用户要求在音频中查找/定位/搜索特定词语或短语时添加
--search 会替换正常的转录输出——它只打印带有时间戳的匹配片段
添加--search-fuzzy仅当用户提及近似/部分匹配或拼写错误时
要将搜索结果保存到文件，请使用-o results.txt

章节检测：

仅在--detect-chapters当用户要求获取章节、小节、目录或“话题在哪里转换”时添加
默认--chapter-gap 8（8秒静音 = 新章节）适用于大多数播客/讲座；对于内容密集的情况可调低此值
--chapter-format youtube（默认）输出YouTube就绪的时间戳；使用json用于编程用途
始终使用--chapters-file 路径当将章节与转录输出结合时——避免将章节标记混入转录文本
如果用户只想要章节（而非转录），请将标准输出通过管道重定向到文件，并使用-o /dev/null并配合使用--chapters-file
批处理模式限制： --chapters-file仅接受单个路径——在批处理模式下，每个文件的章节会覆盖前一个文件的。对于批量章节检测，请省略--chapters-file（章节会在=== 章节 (N) ===标题下打印到标准输出），或为每个文件单独运行一次

说话人音频导出：

仅在需要时添加--export-speakers 目录当用户明确要求分别保存每个说话者的音频时
总是与--diarize配对使用
— 如果不存在说话者标签，则会静默跳过需要 ffmpeg；输出文件为SPEAKER_1.wav、SPEAKER_2.wav等（如果设置了--speaker-names

则输出真实姓名）

语言映射：仅在批处理模式下，当用户确认不同文件使用不同语言时，才添加--language-map
内联格式："interview*.mp3=en,lecture*.mp3=fr"— 对文件名使用 fnmatch 通配符匹配
JSON 文件格式：@/path/to/map.json其中文件内容为{"pattern": "lang_code"}

RSS / 播客：

仅当用户提供播客 RSS 源 URL 时添加--rss URL默认获取最新的 5 集；
--rss-latest 0表示获取所有；--skip-existing用于安全地恢复操作必须配合
-o <目录>使用--rss——否则所有剧集的转录文本会直接拼接输出到标准输出，难以使用；当设置了-o <目录>时，每集会生成独立的文件代理中继的输出格式：

搜索结果

（--search）→ 直接打印给用户；输出为人类可读格式章节输出
→ 如果未指定--chapters-file--chapters-file章节会出现在标准输出中，位于=== 章节 (N) ===标题之后；使用--format json时，章节也会嵌入到 JSON 的"chapters"键下
字幕格式(SRT, VTT, ASS, LRC, TTML) → 总是写入-o文件；告知用户输出路径，切勿粘贴原始字幕内容
数据格式(CSV, HTML, TTML, JSON) → 总是写入-o文件；告知用户输出路径，不要粘贴原始 XML/CSV/HTML
ASS 格式→ 适用于 Aegisub, VLC, mpv；写入文件并告知用户可以在 Aegisub 中打开或在 VLC/mpv 中播放
LRC 格式→ 适用于音乐播放器 (Foobar2000, AIMP, VLC) 的带时间戳歌词；写入文件
多格式(--format srt,text) → 需要-o <目录>; 每种格式保存到单独的文件；告知用户所有写入的路径
JSON 格式→ 适用于程序化后处理；不适合完整粘贴给用户
文本/转录稿→ 对于短文件可直接安全地展示给用户；长文件则进行摘要
统计输出(--stats-file) → 向用户总结关键字段（时长、处理时间、RTF），而非粘贴原始 JSON
语言检测(--detect-language-only) → 直接打印结果；结果为单行
预计完成时间对于批处理作业会自动打印到 stderr；无需操作

不适用场景：

仅有云环境而无本地计算资源
文件时长 <10 秒，API 调用延迟无关紧要的情况

faster-whisper 与 whisperx 对比：本技能涵盖了 whisperx 的所有功能 —— 说话人分离（--diarize）、词级时间戳（--word-timestamps）、SRT/VTT 字幕 —— 因此不需要 whisperx。仅当您特别需要其 pyannote 处理流程或此处未涵盖的批量 GPU 功能时，才使用 whisperx。

快速参考

任务	命令	备注
基础转录	`./scripts/transcribe audio.mp3`	批量推理，开启 VAD，使用 distil-large-v3.5 模型
SRT 字幕	`./scripts/transcribe audio.mp3 --format srt -o subs.srt`	词级时间戳自动启用
VTT 字幕	`./scripts/transcribe audio.mp3 --format vtt -o subs.vtt`	WebVTT 格式
词级时间戳	`./scripts/transcribe audio.mp3 --word-timestamps --format srt`	wav2vec2 对齐（约10毫秒）
说话人日志	`./scripts/transcribe audio.mp3 --diarize`	需要 pyannote.audio
翻译 → 英语	`./scripts/transcribe audio.mp3 --translate`	任意语言 → 英语
流式输出	`./scripts/transcribe audio.mp3 --stream`	实时分段转录
剪辑时间范围	`./scripts/transcribe audio.mp3 --clip-timestamps "30,60"`	仅 30 秒至 60 秒
去噪 + 归一化	`./scripts/transcribe audio.mp3 --denoise --normalize`	先清理嘈杂音频
减少幻觉	`./scripts/transcribe audio.mp3 --hallucination-silence-threshold 1.0`	跳过幻觉产生的静默
YouTube/URL	`./scripts/transcribe https://youtube.com/watch?v=...`	通过yt-dlp自动下载
批量处理	`./scripts/transcribe *.mp3 -o ./transcripts/`	输出到指定目录
跳过已存在文件的批量处理	`./scripts/transcribe *.mp3 --skip-existing -o ./out/`	恢复中断的批量任务
领域术语	`./scripts/transcribe audio.mp3 --initial-prompt 'Kubernetes gRPC'`	提升罕见术语识别
热词增强	`./scripts/transcribe audio.mp3 --hotwords 'JIRA Kubernetes'`	使解码器偏向特定词汇
前缀条件设定	`./scripts/transcribe audio.mp3 --prefix 'Good morning,'`	用已知开场白初始化首段内容
固定模型版本	`./scripts/transcribe audio.mp3 --revision v1.2.0`	使用固定版本实现可复现的转录
调试库日志	`./scripts/transcribe audio.mp3 --log-level debug`	显示 faster_whisper 内部日志
Turbo 模型	`./scripts/transcribe audio.mp3 -m turbo`	large-v3-turbo 的别名
更快的英语转录	`./scripts/transcribe audio.mp3 --model distil-medium.en -l en`	仅限英语，速度提升 6.8 倍
最高精度	`./scripts/transcribe audio.mp3 --model large-v3 --beam-size 10`	完整模型
JSON 输出	`./scripts/transcribe audio.mp3 --format json -o out.json`	通过统计信息进行编程式访问
过滤噪音	`./scripts/transcribe audio.mp3 --min-confidence 0.6`	丢弃低置信度片段
混合量化	`./scripts/transcribe audio.mp3 --compute-type int8_float16`	节省显存，质量损失最小
减少批次大小	`./scripts/transcribe audio.mp3 --batch-size 4`	如果GPU内存不足
TSV输出	`./scripts/transcribe audio.mp3 --format tsv -o out.tsv`	OpenAI Whisper兼容的TSV
修复幻觉	`./scripts/transcribe audio.mp3 --temperature 0.0 --no-speech-threshold 0.8`	锁定温度 + 跳过静默
调整VAD灵敏度	`./scripts/transcribe audio.mp3 --vad-threshold 0.6 --min-silence-duration 500`	更严格的语音检测
已知说话者数量	`./scripts/transcribe meeting.wav --diarize --min-speakers 2 --max-speakers 3`	约束说话人分离
字幕换行	`./scripts/transcribe audio.mp3 --format srt --word-timestamps --max-words-per-line 8`	分割长字幕提示
私有/门控模型	`./scripts/transcribe audio.mp3 --hf-token hf_xxx`	直接传递令牌
显示版本	`./scripts/transcribe --version`	打印 faster-whisper 版本
原地升级	`./setup.sh --update`	无需完全重装的升级
系统检查	`./setup.sh --check`	验证 GPU、Python、ffmpeg、venv、yt-dlp、pyannote
仅检测语言	`./scripts/transcribe audio.mp3 --detect-language-only`	快速语言识别，不进行转录
检测语言 JSON	`./scripts/transcribe audio.mp3 --detect-language-only --format json`	机器可读的语言检测
LRC 字幕	`./scripts/transcribe audio.mp3 --format lrc -o lyrics.lrc`	适用于音乐播放器的带时间轴歌词格式
ASS 字幕	`./scripts/transcribe audio.mp3 --format ass -o subtitles.ass`	高级字幕站 Alpha 格式（适用于 Aegisub、mpv、VLC）
合并句子	`./scripts/transcribe audio.mp3 --format srt --merge-sentences`	将片段合并为句子块
统计侧文件	`./scripts/transcribe audio.mp3 --stats-file stats.json`	转录后写入性能统计JSON
批量统计	`./scripts/transcribe *.mp3 --stats-file ./stats/`	目录中每个输入文件对应一个统计文件
模板命名	`./scripts/transcribe audio.mp3 -o ./out/ --output-template "{stem}_{lang}.{ext}"`	自定义批量输出文件名
标准输入	`ffmpeg -i input.mp4 -f wav - \| ./scripts/transcribe -`	直接从标准输入传输音频
自定义模型目录	`./scripts/transcribe audio.mp3 --model-dir ~/my-models`	自定义HuggingFace缓存目录
本地模型	`./scripts/transcribe audio.mp3 -m ./my-model-ct2`	CTranslate2模型目录
HTML转录文件	`./scripts/transcribe audio.mp3 --format html -o out.html`	置信度着色
烧录字幕	`./scripts/transcribe video.mp4 --burn-in output.mp4`	需要ffmpeg + 视频输入
命名说话人	`./scripts/transcribe audio.mp3 --diarize --speaker-names "Alice,Bob"`	替换SPEAKER_1/2
过滤幻觉内容	`./scripts/transcribe audio.mp3 --filter-hallucinations`	移除伪影
保留临时文件	`./scripts/transcribe https://... --keep-temp`	用于URL重新处理
并行批处理	`./scripts/transcribe *.mp3 --parallel 4 -o ./out/`	CPU多文件处理
推荐使用RTX 3070	`./scripts/transcribe audio.mp3 --compute-type int8_float16`	节省约1GB显存，画质损失极小
CPU线程数	`./scripts/transcribe audio.mp3 --threads 8`	强制指定CPU线程数（默认：自动）
播客RSS（最新5集）	`./scripts/transcribe --rss https://feeds.example.com/podcast.xml`	下载并转录最新5期节目
播客RSS（所有剧集）	`./scripts/transcribe --rss https://... --rss-latest 0 -o ./episodes/`	所有剧集，每集单独文件
播客+SRT字幕	`./scripts/transcribe --rss https://... --format srt -o ./subs/`	为所有剧集生成字幕
失败重试	`./scripts/transcribe *.mp3 --retries 3 -o ./out/`	出错时最多重试3次（采用退避策略）
CSV输出	`./scripts/transcribe audio.mp3 --format csv -o out.csv`	支持电子表格处理，含标题行；引号规范
带说话人识别的CSV	`./scripts/transcribe audio.mp3 --diarize --format csv -o out.csv`	添加说话人列
语言映射（内联）	`./scripts/transcribe .mp3 --language-map "interview.mp3=en,lecture.wav=fr"`	批处理中按文件指定语言
语言映射（JSON）	`./scripts/transcribe *.mp3 --language-map @langs.json`	JSON文件：{"pattern": "lang"}
带预计完成时间的批处理	`./scripts/transcribe *.mp3 -o ./out/`	批处理中为每个文件显示自动估算的预计完成时间
TTML字幕	`./scripts/transcribe audio.mp3 --format ttml -o subtitles.ttml`	广播标准DFXP/TTML（Netflix、BBC、Amazon）
带说话人标签的TTML	`./scripts/transcribe audio.mp3 --diarize --format ttml -o subtitles.ttml`	带说话人标签的TTML
搜索转录文本	`./scripts/transcribe audio.mp3 --search "keyword"`	查找关键词出现的时间戳
搜索文件	`./scripts/transcribe audio.mp3 --search "keyword" -o results.txt`	保存搜索结果
模糊搜索	`./scripts/transcribe audio.mp3 --search "aproximate" --search-fuzzy`	近似/部分匹配
检测章节	`./scripts/transcribe audio.mp3 --detect-chapters`	根据静音间隔自动检测章节
章节间隔调整	`./scripts/transcribe audio.mp3 --detect-chapters --chapter-gap 5`	在间隔≥5秒处划分章节（默认：8秒）
章节保存至文件	`./scripts/transcribe audio.mp3 --detect-chapters --chapters-file ch.txt`	保存YouTube格式章节列表
章节JSON格式	`./scripts/transcribe audio.mp3 --detect-chapters --chapter-format json`	机器可读章节列表
导出说话人音频	`./scripts/transcribe audio.mp3 --diarize --export-speakers ./speakers/`	将每个说话者的音频保存为单独的 WAV 文件
多格式输出	`./scripts/transcribe audio.mp3 --format srt,text -o ./out/`	一次性写入 SRT + TXT 文件
移除填充词	`./scripts/transcribe audio.mp3 --clean-filler`	剔除 um/uh/er/ah/hmm 等语气词和话语标记
仅左声道	`./scripts/transcribe audio.mp3 --channel left`	转录前提取左立体声道
仅右声道	`./scripts/transcribe audio.mp3 --channel right`	提取右立体声道
每行最大字符数	`./scripts/transcribe audio.mp3 --format srt --max-chars-per-line 42`	基于字符的字幕换行
检测段落	`./scripts/transcribe audio.mp3 --detect-paragraphs`	在文本输出中插入段落分隔符
段落间隔调整	`./scripts/transcribe audio.mp3 --detect-paragraphs --paragraph-gap 5.0`	调整段落间隔阈值（默认为3.0秒）

模型选择

根据您的需求选择合适的模型：

digraph model_selection {
    rankdir=LR;
    node [shape=box, style=rounded];

    start [label="Start", shape=doublecircle];
    need_accuracy [label="Need maximum\naccuracy?", shape=diamond];
    multilingual [label="Multilingual\ncontent?", shape=diamond];
    resource_constrained [label="Resource\nconstraints?", shape=diamond];

    large_v3 [label="large-v3\nor\nlarge-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    large_turbo [label="large-v3-turbo", style="rounded,filled", fillcolor=lightblue];
    distil_large [label="distil-large-v3.5\n(default)", style="rounded,filled", fillcolor=lightgreen];
    distil_medium [label="distil-medium.en", style="rounded,filled", fillcolor=lightyellow];
    distil_small [label="distil-small.en", style="rounded,filled", fillcolor=lightyellow];

    start -> need_accuracy;
    need_accuracy -> large_v3 [label="yes"];
    need_accuracy -> multilingual [label="no"];
    multilingual -> large_turbo [label="yes"];
    multilingual -> resource_constrained [label="no (English)"];
    resource_constrained -> distil_small [label="mobile/edge"];
    resource_constrained -> distil_medium [label="some limits"];
    resource_constrained -> distil_large [label="no"];
}

模型表格

标准模型（完整版 Whisper）

模型	大小	速度	准确度	使用场景
`tiny`/`tiny.en`	39M	最快	基础	快速草稿
`base`/`base.en`	74M	非常快	良好	通用用途
`small`/`小型.en`	244M	快速	更好	大多数任务
`中等`/`中等.en`	769M	中等	高	高质量转录
`大型-v1/v2/v3`	1.5GB	较慢	最佳	最高准确度
`大型-v3-增强版`	809M	快速	优秀	高准确度（比蒸馏模型慢）

蒸馏模型（约快6倍，词错误率差异约1%）

模型	大小	相对于标准模型的速度	准确度	适用场景
`distil-large-v3.5`	7.56亿参数	约快6.3倍	7.08% 词错误率	默认版本，最佳平衡
`distil-large-v3`	7.56亿参数	约快6.3倍	7.53% 词错误率	前默认版本
`distil-large-v2`	7.56亿参数	约快5.8倍	10.1% 词错误率	备选版本
`distil-medium.en`	3.94亿参数	约快6.8倍	11.1% 词错误率	仅限英语，资源受限环境
`distil-small.en`	1.66亿参数	约快5.6倍	12.1% 词错误率	移动/边缘设备

.en模型为纯英文版本，在处理英文内容时速度稍快/效果略优。

关于蒸馏模型的注意事项：HuggingFace建议禁用condition_on_previous_text参数以防止所有蒸馏模型出现重复循环。脚本会自动应用 --no-condition-on-previous-text参数当检测到distil-*模型时。如需覆盖此设置，可传递--condition-on-previous-text参数。

自定义与微调模型

WhisperModel支持本地CTranslate2模型目录和HuggingFace仓库名称——无需修改代码。

加载本地CTranslate2模型

./scripts/transcribe audio.mp3 --model /path/to/my-model-ct2

将HuggingFace模型转换为CTranslate2格式

pip install ctranslate2
ct2-transformers-converter \
  --model openai/whisper-large-v3 \
  --output_dir whisper-large-v3-ct2 \
  --copy_files tokenizer.json preprocessor_config.json \
  --quantization float16
./scripts/transcribe audio.mp3 --model ./whisper-large-v3-ct2

通过HuggingFace仓库名称加载模型（自动下载）

./scripts/transcribe audio.mp3 --model username/whisper-large-v3-ct2

自定义模型缓存目录

默认情况下，模型会被缓存到~/.cache/huggingface/。使用--model-dir参数可以覆盖此路径：

./scripts/transcribe audio.mp3 --model-dir ~/my-models

安装

Linux / macOS / WSL2

# Base install (creates venv, installs deps, auto-detects GPU)
./setup.sh

# With speaker diarization support
./setup.sh --diarize

要求：

Python 3.10+
ffmpeg不是必需的对于基础转录——PyAV（与 faster-whisper 捆绑）负责音频解码。ffmpeg 仅在需要--burn-in、--normalize和--denoise功能时才需要。
可选：yt-dlp（用于 URL/YouTube 输入）
可选：pyannote.audio（用于--diarize功能，通过setup.sh --diarize命令安装）

平台支持

平台	加速方式	速度
Linux + NVIDIA GPU	CUDA	约20倍实时 🚀
WSL2 + NVIDIA GPU	CUDA	约20倍实时 🚀
macOS Apple Silicon	CPU*	约3-5倍实时
macOS Intel	CPU	约1-2倍实时
Linux (无GPU)	CPU	约1倍实时

*faster-whisper 使用 CTranslate2，该库在 macOS 上仅支持 CPU，但 Apple Silicon 芯片的速度足以满足实际使用需求。

GPU 支持（重要！）

安装脚本会自动检测您的 GPU，并安装支持 CUDA 的 PyTorch。如果可用，请始终使用 GPU—— CPU 转录速度极慢。

硬件	速度	9分钟视频
RTX 3070 (GPU)	约20倍实时速度	约27秒
CPU (int8)	约0.3倍实时速度	约30分钟

RTX 3070 提示：使用--compute-type int8_float16进行混合量化 —— 可在质量损失极小的情况下节省约1GB显存。非常适合在转录的同时运行说话人分离。

如果安装程序未检测到您的GPU，请手动安装带CUDA的PyTorch：

# For CUDA 12.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.x
uv pip install --python .venv/bin/python torch --index-url https://download.pytorch.org/whl/cu118

WSL2 用户：请确保已在Windows上安装适用于WSL的NVIDIA CUDA驱动程序使用

选项

# Basic transcription
./scripts/transcribe audio.mp3

# SRT subtitles
./scripts/transcribe audio.mp3 --format srt -o subtitles.srt

# WebVTT subtitles
./scripts/transcribe audio.mp3 --format vtt -o subtitles.vtt

# Transcribe from YouTube URL
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ --language en

# Speaker diarization
./scripts/transcribe meeting.wav --diarize

# Diarized VTT subtitles
./scripts/transcribe meeting.wav --diarize --format vtt -o meeting.vtt

# Prime with domain terminology
./scripts/transcribe lecture.mp3 --initial-prompt "Kubernetes, gRPC, PostgreSQL, NGINX"

# Batch process a directory
./scripts/transcribe ./recordings/ -o ./transcripts/

# Batch with glob, skip already-done files
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/

# Filter low-confidence segments
./scripts/transcribe noisy-audio.mp3 --min-confidence 0.6

# JSON output with full metadata
./scripts/transcribe audio.mp3 --format json -o result.json

# Specify language (faster than auto-detect)
./scripts/transcribe audio.mp3 --language en

输出格式

Input:
  AUDIO                 Audio file(s), directory, glob pattern, or URL
                        Accepts: mp3, wav, m4a, flac, ogg, webm, mp4, mkv, avi, wma, aac
                        URLs auto-download via yt-dlp (YouTube, direct links, etc.)

Model & Language:
  -m, --model NAME      Whisper model (default: distil-large-v3.5; "turbo" = large-v3-turbo)
  --revision REV        Model revision (git branch/tag/commit) to pin a specific version
  -l, --language CODE   Language code, e.g. en, es, fr (auto-detects if omitted)
  --initial-prompt TEXT  Prompt to condition the model (terminology, formatting style)
  --prefix TEXT         Prefix to condition the first segment (e.g. known starting words)
  --hotwords WORDS      Space-separated hotwords to boost recognition
  --translate           Translate any language to English (instead of transcribing)
  --multilingual        Enable multilingual/code-switching mode (helps smaller models)
  --hf-token TOKEN      HuggingFace token for private/gated models and diarization
  --model-dir PATH      Custom model cache directory (default: ~/.cache/huggingface/)

Output Format:
  -f, --format FMT      text | json | srt | vtt | tsv | lrc | html | ass | ttml (default: text)
                        Accepts comma-separated list: --format srt,text writes both in one pass
                        Multi-format requires -o <dir> when saving to files
  --word-timestamps     Include word-level timestamps (wav2vec2 aligned automatically)
  --stream              Output segments as they are transcribed (disables diarize/alignment)
  --max-words-per-line N  For SRT/VTT, split segments into sub-cues of at most N words
  --max-chars-per-line N  For SRT/VTT/ASS/TTML, split lines so each fits within N characters
                        Takes priority over --max-words-per-line when both are set
  --clean-filler        Remove hesitation fillers (um, uh, er, ah, hmm, hm) and discourse markers
                        (you know, I mean, you see) from transcript text. Off by default.
  --detect-paragraphs   Insert paragraph breaks (blank lines) in text output at natural boundaries.
                        A new paragraph starts when: silence gap ≥ --paragraph-gap, OR the previous
                        segment ends a sentence AND the gap ≥ 1.5s.
  --paragraph-gap SEC   Minimum silence gap in seconds to start a new paragraph (default: 3.0).
                        Used with --detect-paragraphs.
  --channel {left,right,mix}
                        Stereo channel to transcribe: left (c0), right (c1), or mix (default: mix).
                        Extracts the channel via ffmpeg before transcription. Requires ffmpeg.
  --merge-sentences     Merge consecutive segments into sentence-level chunks
                        (improves SRT/VTT readability; groups by terminal punctuation or >2s gap)
  -o, --output PATH     Output file or directory (directory for batch mode)
  --output-template TEMPLATE
                        Batch output filename template. Variables: {stem}, {lang}, {ext}, {model}
                        Example: "{stem}_{lang}.{ext}" → "interview_en.srt"

Inference Tuning:
  --beam-size N         Beam search size; higher = more accurate but slower (default: 5)
  --temperature T       Sampling temperature or comma-separated fallback list, e.g.
                        '0.0' or '0.0,0.2,0.4' (default: faster-whisper's schedule)
  --no-speech-threshold PROB
                        Probability threshold to mark segments as silence (default: 0.6)
  --batch-size N        Batched inference batch size (default: 8; reduce if OOM)
  --no-vad              Disable voice activity detection (on by default)
  --vad-threshold T     VAD speech probability threshold (default: 0.5)
  --vad-neg-threshold T VAD negative threshold for ending speech (default: auto)
  --vad-onset T         Alias for --vad-threshold (legacy)
  --vad-offset T        Alias for --vad-neg-threshold (legacy)
  --min-speech-duration MS  Minimum speech segment duration in ms (default: 0)
  --max-speech-duration SEC Maximum speech segment duration in seconds (default: unlimited)
  --min-silence-duration MS Minimum silence before splitting a segment in ms (default: 2000)
  --speech-pad MS       Padding around speech segments in ms (default: 400)
  --no-batch            Disable batched inference (use standard WhisperModel)
  --hallucination-silence-threshold SEC
                        Skip silent sections where model hallucinates (e.g. 1.0)
  --no-condition-on-previous-text
                        Don't condition on previous text (reduces repetition/hallucination loops;
                        auto-enabled for distil models per HuggingFace recommendation)
  --condition-on-previous-text
                        Force-enable conditioning on previous text (overrides auto-disable for distil models)
  --compression-ratio-threshold RATIO
                        Filter segments above this compression ratio (default: 2.4)
  --log-prob-threshold PROB
                        Filter segments below this avg log probability (default: -1.0)
  --max-new-tokens N    Maximum tokens per segment (prevents runaway generation)
  --clip-timestamps RANGE
                        Transcribe specific time ranges: '30,60' or '0,30;60,90' (seconds)
  --progress            Show transcription progress bar
  --best-of N           Candidates when sampling with non-zero temperature (default: 5)
  --patience F          Beam search patience factor (default: 1.0)
  --repetition-penalty F  Penalty for repeated tokens (default: 1.0)
  --no-repeat-ngram-size N  Prevent n-gram repetitions of this size (default: 0 = off)

Advanced Inference:
  --no-timestamps       Output text without timing info (faster; incompatible with
                        --word-timestamps, --format srt/vtt/tsv, --diarize)
  --chunk-length N      Audio chunk length in seconds for batched inference (default: auto)
  --language-detection-threshold T
                        Confidence threshold for language auto-detection (default: 0.5)
  --language-detection-segments N
                        Audio segments to sample for language detection (default: 1)
  --length-penalty F    Beam search length penalty; >1 favors longer, <1 favors shorter (default: 1.0)
  --prompt-reset-on-temperature T
                        Reset initial prompt when temperature fallback hits threshold (default: 0.5)
  --no-suppress-blank   Disable blank token suppression (may help soft/quiet speech)
  --suppress-tokens IDS Comma-separated token IDs to suppress in addition to default -1
  --max-initial-timestamp T
                        Maximum timestamp for the first segment in seconds (default: 1.0)
  --prepend-punctuations CHARS
                        Punctuation characters merged into preceding word (default: "'¿([{-)
  --append-punctuations CHARS
                        Punctuation characters merged into following word (default: "'.。,，!！?？:：")]}、")

Preprocessing:
  --normalize           Normalize audio volume (EBU R128 loudnorm) before transcription
  --denoise             Apply noise reduction (high-pass + FFT denoise) before transcription

Advanced:
  --diarize             Speaker diarization (requires pyannote.audio)
  --min-speakers N      Minimum number of speakers hint for diarization
  --max-speakers N      Maximum number of speakers hint for diarization
  --speaker-names NAMES Comma-separated names to replace SPEAKER_1, SPEAKER_2 (e.g. 'Alice,Bob')
                        Requires --diarize
  --min-confidence PROB Filter segments below this avg word confidence (0.0–1.0)
  --skip-existing       Skip files whose output already exists (batch mode)
  --detect-language-only
                        Detect language and exit (no transcription). Output: "Language: en (probability: 0.984)"
                        With --format json: {"language": "en", "language_probability": 0.984}
  --stats-file PATH     Write JSON stats sidecar after transcription (processing time, RTF, word count, etc.)
                        Directory path → writes {stem}.stats.json inside; file path → exact path
  --burn-in OUTPUT      Burn subtitles into the original video (single-file mode only; requires ffmpeg)
  --filter-hallucinations
                        Filter common Whisper hallucinations: music/applause markers, duplicate segments,
                        'Thank you for watching', lone punctuation, etc.
  --keep-temp           Keep temp files from URL downloads (useful for re-processing without re-downloading)
  --parallel N          Number of parallel workers for batch processing (default: sequential)
  --retries N           Retry failed files up to N times with exponential backoff (default: 0;
                        incompatible with --parallel)

Batch ETA:
  Automatically shown for sequential batch jobs (no flag needed). After each file completes,
  the next file's progress line includes:  [current/total] filename | ETA: Xm Ys
  ETA is calculated from average time per file × remaining files.
  Shown to stderr (surfaced to users via OpenClaw/Clawdbot output).

Language Map (per-file language override):
  --language-map MAP    Per-file language override for batch mode. Two forms:
                          Inline: "interview*.mp3=en,lecture.wav=fr,keynote.wav=de"
                          JSON file: "@/path/to/map.json"  (must be {pattern: lang} dict)
                        Patterns support fnmatch globs on filename or stem.
                        Priority: exact filename > exact stem > glob on filename > glob on stem > fallback.
                        Files not matched fall back to --language (or auto-detect if not set).

Transcript Search:
  --search TERM         Search the transcript for TERM and print matching segments with timestamps.
                        Replaces normal transcript output (use -o to save results to a file).
                        Case-insensitive exact substring match by default.
  --search-fuzzy        Enable fuzzy/approximate matching with --search (useful for typos, phonetic
                        near-misses, or partial words; uses SequenceMatcher ratio ≥ 0.6)

Chapter Detection:
  --detect-chapters     Auto-detect chapter/section breaks from silence gaps and print chapter markers.
                        Output is printed after the transcript (or to --chapters-file).
  --chapter-gap SEC     Minimum silence gap in seconds between consecutive segments to start a new
                        chapter (default: 8.0). Tune down for dense speech, up for sparse content.
  --chapters-file PATH  Write chapter markers to this file (default: stdout after transcript)
  --chapter-format FMT  youtube | text | json — chapter output format:
                          youtube: "0:00 Chapter 1" (YouTube description ready)
                          text:    "Chapter 1: 00:00:00"
                          json:    JSON array with chapter, start, title fields
                        (default: youtube)

Speaker Audio Export:
  --export-speakers DIR After diarization, export each speaker's audio turns concatenated into
                        separate WAV files saved in DIR. Requires --diarize and ffmpeg.
                        Output: SPEAKER_1.wav, SPEAKER_2.wav, … (or real names if --speaker-names set)

RSS / Podcast:
  --rss URL             Podcast RSS feed URL — extracts audio enclosures and transcribes them.
                        AUDIO positional is optional when --rss is used.
  --rss-latest N        Number of most-recent episodes to process (default: 5; 0 = all episodes)

Device:
  --device DEV          auto | cpu | cuda (default: auto)
  --compute-type TYPE   auto | int8 | int8_float16 | float16 | float32 (default: auto)
                        int8_float16 = hybrid mode for GPU (saves VRAM, minimal quality loss)
  --threads N           CPU thread count for CTranslate2 (default: auto)
  -q, --quiet           Suppress progress and status messages
  --log-level LEVEL     Set faster_whisper library logging level: debug | info | warning | error
                        (default: warning; use debug to see CTranslate2/VAD internals)

Utility:
  --version             Print installed faster-whisper version and exit
  --update              Upgrade faster-whisper in the skill venv and exit

文本（默认）

纯文本转录。配合

--diarize--diarize，插入了说话人标签：

[SPEAKER_1]
 Hello, welcome to the meeting.
[SPEAKER_2]
 Thanks for having me.

JSON (`--format json`)

包含片段、时间戳、语言检测和性能统计的完整元数据：

{
  "file": "audio.mp3",
  "text": "Hello, welcome...",
  "language": "en",
  "language_probability": 0.98,
  "duration": 600.5,
  "segments": [...],
  "speakers": ["SPEAKER_1", "SPEAKER_2"],
  "stats": {
    "processing_time": 28.3,
    "realtime_factor": 21.2
  }
}

SRT (`--format srt`)

视频播放器的标准字幕格式：

1
00:00:00,000 --> 00:00:02,500
[SPEAKER_1] Hello, welcome to the meeting.

2
00:00:02,800 --> 00:00:04,200
[SPEAKER_2] Thanks for having me.

VTT (`--format vtt`)

适用于网页视频播放器的WebVTT格式：

WEBVTT

1
00:00:00.000 --> 00:00:02.500
[SPEAKER_1] Hello, welcome to the meeting.

2
00:00:02.800 --> 00:00:04.200
[SPEAKER_2] Thanks for having me.

TSV (`--format tsv`)

制表符分隔值，与OpenAI Whisper兼容。列包括：start_ms,end_ms,text：

0	2500	Hello, welcome to the meeting.
2800	4200	Thanks for having me.

适用于通过管道传输到其他工具或电子表格。无标题行。

ASS/SSA (`--format ass`)

Advanced SubStation Alpha 格式 — 由 Aegisub、VLC、mpv、MPC-HC 和大多数视频编辑器支持。通过[V4+ Styles]部分，提供比 SRT 更丰富的样式（字体、大小、颜色、位置）：

[Script Info]
ScriptType: v4.00+
...

[V4+ Styles]
Style: Default,Arial,20,&H00FFFFFF,...

[Events]
Format: Layer, Start, End, Style, Name, ..., Text
Dialogue: 0,0:00:00.00,0:00:02.50,Default,,[SPEAKER_1] Hello, welcome.
Dialogue: 0,0:00:02.80,0:00:04.20,Default,,[SPEAKER_2] Thanks for having me.

时间戳使用H:MM:SS.cc（厘秒）。在 Aegisub 中编辑[V4+ Styles]块以自定义字体、颜色和位置，无需重新转录。

LRC (`--format lrc`)

音乐播放器（例如 Foobar2000、VLC、AIMP）使用的带时间轴的歌词格式。时间戳使用[mm:ss.xx]其中xx= 厘秒：

[00:00.50]Hello, welcome to the meeting.
[00:02.80]Thanks for having me.

使用说话人分离时，会包含说话人标签：

[00:00.50][SPEAKER_1] Hello, welcome to the meeting.
[00:02.80][SPEAKER_2] Thanks for having me.

默认文件扩展名：.lrc. 适用于音乐转录、卡拉OK，以及任何需要带时间戳文本且兼容音乐播放器的工作流程。

说话人日志

识别谁在何时说话，使用pyannote.audio。

设置：

./setup.sh --diarize

要求：

HuggingFace令牌位于~/.cache/huggingface/token（huggingface-cli login）
已接受的模型协议：
- https://hf.co/pyannote/speaker-diarization-3.1
- https://hf.co/pyannote/segmentation-3.0

使用方法：

# Basic diarization (text output)
./scripts/transcribe meeting.wav --diarize

# Diarized subtitles
./scripts/transcribe meeting.wav --diarize --format srt -o meeting.srt

# Diarized JSON (includes speakers list)
./scripts/transcribe meeting.wav --diarize --format json

说话人按首次出现的顺序标记为SPEAKER_1、SPEAKER_2等。如果CUDA可用，日志功能会自动在GPU上运行。

精确到词的时间戳

当计算词级时间戳时（--word-timestamps、--diarize或--min-confidence），wav2vec2强制对齐过程会自动将其从Whisper约100-200毫秒的精度优化至约10毫秒。无需额外标志。

# Word timestamps with automatic wav2vec2 alignment
./scripts/transcribe audio.mp3 --word-timestamps --format json

# Diarization also gets precise alignment automatically
./scripts/transcribe meeting.wav --diarize

# Precise subtitles
./scripts/transcribe audio.mp3 --word-timestamps --format srt -o subtitles.srt

使用torchaudio的MMS（大规模多语言语音）模型——支持1000多种语言。模型首次加载后会被缓存，因此批量处理保持快速。

URL与YouTube输入

可输入任何URL——音频将通过yt-dlp自动下载：

# YouTube video
./scripts/transcribe https://youtube.com/watch?v=dQw4w9WgXcQ

# Direct audio URL
./scripts/transcribe https://example.com/podcast.mp3

# With options
./scripts/transcribe https://youtube.com/watch?v=... --language en --format srt -o subs.srt

需要yt-dlp（检查PATH和~/.local/share/pipx/venvs/yt-dlp/bin/yt-dlp）。

批量处理

可通过通配符模式、目录或多个路径同时处理多个文件：

# All MP3s in current directory
./scripts/transcribe *.mp3

# Entire directory (auto-filters audio files)
./scripts/transcribe ./recordings/

# Output to directory (one file per input)
./scripts/transcribe *.mp3 -o ./transcripts/

# Skip already-transcribed files (resume interrupted batch)
./scripts/transcribe *.mp3 --skip-existing -o ./transcripts/

# Mixed inputs
./scripts/transcribe file1.mp3 file2.wav ./more-recordings/

# Batch SRT subtitles
./scripts/transcribe *.mp3 --format srt -o ./subtitles/

输出到目录时，文件将命名为{输入文件主干名}.{扩展名}（例如，audio.mp3→audio.srt）。

批处理模式在所有文件完成后打印摘要：

📊 Done: 12 files, 3h24m audio in 10m15s (19.9× realtime)

工作流程

为常见用例提供的端到端处理流程。

播客转录流程

从任何播客RSS源获取并转录最新的5个剧集：

# Transcribe latest 5 episodes → one .txt per episode
./scripts/transcribe --rss https://feeds.megaphone.fm/mypodcast -o ./transcripts/

# All episodes, as SRT subtitles
./scripts/transcribe --rss https://... --rss-latest 0 --format srt -o ./subtitles/

# Skip already-done episodes (safe to re-run)
./scripts/transcribe --rss https://... --skip-existing -o ./transcripts/

# With diarization (who said what) + retry on flaky network
./scripts/transcribe --rss https://... --diarize --retries 2 -o ./transcripts/

会议记录流程

为会议录音转录并标记说话人，然后输出清晰的文本：

# Diarize + name speakers (replace SPEAKER_1/2 with real names)
./scripts/transcribe meeting.wav --diarize --speaker-names "Alice,Bob" -o meeting.txt

# Diarized JSON for post-processing (summaries, action items)
./scripts/transcribe meeting.wav --diarize --format json -o meeting.json

# Stream live while it transcribes (long meetings)
./scripts/transcribe meeting.wav --stream

视频字幕流程

为视频文件生成可直接使用的字幕：

# SRT subtitles with sentence merging (better readability)
./scripts/transcribe video.mp4 --format srt --merge-sentences -o subtitles.srt

# Burn subtitles directly into the video
./scripts/transcribe video.mp4 --format srt --burn-in video_subtitled.mp4

# Word-level SRT (karaoke-style), capped at 8 words per cue
./scripts/transcribe video.mp4 --format srt --word-timestamps --max-words-per-line 8 -o subs.srt

YouTube批量处理流程

一次性转录多个YouTube视频：

# One-liner: transcribe a playlist video + output SRT
./scripts/transcribe "https://youtube.com/watch?v=abc123" --format srt -o subs.srt

# Batch from a text file of URLs (one per line)
cat urls.txt | xargs ./scripts/transcribe -o ./transcripts/

# Download audio first, then transcribe (for re-use without re-downloading)
./scripts/transcribe https://youtube.com/watch?v=abc123 --keep-temp

嘈杂音频处理流程

在转录前清理质量较差的录音：

# Denoise + normalize, then transcribe
./scripts/transcribe interview.mp3 --denoise --normalize -o interview.txt

# Noisy batch with aggressive hallucination filtering
./scripts/transcribe *.mp3 --denoise --filter-hallucinations -o ./out/

批量恢复流程

处理包含重试机制的大型文件夹——失败后重新运行是安全的：

# Retry each failed file up to 3 times, skip already-done
./scripts/transcribe ./recordings/ --skip-existing --retries 3 -o ./transcripts/

# Check what failed (printed in batch summary at the end)
# Re-run the same command — skips successes, retries failures

服务器模式（兼容OpenAI的API）

演讲以OpenAI兼容模式运行faster-whisper/v1/audio/transcriptions端点——作为OpenAI Whisper API的即插即用替代方案，支持流式传输、Docker和实时转录。

快速开始（Docker）

docker run --gpus all -p 8000:8000 ghcr.io/speaches-ai/speaches:latest-cuda

测试

# Transcribe a file via the API (same format as OpenAI)
curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.mp3 \
  -F model=Systran/faster-whisper-large-v3

与任何OpenAI SDK配合使用

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000", api_key="none")
with open("audio.mp3", "rb") as f:
    result = client.audio.transcriptions.create(model="Systran/faster-whisper-large-v3", file=f)
print(result.text)

适用于需要将转录功能作为本地API暴露给其他工具的场景（如Home Assistant、n8n、自定义应用）。

常见错误

错误	问题	解决方案
GPU可用时仍使用CPU	转录速度慢10-20倍	检查`nvidia-smi`；验证CUDA安装
未指定语言	在已知内容上浪费时间进行自动检测	使用`--language en`当您已知语言时
使用错误的模型	不必要的速度过慢或准确性差	默认`distil-large-v3.5`表现优异；仅在`large-v3`遇到准确性问题时使用
忽略蒸馏模型	错失6倍加速且精度损失<1%的机会	尝试`distil-large-v3.5`再考虑使用标准模型
忘记安装ffmpeg	导致安装失败或无法处理音频	安装脚本会处理此问题；手动安装需单独安装ffmpeg
内存不足错误	模型过大超出可用显存/内存	使用更小模型、`--compute-type int8`或`--batch-size 4`
过度调整波束大小	波束大小超过5-7后收益递减	默认值5即可；对于关键转录，可尝试设为10
不使用pyannote进行说话人分离	运行时导入错误	运行`setup.sh --diarize`首先
不使用HuggingFace令牌进行说话人分离	模型下载失败	运行`huggingface-cli login`并接受模型协议
未安装yt-dlp时使用URL输入	下载失败	安装：`pipx install yt-dlp`
最小置信度设置过高	会丢弃包含自然停顿的有效片段	从0.5开始，逐步调高；通过JSON输出检查概率值
基础转录使用--word-timestamps参数	增加约5-10秒开销但收效甚微	仅在需要词级精度时使用
批量处理时未指定-o目录参数	所有输出混合在标准输出中	使用`-o ./transcripts/`为每个输入写入单独文件

性能说明

首次运行：将模型下载至~/.cache/huggingface/（一次性操作）
批量推理：通过BatchedInferencePipeline默认启用 —— 比标准模式快约3倍；默认开启VAD
GPU：若可用则自动使用CUDA
量化：CPU上使用INT8量化实现约4倍加速，精度损失极小
性能统计：每次转写均显示音频时长、处理时间与实时因子
基准测试（RTX 3070，21分钟文件）：约24秒使用批量推理时（distil-large-v3和v3.5均为批量处理）耗时约16秒，不使用批量处理时耗时约69秒
--精确开销：wav2vec2模型加载+对齐增加约5-10秒（批量处理时模型已缓存）
说话人分离开销：根据音频长度增加约10-30秒（如有GPU则在GPU上运行）
内存：
- distil-large-v3：约2GB内存 / 约1GB显存
- large-v3-turbo：约4GB内存 / 约2GB显存
- tiny/base：<1GB内存
- 说话人分离：额外增加约1-2GB显存
内存溢出：若出现内存溢出错误，请尝试降低--batch-size（尝试设为4）
预转换为WAV格式（可选）：ffmpeg -i input.mp3 -ar 16000 -ac 1 input.wav在转录前转换为16kHz单声道WAV。单次使用效益有限（约5%），因为PyAV解码效率高——最适用于多次重复处理同一文件（研究/实验）或当某种格式导致PyAV解码问题时。注意：--normalize和--denoise已自动执行此转换。
Silero VAD V6：faster-whisper 1.2.1 升级至 Silero VAD V6（改进的语音检测）。运行./setup.sh --update以获取更新。
批量静音移除：faster-whisper 1.2.0+ 在BatchedInferencePipeline（默认使用）中自动移除静音。若您在2024年8月前安装，请通过./setup.sh --update升级以获取此功能。

为何选择 faster-whisper？

速度：比OpenAI原版Whisper快约4-6倍
准确率完全相同（使用相同的模型权重）
高效性：通过量化降低内存使用
生产就绪：稳定的C++后端（CTranslate2）
蒸馏模型：速度提升约6倍，精度损失小于1%
字幕：原生支持SRT/VTT/HTML输出
精确对齐：自动wav2vec2细化（约10毫秒词边界精度）
说话人分离：可通过pyannote进行可选的说话人识别；--speaker-names映射至真实姓名
URL支持：支持直接输入YouTube/URL；--keep-temp保留下载内容以供重复使用
自定义模型：可加载本地CTranslate2目录或HuggingFace仓库；--model-dir控制缓存目录
质量控制:--filter-hallucinations去除音乐/掌声标记和重复内容
并行批处理:--parallel N用于多线程批处理
字幕硬嵌入:--burn-in通过ffmpeg将字幕直接叠加到视频中

v1.5.0 新功能

多格式输出：

--format srt,text— 单次处理写入多种格式（例如同时生成SRT和纯文本）
支持逗号分隔列表：srt,vtt,json、srt,text等
写入多格式时需配合-o <目录>参数；单格式输出保持不变

填充词移除：

--clean-filler— 从转录文本中去除犹豫音（嗯、呃、啊、哈、哼、嗯）和话语标记词（你知道、我的意思是、你瞧）；默认关闭
在单词边界采用保守的正则表达式匹配，以避免误判
清理后变为空的片段将自动删除

立体声声道选择：

--channel left|right|mix— 在转录前提取特定的立体声声道（默认：混合声道）
适用于双轨录音（采访者在左声道，受访者在右声道）
使用 ffmpeg 的声道处理滤镜；如果未找到 ffmpeg 则优雅地回退到完整混合声道

基于字符的字幕换行：

--max-chars-per-line N— 分割字幕提示行，使每行字符数不超过 N
适用于 SRT、VTT、ASS 和 TTML 格式；优先级高于--max-words-per-line
需要单词级时间戳；若无单词数据则回退到完整片段

段落检测：

--detect-paragraphs— 插入\n\n在文本输出的自然边界处进行段落分隔
--paragraph-gap SEC— 段落的最小静音间隔（默认值：3.0秒）
当上一段以句子结尾且间隔≥1.5秒时，也会检测段落分隔

字幕格式：

--format ass— Advanced SubStation Alpha 格式（适用于 Aegisub、VLC、mpv、MPC-HC）
--format lrc— 用于音乐播放器的带时间轴歌词格式
--format html— 基于置信度着色的 HTML 转录文本（每个单词按绿/黄/红标注）
--format ttml— W3C TTML 1.0 (DFXP) 广播标准（Netflix、Amazon Prime、BBC 使用）
--format csv— 适用于电子表格的 CSV 格式，包含标题行；符合 RFC 4180 引用规范；当进行说话人分离时，包含speaker

列

--搜索术语— 查找词语/短语出现的所有时间戳；替换正常输出；-o用于保存
--搜索模糊匹配— 使用近似/部分匹配配合--搜索
--检测章节— 根据静音间隙自动检测章节分割点；--章节间隙秒（默认 8秒）
--章节文件路径— 将章节写入文件而非标准输出；--章节格式 youtube|text|json
--导出说话人目录— 在--说话人分离后，通过ffmpeg将每个说话人的话轮保存为单独的WAV文件

批量处理改进：

预计完成时间—[当前/总数] 文件名 | 预计完成时间：X分 Y秒在顺序批处理中每个文件前显示；无需标志
--language-map "pat=lang,..."— 按文件语言覆盖；fnmatch 通配符模式；@file.json形式
--retries N— 使用指数退避重试失败的文件；结束时提供失败文件摘要
--rss URL— 转录播客 RSS 源；--rss-latest N用于剧集数量
--skip-existing/--parallel N/--output-template/--stats-file/--merge-sentences

模型与推理：

distil-large-v3.5默认（已替换 distil-large-v3）
自动禁用condition_on_previous_text适用于蒸馏模型（防止重复循环）
--condition-on-previous-text用于覆盖；--log-level用于库的调试输出
--model-dir PATH— 自定义HuggingFace缓存目录；支持本地CTranslate2模型
--no-timestamps，--chunk-length，--length-penalty，--repetition-penalty，--no-repeat-ngram-size
--clip-timestamps，--stream，--progress，--best-of，--patience,--max-new-tokens
--hotwords,--prefix,--revision,--suppress-tokens,--max-initial-timestamp

说话者与质量：

--speaker-names "Alice,Bob"— 将SPEAKER_1/2替换为真实姓名（需要--diarize）
--filter-hallucinations— 移除音乐/掌声标记、重复内容、“感谢观看”
--burn-in OUTPUT— 通过ffmpeg将字幕烧录进视频
--keep-temp— 保留从URL下载的音频以供重新处理

设置：

setup.sh --check— 系统诊断：GPU、CUDA、Python、ffmpeg、pyannote、HuggingFace令牌（约12秒完成）
基础转录不再需要ffmpeg（PyAV处理解码）；skill.json已更新以反映此变化（ffmpeg现在被列为optionalBins）

故障排除

"CUDA不可用 — 使用CPU"：安装带CUDA支持的PyTorch（参见上文的GPU支持部分）安装失败：确保已安装Python 3.10或更高版本内存不足：使用更小的模型、--compute-type int8、或--batch-size 4 CPU上运行缓慢：预期情况 — 进行实际转录时请使用GPU模型下载失败：请检查~/.cache/huggingface/权限说话人分离模型失败：确保 HuggingFace 令牌存在且模型协议已接受；或直接通过--hf-token hf_xxx 传递令牌URL 下载失败：检查是否已安装 yt-dlp (pipx install yt-dlp)批量处理中没有音频文件：检查文件扩展名是否匹配支持的格式检查已安装版本：运行 ./scripts/transcribe --version升级 faster-whisper：运行./setup.sh --update（原地升级，无需完全重新安装）静音/音乐部分产生幻觉文本：尝试 --temperature 0.0 --no-speech-threshold 0.8调整--vad-threshold 0.3（降低阈值）或--min-silence-duration 300 来改进语音检测：运行./setup.sh --update以将faster-whisper升级至最新版本（包含Silero VAD V6）。

参考资源

faster-whisper GitHub
Distil-Whisper论文
HuggingFace模型库
pyannote.audio（说话人日志）
yt-dlp（URL/YouTube下载工具）

免责申明

部分文章来自各大搜索引擎，如有侵权，请与我联系删除。

打赏

文章底部电脑广告

手机广告位-内容正文底部

标签

上一篇：Image Editing技能使用说明下一篇：QuickBooks技能使用说明

Faster Whisper技能使用说明

更快Whisper

何时使用

快速参考

模型选择

模型表格

标准模型（完整版 Whisper）

蒸馏模型（约快6倍，词错误率差异约1%）

自定义与微调模型

加载本地CTranslate2模型

将HuggingFace模型转换为CTranslate2格式

通过HuggingFace仓库名称加载模型（自动下载）

自定义模型缓存目录

安装

Linux / macOS / WSL2

平台支持

GPU 支持（重要！）

选项

输出格式

文本（默认）

纯文本转录。配合

JSON (--format json)

SRT (--format srt)

VTT (--format vtt)

TSV (--format tsv)

ASS/SSA (--format ass)

LRC (--format lrc)

说话人日志

精确到词的时间戳

URL与YouTube输入

批量处理

工作流程

播客转录流程

会议记录流程

视频字幕流程

YouTube批量处理流程

嘈杂音频处理流程

批量恢复流程

服务器模式（兼容OpenAI的API）

快速开始（Docker）

测试

与任何OpenAI SDK配合使用

常见错误

性能说明

为何选择 faster-whisper？

v1.5.0 新功能

故障排除

参考资源

相关文章

推荐文章

热门浏览

标签列表

JSON (`--format json`)

SRT (`--format srt`)

VTT (`--format vtt`)

TSV (`--format tsv`)

ASS/SSA (`--format ass`)

LRC (`--format lrc`)