Speech to text 语音转文本
Learn how to turn audio into text
了解如何将音频转换为文本
ChatGPT 是集人工智能和自然语言处理技术于一身的大型语言模型。它能够通过文字、语音或者图像等多种方式与用户进行交互。其中,通过语音转文字功能,ChatGPT 能够将用户说出的话语,立即转化为文字,并对其进行分析处理,再以文字形式作答。这样的交互方式大大提升了 ChatGPT 与用户之间的交流效率。
The speech to text API provides two endpoints, transcriptions
and translations
, based on our state-of-the-art open source large-v2 Whisper model. They can be used to:
语音到文本API提供了两个端点, transcriptions
和 translations
,基于我们最先进的开源大型v2 Whisper模型。它们可用于:
File uploads are currently limited to 25 MB and the following input file types are supported: mp3, mp4, mpeg, mpga, m4a, wav, and webm
.
文件上传当前限制为25 MB,支持以下输入文件类型: mp3, mp4, mpeg, mpga, m4a, wav, and webm
。
The transcriptions API takes as input the audio file you want to transcribe and the desired output file format for the transcription of the audio. We currently support multiple input and output file formats.
转录API将您要转录的音频文件和音频转录所需的输出文件格式作为输入。我们目前支持多种输入和输出文件格式。
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai
audio_file= open("/path/to/file/audio.mp3", "rb")
transcript = openai.Audio.transcribe("whisper-1", audio_file)
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header 'Authorization: Bearer TOKEN' \
--header 'Content-Type: multipart/form-data' \
--form file=@/path/to/file/openai.mp3 \
--form model=whisper-1
By default, the response type will be json with the raw text included.
默认情况下,响应类型将是包含原始文本的json。
{
“text”: "Imagine the wildest idea that you’ve ever had, and you’re curious about how it might scale to something that’s a 100, a 1,000 times bigger.
…
}
{ “text”:“想象一下你有过的最疯狂的想法,你很好奇它如何扩展到100倍,1,000倍大的东西。… }
To set additional parameters in a request, you can add more --form
lines with the relevant options. For example, if you want to set the output format as text, you would add the following line:
要在请求中设置其他参数,您可以添加更多带有相关选项的 --form
行。例如,如果要将输出格式设置为文本,则应添加以下行:
...
--form file=@openai.mp3 \
--form model=whisper-1 \
--form response_format=text
The translations API takes as input the audio file in any of the supported languages and transcribes, if necessary, the audio into english. This differs from our /Transcriptions endpoint since the output is not in the original input language and is instead translated to english text.
翻译API接受任何支持语言的音频文件作为输入,并在必要时将音频转录为英语。这与我们的/Transcriptions端点不同,因为输出不是原始输入语言,而是翻译为英语文本。
# Note: you need to be using OpenAI Python v0.27.0 for the code below to work
import openai
audio_file= open("/path/to/file/german.mp3", "rb")
transcript = openai.Audio.translate("whisper-1", audio_file)
curl --request POST --url https://api.openai.com/v1/audio/translations --header 'Authorization: Bearer TOKEN' --header 'Content-Type: multipart/form-data' --form file=@/path/to/file/german.mp3 --form model=whisper-1
In this case, the inputted audio was german and the outputted text looks like:
在这种情况下,输入的音频是德语,输出的文本看起来像:
Hello, my name is Wolfgang and I come from Germany. Where are you heading today?
大家好,我叫沃尔夫冈,来自德国。你今天要去哪里?
We only support translation into english at this time.
我们只支持翻译成英语。
We currently support the following languages through both the transcriptions
and translations
endpoint:
我们目前通过 transcriptions
和 translations
端点支持以下语言:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
南非荷兰语,阿拉伯语,亚美尼亚语,阿塞拜疆语,白俄罗斯语,波斯尼亚语,保加利亚语,加泰罗尼亚语,中文,克罗地亚语,捷克语,丹麦语,荷兰语,英语,爱沙尼亚语,芬兰语,法语,加利西亚语,德语,希腊语,希伯来语,印地语,匈牙利语,冰岛语,印度尼西亚语,意大利语,日语,卡纳达语,哈萨克语,韩语,拉脱维亚语,立陶宛语,马其顿语,马来语,马拉地语,毛利语,尼泊尔语,挪威语,波斯语,波兰语,葡萄牙语,罗马尼亚语,俄语,塞尔维亚语、斯洛伐克语、斯洛文尼亚语、西班牙语、斯瓦希里语、瑞典语、菲律宾语、泰米尔语、泰语、土耳其语、乌克兰语、乌尔都语、越南语和威尔士语。
While the underlying model was trained on 98 languages, we only list the languages that exceeded <50% word error rate (WER) which is an industry standard benchmark for speech to text model accuracy. The model will return results for languages not listed above but the quality will be low.
虽然底层模型在98种语言上进行了训练,但我们只列出了超过50%单词错误率(WER)的语言,这是语音到文本模型准确性的行业标准基准。该模型将返回上面未列出的语言的结果,但质量将较低。
By default, the Whisper API only supports files that are less than 25 MB. If you have an audio file that is longer than that, you will need to break it up into chunks of 25 MB’s or less or used a compressed audio format. To get the best performance, we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.
默认情况下,Whisper API仅支持小于25 MB的文件。如果你有一个音频文件比这更长,你需要把它分成25 MB或更少的块,或者使用压缩的音频格式。为了获得最佳性能,我们建议您避免在句子中间打断音频,因为这可能会导致一些上下文丢失。
One way to handle this is to use the PyDub open source Python package to split the audio:
处理这个问题的一种方法是使用PyDub开源Python包来分割音频:
from pydub import AudioSegment
song = AudioSegment.from_mp3("good_morning.mp3")
# PyDub handles time in milliseconds
ten_minutes = 10 * 60 * 1000
first_10_minutes = song[:ten_minutes]
first_10_minutes.export("good_morning_10.mp3", format="mp3")
OpenAI makes no guarantees about the usability or security of 3rd party software like PyDub.
OpenAI不保证PyDub等第三方软件的可用性或安全性。
You can use a prompt to improve the quality of the transcripts generated by the Whisper API. The model will try to match the style of the prompt, so it will be more likely to use capitalization and punctuation if the prompt does too. However, the current prompting system is much more limited than our other language models and only provides limited control over the generated audio. Here are some examples of how prompting can help in different scenarios:
您可以使用提示来提高Whisper API生成的转录本的质量。该模型将尝试匹配提示符的样式,因此如果提示符也使用大写和标点符号,则更有可能使用大写和标点符号。然而,当前的提示系统比我们的其他语言模型要有限得多,并且仅对生成的音频提供有限的控制。以下是提示如何在不同情况下提供帮助的一些示例:
The transcript is about OpenAI which makes technology like DALL·E, GPT-3, and ChatGPT with the hope of one day building an AGI system that benefits all of humanity
OpenAI开发了DALL·E、GPT-3和ChatGPT等技术,希望有一天能建立一个造福全人类的AGI系统。
To preserve the context of a file that was split into segments, you can prompt the model with the transcript of the preceding segment. This will make the transcript more accurate, as the model will use the relevant information from the previous audio. The model will only consider the final 224 tokens of the prompt and ignore anything earlier.
若要保留已拆分为段的文件的上下文,可以使用前一段的副本提示模型。这将使转录更准确,因为模型将使用来自先前音频的相关信息。该模型将只考虑提示符的最后224个标记,而忽略之前的任何标记。
Sometimes the model might skip punctuation in the transcript. You can avoid this by using a simple prompt that includes punctuation:
有时候模型可能会跳过文本中的标点符号。您可以使用包含标点符号的简单提示来避免这种情况:
Hello, welcome to my lecture. 大家好,欢迎来听我的讲座。
Umm, let me think like, hmm… Okay, here’s what I’m, like, thinking."
嗯,让我想想,嗯……好吧,我是这么想的。”
如果大家想继续了解人工智能相关学习路线和知识体系,欢迎大家翻阅我的另外一篇博客《重磅 | 完备的人工智能AI 学习——基础知识学习路线,所有资料免关注免套路直接网盘下载》
这篇博客参考了Github知名开源平台,AI技术平台以及相关领域专家:Datawhale,ApacheCN,AI有道和黄海广博士等约有近100G相关资料,希望能帮助到所有小伙伴们。