在本文中,我们将使用 Whisper 创建语音转文本应用程序。Whisper需要Python后端,因此我们将使用Flask为应用程序创建服务器。
React Native 作为构建移动客户端的框架。我希望您喜欢创建此应用程序的过程,因为我确实这样做了。让我们直接深入研究它。
语音识别使程序能够将人类语音处理成书面格式。语法、句法、结构和音频对于理解和处理人类语音至关重要。
语音识别算法是计算机科学中如何在 macOS Big Sur 上显示/隐藏电池百分比最复杂的领域之一。人工智能、机器学习、无监督预训练技术的发展,以及 Wav2Vec 2.0 等框架,这些框架在自我监督学习和从原始音频中学习方面是有效的,已经提高了它们的能力。
语音识别器由以下组件组成:
语音输入
一种解码器,它依赖于声学模型、发音词典和语言模型进行输出
输出一词
这些组件和技术进步使未标记语音的大型数据集的消费成为可能。预先训练的音频编码器能够学习高质量的语音表示;它们唯一的缺点是不受监督的性质。
高性能解码器将语音表示映射到可用输出。解码器解决了音频编码器的监控问题。但是,解码器限制了Wav2Vec等框架对语音识别的有效性。解码器使用起来可能非常复杂,需要熟练的从业者,特别是因为 Wav2Vec 2.0 等技术难以使用。
关键是要结合尽可能多的高质量语音识别数据集。以这种方式训练的模型比在单个源上训练的模型更有效。
Whisper或WSPR代表用于语音识别的Web级监督预训练。耳语模型接受训练,以便能够预测成绩单的文本。
Whisper 依靠序列到序列模型在话语如何在 Windows 11 中设置帐户锁定阈值及其转录形式之间进行映射,这使得语音识别管道更有效。Whisper带有一个音频语言检测器,这是一个在VoxLingua107上训练的微调模型。
Whisper数据集由音频与来自互联网的成绩单配对组成。数据集的质量通过使用自动筛选方法而提高。
要使用Whisper,我们需要依靠Python作为我们的后端。Whisper 还需要命令行工具 ffmpeg,它使我们的应用程序能够录制、转如何在发件人不知情的情况下阅读 WhatsApp 消息换和流式传输音频和视频。
以下是在不同机器上安装 ffgmeg 的必要命令:
# on Ubuntu or Debian sudo apt update && sudo apt install ffmpeg # on Arch Linux sudo pacman -S ffmpeg # on MacOS using Homebrew (https://brew.sh/) brew install ffmpeg # on Windows using Chocolatey (https://chocolatey.org/) choco install ffmpeg # on Windows using Scoop (https://scoop.sh/) scoop install ffmpeg
在本节中,我们将为应用创建后端服务。 Flask是一个用Python编写的Web框架。我选择将Flask用于此应用程序,因为它易于设置。
Flask开发团队建议使用最新版本的Python,尽管Flask仍然支持Python ≥ 3.7。
安装先决条件完成后,我们可以创建项目文件夹来保存客户端和后端应用程序。
mkdir translateWithWhisper && cd translateWithWhisper && mkdir backend && cd backend
Flask 利用虚拟环境来管理项目依赖关系;Python有一个开箱即用的venv模块来创建它们。
在终端窗口中使用以下命令创建文件夹。此文件夹包含我们的依赖项。venv
python3 -m venv venv
使用文件指定必要的依赖项。该文件位于后端目录的根目录中。requirements.txt``requirements.txt
touch requirements.txt code requirements.txt
将以下代码复制并粘贴到文件中:requirements.txt
numpy tqdm transformers>=4.19.0 ffmpeg-python==0.2.0 pyaudio SpeechRecognition pydub git+https://github.com/openai/whisper.git --extra-index-url https://download.pytorch.org/whl/cu113 torch flask flask_cors
在根项目目录中,创建一个 Bash shell 脚本文件。Bash 脚本处理 Flask 应用程序中依赖项的安装。
在根项目目录中,打开终端窗口。使用以下命令创建外壳脚本:
touch install_dependencies.sh code install_dependencies.sh
将以下代码块复制并粘贴到文件中:install_dependencies.sh
# install and run backend cd backend && python3 -m venv venv source venv/Scripts/activate pip install wheel pip install -r requirements.txt
现在,在根目录中打开一个终端窗口并运行以下命令:
sh .\install_dependencies.sh
现在,我们将在应用程序中创建如何在Windows中随机停止USB连接/中断连接噪音一个终结点,该终结点将从客户端接收音频输入。应用程序将转录输入并将转录的文本返回给客户端。transcribe
此终结点接受请求并处理输入。当响应是 200 HTTP 响应时,客户端会收到转录的文本。POST
创建一个文件来保存用于处理输入的逻辑。打开一个新的终端窗口,在后端目录中创建一个文件:app.py``app.py
touch backend/app.py code backend/app.py
将下面的代码块复制并新的 Bing 聊天可以做什么?(必应AI聊天功能)粘贴到文件中:app.py
import os import tempfile import flask from flask import request from flask_cors import CORS import whisper app = flask.Flask(__name__) CORS(app) // endpoint for handling the transcribing of audio inputs @app.route('/transcribe', methods=['POST']) def transcribe(): if request.method == 'POST language = request.form['language'] model = request.form['model_size'] # there are no english models for large if model != 'large' and language == 'english': model = model + '.en' audio_model = whisper.load_model(model) temp_dir = tempfile.mkdtemp() save_path = os.path.join(temp_dir, 'temp.wav') wav_file = request.files['audio_data'] wav_file.save(save_path) if language == 'english': result = audio_model.transcribe(save_path, language='english') else: result = audio_model.transcribe(save_path) return result['text'] else: return "This endpoint only processes POST wav blob"
在包含变量的已激活终端窗口中,运行以下命令以启动应用程序:venv
$ cd backend $ flask run –port 8000
期望应用程序启动时没有任何错误。如果是这种情况,终端窗口中应显示以下结果:
这结束了在 Flask 应用程序中创建端点的过程。transcribe
若要向 iOS 中创建的 HTTP 终结趣知笔记 - 分享有价值的教程!点发出网络请求,我们需要路由到 HTTPS 服务器。ngrok 解决了创建重新路由的问题。
下载 ngrok,然后安装软件包并打开它。终端窗口启动;输入以下命令以使用 ngrok 托管服务器:
ngrok http 8000
ngrok 将生成一个托管 URL,该 URL 将在客户端应用程序中用于请求。
对于本教程的这一部分,您需要安装一些东西:
世博会 CLI:用于与世博会工具接口的命令行工具
适用于 Android 和 iOS 的 Expo Go 应用程序:用于打开通过 Expo CLI 提供的应用程序
在新的终端窗口中,初始化 React Native 项目:
npx create-expo-app client cd client
现在,启动开发服务器:
npx expo start
要在iOS设备上打开应用程序,请打开相机并扫描终端上的QR码。在 Android 设备上,按 Expo Go 应用程序的“主页”选项卡上的扫描二维码。
我们的世博围棋应用程序
Expo-av 在我们的应用程序中处理音频的录制。我们的 Flask 服务器需要文件格式的文件。expo-av 包允许我们在保存之前指定格式。.wav
在终端中安装必要的软件包:
yarn add axios expo-av react-native-picker-select
应用程序必须能够选择模型大小。有五个选项可供选择:
小
基础
小
中等
大
所选输入大小确定要在服务器上将输入与哪个模型进行比较。
再次在终端中,使用以下命令创建一个名为 的文件夹和子文件夹:src``/components
mkdir src mkdir src/components touch src/components/Mode.tsx code src/components/Mode.tsx
将代码块粘贴到文件中:Mode.tsx
import React from "react"; import { View, Text, StyleSheet } from "react-native"; import RNPickerSelect from "react-native-picker-select"; const Mode = ({ onModelChange, transcribeTimeout, onTranscribeTimeoutChanged, }: any) => { function onModelChangeLocal(value: any) { onModelChange(value); } function onTranscribeTimeoutChangedLocal(event: any) { onTranscribeTimeoutChanged(event.target.value); } return (); }; export default Mode; const styles = StyleSheet.create({ title: { fontWeight: "200", fontSize: 25, float: "left", }, }); const customPickerStyles = StyleSheet.create({ inputIOS: { fontSize: 14, paddingVertical: 10, paddingHorizontal: 12, borderWidth: 1, borderColor: "green", borderRadius: 8, color: "black", paddingRight: 30, // to ensure the text is never behind the icon }, inputAndroid: { fontSize: 14, paddingHorizontal: 10, paddingVertical: 8, borderWidth: 1, borderColor: "blue", borderRadius: 8, color: "black", paddingRight: 30, // to ensure the text is never behind the icon }, }); Model Size onModelChangeLocal(value)} useNativeAndroidPickerStyle={false} placeholder={{ label: "Select model", value: null }} items={[ { label: "tiny", value: "tiny" }, { label: "base", value: "base" }, { label: "small", value: "small" }, { label: "medium", value: "medium" }, { label: "large", value: "large" }, ]} style={customPickerStyles} /> Timeout :{transcribeTimeout}
服务器返回带有文本的输出。此组件接收输出数据并显示它。
mkdir src mkdir src/components touch src/components/TranscribeOutput.tsx code src/components/TranscribeOutput.tsx
将代码块粘贴到文件中:TranscribeOutput.tsx
import React from "react"; import { Text, View, StyleSheet } from "react-native"; const TranscribedOutput = ({ transcribedText, interimTranscribedText, }: any) => { if (transcribedText.length === 0 && interimTranscribedText.length === 0) { return... ; } return (); }; const styles = StyleSheet.create({ box: { borderColor: "black", borderRadius: 10, marginBottom: 0, }, text: { fontWeight: "400", fontSize: 30, }, }); export default TranscribedOutput; {transcribedText} {interimTranscribedText}
该应用程序依靠 Axios 从 Flask 服务器发送和接收数据;我们在前面的部分中安装了它。测试应用程序的默认语言是英语。
在文件中,导入必要的依赖项:App.tsx
import * as React from "react"; import { Text, StyleSheet, View, Button, ActivityIndicator, } from "react-native"; import { Audio } from "expo-av"; import FormData from "form-data"; import axios from "axios"; import Mode from "./src/components/Mode"; import TranscribedOutput from "./src/components/TranscribeOutput";
应用程序需要跟踪录制、转录数据、录制和正在进行的转录。默认情况下,语言、模型和超时在状态中设置。
export default () => { const [recording, setRecording] = React.useState(false as any); const [recordings, setRecordings] = React.useState([]); const [message, setMessage] = React.useState(""); const [transcribedData, setTranscribedData] = React.useState([] as any); const [interimTranscribedData] = React.useState(""); const [isRecording, setIsRecording] = React.useState(false); const [isTranscribing, setIsTranscribing] = React.useState(false); const [selectedLanguage, setSelectedLanguage] = React.useState("english"); const [selectedModel, setSelectedModel] = React.useState(1); const [transcribeTimeout, setTranscribeTimout] = React.useState(5); const [stopTranscriptionSession, setStopTranscriptionSession] = React.useState(false); const [isLoading, setLoading] = React.useState(false); return () } const styles = StyleSheet.create({ root: { display: "flex", flex: 1, alignItems: "center", textAlign: "center", flexDirection: "column", }, });
useRef Hook 使我们能够跟踪当前初始化的属性。我们要设置转录会话、语言和模型。useRef
将代码块粘贴到挂钩下:setLoading``useState
const [isLoading, setLoading] = React.useState(false); const intervalRef: any = React.useRef(null); const stopTranscriptionSessionRef = React.useRef(stopTranscriptionSession); stopTranscriptionSessionRef.current = stopTranscriptionSession; const selectedLangRef = React.useRef(selectedLanguage); selectedLangRef.current = selectedLanguage; const selectedModelRef = React.useRef(selectedModel); selectedModelRef.current = selectedModel; const supportedLanguages = [ "english", "chinese", "german", "spanish", "russian", "korean", "french", "japanese", "portuguese", "turkish", "polish", "catalan", "dutch", "arabic", "swedish", "italian", "indonesian", "hindi", "finnish", "vietnamese", "hebrew", "ukrainian", "greek", "malay", "czech", "romanian", "danish", "hungarian", "tamil", "norwegian", "thai", "urdu", "croatian", "bulgarian", "lithuanian", "latin", "maori", "malayalam", "welsh", "slovak", "telugu", "persian", "latvian", "bengali", "serbian", "azerbaijani", "slovenian", "kannada", "estonian", "macedonian", "breton", "basque", "icelandic", "armenian", "nepali", "mongolian", "bosnian", "kazakh", "albanian", "swahili", "galician", "marathi", "punjabi", "sinhala", "khmer", "shona", "yoruba", "somali", "afrikaans", "occitan", "georgian", "belarusian", "tajik", "sindhi", "gujarati", "amharic", "yiddish", "lao", "uzbek", "faroese", "haitian creole", "pashto", "turkmen", "nynorsk", "maltese", "sanskrit", "luxembourgish", "myanmar", "tibetan", "tagalog", "malagasy", "assamese", "tatar", "hawaiian", "lingala", "hausa", "bashkir", "javanese", "sundanese", ]; const modelOptions = ["tiny", "base", "small", "medium", "large"]; React.useEffect(() => { return () => clearInterval(intervalRef.current); }, []); function handleTranscribeTimeoutChange(newTimeout: any) { setTranscribeTimout(newTimeout); }
在本节中,我们将编写五个函数来处理音频听录。
第一个函数是函数。此函数使应用程序能够请求使用麦克风的权限。所需的音频格式是预设的,我们有一个用于跟踪超时的参考:startRecording
async function startRecording() { try { console.log("Requesting permissions.."); const permission = await Audio.requestPermissionsAsync(); if (permission.status === "granted") { await Audio.setAudioModeAsync({ allowsRecordingIOS: true, playsInSilentModeIOS: true, }); alert("Starting recording.."); const RECORDING_OPTIONS_PRESET_HIGH_QUALITY: any = { android: { extension: ".mp4", outputFormat: Audio.RECORDING_OPTION_ANDROID_OUTPUT_FORMAT_MPEG_4, audioEncoder: Audio.RECORDING_OPTION_ANDROID_AUDIO_ENCODER_AMR_NB, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, }, ios: { extension: ".wav", audioQuality: Audio.RECORDING_OPTION_IOS_AUDIO_QUALITY_MIN, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, linearPCMBitDepth: 16, linearPCMIsBigEndian: false, linearPCMIsFloat: false, }, }; const { recording }: any = await Audio.Recording.createAsync( RECORDING_OPTIONS_PRESET_HIGH_QUALITY ); setRecording(recording); console.log("Recording started"); setStopTranscriptionSession(false); setIsRecording(true); intervalRef.current = setInterval( transcribeInterim, transcribeTimeout * 1000 ); console.log("erer", recording); } else { setMessage("Please grant permission to app to access microphone"); } } catch (err) { console.error(" Failed to start recording", err); } }
该功能使用户能够停止录制。状态变量存储并保存更新的记录。stopRecording``recording
async function stopRecording() { console.log("Stopping recording.."); setRecording(undefined); await recording.stopAndUnloadAsync(); const uri = recording.getURI(); let updatedRecordings = [...recordings] as any; const { sound, status } = await recording.createNewLoadedSoundAsync(); updatedRecordings.push({ sound: sound, duration: getDurationFormatted(status.durationMillis), file: recording.getURI(), }); setRecordings(updatedRecordings); console.log("Recording stopped and stored at", uri); // Fetch audio binary blob data clearInterval(intervalRef.current); setStopTranscriptionSession(true); setIsRecording(false); setIsTranscribing(false); }
要获取录制的持续时间和录制文本的长度,请创建 and 函数:getDurationFormatted``getRecordingLines
function getDurationFormatted(millis: any) { const minutes = millis / 1000 / 60; const minutesDisplay = Math.floor(minutes); const seconds = Math.round(minutes - minutesDisplay) * 60; const secondDisplay = seconds < 10 ? `0${seconds}` : seconds; return `${minutesDisplay}:${secondDisplay}`; } function getRecordingLines() { return recordings.map((recordingLine: any, index) => { return (); }); } {" "} Recording {index + 1} - {recordingLine.duration}
此功能允许我们与 Flask 服务器进行通信。我们使用 expo-av 库中的功能访问我们创建的音频。、 和 是我们发送到服务器的关键数据片段。getURI()language
model_size``audio_data
响应表示成功。我们将响应存储在 Hook 中。此回复包含我们的转录文本。200setTranscribedData
useState
function transcribeInterim() { clearInterval(intervalRef.current); setIsRecording(false); } async function transcribeRecording() { const uri = recording.getURI(); const filetype = uri.split(".").pop(); const filename = uri.split("/").pop(); setLoading(true); const formData: any = new FormData(); formData.append("language", selectedLangRef.current); formData.append("model_size", modelOptions[selectedModelRef.current]); formData.append( "audio_data", { uri, type: `audio/${filetype}`, name: filename, }, "temp_recording" ); axios({ url: "https://2c75-197-210-53-169.eu.ngrok.io/transcribe", method: "POST", data: formData, headers: { Accept: "application/json", "Content-Type": "multipart/form-data", }, }) .then(function (response) { console.log("response :", response); setTranscribedData((oldData: any) => [...oldData, response.data]); setLoading(false); setIsTranscribing(false); intervalRef.current = setInterval( transcribeInterim, transcribeTimeout * 1000 ); }) .catch(function (error) { console.log("error : error"); }); if (!stopTranscriptionSessionRef.current) { setIsRecording(true); } }
让我们组装到目前为止创建的所有零件:
import * as React from "react"; import { Text, StyleSheet, View, Button, ActivityIndicator, } from "react-native"; import { Audio } from "expo-av"; import FormData from "form-data"; import axios from "axios"; import Mode from "./src/components/Mode"; import TranscribedOutput from "./src/components/TranscribeOutput"; export default () => { const [recording, setRecording] = React.useState(false as any); const [recordings, setRecordings] = React.useState([]); const [message, setMessage] = React.useState(""); const [transcribedData, setTranscribedData] = React.useState([] as any); const [interimTranscribedData] = React.useState(""); const [isRecording, setIsRecording] = React.useState(false); const [isTranscribing, setIsTranscribing] = React.useState(false); const [selectedLanguage, setSelectedLanguage] = React.useState("english"); const [selectedModel, setSelectedModel] = React.useState(1); const [transcribeTimeout, setTranscribeTimout] = React.useState(5); const [stopTranscriptionSession, setStopTranscriptionSession] = React.useState(false); const [isLoading, setLoading] = React.useState(false); const intervalRef: any = React.useRef(null); const stopTranscriptionSessionRef = React.useRef(stopTranscriptionSession); stopTranscriptionSessionRef.current = stopTranscriptionSession; const selectedLangRef = React.useRef(selectedLanguage); selectedLangRef.current = selectedLanguage; const selectedModelRef = React.useRef(selectedModel); selectedModelRef.current = selectedModel; const supportedLanguages = [ "english", "chinese", "german", "spanish", "russian", "korean", "french", "japanese", "portuguese", "turkish", "polish", "catalan", "dutch", "arabic", "swedish", "italian", "indonesian", "hindi", "finnish", "vietnamese", "hebrew", "ukrainian", "greek", "malay", "czech", "romanian", "danish", "hungarian", "tamil", "norwegian", "thai", "urdu", "croatian", "bulgarian", "lithuanian", "latin", "maori", "malayalam", "welsh", "slovak", "telugu", "persian", "latvian", "bengali", "serbian", "azerbaijani", "slovenian", "kannada", "estonian", "macedonian", "breton", "basque", "icelandic", "armenian", "nepali", "mongolian", "bosnian", "kazakh", "albanian", "swahili", "galician", "marathi", "punjabi", "sinhala", "khmer", "shona", "yoruba", "somali", "afrikaans", "occitan", "georgian", "belarusian", "tajik", "sindhi", "gujarati", "amharic", "yiddish", "lao", "uzbek", "faroese", "haitian creole", "pashto", "turkmen", "nynorsk", "maltese", "sanskrit", "luxembourgish", "myanmar", "tibetan", "tagalog", "malagasy", "assamese", "tatar", "hawaiian", "lingala", "hausa", "bashkir", "javanese", "sundanese", ]; const modelOptions = ["tiny", "base", "small", "medium", "large"]; React.useEffect(() => { return () => clearInterval(intervalRef.current); }, []); function handleTranscribeTimeoutChange(newTimeout: any) { setTranscribeTimout(newTimeout); } async function startRecording() { try { console.log("Requesting permissions.."); const permission = await Audio.requestPermissionsAsync(); if (permission.status === "granted") { await Audio.setAudioModeAsync({ allowsRecordingIOS: true, playsInSilentModeIOS: true, }); alert("Starting recording.."); const RECORDING_OPTIONS_PRESET_HIGH_QUALITY: any = { android: { extension: ".mp4", outputFormat: Audio.RECORDING_OPTION_ANDROID_OUTPUT_FORMAT_MPEG_4, audioEncoder: Audio.RECORDING_OPTION_ANDROID_AUDIO_ENCODER_AMR_NB, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, }, ios: { extension: ".wav", audioQuality: Audio.RECORDING_OPTION_IOS_AUDIO_QUALITY_MIN, sampleRate: 44100, numberOfChannels: 2, bitRate: 128000, linearPCMBitDepth: 16, linearPCMIsBigEndian: false, linearPCMIsFloat: false, }, }; const { recording }: any = await Audio.Recording.createAsync( RECORDING_OPTIONS_PRESET_HIGH_QUALITY ); setRecording(recording); console.log("Recording started"); setStopTranscriptionSession(false); setIsRecording(true); intervalRef.current = setInterval( transcribeInterim, transcribeTimeout * 1000 ); console.log("erer", recording); } else { setMessage("Please grant permission to app to access microphone"); } } catch (err) { console.error(" Failed to start recording", err); } } async function stopRecording() { console.log("Stopping recording.."); setRecording(undefined); await recording.stopAndUnloadAsync(); const uri = recording.getURI(); let updatedRecordings = [...recordings] as any; const { sound, status } = await recording.createNewLoadedSoundAsync(); updatedRecordings.push({ sound: sound, duration: getDurationFormatted(status.durationMillis), file: recording.getURI(), }); setRecordings(updatedRecordings); console.log("Recording stopped and stored at", uri); // Fetch audio binary blob data clearInterval(intervalRef.current); setStopTranscriptionSession(true); setIsRecording(false); setIsTranscribing(false); } function getDurationFormatted(millis: any) { const minutes = millis / 1000 / 60; const minutesDisplay = Math.floor(minutes); const seconds = Math.round(minutes - minutesDisplay) * 60; const secondDisplay = seconds < 10 ? `0${seconds}` : seconds; return `${minutesDisplay}:${secondDisplay}`; } function getRecordingLines() { return recordings.map((recordingLine: any, index) => { return (); }); } function transcribeInterim() { clearInterval(intervalRef.current); setIsRecording(false); } async function transcribeRecording() { const uri = recording.getURI(); const filetype = uri.split(".").pop(); const filename = uri.split("/").pop(); setLoading(true); const formData: any = new FormData(); formData.append("language", selectedLangRef.current); formData.append("model_size", modelOptions[selectedModelRef.current]); formData.append( "audio_data", { uri, type: `audio/${filetype}`, name: filename, }, "temp_recording" ); axios({ url: "https://2c75-197-210-53-169.eu.ngrok.io/transcribe", method: "POST", data: formData, headers: { Accept: "application/json", "Content-Type": "multipart/form-data", }, }) .then(function (response) { console.log("response :", response); setTranscribedData((oldData: any) => [...oldData, response.data]); setLoading(false); setIsTranscribing(false); intervalRef.current = setInterval( transcribeInterim, transcribeTimeout * 1000 ); }) .catch(function (error) { console.log("error : error"); }); if (!stopTranscriptionSessionRef.current) { setIsRecording(true); } } return ( {" "} Recording {index + 1} - {recordingLine.duration} Speech to Text. {message} {!isRecording && !isTranscribing && (
使用以下命令运行 React Native 应用程序:
yarn start
项目存储库是公开可用的。
在本文中,我们学习了如何在 React Native 应用程序中创建语音转文本功能。我预见到Whisper会改变日常生活中叙述和听写的工作方式。本文中介绍的技术支持创建听写应用。
我很高兴看到新的和创新的方式,开发人员扩展Whisper,例如,使用Whisper在我们的移动和网络设备上执行操作,或使用Whisper来改善我们网站和应用程序的可访问性。