Abstract
AISHELL-1 is by far the largest open-source speech corpus available for Mandarin speech recognition research. It was released with a baseline system containing solid training and testing pipelines for Mandarin ASR. In AISHELL-2, 1000 hours of clean read-speech data from iOS is published, which is free for academic usage. On top of AISHELL-2 corpus, an improved recipe is developed and released, containing key components for industrial applications, such as Chinese word segmentation, flexible vocabulary expension and phone set transformation etc. Pipelines support various state-of-the-art techniques, such as time-delayed neural networks and Lattic-Free MMI objective funciton. In addition, we also release dev and test data from other channels(Android and Mic). For research community, we hope that AISHELL-2 corpus can be a solid resource for topics like transfer learning and robust ASR. For industry, we hope AISHELL-2 recipe can be a helpful reference for building meaningful industrial systems and products.
Index Terms: Speech recognition, Mandarin ASR, Industrial Speech Recognition
Introduction
Automatic Speech Recognition (ASR) is a major application domain in the bloom of Artificial Intelligence (AI). Huge effort has been made from both research community and industry to improve ASR system performance. Among all solutions proposed, deep learning approach has been dominating for the last half decade. Given enough data, neural network (NN) models generally perform better in terms of recognition accuracy, and turn out to be more robust. From industrial perspective, accessing and collecting large amount of speech data has become easier than ever before, with emerging market of smart phones and various other smart devices. However, on the other hand, research community still has limited-access to real-world application data. As a result, improvements in research community do not always scale well to industrial scenarios. In computer vision, there are many high quality free data sets which transform research efforts into industrial applications, such as ImageNet [1] and COCO [2]. In Mandarin ASR, although there are corpus like thchs30 [3] and hkust [4], a large-scale high-quality free corpus is still needed.
AISHELL-2 is a 1000-hour Mandarin Chinese Speech Corpus. 718 hours are from AISHELL-ASR0009-[ZH-CN] and 282 hours are from AISHELL-ARS0010-[ZH-CN]. The speech utterance contains 12 domains, including keywords, voice command, smart home, autonomous driving, industrial production, etc.The recording was put in quiet indoor environment, using 3 different devices in parallel: high fidelity microphone (44.1kHz, 16-bit); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). AISHELL-2 choose audio data record by iOS-system.1991 speakers from different accent areas in China were participate in this recording. The manual transcription accuracy rate is above 96%, through professional speech annotation and strict quality inspection.( This database is free for academic research, not in the commerce, if without permission. )
希尔贝壳中文普通话语音数据库AISHELL-2的语音时长为1000小时,其中718小时来自AISHELL-ASR0009-[ZH-CN],282小时来自AISHELL-ASR0010-[ZH-CN]。录音文本涉及唤醒词、语音控制词、智能家居、无人驾驶、工业生产等12个领域。录制过程在安静室内环境中, 同时使用3种不同设备: 高保真麦克风(44.1kHz,16bit);Android系统手机(16kHz,16bit);iOS系统手机(16kHz,16bit)。AISHELL-2采用iOS系统手机录制的语音数据。1991名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注,并通过严格质量检验,此数据库文本正确率在96%以上。(支持学术研究,未经允许禁止商用。)
1000小时 | 1000 Hours
1991人中文普通话
1991 speakers in the recording
语音识别实验
声纹实验
Speech & Speaker Recognition evaluation
Kaldi系统应用
merged with Kaldi system
Kaldi recipe
希尔贝壳—专注于人工智能大数据和技术的创新北京希尔贝壳科技有限公司成立于2017年,是一家专注人工智能大数据和技术服务的创新公司。针对家居、车载、机器人等语音智能产品做精准场景语音数据并输出方案。利用机器学习平台,在语音数据评测、辅助转写、数据分析、智能语音客服等场景业务建立了领先的核心技术体系。http://www.aishelltech.com/aishell_2