免费的中文语音数据集汇总列表

截止2019年01月11日

  1. AISHELL-1
  2. AISHELL-2(高校与研究机构免费申请)
  3. THCHS30
  4. ST-CMDS
  5. Primewords Chinese Corpus Set 1

截至2020年7月7日

  1. Speechocean 10 Hours Chinese Mandarin Speech Recognition Corpus
  2. MobvoiHotwords
  3. CN-Celeb
  4. MAGICDATA

下载地址:openslr上都有,除了aishell-2。aishell-2可以向希尔公司申请或购买。

 

Dataset

Duration(hours)

Description

AISHELL-1

178

AISHELL-ASR0009录音文本涉及智能家居、无人驾驶、工业生产等11个领域。录制过程在安静室内环境中, 同时使用3种不同设备: 高保真麦克风(44.1kHz,16-bit);Android系统手机(16kHz,16-bit);iOS系统手机(16kHz,16-bit)。高保真麦克风录制的音频降采样为16kHz,用于制作AISHELL-ASR0009-OS1。400名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注,并通过严格质量检验,此数据库文本正确率在95%以上。分为训练集、开发集、测试集。(支持学术研究,未经允许禁止商用。)

AISHELL-2

1000

希尔贝壳中文普通话语音数据库AISHELL-2的语音时长为1000小时,其中718小时来自AISHELL-ASR0009-[ZH-CN],282小时来自AISHELL-ASR0010-[ZH-CN]。录音文本涉及唤醒词、语音控制词、智能家居、无人驾驶、工业生产等12个领域。录制过程在安静室内环境中, 同时使用3种不同设备: 高保真麦克风(44.1kHz,16bit);Android系统手机(16kHz,16bit);iOS系统手机(16kHz,16bit)。AISHELL-2采用iOS系统手机录制的语音数据。1991名来自中国不同口音区域的发言人参与录制。经过专业语音校对人员转写标注,并通过严格质量检验,此数据库文本正确率在96%以上。(支持学术研究,未经允许禁止商用。)

AISHELL-EVAL

(AISHELL2-2018A-EVAL)

 

TEST DATA: 5000 utterances from 10 speakers

 DEV DATA: 2500 utterances from 5 speaker

Sampling Rate :         16kHz

Sample Format :        16bit

Environment :             Indoor

Speech Data Type :    PCM

Channel Number :     1

Recording Equipment :  iOS / Android / High Fidelity Microphone

THCHS30

30

THCHS30 is an open Chinese speech database published by Center for Speech and Language Technology (CSLT) at Tsinghua University.

ST-CMDS

500

A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances.This corpus were recorded in silence in-door environment using cellphone. It has 855 speakers. Each speaker has 120 utterances. All utterances were carefully transcribed and checked by human. Transcription accuracy is guaranteed.

Primewords Chinese Corpus Set 1

100

This free Chinese Mandarin speech corpus set is released by Shanghai Primewords Information Technology Co., Ltd.The corpus is recorded by smart mobile phones from 296 native Chinese speakers. The transcription accuracy is larger than 98%, at the confidence level of 95%. It is free for academic use.The mapping between the transcript and utterance is given in JSON format.

Speechocean 10 Hours Chinese Mandarin Speech Recognition Corpus

10.33

The Chinese Mandarin speech recognition corpus is provided by speechocean.

This is a 10.33 hours corpus, which is collected over 4 different microphones simultaneously.

The corpus was recorded by 20 speakers (10 males and 10 females) in a quiet office. Each speaker was recorded around 120 utterances in one channel.

Transcription files are included.

The sentence transcription accuracy is higher than 98%.

It is totally free to use for academic purpose.

This corpus is a subset of a bigger corpus (159 hours). Please contact us if you are interested.

MobvoiHotwords   The MobvoiHotwords is a corpus of wake-up words collected from a commercial smart speaker of Mobvoi. It consists of keyword and non-keyword utterances.

For keyword data, keyword utterances contain either 'Hi xiaowen' or 'Nihao Wenwen' are collected. For each keyword, there are about 36k utterances. All keyword data is collected from 788 subjects, ages 3-65, with different distances from the smart speaker (1, 3 and 5 meters). Different noises (typical home environment noises like music and TV) with varying sound pressure levels are played in the background during the collection. 

CN-Celeb   This data is a large-scale speaker recognition dataset collected 'in the wild'. The dataset contains more than 130,000 utterances from 1,000 Chinese celebrities, and covers 11 different genres in real world. All the audio files are coded as single channel and sampled at 16kHz with 16-bit precision.
MAGICDATA 755
  • The corpus contains 755 hours of speech data, which is mostly mobile recorded data.
  • 1080 speakers from different accent areas in China are invited to participate in the recording.
  • The sentence transcription accuracy is higher than 98%.
  • Recordings are conducted in a quiet indoor environment.
  • The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.
  • Detail information such as speech data coding and speaker information is preserved in the metadata file.
  • The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.
  • Segmented transcripts are also provided.

 

你可能感兴趣的:(免费的中文语音数据集汇总列表)