上篇文章中,我介绍了如何通过编写爬虫来从 Free Midi Files Download 网站上爬取海量的MIDI数据。本篇文章介绍的是使用 pretty_midi 库来将MIDI文件转化成矩阵,并通过PyTorch的Dataset类来构建数据集,为之后的训练与测试中传入张量做准备。
构建数据集的第一步是将MIDI文件中的音乐信息以(时间,音高)的矩阵形式提取出来,并以稀疏矩阵的形式来保存到npz文件中。pretty_midi库提供了在每一个音轨中遍历音符(Note),并得到每个音符的音高(pitch),音符开始时间(note_on)和音符结束时间(note_off),将开始和结束时间分别除以十六分音符的长度(60秒 / 120BPM / 4),就可以得到开始和结束的时间在矩阵中对应的位置。
代码详见 MusicCritique/util/data/create_database.py
def generate_nonzeros_by_notes():
root_dir = 'E:/merged_midi/'
midi_collection = get_midi_collection()
genre_collection = get_genre_collection()
for genre in genre_collection.find():
genre_name = genre['Name']
print(genre_name)
npy_file_root_dir = 'E:/midi_matrix/one_instr/' + genre_name + '/'
if not os.path.exists(npy_file_root_dir):
os.mkdir(npy_file_root_dir)
for midi in midi_collection.find({'Genre': genre_name, 'OneInstrNpyGenerated': False}, no_cursor_timeout = True):
path = root_dir + genre_name + '/' + midi['md5'] + '.mid'
save_path = npy_file_root_dir + midi['md5'] + '.npz'
pm = pretty_midi.PrettyMIDI(path)
# segment_num = math.ceil(pm.get_end_time() / 8)
note_range = (24, 108)
# data = np.zeros((segment_num, 64, 84), np.bool_)
nonzeros = []
sixteenth_length = 60 / 120 / 4
for instr in pm.instruments:
if not instr.is_drum:
for note in instr.notes:
start = int(note.start / sixteenth_length)
end = int(note.end / sixteenth_length)
pitch = note.pitch
if pitch < note_range[0] or pitch >= note_range[1]:
continue
else:
pitch -= 24
for time_raw in range(start, end):
segment = int(time_raw / 64)
time = time_raw % 64
nonzeros.append([segment, time, pitch])
nonzeros = np.array(nonzeros)
np.savez_compressed(save_path, nonzeros)
midi_collection.update_one({'_id': midi['_id']}, {'$set': {'OneInstrNpyGenerated': True}})
print('Progress: {:.2%}'.format(
midi_collection.count({'Genre': genre_name, 'OneInstrNpyGenerated': True}) / midi_collection.count({'Genre': genre_name})), end='\n')
考虑到以上三点,根据每一个MIDI文件得到的矩阵形式即[包含的四小节乐段数*1*64*84]。为了降低空间占用,保存在文件中的信息是矩阵中每一个非零点的坐标信息,后面可以通过这些坐标来构建稀疏矩阵。
通过上一步,我们已经将MIDI文件中的音乐信息以稀疏矩阵坐标的形式存储在了单独的npz文件中,为了方便构造数据集,我尝试将每个风格的所有稀疏矩阵统一存储。
代码详见 MusicCritique/util/data/create_database.py
def merge_all_sparse_matrices():
midi_collection = get_midi_collection()
genre_collection = get_genre_collection()
root_dir = 'E:/midi_matrix/one_instr/'
time_step = 64
valid_range = (24, 108)
for genre in genre_collection.find({'DatasetGenerated': False}):
save_dir = 'd:/data/' + genre['Name']
if not os.path.exists(save_dir):
os.mkdir(save_dir)
print(genre['Name'])
whole_length = genre['ValidPiecesNum']
shape = np.array([whole_length, time_step, valid_range[1]-valid_range[0]])
processed = 0
last_piece_num = 0
whole_num = midi_collection.count({'Genre': genre['Name']})
non_zeros = []
for midi in midi_collection.find({'Genre': genre['Name']}, no_cursor_timeout=True):
path = root_dir + genre['Name'] + '/' + midi['md5'] + '.npz'
valid_pieces_num = midi['PiecesNum'] - 1
f = np.load(path)
matrix = f['arr_0'].copy()
print(valid_pieces_num, matrix.shape[0])
for data in matrix:
try:
data = data.tolist()
if data[0] < valid_pieces_num:
piece_order = last_piece_num + data[0]
non_zeros.append([piece_order, data[1], data[2]])
except:
print(path)
last_piece_num += valid_pieces_num
processed += 1
print('Progress: {:.2%}\n'.format(processed / whole_num))
non_zeros = np.array(non_zeros)
print(non_zeros.shape)
np.savez_compressed(save_dir + '/data_sparse' + '.npz', nonzeros=non_zeros, shape=shape)
genre_collection.update_one({'_id': genre['_id']}, {'$set': {'DatasetGenerated': True}})
这个函数中genre的ValidPiecesNum域是之前添加的,意义是某一类的所有MIDI文件的四小节数目之和,并从这之中扣除了最后不满一小节的部分。
由于所有的非零的坐标信息已经保存在了npz文件中,通过遍历这些坐标信息并将这些坐标点的数值设置为1.0,就可以得到矩阵。
def generate_sparse_matrix_of_genre(genre):
npy_path = 'D:/data/' + genre + '/data_sparse.npz'
with np.load(npy_path) as f:
shape = f['shape']
data = np.zeros(shape, np.float_)
nonzeros = f['nonzeros']
for x in nonzeros:
data[(x[0], x[1], x[2])] = 1.
return data
通过继承PyTorch的Dataset类,并对几个重要函数进行重写,参考官方文档
代码详见 MusicCritique/util/data/dataset.py
class SteelyDataset(data.Dataset):
def __init__(self, genreA, genreB, phase, use_mix):
assert phase in ['train', 'test'], 'not valid dataset type'
sources = ['metal', 'punk', 'folk', 'newage', 'country', 'bluegrass']
genre_collection = get_genre_collection()
self.data_path = 'D:/data/'
numA = genre_collection.find_one({'Name': genreA})['ValidPiecesNum']
numB = genre_collection.find_one({'Name': genreB})['ValidPiecesNum']
train_num = int(min(numA, numB) * 0.9)
test_num = min(numA, numB) - train_num
if phase is 'train':
self.length = train_num
if use_mix:
dataA = np.expand_dims(generate_sparse_matrix_of_genre(genreA)[:self.length], 1)
dataB = np.expand_dims(generate_sparse_matrix_of_genre(genreB)[:self.length], 1)
mixed = generate_sparse_matrix_from_multiple_genres(sources)
np.random.shuffle(mixed)
data_mixed = np.expand_dims(mixed[:self.length], 1)
self.data = np.concatenate((dataA, dataB, data_mixed), axis=1)
else:
dataA = np.expand_dims(generate_sparse_matrix_of_genre(genreA)[:self.length], 1)
dataB = np.expand_dims(generate_sparse_matrix_of_genre(genreB)[:self.length], 1)
self.data = np.concatenate((dataA, dataB), axis=1)
else:
self.length = test_num
dataA = np.expand_dims(generate_sparse_matrix_of_genre(genreA)[:self.length], 1)
dataB = np.expand_dims(generate_sparse_matrix_of_genre(genreB)[:self.length], 1)
self.data = np.concatenate((dataA, dataB), axis=1)
def __getitem__(self, index):
return self.data[index, :, :, :]
def __len__(self):
return self.length
继承的重点是重写初始化函数、getitem函数和len函数。在构建数据库的时候,为了方便调用数据,我将dataA和dataB合并到了一起,并取较小数据集的数目来确定总体数据集数目,以保证两种数据大小一致,在这过程中使用了Numpy库中的expand_dims函数来增加维度,concatenate函数来把两个矩阵合并到新增的维度上。
大家需要的话可以通过 百度云 下载这一数据集,提取码:nsfi。如在使用过程中遇到问题,请在下面评论,感谢阅读!