The Sound of Pixels

项目地址：https://github.com/hangzhaomit/Sound-of-Pixels

论文地址：https://arxiv.org/abs/1804.03160

项目网址：http://sound-of-pixels.csail.mit.edu/

dataset：https://github.com/roudimit/MUSIC_dataset

Abstract

We introduce PixelPlayer, a system that, by leveraging large amounts of unlabeled videos, learns to locate image regions which produce sounds and separate the input sounds into a set of components that represents the sound from each pixel. Our approach capitalizes on the natural synchronization of the visual and audio modalities to learn models that jointly parse sounds and images, without requiring additional manual supervision. Experimental results on a newly collected MUSIC dataset show that our proposed Mix-and-Separate framework outperforms several baselines on source separation. Qualitative results suggest our model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources.

我们介绍了PixelPlayer，该系统通过利用大量未标记的视频，学习定位产生声音的图像区域，并将输入声音分成一组代表每个像素声音的组件。我们的方法利用视觉和音频模式的自然同步来学习共同解析声音和图像的模型，而无需额外的手动监督。 在新收集的MUSIC数据集上的实验结果表明，我们提出的混合和分离框架在源分离方面优于几个基线。定性结果表明，我们的模型可以学习视觉中的地面声音，从而实现独立调节声源音量等应用。

1 Introduction

The world generates a rich source of visual and auditory signals. Our visual and auditory systems are able to recognize objects in the world, segment image regions covered by the objects, and isolate sounds produced by objects. While auditory scene analysis [5] is widely studied in the fields of environmental sound recognition [26,18] and source separation [4,6,52,41,42,9], the natural synchronization between vision and sound can provide a rich supervisory signal for grounding sounds in vision [17,21,28]. Training systems to recognize objects from vision or sound typically requires large amounts of supervision. In this paper, however, we leverage joint audio visual learning to discover objects that produce sound in the world without manual supervision [36,30,1].

世界产生丰富的视觉和听觉信号源。我们的视觉和听觉系统能够识别世界中的物体，分割物体覆盖的图像区域，并隔离物体产生的声音。虽然听觉场景分析[5]在环境声音识别[26,18]和源分离[4,6,52,41,42,9]领域得到了广泛的研究，但视觉和声音之间的自然同步可以提供丰富的视觉中声音定位的监督信号[17,21,28]。用于识别视觉或声音中的物体的训练系统通常需要大量的监督。然而，在本文中，我们利用联合视听学习来发现世界上没有人工监督的声音[36,30,1]。

We show that by working with both auditory and visual information, we can learn in an unsupervised way to recognize objects from their visual appearance or the sound they make, to localize objects in images, and to separate the audio component coming from each object. We introduce a new system called PixelPlayer. Given an input video, PixelPlayer jointly separates the accompanying audio into components and spatially localizes them in the video. PixelPlayer enables us to listen to the sound originating from each pixel in the video.

我们表明，通过处理听觉和视觉信息，我们可以以无人监督的方式学习识别物体的视觉外观或声音，定位图像中的物体，以及分离来自每个物体的音频分量。我们引入了一个名为PixelPlayer的新系统。在给定输入视频的情况下，PixelPlayer将伴随的音频共同分离为组件，并在视频中对它们进行空间本地化。 PixelPlayer使我们能够聆听源自视频中每个像素的声音。

PixelPlayer在视频中定位声源，并将音频分离到其组件中，无需监督。该图示出：a）输入视频帧I（x; y; t）和视频单声道信号S（t）。 b）系统通过分离输入声音来估计输出声音信号Sout（x; y; t）。每个输出分量对应于来自视频中的空间位置（x; y）的声音。 c）11个示例位置的分量音频波形; 直线表示沉默。 d）系统对每个像素的声能（或音量）的估计。 e）像素空间中的声音分量的聚类。将相同的颜色分配给具有相似声音的像素。作为聚类的示例应用，PixelPlayer将启用视频中不同声源的独立音量控制。

Fig. 1 shows a working example of PixelPlayer (check the project website4 for sample videos and interactive demos). In this example, the system has been trained with a large number of videos containing people playing instruments in different combinations, including solos and duets. No label is provided on what instruments are present in each video, where they are located, and how they sound. During test time, the input (Fig. 1.a) is a video of several instruments played together containing the visual frames I(x; y; t), and the mono audio S(t). PixelPlayer performs audio-visual source separation and localization, splitting the input sound signal to estimate output sound components Sout(x; y; t), each one corresponding to the sound coming from a spatial location (x; y) in the video frame. As an illustration, Fig. 1.c shows the recovered audio signals for 11 example pixels. The flat blue lines correspond to pixels that are considered as silent by the system. The non-silent signals correspond to the sounds coming from each individual instrument. Fig. 1.d shows the estimated sound energy, or volume of the audio signal from each pixel. Note that the system correctly detects that the sounds are coming from the two instruments and not from the background. Fig. 1.e shows how pixels are clustered according to their component sound signals. The same color is assigned to pixels that generate very similar sounds.

图1显示了PixelPlayer的工作示例（查看项目网站4以获取示例视频和交互式演示）。在这个例子中，系统已经训练有大量视频，其中包含以不同组合演奏乐器的人，包括独奏和二重奏。没有提供关于每个视频中存在的乐器，它们所在位置以及它们如何发声的标签。在测试时间内，输入（图1.a）是一起播放的几个乐器的视频，包含视觉帧I（x; y; t）和单声道音频S（t）。 PixelPlayer执行视听源分离和定位，分割输入声音信号以估计输出声音分量Sout（x; y; t），每个声音分量对应于来自视频帧中的空间位置（x; y）的声音。作为说明，图1.c示出了11个示例像素的恢复音频信号。平蓝线对应于系统认为是静音的像素。非静音信号对应于来自每个单独乐器的声音。图1.d示出了估计的声能或来自每个像素的音频信号的音量。请注意，系统正确检测到声音来自两个乐器，而不是来自背景。图1.e示出了如何根据分量声音信号对像素进行聚类。将相同的颜色分配给产生非常相似声音的像素。

The capability to incorporate sound into vision will have a large impact on a range of applications involving the recognition and manipulation of video. PixelPlayer’s ability to separate and locate sounds sources will allow more isolated processing of the sound coming from each object and will aid auditory recognition. Our system could also facilitate sound editing in videos, enabling, for instance, volume adjustments for specific objects or removal of the audio from particular sources.

将声音融入视觉的能力将对涉及识别和操纵视频的一系列应用产生巨大影响。 PixelPlayer分离和定位声源的能力将允许更加孤立地处理来自每个物体的声音，并有助于听觉识别。我们的系统还可以促进视频中的声音编辑，例如，可以对特定对象进行音量调整或从特定来源中删除音频。

Concurrent to this work, there are papers [11,29] at the same conference that also show the power of combining vision and audio to decompose sounds into components. [11] shows how person appearance could help solving the cocktail party problem in speech domain. [29] demonstrates an audio-visual system that separates on-screen sound vs. background sounds not visible in the video.

在这项工作的同时，同一会议上的论文[11,29]也展示了将视觉和音频结合起来将声音分解为组件的能力。 [11]显示了人的外表如何有助于解决语音领域的鸡尾酒会问题。 [29]演示了一种视听系统，它将屏幕上的声音与视频中不可见的背景声音分开。

This paper is presented as follows. In Section 2, we first review related work in both the vision and sound communities. In Section 3, we present our system that leverages cross-modal context as a supervisory signal. In Section 4, we describe a new dataset for visual-audio grounding. In Section 5, we present several experiments to analyze our model. Subjective evaluations are presented in Section 6.

本文如下。在第2节中，我们首先回顾了视觉和健全社区的相关工作。在第3节中，我们介绍了利用跨模态上下文作为监督信号的系统。在第4节中，我们描述了一个用于视频 - 音频定位的新数据集。在第5节中，我们提出了几个实验来分析我们的模型。主观评价见第6节。

2 Related Work

Our work relates mainly to the fields of sound source separation, visual-audio cross-modal learning, and self-supervised learning, which will be briefly discussed in this section.

我们的工作主要涉及声源分离，视听交叉模态学习和自我监督学习等领域，本节将对此进行简要讨论。

Sound source separation. Sound source separation, also known as the "cocktail party problem" [25,14], is a classic problem in engineering and perception. Classical approaches include signal processing methods such as Nonnegative Matrix Factorization (NMF) [42,8,40]. More recently, deep learning methods have gained popularity [45,7]. Sound source separation methods enable applications ranging from music/vocal separation [39], to speech separation and enhancement [16,12,27]. Our problem differs from classic sound source separation problems because we want to separate sounds into visually and spatially grounded components.

声源分离。 声源分离，也称为“鸡尾酒会问题”[25,14]，是工程和感知中的经典问题。经典方法包括信号处理方法，如非负矩阵分解（NMF）[42,8,40]。最近，深度学习方法已经普及[45,7]。声源分离方法使应用范围从音乐/声音分离[39]到语音分离和增强[16,12,27]。我们的问题不同于传统的声源分离问题，因为我们希望将声音分成视觉和空间定位组件。

Learning visual-audio correspondence. Recent work in computer vision has explored the relationship between vision and sound. One line of work has developed models for generating sound from silent videos [30,51]. The correspondence between vision and sound has also been leveraged for learning representations. For example, [31] used audio to supervise visual representations, [3,18] used vision to supervise audio representations, and [1] used sound and vision to jointly supervise each other. In work related to our paper, people studied how to localize sounds in vision according to motion [19] or semantic cues [2,37], however they do not separate multiple sounds from a mixed signal.

学习视听 - 音频一致。最近在计算机视觉方面的工作探索了视觉与声音之间的关系。一项工作已经开发出用于从静音视频中产生声音的模型[30,51]。视觉和声音之间的对应关系也被用于学习表示。例如，[31]使用音频来监督视觉表现，[3,18]使用视觉来监督音频表示，[1]使用声音和视觉来共同监督彼此。在与我们的论文相关的工作中，人们研究了如何根据运动[19]或语义线索[2,37]来定位视觉中的声音，但是他们没有将多个声音与混合信号分开。

Self-supervised learning. Our work builds off efforts to learn perceptual models that are "self-supervised" by leveraging natural contextual signals in images [10,22,33,38,24], videos [46,32,43,44,13,20], and even radio signals [48]. These approaches utilize the power of supervised learning while not requiring manual annotations, instead deriving supervisory signals from the structure in natural data. Our model is similarly self-supervised, but uses self supervision to learn to separate and ground sound in vision.

自我监督学习。我们的工作建立了通过利用图像中的自然语境信号来学习“自我监督”的感知模型的努力[10,22,33,38,24]，视频[46,32,43,44,13,20]，甚至无线电信号[48]。这些方法利用监督学习的能力，而不需要手动注释，而是从自然数据中的结构中获得监督信号。我们的模型同样是自我监督的，但是使用自我监督来学习在视觉中分离和定位声音。

产生像素声音的过程：通过应用于T帧的扩张的ResNet的输出上的时间最大池来提取像素级视觉特征。输入音频频谱图通过U-Net传输，其输出为K个音频通道。每个像素的声音由音频合成器网络计算。音频合成器网络输出要应用于输入频谱图的掩模，该掩模将选择与像素相关联的频谱分量。最后，将逆STFT应用于针对每个像素计算的频谱图以产生最终声音。

3 Audio-Visual Source Separation and Localization

In this section, we introduce the model architectures of PixelPlayer, and the proposed Mix-and-Separate training framework that learns to separate sound according to vision.

在本节中，我们将介绍PixelPlayer的模型体系结构，以及建议的混合和分离训练框架，该框架学习根据视觉分离声音。

3.1 Model architectures

Our model is composed of a video analysis network, an audio analysis network, and an audio synthesizer network, as shown in Fig. 2.

我们的模型由视频分析网络，音频分析网络和音频合成器网络组成，如图2所示。

Video analysis network. The video analysis network extracts visual features from video frames. Its choice can be an arbitrary architecture used for visual classification tasks. Here we use a dilated variation of the ResNet-18 model [15]which will be described in detail in the experiment section. For an input video of size T×H×W×3, the ResNet model extracts per-frame features with size T×(H/16)×(W/16)×K. After temporal pooling and sigmoid activation, we obtain a visual feature ik(x; y) for each pixel with size K.

视频分析网络。视频分析网络从视频帧中提取视觉特征。它的选择可以是用于视觉分类任务的任意架构。 在这里，我们使用ResNet-18模型的扩张变体[15]，将在实验部分中详细描述。对于尺寸为T×H×W×3的输入视频，ResNet模型提取尺寸为T×（H / 16）×（W / 16）×K的每帧特征。在时间池和sigmoid激活之后，我们获得具有大小K的每个像素的视觉特征ik（x; y）。

Audio analysis network. The audio analysis network takes the form of a U-Net [35] architecture, which splits the input sound into K components sk, k = (1; :::; K). We empirically found that working with audio spectrograms gives better performance than using raw waveforms, so the network described in this paper uses the Time-Frequency (T-F) representation of sound. First, a ShortTime Fourier Transform (STFT) is applied on the input mixture sound to obtain its spectrogram. Then the magnitude of spectrogram is transformed into logfrequency scale (analyzed in Sec. 5), and fed into the U-Net which yields K feature maps containing features of different components of the input sound.

音频分析网络。 音频分析网络采用U-Net [35]架构的形式，将输入声音分成K个分量sk，k =（1; :::; K）。我们凭经验发现，使用音频频谱图比使用原始波形具有更好的性能，因此本文中描述的网络使用声音的时频（T-F）表示。首先，对输入混合声音应用短时傅立叶变换（STFT）以获得其频谱图。然后将频谱图的幅度转换为对数刻度（在第5节中分析），并馈入U-Net，产生包含输入声音的不同分量的特征的K个特征映射。

在混合两个视频（N = 2）的情况下，我们提出的混合和分离框架的培训管道。虚线框表示图2中详述的模块。来自两个视频的音频信号被加在一起以产生具有已知组成源信号的输入混合。训练网络以分离以相应视频帧为条件的音频源信号; 它的输出是对两种声音信号的估计。请注意，我们不假设每个视频都包含单个声音源。而且，没有提供注释。因此，该系统学会在没有传统监督的情况下分离各个来源。

Audio synthesizer network. The synthesizer network finally predicts the predicted sound by taking pixel-level visual feature ik(x; y) and audio feature sk. The output sound spectrogram is generated by vision based spectrogram masking technique. Specifically, a mask M(x; y) that could separate the sound of the pixel from the input is estimated, and multiplied with the input spectrogram. Finally, to get the waveform of the prediction, we combine the predicted magnitude of spectrogram with the phase of input spectrogram, and use inverse STFT for recovery

音频合成器网络。合成器网络最终通过获取像素级视觉特征ik（x; y）和音频特征sk来预测预测声音。 输出声谱图由基于视觉的谱图掩蔽技术生成。 具体地，估计可以将像素的声音与输入分开的掩模M（x; y），并将其与输入频谱图相乘。最后，为了得到预测的波形，我们将预测的频谱图幅度与输入频谱图的相位相结合，并使用逆STFT进行恢复

3.2 Mix-and-Separate framework for Self-supervised Training

The idea of the Mix-and-Separate training procedure is to artificially create a complex auditory scene and then solve the auditory scene analysis problem of separating and grounding sounds. Leveraging the fact that audio signals are approximately additive, we mix sounds from different videos to generate a complex audio input signal. The learning objective of the model is to separate a sound source of interest conditioned on the visual input associated with it

混合和分离训练程序的想法是人工创建复杂的听觉场景，然后解决分离和定位声音的听觉场景分析问题。 利用音频信号近乎相加的事实，我们混合来自不同视频的声音以生成复杂的音频输入信号。该模型的学习目标是分离以与其相关联的视觉输入为条件的感兴趣的声源

Concretely, to generate a complex audio input, we randomly sample N videos {In; Sn} from the training dataset, where n = (1; :::; N). In and Sn represent the visual frames and audio of the n-th video, respectively. The input sound mixture is created through linear combinations of the audio inputs as Smix = E N n=1 Sn. The model f learns to estimate the sounds in each video S^n given the audio mixture and the visual of the corresponding video S^n = f(Smix; In).

具体地说，为了生成复杂的音频输入，我们随机抽样N个来自训练数据集的视频{In; Sn}，其中n =（1; :::; N）。 In和Sn分别表示第n个视频的视觉帧和音频。输入声音混合是通过音频输入的线性组合创建的，因为Smix = E N n = 1 Sn。模型f学习估计给定音频混合的每个视频S ^ n中的声音和相应视频的视觉S ^ n = f（Smix; In）。

Fig. 3 shows the training framework in the case of N = 2. The training phase differs from the testing phase in that 1) we sample multiple videos randomly from the training set, mix the sample audios and target to recover each of them given their corresponding visual input; 2) video- level visual features are obtained by spatial-temporal max pooling instead of pixel-level features. Note that although we have clear targets to learn in the training process, it is still unsupervised as we do not use the data labels and do not make assumptions about the sampled data.

图3显示了在N = 2的情况下的训练框架。训练阶段与测试阶段的不同之处在于：1）我们从训练集中随机抽取多个视频，混合样本音频和目标以恢复每个视频给定他们的相应的视觉输入; 2）通过空间 - 时间最大池而不是像素级特征获得视频级视觉特征。请注意，尽管我们在培训过程中有明确的目标要学习，但由于我们不使用数据标签而且不对采样数据做出假设，因此它仍然没有受到监督。

The learning target in our system are the spectrogram masks, they can be binary or ratios. In the case of binary masks, the value of the ground truth mask of the n-th video is calculated by observing whether the target sound is the dominant component in the mixed sound in each T-F unit,

我们系统中的学习目标是频谱图掩模spectrogram masks，它们可以是二进制或比率。在二进制掩码的情况下，通过观察目标声音是否是每个T-F单元中的混合声音中的主要分量来计算第n个视频的地面实况掩模的值，

where (u; v) represents the coordinates in the T-F representation and S represents the spectrogram. Per-pixel sigmoid cross entropy loss is used for learning. For ratio masks, the ground truth mask of a video is calculated as the ratio of the magnitudes of the target sound and the mixed sound,

其中（u; v）表示T-F表示中的坐标，S表示谱图。每像素S形交叉熵损失用于学习。对于比率掩模，视频的gt掩模被计算为目标声音和混合声音的大小的比率，

In this case, per-pixel L1 loss [47] is used for training. Note that the values of the ground truth mask do not necessarily stay within [0; 1] because of interference.

在这种情况下，每像素L1损耗[47]用于训练。注意，地面实况掩模的值不一定保持在[0; 1]因为干扰。

4 MUSIC Dataset

The most commonly used videos with audio-visual correspondence are musical recordings, so we introduce a musical instrument video dataset for the proposed task, called MUSIC (Multimodal Sources of Instrument Combinations) dataset.

最常用的具有视听对应的视频是音乐录音，因此我们为所提议的任务引入了乐器视频数据集，称为MUSIC（多模态乐器组合源）数据集。

We retrieved the MUSIC videos from YouTube by keyword query. During the search, we added keywords such as "cover" to find more videos that were not post-processed or edited.

我们通过关键字查询从YouTube检索了MUSIC视频。在搜索过程中，我们添加了“封面cover”等关键字，以查找未经过后期处理或编辑的更多视频。

MUSIC dataset has 685 untrimmed videos of musical solos and duets, some sample videos are shown in Fig. 4. The dataset spans 11 instrument categories: accordion, acoustic guitar, cello, clarinet, erhu, flute, saxophone, trumpet, tuba, violin and xylophone. Fig. 5 shows the dataset statistics.

MUSIC数据集有685个未经修剪的音乐独奏和二重奏视频，一些示例视频如图4所示。数据集涵盖11种乐器类别：手风琴，原声吉他，大提琴，单簧管，二胡，长笛，萨克斯，小号，大号，小提琴和木琴。图5显示了数据集统计数据。

来自我们的视频数据集的示例帧和相关声音。顶行显示独奏视频，底行显示二重奏视频。声音在时频域中显示为频谱图，频率以对数标度显示。

数据集统计：a）显示视频类别的分布。共有536个独奏视频和149个二重奏视频。 b）显示独奏视频持续时间的分布。平均持续时间约为2分钟。

Statistics reveal that due to the natural distribution of videos, duet performances are less balanced than the solo performances. For example, there are almost no videos of tuba and violin duets, while there are many videos of guitar and violin duets.

统计数据显示，由于视频的自然分布，二重奏表演不如独奏表演平衡。例如，几乎没有大号和小提琴二重奏的视频，而有许多吉他和小提琴二重奏的视频。

5 Experiments

5.1 Audio data processing

There are several steps we take before feeding the audio data into our model. To speed up computation, we sub-sampled the audio signals to 11kHz, such that the highest signal frequency preserved is 5.5kHz. This preserves the most perceptually important frequencies of instruments and only slightly degrades the overall audio quality. Each audio sample is approximately 6 seconds, randomly cropped from the untrimmed videos during training. An STFT with a window size of 1022 and a hop length of 256 is computed on the audio samples, resulting in a 512 × 256 Time-Frequency (T-F) representation of the sound. We further re-sample this signal on a log-frequency scale to obtain a 256 × 256 T-F representation. This step is similar to the common practice of using a Mel-Frequency scale, e.g. in speech recognition [23]. The log frequency scale has the dual advantages of (1) similarity to the frequency decomposition of the human auditory system (frequency discrimination is better in absolute terms at low frequencies) and (2) translation invariance for harmonic sounds such as musical instrument (whose fundamental frequency and higher order harmonics translate on the logfrequency scale as the pitch changes), fitting well to a ConvNet framework. The log magnitude values of T-F units are used as the input to the audio analysis network. After obtaining the output mask from our model, we use an inverse sampling step to convert our mask back to linear frequency scale with size 512×256, which can be applied on the input spectrogram. We finally perform an inverse STFT to obtain the recovered signal.

在将音频数据输入模型之前，我们采取了几个步骤。为了加速计算，我们将音频信号子采样到11kHz，这样保留的最高信号频率为5.5kHz。这样可以保留最具感知重要性的乐器频率，并且只会略微降低整体音频质量。每个音频样本大约6秒，在训练期间从未修剪的视频中随机裁剪。在音频样本上计算窗口大小为1022且跳跃长度为256的STFT，从而产生声音的512×256时频（T-F）表示。我们进一步在对数频率尺度上重新采样该信号以获得256×256T-F表示。该步骤类似于使用Mel频率标度的常规做法，例如，在语音识别中[23]。对数频率尺度具有以下双重优点：（1）与人类听觉系统的频率分解相似（频率辨别在低频时的绝对值更好）和（2）谐波声音的平移不变性，如乐器（其基本原理）随着音高变化，频率和高阶谐波在log-frequency scale上转换，非常适合ConvNet框架。 T-F单元的对数幅度值用作音频分析网络的输入。在从我们的模型中获得输出掩模之后，我们使用反向采样步骤将我们的掩模转换回尺寸为512×256的线性频率标度，这可以应用于输入频谱图。我们最终执行逆STFT以获得恢复的信号。

5.2 Model configurations

In all the experiments, we use a variant of the ResNet-18 model for the video analysis network, with the following modifications made: (1) removing the last average pooling layer and fc layer; (2) removing the stride of the last residual block, and making the convolution layers in this block to have a dilation of 2; (3) adding a last 3 × 3 convolution layer with K output channels. For each video sample, it takes T frames with size 224 ×224×3 as input, and outputs a feature of size K after spatiotemporal max pooling.

在所有实验中，我们使用ResNet-18模型的变体用于视频分析网络，并进行以下修改：（1）去除最后的平均合并层和fc层; （2）去除最后残留块的步幅，并使该块中的卷积层膨胀为2; （3）添加具有K个输出通道的最后3×3卷积层。对于每个视频样本，它采用尺寸为224×224×3的T帧作为输入，并在时空最大池化之后输出大小为K的特征。

The audio analysis network is modified from U-Net. It has 7 convolutions (or down-convolutions) and 7 de-convolutions (or up-convolution) with skip connections in between. It takes an audio spectrogram with size 256 × 256 × 1, and outputs K feature maps of size 256 × 256 × K.

音频分析网络是从U-Net修改的。它有7个卷积（或向下卷积）和7个解卷积（或上卷积），其间有跳过连接。它采用尺寸为256×256×1的音频频谱图，并输出尺寸为256×256×K的K个特征图。

The audio synthesizer takes the outputs from video and audio analysis networks, fuses them with a weighted summation, and outputs a mask that will be applied on the spectrogram. The audio synthesizer is a linear layer which has very few trainable parameters (K weights + 1 bias). It could be designed to have more complex computations, but we choose the simple operation in this work toshow interpretable intermediate representations, which will be shown in Sec 5.6

音频合成器从视频和音频分析网络获取输出，将它们与加权求和融合，并输出将应用于频谱图的掩码。 音频合成器是线性层，其具有非常少的可训练参数（K权重+ 1偏差）。它可以被设计为具有更复杂的计算，但是我们在这个工作中选择简单的操作来表示可解释的中间表示，这将在5.6节中显示。

Our best model takes 3 frames as visual input, and uses the number of feature channels K = 16

我们的最佳模型需要3帧作为视觉输入，并使用特征通道的数量K = 16

5.3 Implementation details

Our goal in the model training is to learn on natural videos (with both solos and duets), evaluate quantitatively on the validation set, and finally solve the source separation and localization problem on the natural videos with mixtures. Therefore, we split our MUSIC dataset into 500 videos for training, 130 videos for validation, and 84 videos for testing. Among them, 500 training videos contain both solos and duets, the validation set only contains solos, and the test set only contains duets.

我们在模型培训中的目标是学习自然视频（包括独奏和二重奏），在验证集上定量评估，最后用合奏解决自然视频的源分离和定位问题。因此，我们将MUSIC数据集分为500个视频进行培训，130个视频进行验证，84个视频进行测试。其中，500个培训视频包含独奏和二重奏，验证集仅包含独奏，测试集仅包含二重奏。

在NSDR / SIR / SAR中评估基线的模型性能和我们提出的模型的不同变化。对数屏幕中的二进制屏蔽在大多数指标中表现最佳。

During training, we randomly sample N = 2 videos from our MUSIC dataset, which can be solos, duets, or silent background. Silent videos are made by pairing silent audio waveforms randomly with images from the ADE dataset [50] which contains images of natural environments. This technique regularizes the model better in localizing objects that sound by introducing more silent videos. To recap, the input audio mixture could contain 0 to 4 instruments. We also experimented with combining more sounds, but that made the task more challenging and the model did not learn better.

在训练期间，我们从MUSIC数据集中随机抽取N = 2个视频，这些视频可以是独奏，二重奏或静音背景。通过将静音音频波形随机配对来自ADE数据集[50]的图像来制作静音视频，其中包含自然环境的图像。这种技术通过引入更多静音视频，更好地定位模型，从而更好地定位声音对象。回顾一下，输入音频混合可以包含0到4个乐器。 我们还试验了更多声音的组合，但这使得任务更具挑战性，并且模型没有更好地学习。

In the optimization process, we use a SGD optimizer with momentum 0.9. We set the learning rate of the audio analysis network and the audio synthesizer both as 0.001, and the learning rate of the video analysis network as 0:0001 since we adopt a pre-trained CNN model on ImageNet.

在优化过程中，我们使用动量为0.9的SGD优化器。 我们将音频分析网络和音频合成器的学习率设置为0.001，并将视频分析网络的学习率设置为0.0001，因为我们在ImageNet上采用预先训练的CNN模型。

5.4 Sound Separation Performance

To evaluate the performance of our model, we also use the Mix-and-Separate process to make a validation set of synthetic mixture audios and the separation is evaluated.

为了评估我们模型的性能，我们还使用混合和分离过程来制作合成混合音频的验证集，并评估分离。

Fig. 6 shows qualitative results of our best model, which predicts binary masks that apply on the mixture spectrogram. The first row shows one frame per sampled videos that we mix together, the second row shows the spectrogram (in log frequency scale) of the audio mixture, which is the actual input to the audio analysis network. The third and fourth rows show ground truth masks and the predicted masks, which are the targets and output of our model. The fifth and sixth rows show the ground truth spectrogram and predicted spectrogram after applying masks on the input spectrogram. We could observe that even with the complex patterns in the mixed spectrogram, our model can "segment" the target instrument components out successfully.

图6显示了我们的最佳模型的定性结果，其预测了应用于混合物谱图的二元掩模。第一行显示我们混合在一起的每个采样视频的一帧，第二行显示音频混合的频谱图（以对数频率标度），其是音频分析网络的实际输入。第三和第四行显示地面真实掩模和预测掩模，它们是我们模型的目标和输出。第五行和第六行显示在输入谱图上应用掩模后的地面实况谱图和预测的谱图。我们可以观察到，即使混合光谱图中的复杂模式，我们的模型也可以成功地“分割”目标仪器组件。

To quantify the performance of the proposed model, we use the following metrics: the Normalized Signal-to-Distortion Ratio (NSDR), Signal-to-Interference Ratio (SIR), and Signal-to-Artifact Ratio (SAR) on the validation set of our synthetic videos. The NSDR is defined as the difference in SDR of the separated signals compared with the ground truth signals and the SDR of the mixture signals compared with the ground truth signals. This represents the improvement of using the separated signal compared with using the mixture as each separated source. The results reported in this paper were obtained by using the open-source mir eval [34] library

为了量化所提出模型的性能，我们使用以下指标：验证时的归一化信号与失真比（NSDR），信号干扰比（SIR）和信号与伪像比（SAR） 我们的合成视频集。 NSDR定义为与地面实况信号相比的分离信号的SDR差异和混合信号的SDR与地面实况信号的比较。这表示与使用混合物作为每个分离的源相比，使用分离的信号的改进。本文报道的结果是通过使用开源mir eval [34]库获得的

Results are shown in Table 1. Among all the models, baseline approaches NMF [42] and DeepConvSep [7] use audio and ground-truth labels to do source separation. All variants of our model use the same architecture we described, and take both visual and sound input for learning. Spectral Regression refers to the model that directly regresses output spectrogram values given an input mixture spectrogram, instead of outputting spectrogram mask values. From the numbers in the table, we can conclude that (1) masking based approaches are generally better than direct regression; (2) working in the log frequency scale performs better than in the linear frequency scale; (3) binary masking based method achieves similar performance as ratio masking.

结果显示在表1中。在所有模型中，基线方法NMF [42]和DeepConvSep [7]使用音频和地面实况标签进行源分离。 我们模型的所有变体都使用我们描述的相同架构，并采用视觉和声音输入进行学习。 光谱回归是指在给定输入混合光谱图的情况下直接对输出光谱图值进行回归的模型，而不是输出光谱图掩模值。从表中的数字可以得出结论：（1）基于mask的方法通常优于直接回归; （2）在对数频率范围内工作的性能优于线性频率范围; （3）基于二值掩蔽的方法实现与比率掩蔽相似的性能。

Meanwhile, we found that the NSDR/SIR/SAR metrics are not the best metrics for evaluating perceptual separation quality, so in Sec 6 we further conduct user studies on the audio separation quality

同时，我们发现NSDR / SIR / SAR指标不是评估感知分离质量的最佳指标，因此在第6部分，我们进一步对音频分离质量进行用户研究

5.5 Visual Grounding of Sounds

As the title of paper indicates, we are fundamentally solving two problems: localization and separation of sounds.

正如文章标题所示，我们从根本上解决了两个问题：声音的定位和分离。

Sound localization. The first problem is related to the spatial grounding question, \which pixels are making sounds?" This is answered in Fig. 7: for natural videos in the dataset, we calculate the sound energy (or volume) of each pixel in the image, and plot their distributions in heatmaps. As can be seen, the model accurately localizes the sounding instruments.

声音定位。第一个问题与空间定位问题有关，即像素正在发出声音？“这在图7中回答：对于数据集中的自然视频，我们计算图像中每个像素的声能（或体积），并绘制它们在热图中的分布。可以看出，该模型准确地定位了探测仪器。

Clustering of sounds. The second problem is related to a further question: "what sounds do these pixels make?" In order to answer this, we visualize the sound each pixel makes in images in the following way: for each pixel in a video frame, we take the feature of its sound, namely the vectorized log spectrogram magnitudes, and project them onto 3D RGB space using PCA for visualization purposes. Results are shown in Fig. 8, different instruments and the background in the same video frame have different color embeddings, indicating different sounds that they make.

声音聚类。第二个问题与另一个问题有关：“这些像素会产生什么声音？” 为了回答这个问题，我们通过以下方式可视化每个像素在图像中产生的声音：对于视频帧中的每个像素，我们采用其声音的特征，即矢量化对数频谱图幅度，并将它们投影到3D RGB空间上使用PCA进行可视化。 结果如图8所示，不同的乐器和同一视频帧中的背景具有不同的颜色嵌入，表示它们产生的不同声音。

Discriminative channel activations. Given our model could separate sounds of different instruments, we explore its channel activations for different categories. For validation samples of each category, we find the strongest activated channel, and then sort them to generate a confusion matrix. Fig. 9 shows the (a) visual and (b) audio confusion matrices from our best model. If we simply evaluate classification by assigning one category to one channel, the accuracy is 46:2% for vision and 68:9% for audio. Note that no learning is involved here, we expect much higher performance by using a linear classifier. This experiment demonstrates that the model has implicitly learned to discriminate instruments visually and auditorily.

判别性通道激活Discriminative channel activations.。鉴于我们的模型可以分离不同乐器的声音，我们将探索不同类别的声道激活。 对于每个类别的验证样本，我们找到最强的激活通道，然后对它们进行排序以生成混淆矩阵。 图9显示了来自我们最佳模型的（a）视觉和（b）音频混淆矩阵。如果我们只是通过将一个类别分配到一个通道来评估分类，则视力的准确度为46：2％，音频的准确度为68：9％。请注意，这里不涉及任何学习，我们期望通过使用线性分类器获得更高的性能。该实验表明该模型已经隐含地学会了在视觉上和听觉上区分仪器。

In a similar fashion, we evaluate object localization performance of the video analysis network based on the channel activations. To generate a bounding box from the channel activation map, we follow [49] to threshold the map. We first segment the regions of which the value is above 20% of the max value of the activation map, and then take the bounding box that covers the largest connected component in the segmentation map. Localization accuracy under different intersection over union (IoU) criterion are shown in Table 2.

以类似的方式，我们基于频道激活来评估视频分析网络的对象定位性能。要从通道激活图生成边界框，我们按照[49]对地图进行阈值处理。我们首先对值大于激活贴图最大值20％的区域进行分段，然后采用覆盖分割图中最大连通分量的边界框。表2中显示了不同交会结合（IoU）标准下的定位精度。

5.6 Visual-audio corresponding activations

As our proposed model is a form of self-supervised learning and is designed such that both visual and audio networks learn to activate simultaneously on the same channel, we further explore the representations learned by the model. Specifically, we look at the K channel activations of the video analysis network before max pooling, and their corresponding channel activations of the audio analysis network. The model has learned to detect important features of specific objects across the individual channels. In Fig. 10 we show the top activated videos of channel 6, 11 and 14. These channels have emerged as violin, guitar and xylophone detectors respectively, in both visual and audio domains. Channel 6 responds strongly to the visual appearance of violin and to the higher order harmonics in violin sounds. Channel 11 responds to guitars and the low frequency region in sounds. And channel 14 responds to the visual appearance of xylophone and to the brief, pulse-like patterns in the spectrogram domain. For other channels, some of them also detect specific instruments while others just detect specific features of instruments.

由于我们提出的模型是一种自我监督学习的形式，并且被设计成使得视觉和音频网络学习在同一频道上同时激活，我们进一步探索由该模型学习的表示。具体来说，我们在最大池化之前查看视频分析网络的K通道激活，以及它们对音频分析网络的相应通道激活。该模型已经学会了检测各个通道上特定对象的重要特征。在图10中，我们展示了通道6,11和14的顶部激活视频。这些频道在视觉和音频领域分别成为小提琴，吉他和木琴探测器。通道6对小提琴的视觉外观和小提琴声中的高次谐波作出强烈反应。通道11响应吉他和声音中的低频区域。并且通道14响应于木琴的视觉外观以及频谱图域中的简短脉冲状图案。对于其他通道，其中一些还检测特定仪器，而其他通道只检测仪器的特定功能。

6 Subjective Evaluations

The objective and quantitative evaluations in Sec. 5.4 are mainly performed on the synthetic mixture videos, the performance on the natural videos needs to be further investigated. On the other hand, the popular NSDR/SIR/SAR metrics used are not closely related to perceptual quality. Therefore we conducted crowd-sourced subjective evaluations as a complementary evaluation. Two studies are conducted on Amazon Mechanical Turk (AMT) by human raters, a sound separation quality evaluation and a visual-audio correspondence evaluation.

第二节中的客观和定量评估。 5.4主要是对合成混合视频进行，对自然视频的表现有待进一步研究。另一方面，使用的流行的NSDR / SIR / SAR指标与感知质量没有密切关系。因此，我们进行了众源主观评价作为补充评价。由人类评估者对亚马逊机械土耳其人（AMT）进行了两项研究，一项声音分离质量评估和视觉 - 音频通信评估。

6.1 Sound separation quality

For the sound separation evaluation, we used a subset of the solos from the dataset as ground truth. We prepared the outputs of the baseline NMF model and the outputs of our models, including spectral regression, ratio masking and binary masking, all in log frequency scale. For each model, we take 256 audio outputs from the same set for evaluation and each audio is evaluated by 3 independent AMT workers. Audio samples are randomly presented to the workers, and the following question is asked: "Which sound do you hear? 1. A, 2. B, 3. Both, or 4. None of them". Here A and B are replaced by their mixture sources, e.g. A=clarinet, B=flute.

对于声音分离评估，我们使用数据集中的独奏子集作为基础事实。我们准备了基线NMF模型的输出和模型的输出，包括光谱回归，比率屏蔽和二进制屏蔽，所有这些都是对数频率标度。对于每个型号，我们从同一组中获取256个音频输出用于评估，每个音频由3个独立的AMT工作人员评估。音频样本随机呈现给工作人员，并询问以下问题：“您听到了哪种声音？1。A，2。B，3。两者，或4.没有一个”。这里A和B被它们的混合物源代替，例如 A =单簧管，B =长笛。

Subjective evaluation results are shown in Table 3. We show the percentages of workers who heard only the correct solo instrument (Correct), who heard only the incorrect solo instrument (Wrong), who heard both of the instruments (Both), and who heard neither of the instruments (None). First, we observe that although the NMF baseline did not have good NSDR numbers in the quantitative evaluation, it has competitive results in our human study. Second, among our models, the binary masking model outperforms all other models by a margin, showing its advantage in separation as a classification model. The binary masking model gives the highest correct rate, lowest error rate, and lowest confusion (percentage of Both), indicating that the binary model performs source separation perceptively better than the other models. It is worth noticing that even the ground truth solos do not give 100% correct rate, which represents the upper bound of performance.

主观评价结果如表3所示。我们显示只听到正确的独奏乐器（正确），听过不正确的独奏乐器（错误），谁听过两种乐器（两者），谁听过的工人的百分比两种工具（无）。首先，我们观察到尽管NMF基线在定量评估中没有良好的NSDR数量，但它在我们的人体研究中具有竞争性结果。其次，在我们的模型中，二元掩蔽模型优于所有其他模型，显示其作为分类模型的分离优势。二进制掩码模型提供最高的正确率，最低的错误率和最低的混淆（两者的百分比），表明二进制模型比其他模型更好地执行源分离。值得注意的是，即使是真实的独奏也没有给出100％的正确率，这代表了性能的上限。

6.2 Visual-sound correspondence evaluations

The second study focuses on the evaluation of the visual-sound correspondence problem. For a pixel-sound pair, we ask the binary question: :Is the sound coming from this pixel?" For this task, we only evaluate our models for comparison as the task requires visual input, so audio-only baselines are not applicable. We select 256 pixel positions (50% on instruments and 50% on background objects) to generate corresponding sounds with different models, and get the percentage of Yes responses from the workers, which tells the percentage of pixels with good source separation and localization, results are shown in Table 4. This evaluation also demonstrates that the binary masking-based model gives the best performance in the vision-related source separation problem.

第二项研究侧重于视觉 - 声音对应问题的评估。对于像素 - 声音对，我们问二进制问题：声音来自这个像素吗？“对于这个任务，我们只评估我们的模型进行比较，因为任务需要视觉输入，因此仅音频基线不适用。我们选择256个像素位置（50％在乐器上，50％在背景物体上）以生成具有不同模型的相应声音，并获得来自工作人员的Yes响应百分比，这表示具有良好源分离和定位的像素百分比，结果表4中显示了该评估。该评估还表明基于二进制掩模的模型在视觉相关的源分离问题中提供了最佳性能。