Stanford University, 2017CVPR
斯坦福大学提出的直接对3D点云数据进行深度学习,是第一个直接处理点云数据的神经网络,同时还提出了PointNet++在2017年的nips上发表。由于会议文章篇幅的限制,看文章的full version带附录的,附录几乎和正文一样长。。
作者github地址
原文 | 译文 |
---|---|
Point cloud is an important type of geometric data structure. Due to its irregular format, most researchers transform such data to regular 3D voxel grids or collections of images. This, however, renders data unnecessarily voluminous and causes issues. | 点云是集合数据结构中的一种重要形式,但是由于点云的不规则性,大多数处理方法都是将点云数据转换成便于处理的规则形式,例如体素网格、多视图等。这些处理方法自然会带来一些不必要的信息丢失和问题。 |
In this paper, we design a novel type of neural network that directly consumes point clouds, which well respects the permutation invariance of points in the input. | 本文提出了一种可以直接处理点云数据的神经网络结构,可以很好的处理点云在空间排列的无序性。 |
Our network, named PointNet, provides a unified architecture for applications ranging from object classification, part segmentation, to scene semantic parsing. Though simple, PointNet is highly efficient and effective. | 我们提出的PointNet提供了一种综合性结构,可以处理物体分类、分割、场景语义转换等任务。尽管网络结构简单,但是却很有效。 |
Empirically, it shows strong performance on par or even better than state of the art. Theoretically, we provide analysis towards understanding of what the network has learnt and why the network is robust with respect to input perturbation and corruption. | 不经在实际应用上表现出色,而且我们也提供了理论分析帮助理解内在原因。 |
PointNet的网络结构如上图所示,包括两个部分:classification和segmentation,这两个任务的部分结构是相同,分割任务的结构更复杂点。
网络主要包括3个关键部分:the max pooling layer、a local and global information combination structure、two joint alignment networks
用max pooling来解决输入点云无序的问题。
目前有3张方法应对点云无序性:
本文选用第3种方法,用max pool来近似这么一个对称函数
f ( x 1 , . . . , x n ) ≈ g ( h ( x 1 ) , . . . , h ( x n ) ) f({x_1,...,x_n}) \approx g(h(x_1),...,h(x_n)) f(x1,...,xn)≈g(h(x1),...,h(xn))
h h h是一个MLP, g g g是一个max pool,作者用max pool来近似一个堆成函数,并且给出了一个证明…但是我没看懂,看懂了再回来写。
从上节中 f f f函数得到的特征 [ f 1 , . . . , f n ] [f_1,...,f_n] [f1,...,fn]是一个点云的global feature,对于分类任务可以训练一个SVM或者MLP。但是对于分割任务,需要知道local feature,作者通过将得到的global feature和原来的point feature结合得到既有global又有local的feature。
点云经过刚性变换后还是一个点云,并且表达意思是不变的。所以我们网络提取的特征对这些transformation是不变的。
借助于2D图像上的STN思想,本文提出将STN用于输入点云和输出特征,即input stn和feature stn,stn是一个很小的可以任意集成进网络的子网络。
对于input stn,输入是input,输出是针对输入的3x3变换矩阵
对于feature stn,输入是feature ,输出是针对输入的64x64变换矩阵,特征维数较高,所以需要将特征stn正则化,实验表明有助于网络收敛。
在作者的github上给出了模型以及训练的代码,我们关注两部分pointnet_cls.py和transform_nets.py
input_transform:
input→conv(3,64)→conv(64,128)→conv(128,1024)→maxpool(N,1)→fc(1024,512)→fc(512,256)→fc(256,9)
作者代码的最后一个fc写的很奇怪,目标是得到一个3x3的变换矩阵,也就是9个输出的全连接层
def input_transform_net(point_cloud, is_training, bn_decay=None, K=3):
""" Input (XYZ) Transform Net, input is BxNx3 gray image
Return:
Transformation matrix of size 3xK """
batch_size = point_cloud.get_shape()[0].value
num_point = point_cloud.get_shape()[1].value
input_image = tf.expand_dims(point_cloud, -1)
net = tf_util.conv2d(input_image, 64, [1,3],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv1', bn_decay=bn_decay)
net = tf_util.conv2d(net, 128, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv2', bn_decay=bn_decay)
net = tf_util.conv2d(net, 1024, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv3', bn_decay=bn_decay)
net = tf_util.max_pool2d(net, [num_point,1],
padding='VALID', scope='tmaxpool')
net = tf.reshape(net, [batch_size, -1])
net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
scope='tfc1', bn_decay=bn_decay)
net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
scope='tfc2', bn_decay=bn_decay)
with tf.variable_scope('transform_XYZ') as sc:
assert(K==3)
weights = tf.get_variable('weights', [256, 3*K],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
biases = tf.get_variable('biases', [3*K],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
biases += tf.constant([1,0,0,0,1,0,0,0,1], dtype=tf.float32)
transform = tf.matmul(net, weights)
transform = tf.nn.bias_add(transform, biases)
transform = tf.reshape(transform, [batch_size, 3, K])
return transform
feature_transform:
feature→conv(64,64)→conv(64,128)→conv(128,1024)→maxpool(N,1)→fc(1024,512)→fc(512,256)→fc(256,64)
def feature_transform_net(inputs, is_training, bn_decay=None, K=64):
""" Feature Transform Net, input is BxNx1xK
Return:
Transformation matrix of size KxK """
batch_size = inputs.get_shape()[0].value
num_point = inputs.get_shape()[1].value
net = tf_util.conv2d(inputs, 64, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv1', bn_decay=bn_decay)
net = tf_util.conv2d(net, 128, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv2', bn_decay=bn_decay)
net = tf_util.conv2d(net, 1024, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='tconv3', bn_decay=bn_decay)
net = tf_util.max_pool2d(net, [num_point,1],
padding='VALID', scope='tmaxpool')
net = tf.reshape(net, [batch_size, -1])
net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
scope='tfc1', bn_decay=bn_decay)
net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
scope='tfc2', bn_decay=bn_decay)
with tf.variable_scope('transform_feat') as sc:
weights = tf.get_variable('weights', [256, K*K],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
biases = tf.get_variable('biases', [K*K],
initializer=tf.constant_initializer(0.0),
dtype=tf.float32)
biases += tf.constant(np.eye(K).flatten(), dtype=tf.float32)
transform = tf.matmul(net, weights)
transform = tf.nn.bias_add(transform, biases)
transform = tf.reshape(transform, [batch_size, K, K])
return transform
input→input_stn(3,3)→conv(3,64)→conv(64,64)→feature_stn(64,64)→conv(64,64)→conv(64,128)→conv(128,1024)→maxpool(N,1)→fc(1024,512)→fc(512,256)→fc(256,num_classes)
def get_model(point_cloud, is_training, bn_decay=None):
""" Classification PointNet, input is BxNx3, output Bx40 """
batch_size = point_cloud.get_shape()[0].value
num_point = point_cloud.get_shape()[1].value
end_points = {}
with tf.variable_scope('transform_net1') as sc:
transform = input_transform_net(point_cloud, is_training, bn_decay, K=3)
point_cloud_transformed = tf.matmul(point_cloud, transform)
input_image = tf.expand_dims(point_cloud_transformed, -1)
net = tf_util.conv2d(input_image, 64, [1,3],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='conv1', bn_decay=bn_decay)
net = tf_util.conv2d(net, 64, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='conv2', bn_decay=bn_decay)
with tf.variable_scope('transform_net2') as sc:
transform = feature_transform_net(net, is_training, bn_decay, K=64)
end_points['transform'] = transform
net_transformed = tf.matmul(tf.squeeze(net, axis=[2]), transform)
net_transformed = tf.expand_dims(net_transformed, [2])
net = tf_util.conv2d(net_transformed, 64, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='conv3', bn_decay=bn_decay)
net = tf_util.conv2d(net, 128, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='conv4', bn_decay=bn_decay)
net = tf_util.conv2d(net, 1024, [1,1],
padding='VALID', stride=[1,1],
bn=True, is_training=is_training,
scope='conv5', bn_decay=bn_decay)
# Symmetric function: max pooling
net = tf_util.max_pool2d(net, [num_point,1],
padding='VALID', scope='maxpool')
net = tf.reshape(net, [batch_size, -1])
net = tf_util.fully_connected(net, 512, bn=True, is_training=is_training,
scope='fc1', bn_decay=bn_decay)
net = tf_util.dropout(net, keep_prob=0.7, is_training=is_training,
scope='dp1')
net = tf_util.fully_connected(net, 256, bn=True, is_training=is_training,
scope='fc2', bn_decay=bn_decay)
net = tf_util.dropout(net, keep_prob=0.7, is_training=is_training,
scope='dp2')
net = tf_util.fully_connected(net, 40, activation_fn=None, scope='fc3')
return net, end_points