Date | Author | Version | Note |
---|---|---|---|
2023.08.16 | Dog Tao | V1.0 | 完成文档初稿(英文) |
2023.09.09 | Dog Tao | V1.1 | 修订文档,增加了张量连接与操作说明。 |
The tensor
is a fundamental data type in PyTorch, and it’s essential for deep learning computations.
Definition: A tensor in PyTorch is a multi-dimensional array, similar to NumPy’s ndarray
. Tensors can be used on a GPU to accelerate computing.
Utility: Tensors are crucial for deep learning frameworks like PyTorch as they allow for efficient mathematical operations on GPUs. They’re used to store the input, output, and intermediate data as well as model parameters (like weights and biases of a neural network).
Types & Shapes: Tensors can have various data types such as float, integer, and boolean. They can exist in multiple shapes, representing scalar values (0-dimensional), vectors (1-dimensional), matrices (2-dimensional), or higher-dimensional structures.
Device Agnostic: One of the notable features of PyTorch tensors is their ability to be device agnostic. This means you can move tensors between CPU and GPU without much hassle, using the .to()
method or .cuda()
and .cpu()
methods.
Creation: You can create tensors from Python lists, from NumPy arrays, or directly in PyTorch using functions like torch.tensor()
, torch.zeros()
, torch.ones()
, torch.randn()
, and many others.
Operations: Tensors support a plethora of operations, including arithmetic operations, reshaping, indexing, and mathematical functions. PyTorch provides an automatic differentiation system, which makes it easy to compute gradients with respect to tensors (important for training neural networks).
PyTorch’s tensor library provides the necessary tools for efficient computation needed in deep learning. The familiar syntax (especially if you come from a NumPy background) combined with its GPU acceleration capabilities makes it a go-to choice for many researchers and practitioners in the machine learning community.
In deep learning and PyTorch, the dimensions of a tensor often have specific meanings based on the context in which they are used. However, it’s essential to note that the exact meaning of each dimension can vary based on the data type, the neural network architecture, or the specific operation being performed.
Here are some common interpretations of tensor dimensions based on different contexts:
Standard Images (e.g., from torchvision datasets):
[batch_size, channels, height, width]
batch_size
: Number of images in a mini-batch.channels
: Number of color channels (e.g., 3 for RGB, 1 for grayscale).height
: Height of the image in pixels.width
: Width of the image in pixels.Sequences (e.g., for RNNs, LSTMs):
[seq_len, batch_size, feature_size]
or [batch_size, seq_len, feature_size]
(depends on the batch_first
argument)
seq_len
: Length of the sequence.batch_size
: Number of sequences in a mini-batch.feature_size
: Number of features at each sequence step.Time Series:
[batch_size, sequence_length, num_features]
batch_size
: Number of time series in a mini-batch.sequence_length
: Number of time steps in the time series.num_features
: Number of features at each time step.Embeddings:
[num_words, embedding_dim]
num_words
: Number of words or unique tokens in the vocabulary.embedding_dim
: Dimensionality of the embedding vector for each word.FC Layers (Fully Connected Layers):
[batch_size, num_features]
batch_size
: Number of samples in a mini-batch.num_features
: Number of features for each sample.3D Medical Images (e.g., MRI scans):
[batch_size, channels, depth, height, width]
batch_size
: Number of scans in a mini-batch.channels
: Number of channels (could be different modalities or types of scans).depth
: Depth or number of slices in the 3D scan.height
: Height of each slice.width
: Width of each slice.In practice, it’s crucial to consult the documentation or specific context in which you’re working to determine the precise meaning of each dimension.
In PyTorch, squeeze()
, unsqueeze()
, and view()
are used to change the dimensions (or shape) of a tensor, but they do so in different ways.
squeeze()
method removes dimensions of size 1 from the shape of a tensor.Examples:
import torch
# Tensor with shape (1, 3, 1, 2)
x = torch.zeros(1, 3, 1, 2)
# Remove all dimensions of size 1
y = x.squeeze()
print(y.shape) # torch.Size([3, 2])
# Squeeze only the 0th dimension
z = x.squeeze(0)
print(z.shape) # torch.Size([3, 1, 2])
In PyTorch, when you use negative indices with functions like squeeze()
and unsqueeze()
, the counting of dimensions starts from the end (rightmost) of the tensor shape, similar to negative indexing in Python lists.
Example:
import torch
# Tensor with shape (3, 4, 1)
x = torch.zeros(3, 4, 1)
# Remove the last dimension, as it is of size 1
y = x.squeeze(-1)
print(y.shape) # torch.Size([3, 4])
unsqueeze()
method adds a dimension of size 1 at a specified position.Examples:
# Tensor with shape (3, 2)
x = torch.zeros(3, 2)
# Add a dimension at position 0
y = x.unsqueeze(0)
print(y.shape) # torch.Size([1, 3, 2])
# Add a dimension at position 2
z = x.unsqueeze(2)
print(z.shape) # torch.Size([3, 2, 1])
Example:
# Tensor with shape (3, 4)
x = torch.zeros(3, 4)
# Add a new last dimension
y = x.unsqueeze(-1)
print(y.shape) # torch.Size([3, 4, 1])
In deep learning, especially when dealing with models like CNNs or RNNs, the input tensor’s shape often needs to match the model’s expected shape. For instance, a CNN may expect a 4D tensor as input (batch size, channels, height, width), but sometimes you might have a single image of shape (channels, height, width). In this case, you’d use unsqueeze()
to add a batch dimension of size 1 before passing the image to the model. Conversely, the output from the model might have a singleton batch dimension that you want to remove with squeeze()
before further processing.
In practice, unsqueeze(1)
is commonly used when you want to add a new last dimension (e.g. channel
dimension) to a tensor, turning, for instance, a 2D tensor of shape [batch_size, features]
into a 3D tensor of shape [batch_size, 1, features]
. This is handy in various deep learning scenarios, such as when prepping data to meet the shape expectations of certain 1D convolutional layers.
The tensor.view()
method in PyTorch is used to reshape a tensor. It returns a new tensor with the specified shape. The new tensor will share the same underlying data with the original tensor, which means if you modify the original tensor, the reshaped tensor will also get modified and vice versa. This behavior ensures efficient memory usage.
Here’s a breakdown of how tensor.view()
works:
Reshaping: You can provide the desired shape as arguments to the view()
method to reshape the tensor.
Automatic Inference: You can specify one dimension as -1
, and PyTorch will automatically compute the correct size for that dimension based on the other dimensions you’ve provided. This is particularly useful when you don’t know the size of a specific dimension in advance.
Requirements:
[4, 5]
(i.e., 20 elements), the reshaped tensor might have shapes like [10, 2]
, [20]
, [2, 10]
, etc., but not [3, 7]
(because that would be 21 elements).tensor.contiguous()
before using view()
.Examples:
import torch
# Create a tensor of shape [2, 3]
x = torch.tensor([[1, 2, 3], [4, 5, 6]])
# Reshape to [3, 2]
y = x.view(3, 2)
print(y)
# tensor([[1, 2],
# [3, 4],
# [5, 6]])
# Reshape to a 1D tensor with 6 elements
z = x.view(-1)
print(z)
# tensor([1, 2, 3, 4, 5, 6])
# Reshape to [6, 1]
w = x.view(6, -1)
print(w)
# tensor([[1],
# [2],
# [3],
# [4],
# [5],
# [6]])
torch.stack
is a function in PyTorch used to stack tensors along a new dimension. This operation is similar to torch.cat
, but it introduces an additional dimension.
When you have a series of tensors and wish to stack them into a larger tensor, you can utilize torch.stack
. This is particularly useful when you want to stack a series of vectors into a matrix or stack matrices into a 3D tensor.
Examples and Usage:
Let’s say you have the following two 1-D tensors:
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])
If you wish to stack these two 1-D tensors into a 2-D tensor (matrix):
c = torch.stack((a, b))
Now, c
would be:
tensor([[1, 2, 3],
[4, 5, 6]])
Another parameter for torch.stack
is dim
, which signifies along which dimension you want to stack the tensors. The default is 0, but by adjusting it, you can alter the stacking direction.
In short, torch.stack
allows you to stack tensors of the same shape into a higher-dimensional tensor.
torch.cat
is a function in PyTorch used to concatenate tensors along a specified dimension. It lets you merge multiple tensors into a larger one.
The main difference between torch.cat
and torch.stack
is that torch.cat
doesn’t introduce a new dimension; it extends the tensor on an existing dimension.
Examples and Usage:
For two 1-D tensors:
a = torch.tensor([1, 2, 3])
b = torch.tensor([4, 5, 6])
Use torch.cat
to concatenate:
c = torch.cat((a, b))
Now, c
is:
tensor([1, 2, 3, 4, 5, 6])
For two 2-D tensors:
x = torch.tensor([[1, 2], [3, 4]])
y = torch.tensor([[5, 6]])
To concatenate along dimension 0 (rows):
z = torch.cat((x, y), dim=0)
Now, z
is:
tensor([[1, 2],
[3, 4],
[5, 6]])
Or if you have a y
of the same shape as x
:
y = torch.tensor([[5, 6], [7, 8]])
Concatenate along dimension 1 (columns):
z = torch.cat((x, y), dim=1)
Now, z
is:
tensor([[1, 2, 5, 6],
[3, 4, 7, 8]])
Note: For torch.cat
, sizes for all dimensions, except for the one you wish to concatenate on, must match.
In summary, torch.cat
enables you to concatenate tensors along a specified dimension, creating a larger tensor without adding new dimensions.
The kernel size (often also referred to as the filter size) in a convolutional layer directly affects the size of the output (also called the feature map or activation map).
Here’s a breakdown of how the kernel size, along with other parameters, affects the output size:
Kernel Size: The dimensions of the filter used in the convolution operation. Common sizes include (1 × 1), (3 × 3), (5 × 5), etc. in 2D convolutions. The kernel size determines how big of a region in the input we are looking at.
Stride: The number of positions the kernel slides over the input tensor. A stride of 1 means the kernel moves one position at a time, while a stride of 2 means it jumps over one position. The greater the stride, the smaller the output size.
Padding: The number of zeroes added to the border of the input tensor. Padding can be used to control the spatial dimensions of the output tensor. Zero padding ensures that the spatial dimensions remain the same when a kernel of size greater than (1 × 1) and stride of 1 is used.
To compute the spatial dimensions of the output feature map for a 2D convolution (assuming square inputs and filters for simplicity):
output_size = input_size − kernel_size + 2 × padding stride + 1 \text{{output\_size}} = \frac{{\text{{input\_size}} - \text{{kernel\_size}} + 2 \times \text{{padding}}}}{{\text{{stride}}}} + 1 output_size=strideinput_size−kernel_size+2×padding+1
For example, let’s consider a 2D input of size (28 × 28):
Remember that the exact formula for calculating output size can change depending on the specific type of convolution (e.g., transposed convolution, dilated convolution).
To make the output size the same as the input size (often referred to as “same” padding), the padding (( P )) can be set based on the kernel size (( K )) and the stride (( S )).
For a convolution operation with a stride of 1, the padding needed to maintain the same spatial dimensions for input and output is:
P = K − 1 2 P = \frac{K - 1}{2} P=2K−1
For instance, with a kernel size of (3 × 3) (( K = 3 )) and stride of 1, you’d need:
P = 3 − 1 2 = 1 P = \frac{3 - 1}{2} = 1 P=23−1=1
So, a padding of 1 would maintain the same dimensions.
However, when using a stride greater than 1, it becomes trickier to maintain exact input-output dimensions. Generally, a stride greater than 1 will downsample the input, and the exact amount of padding needed to keep dimensions consistent will depend on both the input size and the desired output size.
It’s also worth noting that, in deep learning libraries like TensorFlow or PyTorch, you can often specify padding as “same” to automatically ensure the output size matches the input size, at least for a stride of 1. But if you’re implementing convolutions from scratch or need a deep understanding for some advanced architectures or troubleshooting, knowing how to compute the padding manually is useful.
self.convs.append(nn.Conv1d(in_channels, out_channels, kernel_size=kernel_size, stride=1, padding="same"))
Using padding=‘same’ with even kernel lengths and odd dilation may require a zero-padded copy of the input be created.
Setting the padding value to kernel_size // 2
is a common practice when the stride is 1, especially for odd-sized kernels. This choice simplifies ensuring that the output dimensions are the same as the input dimensions.
Odd-sized Kernels: When the kernel size is odd (e.g., 3, 5, 7, …), kernel_size // 2
effectively implements the formula for “same” padding:
P = K − 1 2 P = \frac{K - 1}{2} P=2K−1
Using integer division (// in Python) ensures a whole number. For example, for a 3x3 kernel:
P = 3 − 1 2 = 1 P = \frac{3 - 1}{2} = 1 P=23−1=1
For a (5 × 5) kernel:
P = 5 − 1 2 = 2 P = \frac{5 - 1}{2} = 2 P=25−1=2
and so on.
Even-sized Kernels: For even-sized kernels, using kernel_size // 2
as the padding doesn’t perfectly preserve dimensions. This is part of the reason why odd-sized kernels are more commonly used in practice. However, if even-sized kernels are used, the designer must decide on a specific padding scheme or adjust the kernel size.
Stride: The above rationale holds when the stride is set to 1. If stride is greater than 1, the output dimensions will be reduced even with the padding set to kernel_size // 2
.
The practice of using kernel_size // 2
makes it easier to design and adjust architectures without constantly recalculating padding, especially when using odd-sized kernels with a stride of 1.
Certainly! Pooling layers in neural networks, especially in convolutional neural networks (CNNs), are used to reduce the spatial dimensions of the data (i.e., width and height). This downsampling operation serves a few purposes:
There are several types of pooling operations, but the most common ones are:
The formula to compute the output size after pooling is similar to the formula used for convolution:
output_size = ( input_size − pooling_size stride ) + 1 \text{output\_size} = \left( \frac{\text{input\_size} - \text{pooling\_size}}{\text{stride}} \right) + 1 output_size=(strideinput_size−pooling_size)+1
Where:
input_size
is the width or height of the input data.pooling_size
is the size of the pooling kernel.stride
is the number of pixels the pooling kernel moves per step. If not specified, it’s usually the same as the pooling size.Examples:
import torch.nn as nn
# Assume we have an input tensor of shape [batch_size, channels, height, width]
# For this example: [32, 3, 64, 64]
pooling_layer = nn.MaxPool2d(kernel_size=2, stride=2)
# This will reduce the spatial dimensions (height and width) by half.
# Output shape: [32, 3, 32, 32]
pooling_layer = nn.AvgPool2d(kernel_size=2, stride=2)
# Again, this will reduce the spatial dimensions by half.
# Output shape: [32, 3, 32, 32]
Note:
In practice, modern architectures sometimes prefer using strided convolutions for downsampling instead of pooling layers, but pooling remains an important concept in the understanding and history of CNNs.
In a Convolutional Neural Network (CNN), a fully connected (FC) layer, also known as a dense layer, typically appears after a series of convolutional and pooling layers, and is used to make predictions or classifications based on the extracted features.
To properly set up the input and output dimensions for the FC layers, you need to understand the flow of the data:
Input Dimension of the First FC Layer:
[batch_size, 128, 5, 5]
(with 128 feature maps of size 5x5), then the input dimension for your FC layer after flattening would be 128 * 5 * 5 = 3200
(channels×height×width).Output Dimension of the FC Layer(s):
Here’s a simple illustration:
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes):
super(SimpleCNN, self).__init__()
self.conv_layers = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1), # Assuming 3-channel images as input
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(64, 128, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2)
)
self.fc_layers = nn.Sequential(
nn.Linear(128 * 16 * 16, 512), # Assuming input image size is 128x128
nn.ReLU(),
nn.Linear(512, num_classes)
)
def forward(self, x):
x = self.conv_layers(x)
x = x.view(x.size(0), -1) # Flatten
x = self.fc_layers(x)
return x
In the above example, for an input image of size 128x128 and 3 channels, the size of the tensor before the FC layers is [batch_size, 128, 16, 16]
. The flattened size is 128 * 16 * 16 = 32768. The FC layers reduce this to 512 features, and finally, to num_classes
outputs.