reworked network arch diagrams
This commit is contained in:
@@ -1027,7 +1027,7 @@ The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref
|
||||
|
||||
The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) is a compact convolutional neural network that reduces image data into a lower-dimensional latent space. It consists of two stages of convolution, normalization, non-linear activation, and pooling, followed by a dense layer that defines the latent representation. Conceptually, the convolutional layers learn small filters that detect visual patterns in the input (such as edges or textures). Batch normalization ensures that these learned signals remain numerically stable during training, while a LeakyReLU activation introduces non-linearity, allowing the network to capture more complex relationships. Pooling operations then downsample the feature maps, which reduces the spatial size of the data and emphasizes the most important features. Finally, a dense layer transforms the extracted feature maps into the latent space, which serves as the datas' representation in the reduced dimensionality latent space.
|
||||
|
||||
Concretely, the first convolutional layer uses a $3\times 3$ kernel with 8 output channels, corresponding to 8 learnable filters. For input images of size $1\times 2048\times 32$, this produces an intermediate representation of shape $8\times 2048\times 32$, which is reduced to $8\times 1024\times 16$ by a $2\times 2$ pooling layer. The second convolutional layer again applies a $3\times 3$ kernel but outputs 4 channels, followed by another pooling step, resulting in a feature map of shape $4\times 512\times 8$. This feature map is flattened and passed into a fully connected layer. The dimensionality of the output of this layer corresponds to the latent space, whose size is a tunable hyperparameter chosen according to the needs of the application.
|
||||
%Concretely, the first convolutional layer uses a $5\times 5$ kernel with 8 output channels, corresponding to 8 learnable filters. For input images of size $1\times 2048\times 32$, this produces an intermediate representation of shape $8\times 2048\times 32$, which is reduced to $8\times 1024\times 16$ by a $2\times 2$ pooling layer. The second convolutional layer again applies a $5\times 5$ kernel but outputs 4 channels, followed by another pooling step, resulting in a feature map of shape $4\times 512\times 8$. This feature map is flattened and passed into a fully connected layer. The dimensionality of the output of this layer corresponds to the latent space, whose size is a tunable hyperparameter chosen according to the needs of the application.
|
||||
|
||||
% Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 output dimensionality, equal to the input's, which is required for the autoencoding objective to be possible.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user