reworked rf explanation and motivation

This commit is contained in:
Jan Kowalczyk
2025-08-18 11:57:58 +02:00
parent e1c13be697
commit 63f2005caf

View File

@@ -1006,41 +1006,50 @@ The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref
%The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) consists of two convolution steps with pooling layers, and finally a dense layer which populates the latent space. \todo[inline]{lenet explanation from chatgpt?} The first convolutional layer uses a 3x3 kernel and outputs 8 channels, which depicts the number of features/structures/patterns the network can learn to extract from the input and results in an output dimensionality of 2048x32x8 which is reduced to 1024x16x8 by a 2x2 pooling layer. \todo[inline]{batch normalization, and something else like softmax or relu blah?} The second convolution reduces the 8 channels to 4 with another 3x3 kernel \todo[inline]{why? explain rationale} and is followed by another 2x2 pooling layer resulting in a 512x8x4 dimensionality, which is then flattened and input into a dense layer. The dense layer's output dimension is the chosen latent space dimensionality, which is as previously mentioned another tuneable hyperparameter.
\fig{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{UNFINISHED - Visualization of the original LeNet-inspired encoder architecture.}
The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) is a compact convolutional neural network that reduces image data into a lower-dimensional latent space. It consists of two stages of convolution, normalization, non-linear activation, and pooling, followed by a dense layer that defines the latent representation. At a high level, the convolutional layers learn small filters that detect visual patterns in the input (such as edges or textures). Batch normalization ensures that these learned signals remain numerically stable during training, while the LeakyReLU activation introduces non-linearity, allowing the network to capture more complex relationships. Pooling operations then downsample the feature maps, which reduces the spatial size of the data and emphasizes the most important features. Finally, a dense layer transforms the extracted feature maps into the latent space, which serves as the datas' representation in the reduced dimensionality latent space.
\fig{setup_arch_lenet_decoder}{diagrams/arch_lenet_decoder}{UNFINISHED - Visualization of the original LeNet-inspired decoder architecture.}
Concretely, the first convolutional layer uses a $3\times 3$ kernel with 8 output channels, corresponding to 8 learnable filters. For input images of size $2048\times 32\times 1$, this produces an intermediate representation of shape $2048\times 32\times 8$, which is reduced to $1024\times 16\times 8$ by a $2\times 2$ pooling layer. The second convolutional layer again applies a $3\times 3$ kernel but outputs 4 channels, followed by another pooling step, resulting in a feature map of shape $512\times 8\times 4$. This feature map is flattened and passed into a fully connected layer. The dimensionality of the output of this layer corresponds to the latent space, whose size is a tunable hyperparameter chosen according to the needs of the application.
% Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 output dimensionality, equal to the input's, which is required for the autoencoding objective to be possible.
The decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $512\times 8\times 4$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $2048\times 32 \times 1$, matching the original input dimensionality required for the autoencoding objective.
\todo[inline]{what problems and possible improvements did we find when investigating this architecture}
\todo[inline]{starting point - receptive field, possible loss of information due to narrow RF during convolutions, which motivated us to investigate the impact of an improved arch}
%\todo[inline]{what problems and possible improvements did we find when investigating this architecture}
%\todo[inline]{starting point - receptive field, possible loss of information due to narrow RF during convolutions, which motivated us to investigate the impact of an improved arch}
Even though this architecture is quite simplistic, it also appeared capable to achieve our degradation quantification objective during initial testing. Nonetheless, we identified a few possible shortcomings and improvements, which motivated us to improve upon this architecture and evaluate the impact of choosing a capable architecture for DeepSAD on our usecase.
%Even though this architecture is quite simplistic, it also appeared capable to achieve our degradation quantification objective during initial testing. Nonetheless, we identified a few possible shortcomings and improvements, which motivated us to improve upon this architecture and evaluate the impact of choosing a capable architecture for DeepSAD on our usecase.
The biggest deficiency we identified was the unproportional aspect ratio of the convolutional layer's receptive field. We proposed that it may lead to worse performance and generalization by hindering the capture of certain degradation patterns during the encoding step, due to its badly chosen size and aspect ratio.
%The biggest deficiency we identified was the unproportional aspect ratio of the convolutional layer's receptive field. We proposed that it may lead to worse performance and generalization by hindering the capture of certain degradation patterns during the encoding step, due to its badly chosen size and aspect ratio.
The receptive field of a neural network is a measure of how many input data affect each output data. Is is oftentimes calculated for convolutional networks, which have twodimensional images as input data and can be useful for understanding the network's field of view. The RF size has been shown to be an important factor in a network's capability of extracting/identifying a feature/pattern, in that the RF size has to be at least as large as the pattern meant to capture but should also not be chosen too large, since this can lead to bad performance when capturing smaller more fine-grained patterns. \todo[inline]{refer to graphic 1}
%The receptive field of a neural network is a measure of how many input data affect each output data. Is is oftentimes calculated for convolutional networks, which have twodimensional images as input data and can be useful for understanding the network's field of view. The RF size has been shown to be an important factor in a network's capability of extracting/identifying a feature/pattern, in that the RF size has to be at least as large as the pattern meant to capture but should also not be chosen too large, since this can lead to bad performance when capturing smaller more fine-grained patterns. \todo[inline]{refer to graphic 1}
\todo[inline]{2 subgraphics, left overall neural field , right zoomed in n x n RF affecting single output neuron}
%\todo[inline]{2 subgraphics, left overall neural field , right zoomed in n x n RF affecting single output neuron}
The receptieve field of convolutional neural networks is often given as a single integer $n$, implying a symmetric receptive field of size $n \times n$ affects each output neuron \todo[inline]{refer to graphic 2}. Formally the RF can be calculated independently for each input dimension and does not have to be square.
%The receptieve field of convolutional neural networks is often given as a single integer $n$, implying a symmetric receptive field of size $n \times n$ affects each output neuron \todo[inline]{refer to graphic 2}. Formally the RF can be calculated independently for each input dimension and does not have to be square.
For the LeNet-inspired encoder we calculated the RF size to be \todo[inline]{result and calculation here}. The problem we identified stems from the vast difference in resolution per axis in the input data. Due to the working mechanics of spinning lidar sensors, it is typical for the point cloud's produced by them to have a denser resolution in their x axis than in their y axis. This is the result of a fixed number of vertical channels (typically 32 to 128) rotating on the sensor's x axis and producing oftentimes over 1000 measurements per channel over the 360 degree field of view. We can therefore calculate the measurements per degree for both axis, which is tantamount to pixels per degree for our spherical projection images. The vertical pixel per degree are with A ppd nearly 10 times larger than the B ppd of the horizontal axis \todo[inline]{show calculation and correct values of ppd}. This discrapency in resolution per axis means that a square pixel RF results in a unproportional RF when viewed in terms of degrees, namely N by M degrees of RF.
%For the LeNet-inspired encoder we calculated the RF size to be \todo[inline]{result and calculation here}. The problem we identified stems from the vast difference in resolution per axis in the input data. Due to the working mechanics of spinning lidar sensors, it is typical for the point cloud's produced by them to have a denser resolution in their x axis than in their y axis. This is the result of a fixed number of vertical channels (typically 32 to 128) rotating on the sensor's x axis and producing oftentimes over 1000 measurements per channel over the 360 degree field of view. We can therefore calculate the measurements per degree for both axis, which is tantamount to pixels per degree for our spherical projection images. The vertical pixel per degree are with A ppd nearly 10 times larger than the B ppd of the horizontal axis \todo[inline]{show calculation and correct values of ppd}. This discrapency in resolution per axis means that a square pixel RF results in a unproportional RF when viewed in terms of degrees, namely N by M degrees of RF.
Since our understanding of the kinds of degradation present in lidar data is limited we want to make sure the network is capable of capturing many types of degradation patterns. To increase the network's chance of learning to identify such patterns we
%Since our understanding of the kinds of degradation present in lidar data is limited we want to make sure the network is capable of capturing many types of degradation patterns. To increase the network's chance of learning to identify such patterns we explored a new network architecture possessing a more square RF when viewed in terms of degrees and also included some other improvements.
Even though the LeNet-inspired encoder proved capable of achieving our degradation quantification objective in initial experiments, we identified several shortcomings that motivated the design of a second, more efficient architecture. The most important issue concerns the shape of the receptive field (RF) in relation to the anisotropic resolution of our LiDAR input data.
The receptive field of a convolutional neural network describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may blur fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region \todo[inline]{add schematic of square RF}, but in principle it can be computed independently per axis.
In the case of spherical LiDAR projections, the input has a highly unbalanced resolution due to the sensor geometry. A fixed number of vertical channels (typically 32--128) sweeps across the horizontal axis, producing thousands of measurements per channel. This results in a pixel-per-degree resolution of approximately $0.99^{\circ}$/pixel vertically and $0.18^{\circ}$/pixel horizontally \todo[inline]{double-check with calculation graphic/table}. Consequently, the LeNet-inspired encoders calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the networks ability to capture degradation patterns that extend differently across the two axes. \todo[inline]{add schematic showing rectangular angular RF overlaid on LiDAR projection}
To address this, we developed an efficient network architecture with asymmetric convolution kernels, resulting in a receptive field of $10 \times 52$ pixels. In angular terms, this corresponds to $9.93^{\circ} \times 9.14^{\circ}$, which is far more balanced between vertical and horizontal directions. This adjustment increases the likelihood of capturing a broader variety of degradation patterns. Additional design improvements were incorporated as well, which will be described in the following section.
\todo[inline]{start by explaining lenet architecture, encoder and decoder split, encoder network is the one being trained during the main training step, together as autoencoder during pretraining, decoder of lenet pretty much mirrored architecture of encoder, after preprocessing left with image data (2d projections, grayscale = 1 channel) so input is 2048x32x1. convolutional layers with pooling afterwards (2 convolution + pooling) convolutions to multiple channels (8, 4?) each channel capable of capturing a different pattern/structure of input. fully connected layer before latent space, latent space size not fixed since its also a hyperparameter and depended on how well the normal vs anomalous data can be captured and differentiated in the dimensionality of the latent space}
%\todo[inline]{start by explaining lenet architecture, encoder and decoder split, encoder network is the one being trained during the main training step, together as autoencoder during pretraining, decoder of lenet pretty much mirrored architecture of encoder, after preprocessing left with image data (2d projections, grayscale = 1 channel) so input is 2048x32x1. convolutional layers with pooling afterwards (2 convolution + pooling) convolutions to multiple channels (8, 4?) each channel capable of capturing a different pattern/structure of input. fully connected layer before latent space, latent space size not fixed since its also a hyperparameter and depended on how well the normal vs anomalous data can be captured and differentiated in the dimensionality of the latent space}
\todo[inline]{batch normalization, relu? something....}
\fig{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{UNFINISHED - Visualization of the original LeNet-inspired encoder architecture.}
\fig{setup_arch_lenet_decoder}{diagrams/arch_lenet_decoder}{UNFINISHED - Visualization of the original LeNet-inspired decoder architecture.}
\fig{setup_arch_ef_encoder}{diagrams/arch_ef_encoder}{UNFINISHED - Visualization of the efficient encoder architecture.}
\fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{UNFINISHED - Visualization of the efficient decoder architecture.}