rf first paragraphs
This commit is contained in:
@@ -1006,10 +1006,24 @@ The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref
|
||||
|
||||
The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) consists of two convolution steps with pooling layers, and finally a dense layer which populates the latent space. \todo[inline]{lenet explanation from chatgpt?} The first convolutional layer uses a 3x3 kernel and outputs 8 channels, which depicts the number of features/structures/patterns the network can learn to extract from the input and results in an output dimensionality of 2048x32x8 which is reduced to 1024x16x8 by a 2x2 pooling layer. \todo[inline]{batch normalization, and something else like softmax or relu blah?} The second convolution reduces the 8 channels to 4 with another 3x3 kernel \todo[inline]{why? explain rationale} and is followed by another 2x2 pooling layer resulting in a 512x8x4 dimensionality, which is then flattened and input into a dense layer. The dense layer's output dimension is the chosen latent space dimensionality, which is as previously mentioned another tuneable hyperparameter.
|
||||
|
||||
Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of \todo[inline]{upscale, then deconv?} \todo[inline]{deconv = convtranspose why??} 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 dimensionality, equal to the input's, which is required for the autoencoder, for its autoencoding objective to be possible.
|
||||
Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of \todo[inline]{upscale, then deconv?} \todo[inline]{deconv = convtranspose why??} 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 output dimensionality, equal to the input's, which is required for the autoencoding objective to be possible.
|
||||
|
||||
\todo[inline]{what problems and possible improvements did we find when investigating this architecture}
|
||||
\todo[inline]{starting point - receptive field, possible loss of information due to narrow RF during convolutions, which motivated us to investigate the impact of an improved arch}
|
||||
|
||||
Even though this architecture is quite simplistic, it also appeared capable to achieve our degradation quantification objective during initial testing. Nonetheless, we identified a few possible shortcomings and improvements, which motivated us to improve upon this architecture and evaluate the impact of choosing a capable architecture for DeepSAD on our usecase.
|
||||
|
||||
The biggest deficiency we identified was the unproportional aspect ratio of the convolutional layer's receptive field. We proposed that it may lead to worse performance and generalization by hindering the capture of certain degradation patterns during the encoding step, due to its badly chosen size and aspect ratio.
|
||||
|
||||
The receptive field of a neural network is a measure of how many input data affect each output data. Is is oftentimes calculated for convolutional networks, which have twodimensional images as input data and can be useful for understanding the network's field of view. The RF size has been shown to be an important factor in a network's capability of extracting/identifying a feature/pattern, in that the RF size has to be at least as large as the pattern meant to capture but should also not be chosen too large, since this can lead to bad performance when capturing smaller more fine-grained patterns. \todo[inline]{refer to graphic 1}
|
||||
|
||||
\todo[inline]{2 subgraphics, left overall neural field , right zoomed in n x n RF affecting single output neuron}
|
||||
|
||||
The receptieve field of convolutional neural networks is often given as a single integer $n$, implying a symmetric receptive field of size $n \times n$ affects each output neuron \todo[inline]{refer to graphic 2}. Formally the RF can be calculated independently for each input dimension and does not have to be square.
|
||||
|
||||
For the LeNet-inspired encoder we calculated the RF size to be \todo[inline]{result and calculation here}. The problem we identified stems from the vast difference in resolution per axis in the input data. Due to the working mechanics of spinning lidar sensors, it is typical for the point cloud's produced by them to have a denser resolution in their x axis than in their y axis. This is the result of a fixed number of vertical channels (typically 32 to 128) rotating on the sensor's x axis and producing oftentimes over 1000 measurements per channel over the 360 degree field of view. We can therefore calculate the measurements per degree for both axis, which is tantamount to pixels per degree for our spherical projection images. The vertical pixel per degree are with A ppd nearly 10 times larger than the B ppd of the horizontal axis \todo[inline]{show calculation and correct values of ppd}. This discrapency in resolution per axis means that a square RF results in a unproportional RF when viewed in terms of degrees, namely N by M degrees of RF.
|
||||
|
||||
Since our understanding of the kinds of degradation present in lidar data is limited we want to make sure the network is capable of capturing many types of patterns
|
||||
\todo[inline]{start by explaining lenet architecture, encoder and decoder split, encoder network is the one being trained during the main training step, together as autoencoder during pretraining, decoder of lenet pretty much mirrored architecture of encoder, after preprocessing left with image data (2d projections, grayscale = 1 channel) so input is 2048x32x1. convolutional layers with pooling afterwards (2 convolution + pooling) convolutions to multiple channels (8, 4?) each channel capable of capturing a different pattern/structure of input. fully connected layer before latent space, latent space size not fixed since its also a hyperparameter and depended on how well the normal vs anomalous data can be captured and differentiated in the dimensionality of the latent space}
|
||||
\todo[inline]{batch normalization, relu? something....}
|
||||
|
||||
|
||||
Reference in New Issue
Block a user