efficient net explanation
This commit is contained in:
@@ -1048,11 +1048,37 @@ To address this, we developed an efficient network architecture with asymmetric
|
||||
|
||||
|
||||
%\todo[inline]{start by explaining lenet architecture, encoder and decoder split, encoder network is the one being trained during the main training step, together as autoencoder during pretraining, decoder of lenet pretty much mirrored architecture of encoder, after preprocessing left with image data (2d projections, grayscale = 1 channel) so input is 2048x32x1. convolutional layers with pooling afterwards (2 convolution + pooling) convolutions to multiple channels (8, 4?) each channel capable of capturing a different pattern/structure of input. fully connected layer before latent space, latent space size not fixed since its also a hyperparameter and depended on how well the normal vs anomalous data can be captured and differentiated in the dimensionality of the latent space}
|
||||
\todo[inline]{batch normalization, relu? something....}
|
||||
%\todo[inline]{batch normalization, relu? something....}
|
||||
|
||||
|
||||
The efficient network architecture was designed to address the shortcomings of the LeNet-inspired model and to better align with the characteristics of spherical LiDAR projections. At a high level, the main design principles were (i) balancing the receptive field in angular space, (ii) improving efficiency for deployment on embedded hardware, and (iii) avoiding reconstruction artifacts in the decoder. The following paragraphs describe the encoder and decoder components in more detail.
|
||||
|
||||
\fig{setup_arch_ef_encoder}{diagrams/arch_ef_encoder}{UNFINISHED - Visualization of the efficient encoder architecture.}
|
||||
|
||||
\paragraph{Encoder.}
|
||||
The efficient encoder (see figure~\ref{fig:setup_arch_ef_encoder}) follows the same general idea as the LeNet-inspired encoder, but incorporates several modifications to improve both performance and efficiency:
|
||||
\begin{itemize}
|
||||
\item \textbf{Non-square convolution kernels.} Depthwise-separable convolutions with kernel size $3 \times 17$ are used instead of square kernels, resulting in a more balanced receptive field in angular coordinates and better alignment with the anisotropic LiDAR input.
|
||||
\item \textbf{Circular padding along azimuth.} Only the horizontal axis is circularly padded to respect the wrap-around of $360^{\circ}$ LiDAR data, preventing artificial seams at the image boundaries.
|
||||
\item \textbf{Depthwise-separable convolutions with channel shuffle.} Inspired by MobileNet and ShuffleNet, this reduces the number of parameters and computations while retaining representational capacity, making the network more suitable for embedded platforms.
|
||||
\item \textbf{Aggressive horizontal pooling.} A $1 \times 4$ pooling operation is applied early in the network, which reduces the over-sampled horizontal resolution (2048~px to 512~px) while keeping vertical detail intact.
|
||||
\item \textbf{Max pooling.} Standard max pooling is used instead of average pooling, since it preserves sharp activations that are often indicative of localized degradation.
|
||||
\item \textbf{Channel compression before latent mapping.} After feature extraction, a $1 \times 1$ convolution reduces the number of channels before flattening, which lowers the parameter count of the final fully connected layer without sacrificing feature richness.
|
||||
\end{itemize}
|
||||
These design choices together result in a latent representation of dimension $d=512$ (tunable), with a receptive field of approximately $10 \times 52$ pixels, corresponding to $9.93^{\circ} \times 9.14^{\circ}$ in angular space \todo[inline]{insert RF figure}. This is substantially more balanced than the $15.88^{\circ} \times 2.81^{\circ}$ receptive field of the LeNet-inspired encoder.
|
||||
|
||||
\fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{UNFINISHED - Visualization of the efficient decoder architecture.}
|
||||
|
||||
\paragraph{Decoder.}
|
||||
The efficient decoder (see figure~\ref{fig:setup_arch_ef_decoder}) mirrors the encoder’s structure but introduces changes to improve reconstruction stability:
|
||||
\begin{itemize}
|
||||
\item \textbf{Nearest-neighbor upsampling followed by convolution.} Instead of relying solely on transposed convolutions, each upsampling stage first enlarges the feature map using parameter-free nearest-neighbor interpolation, followed by a depthwise-separable convolution. This strategy reduces the risk of checkerboard artifacts while still allowing the network to learn fine detail.
|
||||
\item \textbf{Asymmetric upsampling schedule.} Horizontal resolution is restored more aggressively (e.g., scale factor $1 \times 4$) to reflect the anisotropic downsampling performed in the encoder.
|
||||
\item \textbf{Final convolution with circular padding.} The output is generated using a $(3 \times 17)$ convolution with circular padding along the azimuth, ensuring consistent treatment of the 360° LiDAR input.
|
||||
\end{itemize}
|
||||
The resulting output has the same dimensionality as the input ($32 \times 2048 \times 1$), enabling the autoencoding objective.
|
||||
|
||||
|
||||
|
||||
\threadtodo
|
||||
{how was training/testing adapted (networks overview), inference, ae tuning}
|
||||
|
||||
Reference in New Issue
Block a user