wip

2025-08-28 18:36:02 +02:00
parent 3b0c2a0727
commit 5ff56994c0
14 changed files with 112 additions and 1002 deletions
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -1006,23 +1006,45 @@ Since the neural network architecture trained in the deepsad method is not fixed

 \newsubsubsectionNoTOC{Network architectures (LeNet variant, custom encoder) and how they suit the point‑cloud input}

-\todo[inline]{STANDARDIZE ALL DIMENSIONS TO (CHANNEL, WIDTH, HEIGHT)}
-
 The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (figure~\ref{fig:setup_arch_lenet_decoder}) with a latent space inbetween the two parts. Such an arrangement is typical for autoencoder architectures as we discussed in section~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture which is used to infer the degradation quantification in our use-case, once trained.

 %The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) consists of two convolution steps with pooling layers, and finally a dense layer which populates the latent space. \todo[inline]{lenet explanation from chatgpt?} The first convolutional layer uses a 3x3 kernel and outputs 8 channels, which depicts  the number of features/structures/patterns the network can learn to extract from the input and results in an output dimensionality of 2048x32x8 which is reduced to 1024x16x8 by a 2x2 pooling layer. \todo[inline]{batch normalization, and something else like softmax or relu blah?} The second convolution reduces the 8 channels to 4 with another 3x3 kernel \todo[inline]{why? explain rationale} and is followed by another 2x2 pooling layer resulting in a 512x8x4 dimensionality, which is then flattened and input into a dense layer. The dense layer's output dimension is the chosen latent space dimensionality, which is as previously mentioned another tuneable hyperparameter. 

-\fig{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{UNFINISHED - Visualization of the original LeNet-inspired encoder architecture.}
+%\fig{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{UNFINISHED - Visualization of the original LeNet-inspired encoder architecture.}
+
+\fig{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{
+	Architecture of the LeNet-inspired encoder. The input is a LiDAR range image of size
+	$1\times 2048\times 32$ (channels $\times$ width $\times$ height). The first block (Conv1) applies a
+	$5\times 5$ convolution with 8 output channels, followed by batch normalization, LeakyReLU activation,
+	and $2\times 2$ max pooling, resulting in a feature map of size $8\times 1024\times 16$.
+	The second block (Conv2) applies another $5\times 5$ convolution with 4 output channels, again followed
+	by normalization, activation, and $2\times 2$ max pooling, producing $4\times 512\times 8$.
+	This feature map is flattened and passed through a fully connected (FC) layer of size
+	$4\cdot 512 \cdot 8 = 16384$, which maps to the latent space of dimensionality $d$, where $d$ is a tunable
+	hyperparameter ($32 \leq d \leq 1024$ in our experiments). The latent space serves as the compact
+	representation used by DeepSAD for anomaly detection.
+}

 The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) is a compact convolutional neural network that reduces image data into a lower-dimensional latent space. It consists of two stages of convolution, normalization, non-linear activation, and pooling, followed by a dense layer that defines the latent representation. Conceptually, the convolutional layers learn small filters that detect visual patterns in the input (such as edges or textures). Batch normalization ensures that these learned signals remain numerically stable during training, while a LeakyReLU activation introduces non-linearity, allowing the network to capture more complex relationships. Pooling operations then downsample the feature maps, which reduces the spatial size of the data and emphasizes the most important features. Finally, a dense layer transforms the extracted feature maps into the latent space, which serves as the datas' representation in the reduced dimensionality latent space.

-Concretely, the first convolutional layer uses a $3\times 3$ kernel with 8 output channels, corresponding to 8 learnable filters. For input images of size $2048\times 32\times 1$, this produces an intermediate representation of shape $2048\times 32\times 8$, which is reduced to $1024\times 16\times 8$ by a $2\times 2$ pooling layer. The second convolutional layer again applies a $3\times 3$ kernel but outputs 4 channels, followed by another pooling step, resulting in a feature map of shape $512\times 8\times 4$. This feature map is flattened and passed into a fully connected layer. The dimensionality of the output of this layer corresponds to the latent space, whose size is a tunable hyperparameter chosen according to the needs of the application.
+Concretely, the first convolutional layer uses a $3\times 3$ kernel with 8 output channels, corresponding to 8 learnable filters. For input images of size $1\times 2048\times 32$, this produces an intermediate representation of shape $8\times 2048\times 32$, which is reduced to $8\times 1024\times 16$ by a $2\times 2$ pooling layer. The second convolutional layer again applies a $3\times 3$ kernel but outputs 4 channels, followed by another pooling step, resulting in a feature map of shape $4\times 512\times 8$. This feature map is flattened and passed into a fully connected layer. The dimensionality of the output of this layer corresponds to the latent space, whose size is a tunable hyperparameter chosen according to the needs of the application.

 % Its decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) is a mirrored version of the encoder, with a dense layer after the latent space and two pairs of 2x2 upsampling and transpose convolution layers which use 4 and 8 input channels respectively with the second one reducing its output to one channel resulting in the 2048x32x1 output dimensionality, equal to the input's, which is required for the autoencoding objective to be possible.

-\fig{setup_arch_lenet_decoder}{diagrams/arch_lenet_decoder}{UNFINISHED - Visualization of the original LeNet-inspired decoder architecture.}
+\fig{setup_arch_lenet_decoder}{diagrams/arch_lenet_decoder}{
+	Architecture of the LeNet-inspired decoder. The input is a latent vector of dimension $d$,
+	where $d$ is the tunable latent space size ($32 \leq d \leq 1024$ in our experiments).
+	A fully connected (FC) layer first expands this vector into a feature map of size
+	$4\times 512\times 8$ (channels $\times$ width $\times$ height).
+	The first upsampling stage applies interpolation with scale factor 2, followed by a
+	transpose convolution with 8 output channels, batch normalization, and LeakyReLU activation,
+	yielding $8\times 1024\times 16$. The second stage again upsamples by factor 2 and applies
+	a transpose convolution, reducing the channels to 1. This produces the reconstructed output
+	of size $1\times 2048\times 32$, which matches the original input dimensionality required
+	for the autoencoding objective.
+}

-The decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $512\times 8\times 4$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $2048\times 32 \times 1$, matching the original input dimensionality required for the autoencoding objective.
+The decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $4\times 512\times 8$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $1\times 2048\times 32$, matching the original input dimensionality required for the autoencoding objective.

 %\todo[inline]{what problems and possible improvements did we find when investigating this architecture}
 %\todo[inline]{starting point - receptive field, possible loss of information due to narrow RF during convolutions, which motivated us to investigate the impact of an improved arch}
@@ -1064,7 +1086,24 @@ To adjust for this, we decided to modify the network architecture and included f
 	\item \textbf{Channel compression before latent mapping.} After feature extraction, a $1 \times 1$ convolution reduces the number of channels before flattening, which lowers the parameter count of the final fully connected layer without sacrificing feature richness.
 \end{itemize}

-\fig{setup_arch_ef_encoder}{diagrams/arch_ef_encoder}{UNFINISHED - Visualization of the efficient encoder architecture.}
+%\fig{setup_arch_ef_encoder}{diagrams/arch_ef_encoder}{UNFINISHED - Visualization of the efficient encoder architecture.}
+
+\fig{setup_arch_ef_encoder}{diagrams/arch_ef_encoder}{
+	Architecture of the Efficient encoder. The input is a LiDAR range image of size
+	$1 \times 2048 \times 32$ (channels $\times$ width $\times$ height).
+	The first block (\textbf{Conv1}) applies a depthwise–separable $3 \times 17$ convolution
+	with circular padding along the azimuth, followed by batch normalization, LeakyReLU,
+	and an aggressive horizontal pooling step, producing an intermediate representation
+	of $16 \times 512 \times 32$.
+	The second block (\textbf{Conv2}) applies another depthwise–separable convolution
+	with channel shuffle, followed by two stages of $2 \times 2$ max pooling, yielding
+	$32 \times 128 \times 8$.
+	A $1 \times 1$ convolution (\textbf{Squeeze}) then reduces the channel dimension
+	to 8, producing $8 \times 128 \times 8$.
+	Finally, a fully connected layer (\textbf{FC}) flattens this feature map and projects
+	it into the latent space of size $d$, where $d$ is a tunable hyperparameter
+	($32 \leq d \leq 1024$ in our experiments).
+}


 \paragraph{Decoder.}
@@ -1075,9 +1114,24 @@ The decoder (see figure~\ref{fig:setup_arch_ef_decoder}) mirrors the encoder’s
 	\item \textbf{Final convolution with circular padding.} The output is generated using a $(3 \times 17)$ convolution with circular padding along the azimuth similar to the new encoder, ensuring consistent treatment of the 360° LiDAR input.
 \end{itemize}

-\fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{UNFINISHED - Visualization of the efficient decoder architecture.}
+%\fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{UNFINISHED - Visualization of the efficient decoder architecture.}
+\fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{
+	Architecture of the Efficient decoder. The input is a latent vector of dimension $d$.
+	A fully connected layer first expands this into a feature map of size $8 \times 128 \times 8$,
+	followed by a $1 \times 1$ convolution (\textbf{Unsqueeze}) to increase the channel count to 32.
+	The following three blocks (\textbf{Deconv1–3}) each consist of nearest-neighbor upsampling
+	and a depthwise–separable convolution:
+	\textbf{Deconv1} doubles both axes ($32 \times 256 \times 16$),
+	\textbf{Deconv2} restores horizontal resolution more aggressively with a $1 \times 4$ upsampling
+	($16 \times 1024 \times 16$),
+	and \textbf{Deconv3} doubles both axes again to $8 \times 2048 \times 32$.
+	The final block (\textbf{Deconv4}) applies a $(3 \times 17)$ convolution with circular padding
+	along the azimuth, reducing the channels to 1 and producing the reconstructed output
+	of size $1 \times 2048 \times 32$, which matches the original input dimensionality.
+}

-Even though both encoders were designed for the same input dimensionality of $2048 \times 32$, their computational requirements differ significantly. To quantify this, we compared the number of trainable parameters and the number of multiply–accumulate operations (MACs) for different latent space sizes used in our experiments.
+
+Even though both encoders were designed for the same input dimensionality of $1\times 2048 \times 32$, their computational requirements differ significantly. To quantify this, we compared the number of trainable parameters and the number of multiply–accumulate operations (MACs) for different latent space sizes used in our experiments.

 \begin{table}[h]
 	\centering
@@ -1151,16 +1205,45 @@ During training, the algorithm balances two competing objectives: capturing as m

 We adapted the baseline implementations to our data loader and input format \todo[inline]{briefly describe file layout / preprocessing}, and added support for multiple evaluation targets per frame (two labels per data point), reporting both results per experiment. For OCSVM, the dimensionality reduction step is \emph{always} performed with the corresponding DeepSAD encoder and its autoencoder pretraining weights that match the evaluated setting (i.e., same latent size and backbone). Both baselines, like DeepSAD, output continuous anomaly scores. This allows us to evaluate them directly without committing to a fixed threshold.

-\section{Experiment Matrix \& Computational Environment}
-% Combines: Experiment Matrix + Hardware & Runtimes
-% Goals: clearly enumerate each experiment configuration and give practical runtime details
-\newsubsubsectionNoTOC{Table of experiment variants (architectures, hyperparameters, data splits)}
+\section{Experiment Overview \& Computational Environment}
+
 \threadtodo
 {give overview of experiments and their motivations}
 {training setup clear, but not what was trained/tested}
 {explanation of what was searched for (ae latent space first), other hyperparams and why}
 {all experiments known $\rightarrow$ how long do they take to train}
-\newsubsubsectionNoTOC{Hardware specifications (GPU/CPU, memory), software versions, typical training/inference runtimes}
+
+Our experimental setup consisted of two stages. First, we conducted a hyperparameter search over the latent space dimensionality by pretraining the autoencoders alone. For both the LeNet-inspired and the Efficient network, we evaluated latent space sizes of $32, 64, 128, 256, 384, 512, 768,$ and $1024$. Each autoencoder was trained for 50~epochs with a learning rate of $1\cdot 10^{-5}$, and results were averaged across 5-fold cross-validation. The goal of this stage was to identify the ``elbow point'' in reconstruction loss curves, which serves as a practical indicator of a sufficiently expressive, yet compact, representation.
+
+Second, we trained the full DeepSAD models on the same latent space sizes in order to investigate how autoencoder performance transfers to anomaly detection performance. Specifically, we aimed to answer whether poor autoencoder reconstructions necessarily imply degraded DeepSAD results, or whether the two stages behave differently. To disentangle these effects, both network architectures (LeNet-inspired and Efficient) were trained with identical configurations, allowing for a direct architectural comparison.
+
+Furthermore, we investigated the effect of semi-supervised labeling. DeepSAD can incorporate labeled data during training, and we wanted to investigate the impact of labeling on anomaly detection performance. To this end, each configuration was trained under three labeling regimes:
+\begin{itemize}
+	\item \textbf{Unsupervised:} $(0,0)$ labeled samples of (normal, anomalous) data.
+	\item \textbf{Low supervision:} $(50,10)$ labeled samples.
+	\item \textbf{High supervision:} $(500,100)$ labeled samples.
+\end{itemize}
+
+All models were pre-trained for 50~epochs and then trained for 150~epochs with the same learning rate of $1\cdot 10^{-5}$ and evaluated with 5-fold cross-validation.
+Table~\ref{tab:exp_grid} summarizes the full experiment matrix.
+
+\begin{table}[h]
+	\centering
+	\caption{Experiment grid of all DeepSAD trainings. Each latent space size was tested for both network architectures and three levels of semi-supervised labeling.}
+	\begin{tabular}{c|c|c}
+		\toprule
+		\textbf{Latent sizes}              & \textbf{Architectures}    & \textbf{Labeling regimes (normal, anomalous)} \\
+		\midrule
+		$32, 64, 128, 256, 512, 768, 1024$ & LeNet-inspired, Efficient & (0,0), (50,10), (500,100)                     \\
+		\bottomrule
+	\end{tabular}
+	\label{tab:exp_grid}
+\end{table}
+
+% Combines: Experiment Matrix + Hardware & Runtimes
+% Goals: clearly enumerate each experiment configuration and give practical runtime details
+%\newsubsubsectionNoTOC{Table of experiment variants (architectures, hyperparameters, data splits)}
+
 \threadtodo
 {give overview about hardware setup and how long things take to train}
 {we know what we trained but not how long that takes}