This commit is contained in:
Jan Kowalczyk
2025-09-27 16:34:52 +02:00
parent cfb77dccab
commit c270783225
6 changed files with 609 additions and 118 deletions

View File

@@ -238,7 +238,10 @@ For remote controlled robots a human operator can make these decisions but many
In this thesis we aim to answer this question by assessing a deep learning-based anomaly detection method and its performance when quantifying the sensor data's degradation. The employed algorithm is a semi-supervised anomaly detection algorithm which uses manually labeled training data to improve its performance over unsupervised methods. We show how much the introduction of these labeled samples improves the methods performance. The models output is an anomaly score which quantifies the data reliability and can be used by algorithms that rely on the sensor data. These reliant algorithms may decide to for example slow down the robot to collect more data, choose alternative routes, signal for help or rely more heavily on other sensor's input data.
\todo[inline]{discuss results (we showed X)}
%We showed that anomaly detection methods are suitable for the task at hand, enabling quantification of lidar sensor data degradation on sub-terranean data similar to the environments SAR missions take place in. We found that the semi-supervised learning-based method DeepSAD performed better than well-known anomaly detection baselines but found a lack of suitable training data and especially labeled evaluation data to hold back the quality of assessing the methods' expected performance under real-life conditions.
Our experiments demonstrate that anomaly detection methods are indeed applicable to this task, allowing lidar data degradation to be quantified on subterranean datasets representative of SAR environments. Among the tested approaches, the semi-supervised method DeepSAD consistently outperformed established baselines such as Isolation Forest and OC-SVM. At the same time, the lack of suitable training data—and in particular the scarcity of reliable evaluation labels—proved to be a major limitation, constraining the extent to which the expected real-world performance of these methods could be assessed.
%\todo[inline, color=green!40]{autonomous robots have many sensors for understanding the world around them, especially visual sensors (lidar, radar, ToF, ultrasound, optical cameras, infrared cameras), they use that data for navigation mapping, SLAM algorithms, and decision making. these are often deep learning algorithms, oftentimes only trained on good data}
%\todo[inline, color=green!40]{difficult environments for sensors to produce good data quality (earthquakes, rescue robots), produced data may be unreliable, we don't know how trustworthy that data is (no quantification, confidence), since all navigation and decision making is based on input data, this makes the whole pipeline untrustworthy/problematic}
@@ -273,15 +276,21 @@ The method we employ produces an analog score that reflects the confidence in th
\newsection{thesis_structure}{Structure of the Thesis}
The remainder of this thesis is organized as follows.
Chapter~\ref{chp:background} introduces the theoretical background and related work, covering anomaly detection methods, semi-supervised learning algorithms, autoencoders, and the fundamentals of LiDAR sensing.
Chapter~\ref{chp:deepsad} presents the DeepSAD algorithm in detail, including its optimization objective, network architecture, and hyperparameters.
In Chapter~\ref{chp:data_preprocessing}, we describe the dataset, the preprocessing pipeline, and our labeling strategies.
Chapter~\ref{chp:experimental_setup} outlines the experimental design, implementation details, and evaluation protocol.
The results are presented and discussed in Chapter~\ref{chp:results_discussion}, where we analyze the performance of DeepSAD compared to baseline methods.
Finally, Chapter~\ref{chp:conclusion_future_work} concludes the thesis by summarizing the main findings, highlighting limitations, and discussing open questions and directions for future work.
\threadtodo
{explain how structure will guide reader from zero knowledge to answer of research question}
{since reader knows what we want to show, an outlook over content is a nice transition}
{state structure of thesis and explain why specific background is necessary for next section}
{reader knows what to expect $\rightarrow$ necessary background info and related work}
\todo[inline]{brief overview of thesis structure}
\todo[inline, color=green!40]{in section x we discuss anomaly detection, semi-supervised learning since such an algorithm was used as the chosen method, we also discuss how lidar works and the data it produces. then in we discuss in detail the chosen method Deep SAD in section X, in section 4 we discuss the traing and evaluation data, in sec 5 we describe our setup for training and evaluation (whole pipeline). results are presented and discussed in section 6. section 7 contains a conclusion and discusses future work}
\newchapter{background}{Background and Related Work}
%\todo[inline, color=green!40]{in this section we will discuss necessary background knowledge for our chosen method and the sensor data we work with. related work exists mostly from autonomous driving which does not include subter data and mostly looks at precipitation as source of degradation, we modeled after one such paper and try to adapt the same method for the domain of rescue robots, this method is a semi-supervised deep learning approach to anomaly detection which we describe in more detail in sections 2.1 and 2.2. in the last subsection 2.3 we discuss lidar sensors and the data they produce}
@@ -409,7 +418,7 @@ In unsupervised learning, models work directly with raw data, without any ground
In reinforcement learning, the model—often called an agent—learns by interacting with an environment, that provides feedback in the form of rewards or penalties. At each step, the agent observes the environments state, selects an action, and an interpreter judges the action's outcome based on how the environment changed, providing a scalar reward or penalty that reflects the desirability of that outcome. The agents objective is to adjust its decision-making strategy to maximize the cumulative reward over time, balancing exploration of new actions with exploitation of known high-reward behaviors. This trial-and-error approach is well suited to sequential decision problems in complex settings, such as autonomous navigation or robotic control, where each choice affects both the immediate state and future possibilities.
\todo[inline, color=green!40]{illustration reinforcement learning}
%\todo[inline, color=green!40]{illustration reinforcement learning}
Semi-Supervised learning algorithms are an inbetween category of supervised and unsupervised algorithms, in that they use a mixture of labeled and unlabeled data. Typically vastly more unlabeled data is used during training of such algorithms than labeled data, due to the effort and expertise required to label large quantities of data correctly. Semi-supervised methods are oftentimes an effort to improve a machine learning algorithm belonging to either the supervised or unsupervised category. Supervised methods such as classification tasks are enhanced by using large amounts of unlabeled data to augment the supervised training without additional need of labeling work. Alternatively, unsupervised methods like clustering algorithms may not only use unlabeled data but improve their performance by considering some hand-labeled data during training.
%Semi-Supervised learning algorithms are an inbetween category of supervised and unsupervised algorithms, in that they use a mixture of labeled and unlabeled data. Typically vastly more unlabeled data is used during training of such algorithms than labeled data, due to the effort and expertise required to label large quantities of data correctly. The type of task performed by semi-supervised methods can originate from either supervised learningor unsupervised learning domain. For classification tasks which are oftentimes achieved using supervised learning the additional unsupervised data is added during training with the hope to achieve a better outcome than when training only with the supervised portion of the data. In contrast for unsupervised learning use cases such as clustering algorithms, the addition of labeled samples can help guide the learning algorithm to improve performance over fully unsupervised training.
@@ -430,7 +439,7 @@ Autoencoders are a type of neural network architecture, whose main goal is learn
\fig{autoencoder_general}{figures/autoencoder_principle_placeholder.png}{PLACEHOLDER - An illustration of autoencoders' general architecture and reconstruction task.}
%\todo[inline, color=green!40]{explain figure}
\todo[inline, color=green!40]{Paragraph about Variational Autoencoders? generative models vs discriminative models, enables other common use cases such as generating new data by changing parameterized generative distribution in latent space - VAES are not really relevant, maybe leave them out and just mention them shortly, with the hint that they are important but too much to explain since they are not key knowledge for this thesis}
%\todo[inline, color=green!40]{Paragraph about Variational Autoencoders? generative models vs discriminative models, enables other common use cases such as generating new data by changing parameterized generative distribution in latent space - VAES are not really relevant, maybe leave them out and just mention them shortly, with the hint that they are important but too much to explain since they are not key knowledge for this thesis}
One key use case of autoencoders is to employ them as a dimensionality reduction technique. In that case, the latent space inbetween the encoder and decoder is of a lower dimensionality than the input data itself. Due to the aforementioned reconstruction goal, the shared information between the input data and its latent space representation is maximized, which is known as following the infomax principle. After training such an autoencoder, it may be used to generate lower-dimensional representations of the given datatype, enabling more performant computations which may have been infeasible to achieve on the original data. DeepSAD uses an autoencoder in a pre-training step to achieve this goal among others.
@@ -713,7 +722,6 @@ Based on the previously discussed requirements and the challenges of obtaining r
%-------------------------------------------------
% Compact sensor overview (row numbers follow Fig.~\ref{fig:subter_platform})
%-------------------------------------------------
\todo[inline]{todo: check table for accuracy/errors}
\begin{table}[htbp]
\centering
\caption{Onboard sensors recorded in the \citetitle{subter} dataset. Numbers match the labels in Fig.~\ref{fig:subter_platform}; only the most salient details are shown for quick reference.}
@@ -868,7 +876,7 @@ To create this mapping, we leveraged the available measurement indices and chann
Figure~\ref{fig:data_projections} displays two examples of LiDAR point cloud projections to aid in the readers understanding. Although the original point clouds were converted into grayscale images with a resolution of 2048×32 pixels, these raw images can be challenging to interpret. To enhance human readability, we applied the viridis colormap and vertically stretched the images so that each measurement occupies multiple pixels in height. The top projection is derived from a scan without artificial smoke—and therefore minimal degradation—while the lower projection comes from an experiment where artificial smoke introduced significant degradation.
\todo[inline, color=green!40]{add same projections as they are used in training? grayscale without vertical scaling}
%\todo[inline, color=green!40]{add same projections as they are used in training? grayscale without vertical scaling}
\fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two pointclouds, one from an experiment without degradation and one from an experiment with artifical smoke as degradation. To aid the readers perception, the images are vertically stretched and a colormap has been applied to the pixels' reciprocal range values, while the actual training data is grayscale.}
@@ -1239,53 +1247,44 @@ Our experimental setup consisted of two stages. First, we conducted a hyperparam
Second, we trained the full DeepSAD models on the same latent space sizes in order to investigate how autoencoder performance transfers to anomaly detection performance. Specifically, we aimed to answer whether poor autoencoder reconstructions necessarily imply degraded DeepSAD results, or whether the two stages behave differently. To disentangle these effects, both network architectures (LeNet-inspired and Efficient) were trained with identical configurations, allowing for a direct architectural comparison.
Furthermore, we investigated the effect of semi-supervised labeling. DeepSAD can incorporate labeled data during training, and we wanted to investigate the impact of labeling on anomaly detection performance. To this end, each configuration was trained under three labeling regimes:
% Furthermore, we investigated the effect of semi-supervised labeling. DeepSAD can incorporate labeled data during training, and we wanted to investigate the impact of labeling on anomaly detection performance. To this end, each configuration was trained under three labeling regimes:
% \begin{itemize}
% \item \textbf{Unsupervised:} $(0,0)$ labeled samples of (normal, anomalous) data.
% \item \textbf{Low supervision:} $(50,10)$ labeled samples.
% \item \textbf{High supervision:} $(500,100)$ labeled samples.
% \end{itemize}
Furthermore, we examined the effect of semi-supervised labeling on DeepSADs performance. As summarized in Table~\ref{tab:labeling_regimes}, three labeling regimes were tested, ranging from fully unsupervised training to progressively larger amounts of supervision:
\begin{itemize}
\item \textbf{Unsupervised:} $(0,0)$ labeled samples of (normal, anomalous) data.
\item \textbf{Low supervision:} $(50,10)$ labeled samples.
\item \textbf{High supervision:} $(500,100)$ labeled samples.
\end{itemize}
The percentages reported in Table~\ref{tab:labeling_regimes} are relative to the training folds after 5-fold cross-validation. Here, the classes “normal,” “anomalous,” and “unknown” follow the same definition as in the experiment-based labeling scheme. In particular, the “unknown” category arises because for semi-supervised anomaly labels we only used the manually selected, unambiguous degradation intervals from smoke experiments. Frames outside of these intervals were treated as “unknown” rather than anomalous, so as to prevent mislabeled data from being used during training. This design choice ensured that the inclusion of labeled samples could not inadvertently reduce performance by introducing additional label noise.
\begin{table}[h]
\centering
\caption{Proportion of labeled samples in the training folds for each labeling regime.
Percentages are computed relative to the available training data after 5-fold splitting
(80\% of the dataset per fold). Unknown samples were never labeled.}
\renewcommand{\arraystretch}{1.15}
\begin{tabularx}{\linewidth}{lYYYY}
\toprule
\textbf{Regime} & \textbf{\% normals labeled} & \textbf{\% anomalies labeled} & \textbf{\% overall labeled} & \textbf{\% overall unlabeled} \\
\midrule
(0,0) & 0.00\% & 0.00\% & 0.00\% & 100.00\% \\
(50,10) & 0.40\% & 1.46\% & 0.42\% & 99.58\% \\
(500,100) & 3.96\% & 14.57\% & 4.25\% & 95.75\% \\
\bottomrule
\end{tabularx}
\label{tab:labeling_regimes}
\end{table}
All models were pre-trained for 50~epochs and then trained for 150~epochs with the same learning rate of $1\cdot 10^{-5}$ and evaluated with 5-fold cross-validation.
Table~\ref{tab:exp_grid} summarizes the full experiment matrix.
% \begin{table}[h]
% \centering
% \caption{Experiment grid of all DeepSAD trainings. Each latent space size was tested for both network architectures and three levels of semi-supervised labeling.}
% \begin{tabular}{c|c|c}
% \toprule
% \textbf{Latent sizes} & \textbf{Architectures} & \textbf{Labeling regimes (normal, anomalous)} \\
% \midrule
% $32, 64, 128, 256, 512, 768, 1024$ & LeNet-inspired, Efficient & (0,0), (50,10), (500,100) \\
% \bottomrule
% \end{tabular}
% \label{tab:exp_grid}
% \end{table}
% \begin{table}[h]
% \centering
% \caption{Experiment grid of all DeepSAD trainings. Each latent space size was tested for both network architectures and three levels of semi-supervised labeling.}
% \renewcommand{\arraystretch}{1.2}
% \begin{tabularx}{\textwidth}{cXX}
% \hline
% \textbf{Latent sizes} & \textbf{Architectures} & \textbf{Labeling regimes (normal, anomalous)} \\
% \hline
% \begin{tabular}{@{}c@{}}
% 32 \\ 64 \\ 128 \\ 256 \\ 512 \\ 768 \\ 1024
% \end{tabular}
% &
% \begin{tabular}{@{}c@{}}
% LeNet-inspired \\ Efficient
% \end{tabular}
% &
% \begin{tabular}{@{}c@{}}
% (0,0) \\ (50,10) \\ (500,100)
% \end{tabular} \\
% \hline
% \end{tabularx}
% \label{tab:exp_grid}
% \end{table}
\begin{table}[h]
\centering
\caption{Parameter space for the DeepSAD grid search. Each latent size is tested for both architectures and all labeling regimes.}
@@ -1615,64 +1614,29 @@ The contrast between the two evaluation schemes indicates, on the one hand, that
\end{tabularx}
\end{table}
% \begin{table}[t]
% \centering
% \caption{AP means $\pm$ std across 5 folds for experiment-based evaluation only, grouped by labeling regime.}
% \label{tab:results_ap_with_std}
% \begin{tabularx}{\textwidth}{c*{4}{Y}}
% \toprule
% Latent Dim. & \rotheader{DeepSAD \\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OC-SVM} \\
% \midrule
% \multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{0/0}\)} \\
% \addlinespace[2pt]
% 32 & 0.664$\,\pm\,0.029$ & 0.650$\,\pm\,0.017$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
% 64 & 0.635$\,\pm\,0.018$ & 0.643$\,\pm\,0.016$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
% 128 & 0.642$\,\pm\,0.022$ & 0.642$\,\pm\,0.017$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
% 256 & 0.615$\,\pm\,0.022$ & 0.631$\,\pm\,0.015$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
% 512 & 0.613$\,\pm\,0.015$ & 0.635$\,\pm\,0.016$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
% 768 & 0.609$\,\pm\,0.036$ & 0.617$\,\pm\,0.016$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
% 1024 & 0.607$\,\pm\,0.018$ & 0.612$\,\pm\,0.018$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
% \midrule
% \multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{50/10}\)} \\
% \addlinespace[2pt]
% 32 & 0.569$\,\pm\,0.061$ & 0.582$\,\pm\,0.008$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
% 64 & 0.590$\,\pm\,0.032$ & 0.592$\,\pm\,0.017$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
% 128 & 0.566$\,\pm\,0.078$ & 0.588$\,\pm\,0.011$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
% 256 & 0.598$\,\pm\,0.027$ & 0.587$\,\pm\,0.015$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
% 512 & 0.550$\,\pm\,0.157$ & 0.587$\,\pm\,0.020$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
% 768 & 0.596$\,\pm\,0.014$ & 0.577$\,\pm\,0.017$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
% 1024 & 0.601$\,\pm\,0.012$ & 0.568$\,\pm\,0.013$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
% \midrule
% \multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{500/100}\)} \\
% \addlinespace[2pt]
% 32 & 0.625$\,\pm\,0.015$ & 0.621$\,\pm\,0.009$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
% 64 & 0.611$\,\pm\,0.014$ & 0.621$\,\pm\,0.018$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
% 128 & 0.607$\,\pm\,0.014$ & 0.615$\,\pm\,0.015$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
% 256 & 0.604$\,\pm\,0.022$ & 0.612$\,\pm\,0.017$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
% 512 & 0.578$\,\pm\,0.109$ & 0.608$\,\pm\,0.018$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
% 768 & 0.597$\,\pm\,0.014$ & 0.598$\,\pm\,0.017$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
% 1024 & 0.601$\,\pm\,0.013$ & 0.591$\,\pm\,0.016$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
% \bottomrule
% \end{tabularx}
% \end{table}
Representative precisionrecall curves illustrate how methods differ in their operating regimes (Figure~\ref{fig:prc_representative}). DeepSAD shows a stable high-precision region up to about 0.5 recall, followed by a sharp drop once it is forced to classify borderline cases. OC-SVM declines gradually without ever reaching a strong plateau, while Isolation Forest detects only a few extreme anomalies before collapsing to near-random performance. These qualitative differences are masked in single-number metrics but are critical for interpreting how the methods would behave in deployment.
\fig{prc_representative}{figures/results_prc.png}{Representative precisionrecall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OC-SVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
%\newsection{results_latent}{Effect of latent space dimensionality}
%Figure~\ref{fig:latent_dim_ap} plots AP versus latent dimension under the experiment-based evaluation. DeepSAD benefits from compact latent spaces (e.g., 32128), with diminishing or negative returns at larger codes. We argue that the most likely reason for the declining performance with increasing latent space size is due to the network learning
Figure~\ref{fig:latent_dim_ap} plots AP versus latent dimension under the experiment-based evaluation. DeepSAD benefits most from compact latent spaces (e.g., 32128), with diminishing or even negative returns at larger code sizes. We argue that two interacting effects likely explain this trend. First, higher-dimensional latent spaces increase model capacity and reduce the implicit regularization provided by smaller bottlenecks, leading to overfitting. Second, as illustrated by the representative PRC curves in Figure~\ref{fig:prc_representative}, DeepSAD exhibits a steep decline in precision once recall exceeds roughly 0.5. We attribute this effect primarily to mislabeled or ambiguous samples in the experiment-based evaluation: once the model is forced to classify these borderline cases, precision inevitably drops. Importantly, while such a drop is visible across all latent dimensions, its sharpness increases with latent size. At small dimensions (e.g., 32), the decline is noticeable but somewhat gradual, whereas at 1024 it becomes nearly vertical. This suggests that larger latent spaces exacerbate the difficulty of distinguishing borderline anomalies from normal data, leading to more abrupt collapses in precision once the high-confidence region is exhausted.
\fig{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}
%\newsection{results_semi}{Effect of semi-supervised labeling regime}
Refering back to the results in table~\ref{tab:results_ap} compares AP across labeling regimes (0/0, 50/10, 500/100). Surprisingly, the unsupervised regime (0/0) often performs best; adding labels does not consistently help, likely due to label noise and the scarcity/ambiguity of anomalous labels. Baselines (which do not use labels) are stable across regimes.
% Refering back to the results in table~\ref{tab:results_ap} compares AP across labeling regimes (0/0, 50/10, 500/100). Surprisingly, the unsupervised regime (0/0) often performs best; adding labels does not consistently help, likely due to label noise and the scarcity/ambiguity of anomalous labels. Baselines (which do not use labels) are stable across regimes.
%
% \todo[inline]{rework this discussion of semi-supervised labeling and how it affected our results}
\paragraph{Effect of semi-supervised labeling.}
As shown in Table~\ref{tab:results_ap}, the \emph{unsupervised} models reach the best AP values, while the lightly labeled regime \((50,10)\) performs the worst. With many labels \((500,100)\), performance improves again but usually stays a little below the unsupervised case. This behavior also appears in the \emph{hand-labeled} evaluation, where only clearly degraded frames are used. Therefore, the drop with light labeling cannot be explained by mislabeled evaluation data.
\todo[inline]{rework this discussion of semi-supervised labeling and how it affected our results}
The PRC curves (Figure~\ref{fig:prc_over_semi}) help explain this effect. With only \((50,10)\) labels, \textbf{DeepSADLeNet} shows a slow and continuous loss of precision before the usual sharp decline, and the variation across folds is very large. In contrast, \textbf{DeepSADEfficient} keeps a flat precision region until the sudden drop, and its results are more stable across folds. This suggests that using very few labels makes training unstable: depending on which samples are selected, the model may fit them too strongly and generalize poorly. The Efficient encoder is more robust to this effect, while LeNet is more sensitive.
%\fig{prc_over_semi}{figures/results_prc_over_semi.png}{Precision--recall curves at latent dimension~32 for all three labeling regimes (unsupervised $(0,0)$, lightly supervised $(50,10)$, heavily supervised $(500,100)$), shown separately for the LeNet-inspired (left) and Efficient (right) encoders. Each subplot also includes the baseline methods (Isolation Forest, OC-SVM) for reference. The curves highlight how semi-supervised labels influence DeepSAD: in the lightly labeled regime, LeNet exhibits a gradual precision decay and high variance across folds, whereas Efficient retains a flat high-precision region until the usual sharp drop. With many labels, both architectures return to behavior close to the unsupervised case, although performance remains slightly lower. These plots illustrate that a small amount of supervision can destabilize training, while larger labeled sets reduce this effect without clearly surpassing the unsupervised baseline.}
\fig{prc_over_semi}{figures/results_prc_over_semi.png}{Precision--recall curves at latent dimension~32 for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (left) and Efficient (right) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}
With more labels \((500,100)\), the results become more stable again and the PRC curves look very similar to the unsupervised case, only slightly worse. One exception is a single outlier at latent dimension 512 for LeNet, where the curve again looks like the lightly labeled case. This is likely due to unlucky label sampling in that fold, combined with higher latent capacity amplifying the problem. Overall, we conclude that (i) a very small amount of labeled data can hurt performance instead of helping, (ii) many labels reduce this problem but still do not improve over unsupervised training, and (iii) the choice of encoder architecture strongly affects how robust the model is to these effects.
% --- Section: Autoencoder Pretraining Results ---
\newsection{results_inference}{Inference on Held-Out Experiments}
@@ -1734,7 +1698,7 @@ Based on the experiments presented in Chapter~\ref{sec:results_deepsad} and Chap
The main contributions of this thesis can be summarized as follows:
\begin{itemize}
\item \textbf{Empirical evaluation:} A systematic comparison of DeepSAD against Isolation Forest and OC-SVM for lidar degradation detection, demonstrating that DeepSAD consistently outperforms simpler baselines.
\item \textbf{Analysis of latent dimensionality:} An investigation of how representation size influences performance and stability under noisy labels, revealing that smaller latent spaces are more robust in this setting.
\item \textbf{Analysis of latent dimensionality:} An investigation of how representation size influences performance and stability under noisy labels, revealing that smaller latent spaces are more robust in this setting and that in spite of the high input dimensionality of whole point clouds produced by spinning lidar sensors, bottlenecks as small as 32 dimensions can achieve promising performance.
\item \textbf{Analysis of semi-supervised training labels:} An evaluation of different semi-supervised labeling regimes, showing that in our case purely unsupervised training yielded the best performance. Adding a small number of labels reduced performance, while a higher ratio of labels led to partial recovery. This pattern may indicate overfitting effects, although interpretation is complicated by the presence of mislabeled evaluation targets.
\item \textbf{Analysis of encoder architecture:} A comparison between a LeNet-inspired and an Efficient encoder showed that the choice of architecture has a decisive influence on DeepSADs performance. The Efficient encoder outperformed the LeNet-inspired baseline not only during autoencoder pretraining but also in anomaly detection. While the exact magnitude of this improvement is difficult to quantify due to noisy evaluation targets, the results underline the importance of encoder design for representation quality in DeepSAD.
\item \textbf{Feasibility study:} An exploration of runtime, temporal inference plots, and downstream applicability, indicating that anomaly scores correlate with degradation trends and could provide a foundation for future quantification methods.