Compare commits

...

2 Commits

Author SHA1 Message Date
Jan Kowalczyk
6cd2c7fbef abstract lidar capitalization 2025-10-19 17:34:38 +02:00
Jan Kowalczyk
62c424cd54 grammarly done 2025-10-19 17:29:31 +02:00
3 changed files with 44 additions and 43 deletions

Binary file not shown.

View File

@@ -91,7 +91,8 @@
}
\DeclareRobustCommand{\rev}[1]{\textcolor{red}{#1}}
%\DeclareRobustCommand{\rev}[1]{\textcolor{red}{#1}}
\DeclareRobustCommand{\rev}[1]{#1}
\DeclareRobustCommand{\mcah}[1]{}
% correct bad hyphenation
@@ -569,16 +570,16 @@ In the following sections, we detail our adaptations to this framework:
\item Experimental environment: the hardware and software stack used, with typical training and inference runtimes.
\end{itemize}
Together, these components define the full experimental pipeline, from data loading, preprocessing, method training to the evaluation and comparing of methods.
Together, these components define the full experimental pipeline, from data loading, preprocessing, method training, to the evaluation and comparison of methods.
\section{Framework \& Data Preparation}
DeepSAD's PyTorch implementation—our starting point—includes implementations for training on standardized datasets such as MNIST, CIFAR-10 and datasets from \citetitle{odds}~\cite{odds}. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE and SemiDGM with the loaded data and evaluate their performance by calculating the Receiver Operating Characteristic (ROC) and its Area Under the Curve (AUC) for all given algorithms. We adapted this implementation which was originally developed for Python 3.7 to work with Python 3.12 and changed or added \rev{functionality. We allowed loading data from our of} chosen dataset, added DeepSAD models that work with the \rev{LiDAR} projections datatype, added more evaluation methods and an inference module.
DeepSAD's PyTorch implementation—our starting point—includes implementations for training on standardized datasets such as MNIST, CIFAR-10, and datasets from \citetitle{odds}~\cite{odds}. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE, and SemiDGM, with the loaded data and evaluate their performance by calculating the Receiver Operating Characteristic (ROC) and its Area Under the Curve (AUC) for all given algorithms. We adapted this implementation, which was originally developed for Python 3.7, to work with Python 3.12 and changed or added \rev{functionality. We allowed loading data from our} chosen dataset, added DeepSAD models that work with the \rev{LiDAR} projections datatype, added more evaluation methods, and an inference module.
The raw SubTER dataset is provided as one ROS bag file per experiment, each containing a dense 3D point cloud from the Ouster OS1-32 \rev{LiDAR}. To streamline training and avoid repeated heavy computation, we project these point clouds offline into 2D “range images” as described in \rev{Section}~\ref{sec:preprocessing} and export them to files as NumPy arrays. Storing precomputed projections allows rapid data loading during training and evaluation. Many modern \rev{LiDARs} can be configured to output range images directly which would bypass the need for post-hoc projection. When available, such native range-image streams can further simplify preprocessing or even allow skipping this step completely.
The raw SubTER dataset is provided as one ROS bag file per experiment, each containing a dense 3D point cloud from the Ouster OS1-32 \rev{LiDAR}. To streamline training and avoid repeated heavy computation, we project these point clouds offline into 2D “range images” as described in \rev{Section}~\ref{sec:preprocessing} and export them to files as NumPy arrays. Storing precomputed projections allows rapid data loading during training and evaluation. Many modern \rev{LiDARs} can be configured to output range images directly, which would bypass the need for post-hoc projection. When available, such native range-image streams can further simplify preprocessing or even allow skipping this step completely.
We extended the DeepSAD frameworks PyTorch \texttt{DataLoader} by implementing a custom \texttt{Dataset} class that ingests our precomputed NumPy range-image files and attaches appropriate evaluation labels. Each experiments frames are stored as individual \texttt{.npy} files with the numpy array shape \((\text{Number of Frames}, H, W)\), containing the point clouds' reciprocal range values. Our \texttt{Dataset} initializer scans a directory of these files, loads the NumPy arrays from file into memory, transforms them into PyTorch tensors and assigns evaluation and training labels accordingly.
We extended the DeepSAD frameworks PyTorch \texttt{DataLoader} by implementing a custom \texttt{Dataset} class that ingests our precomputed NumPy range-image files and attaches appropriate evaluation labels. Each experiments frames are stored as individual \texttt{.npy} files with the numpy array shape \((\text{Number of Frames}, H, W)\), containing the point clouds' reciprocal range values. Our \texttt{Dataset} initializer scans a directory of these files, loads the NumPy arrays from file into memory, transforms them into PyTorch tensors, and assigns evaluation and training labels accordingly.
The first labeling scheme, called \emph{experiment-based labels}, assigns
\[
@@ -590,7 +591,7 @@ The first labeling scheme, called \emph{experiment-based labels}, assigns
\]
At load time, any file with “smoke” in its name is treated as anomalous (label \(-1\)), and all others (normal experiments) are labeled \(+1\).
To obtain a second source of ground truth, we also support \emph{manually-defined labels}. A companion JSON file specifies a start and end frame index for each of the four smoke experiments—defining the interval of unequivocal degradation. During loading the second label $y_{man}$ is assigned as follows:
To obtain a second source of ground truth, we also support \emph{manually-defined labels}. A companion JSON file specifies a start and end frame index for each of the four smoke experiments—defining the interval of unequivocal degradation. During loading, the second label $y_{man}$ is assigned as follows:
\[
y_{\mathrm{man}} =
@@ -601,19 +602,19 @@ To obtain a second source of ground truth, we also support \emph{manually-define
\end{cases}
\]
We pass instances of this \texttt{Dataset} to PyTorchs \texttt{DataLoader}, enabling batch sampling, shuffling, and multi-worker loading. The dataloader returns the preprocessed \rev{LiDAR} projection, both evaluation labels and a semi-supervised training label.
We pass instances of this \texttt{Dataset} to PyTorchs \texttt{DataLoader}, enabling batch sampling, shuffling, and multi-worker loading. The dataloader returns the preprocessed \rev{LiDAR} projection, both evaluation labels, and a semi-supervised training label.
To control the supervision of DeepSAD's training, our custom PyTorch \texttt{Dataset} accepts two integer parameters, \texttt{num\_labelled\_normal} and \texttt{num\_labelled\_anomalous}, which specify how many samples of each class should retain their labels during training. We begin with the manually-defined evaluation labels, to not use mislabeled anomalous frames for the semi-supervision. Then, we randomly un-label (set to 0) enough samples of each class until exactly \texttt{num\_labelled\_normal} normals and \texttt{num\_labelled\_anomalous} anomalies remain labeled.
To control the supervision of DeepSAD's training, our custom PyTorch \texttt{Dataset} accepts two integer parameters, \texttt{num\_labelled\_normal} and \texttt{num\_labelled\_anomalous}, which specify how many samples of each class should retain their labels during training. We begin with the manually-defined evaluation labels, so as not to use mislabeled anomalous frames for the semi-supervision. Then, we randomly un-label (set to 0) enough samples of each class until exactly \texttt{num\_labelled\_normal} normals and \texttt{num\_labelled\_anomalous} anomalies remain labeled.
To obtain robust performance estimates on our relatively small dataset, we implement $k$-fold cross-validation. A single integer parameter, \texttt{num\_folds}, controls the number of splits. We use scikit-learns \texttt{KFold} (from \texttt{sklearn.model\_selection}) with \texttt{shuffle=True} and a fixed random seed to partition each experiments frames into \texttt{num\_folds} disjoint folds. Training then proceeds across $k$ rounds, each time training on $(k-1)/k$ of the data and evaluating on the remaining $1/k$. In our experiments, we set \texttt{num\_folds=5}, yielding an 80/20 train/evaluation split per fold.
For inference (i.e.\ model validation on held-out experiments), we provide a second \texttt{Dataset} class that loads a single experiment's NumPy file (no k-fold splitting), does not assign any labels to the frames nor does it shuffle frames, preserving temporal order. This setup enables seamless, frame-by-frame scoring of complete runs—crucial for analyzing degradation dynamics over an entire experiment.
For inference (i.e., model validation on held-out experiments), we provide a second \texttt{Dataset} class that loads a single experiment's NumPy file (no k-fold splitting), does not assign any labels to the frames, nor does it shuffle frames, preserving temporal order. This setup enables seamless, frame-by-frame scoring of complete runs—crucial for analyzing degradation dynamics over an entire experiment.
\section{Model Configuration}
Since the neural network architecture trained in the \rev{DeepSAD} method is not fixed as described in \rev{Section}~\ref{sec:algorithm_details} but rather chosen based on the input data, we also had to choose an autoencoder architecture befitting our preprocessed \rev{LiDAR} data projections. Since \rev{\cite{degradation_quantification_rain}} reported success in training DeepSAD on similar data we firstly adapted the network architecture utilized by them for our use case, which is based on the simple and well understood LeNet architecture~\cite{lenet}. Additionally we were interested in evaluating the importance and impact of a well-suited network architecture for DeepSAD's performance and therefore designed a second network architecture henceforth \rev{refered} to as "efficient architecture" to incorporate a few modern techniques, befitting our use case.
Since the neural network architecture trained in the \rev{DeepSAD} method is not fixed as described in \rev{Section}~\ref{sec:algorithm_details} but rather chosen based on the input data, we also had to choose an autoencoder architecture befitting our preprocessed \rev{LiDAR} data projections. Since \rev{\cite{degradation_quantification_rain}} reported success in training DeepSAD on similar data, we first adapted the network architecture utilized by them for our use case, which is based on the simple and well-understood LeNet architecture~\cite{lenet}. Additionally, we were interested in evaluating the importance and impact of a well-suited network architecture for DeepSAD's performance and therefore designed a second network architecture, henceforth \rev{referred} to as "efficient architecture", to incorporate a few modern techniques, befitting our use case.
The LeNet-inspired autoencoder can be split into an encoder network (\rev{Figure}~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (\rev{Figure}~\ref{fig:setup_arch_lenet_decoder}) with a latent space \rev{in between} the two parts. Such an arrangement is typical for autoencoder architectures as we discussed in \rev{Section}~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture which is used to infer the degradation quantification in our use case, once trained.
The LeNet-inspired autoencoder can be split into an encoder network (\rev{Figure}~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (\rev{Figure}~\ref{fig:setup_arch_lenet_decoder}) with a latent space \rev{in between} the two parts. Such an arrangement is typical for autoencoder architectures, as we discussed in \rev{Section}~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture, which is used to infer the degradation quantification in our use case, once trained.
\figc{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{
Architecture of the LeNet-inspired encoder. The input is a \rev{LiDAR} range image of size
@@ -637,7 +638,7 @@ The LeNet-inspired encoder network (see \rev{Figure}~\ref{fig:setup_arch_lenet_e
$4\times 512\times 8$ (channels $\times$ width $\times$ height).
The first upsampling stage applies interpolation with scale factor 2, followed by a
transpose convolution with 8 output channels, batch normalization, and LeakyReLU activation,
yielding $8\times 1024\times 16$. The second stage again upsamples by factor 2 and applies
yielding $8\times 1024\times 16$. The second stage again upsamples by a factor of 2 and applies
a transpose convolution, reducing the channels to 1. This produces the reconstructed output
of size $1\times 2048\times 32$, which matches the original input dimensionality required
for the autoencoding objective.
@@ -645,14 +646,14 @@ The LeNet-inspired encoder network (see \rev{Figure}~\ref{fig:setup_arch_lenet_e
The decoder network (see \rev{Figure}~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $4\times 512\times 8$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $1\times 2048\times 32$, matching the original input dimensionality required for the autoencoding objective.
Even though the LeNet-inspired encoder proved capable of achieving our degradation quantification objective in initial experiments, we identified several shortcomings that motivated the design of a second, more efficient architecture. The most important issue concerns the shape of the CNN's receptive field (RF) which describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may hinder the network from learning to recognize fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region, but in principle it can be computed independently per axis.
Even though the LeNet-inspired encoder proved capable of achieving our degradation quantification objective in initial experiments, we identified several shortcomings that motivated the design of a second, more efficient architecture. The most important issue concerns the shape of the CNN's receptive field (RF), which describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may hinder the network from learning to recognize fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region, but in principle it can be computed independently per axis.
%\figc{setup_ef_concept}{figures/setup_ef_concept}{Receptive fields in a CNN. Each output activation aggregates information from a region of the input; stacking layers expands this region, while kernel size, stride, and padding control how quickly it grows and what shape it takes. (A) illustrates slower, fine-grained growth; (B) shows faster expansion, producing a larger—potentially anisotropic—receptive field and highlighting the trade-off between detail and context. Reproduced from~\cite{ef_concept_source}}{width=.6\textwidth}
The RF shape's issue arises from the fact that spinning multi-beam \rev{LiDAR} oftentimes produce point clouds posessing dense horizontal but limited vertical resolution. In our \rev{case, this} results in a pixel-per-degree resolution of approximately $5.69\,\sfrac{pixel}{deg}$ vertically and $1.01\,\sfrac{pixel}{deg}$ horizontally. Consequently, the LeNet-inspired encoders calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the networks ability to capture degradation patterns that extend differently across the two axes.
The RF shape's issue arises from the fact that spinning multi-beam \rev{LiDAR} oftentimes produce point clouds possessing dense horizontal but limited vertical resolution. In our \rev{case, this} results in a pixel-per-degree resolution of approximately $5.69\,\sfrac{pixel}{deg}$ vertically and $1.01\,\sfrac{pixel}{deg}$ horizontally. Consequently, the LeNet-inspired encoders calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the networks ability to capture degradation patterns that extend differently across the two axes.
To adjust for this, we decided to modify the network architecture and included further modificatons to improve the method's performance. The encoder (see \rev{Figure}~\ref{fig:setup_arch_ef_encoder}) follows the same general idea as the LeNet-inspired encoder, but incorporates the following modificatons:
To adjust for this, we decided to modify the network architecture and included further modifications to improve the method's performance. The encoder (see \rev{Figure}~\ref{fig:setup_arch_ef_encoder}) follows the same general idea as the LeNet-inspired encoder, but incorporates the following modifications:
\begin{itemize}
\item \textbf{Non-square convolution kernels.} Depthwise-separable convolutions with kernel size $3 \times 17$ are used instead of square kernels, resulting in an RF of $10 \times 52$ pixels, corresponding to $9.93^{\circ} \times 9.14^{\circ}$, substantially more balanced than the LeNet-inspired network's RF.
\item \textbf{Circular padding along azimuth.} The horizontal axis is circularly padded to respect the wrap-around of $360^{\circ}$ \rev{LiDAR} data, preventing artificial seams at the image boundaries.
@@ -684,7 +685,7 @@ The decoder (see \rev{Figure}~\ref{fig:setup_arch_ef_decoder}) mirrors the encod
\begin{itemize}
\item \textbf{Nearest-neighbor upsampling followed by convolution.} Instead of relying solely on transposed convolutions, each upsampling stage first enlarges the feature map using parameter-free nearest-neighbor interpolation, followed by a depthwise-separable convolution. This strategy reduces the risk of checkerboard artifacts while still allowing the network to learn fine detail.
\item \textbf{Asymmetric upsampling schedule.} Horizontal resolution is restored more aggressively (e.g., scale factor $1 \times 4$) to reflect the anisotropic downsampling performed in the encoder.
\item \textbf{Final convolution with circular padding.} The output is generated using a $(3 \times 17)$ convolution with circular padding along the azimuth similar to the new encoder, ensuring consistent treatment of the 360° \rev{LiDAR} input.
\item \textbf{Final convolution with circular padding.} The output is generated using a $(3 \times 17)$ convolution with circular padding along the azimuth, similar to the new encoder, ensuring consistent treatment of the 360° \rev{LiDAR} input.
\end{itemize}
\fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{
@@ -703,7 +704,7 @@ The decoder (see \rev{Figure}~\ref{fig:setup_arch_ef_decoder}) mirrors the encod
}
To compare the computational efficiency of the two architectures we show the number of trainable parameters and the number of multiplyaccumulate operations (MACs) for different latent space sizes used in our experiments in \rev{Table}~\ref{tab:params_lenet_vs_efficient}. Even though the efficient architecture employs more layers and channels which allows the network to learn to recognize more types of patterns when compared to the LeNet-inspired one, the encoders' MACs are quite similar. The more complex decoder design of the efficient network appears to contribute a lot more MACs, which leads to longer pretraining times which we report in \rev{Section}~\ref{sec:setup_experiments_environment}.
To compare the computational efficiency of the two architectures, we show the number of trainable parameters and the number of multiplyaccumulate operations (MACs) for different latent space sizes used in our experiments in \rev{Table}~\ref{tab:params_lenet_vs_efficient}. Even though the efficient architecture employs more layers and channels, which allows the network to learn to recognize more types of patterns when compared to the LeNet-inspired one, the encoders' MACs are quite similar. The more complex decoder design of the efficient network appears to contribute a lot more MACs, which leads to longer pretraining times, which we report in \rev{Section}~\ref{sec:setup_experiments_environment}.
%& \multicolumn{4}{c}{\textbf{Encoders}} & \multicolumn{4}{c}{\rev{\textbf{Autoencoders (Encoder $+$ Decoder)}}} \\
@@ -750,9 +751,9 @@ To contextualize the performance of DeepSAD, we compare against two widely used
\paragraph{Isolation Forest} is an ensemble method for anomaly detection that builds on the principle that anomalies are easier to separate from the rest of the data. It constructs many binary decision trees, each by recursively splitting the data at randomly chosen features and thresholds. In this process, the “training” step consists of building the forest of trees: each tree captures different random partitions of the input space, and together they form a diverse set of perspectives on how easily individual samples can be isolated.
Once trained, the method assigns an anomaly score to new samples by measuring their average path length through the trees. Normal samples, being surrounded by other similar samples, typically require many recursive splits and thus end up deep in the trees. Anomalies, by contrast, stand out in one or more features, which means they can be separated much earlier and end up closer to the root. The shorter the average path length, the more anomalous the sample is considered. This makes Isolation Forest highly scalable and robust: training is efficient and the resulting model is fast to apply to new data. In our setup, we apply Isolation Forest directly to the \rev{LiDAR} input representation, providing a strong non-neural baseline for comparison against DeepSAD.
Once trained, the method assigns an anomaly score to new samples by measuring their average path length through the trees. Normal samples, being surrounded by other similar samples, typically require many recursive splits and thus end up deep in the trees. Anomalies, by contrast, stand out in one or more features, which means they can be separated much earlier and end up closer to the root. The shorter the average path length, the more anomalous the sample is considered. This makes Isolation Forest highly scalable and robust: training is efficient, and the resulting model is fast to apply to new data. In our setup, we apply Isolation Forest directly to the \rev{LiDAR} input representation, providing a strong non-neural baseline for comparison against DeepSAD.
\paragraph{OCSVM} takes a very different approach by learning a flexible boundary around normal samples. It assumes all training data to be normal, with the goal of enclosing the majority of these samples in such a way that new points lying outside this boundary can be identified as anomalies. The boundary itself is learned using the support vector machine framework. In essence, OCSVM looks for a hyperplane in some feature space that maximizes the separation between the bulk of the data and the origin. To make this possible even when the normal data has a complex, curved shape, OCSVM uses a kernel function such as the radial basis function (RBF). The kernel implicitly maps the input data into a higher-dimensional space, where the cluster of normal samples becomes easier to separate with a simple hyperplane. When this separation is mapped back to the original input space, it corresponds to a flexible, nonlinear boundary that can adapt to the structure of the data.
\paragraph{OCSVM} takes a very different approach by learning a flexible boundary around normal samples. It assumes all training data to be normal, with the goal of enclosing the majority of these samples in such a way that new points lying outside this boundary can be identified as anomalies. The boundary itself is learned using the support vector machine framework. In essence, OCSVM looks for a hyperplane in some feature space that maximizes the separation between the bulk of the data and the origin. To make this possible, even when the normal data has a complex, curved shape, OCSVM uses a kernel function such as the radial basis function (RBF). The kernel implicitly maps the input data into a higher-dimensional space, where the cluster of normal samples becomes easier to separate with a simple hyperplane. When this separation is mapped back to the original input space, it corresponds to a flexible, nonlinear boundary that can adapt to the structure of the data.
During training, the algorithm balances two competing objectives: capturing as many of the normal samples as possible inside the boundary, while keeping the region compact enough to exclude potential outliers. Once this boundary is established, applying OCSVM is straightforward — any new data point is checked against the learned boundary, with points inside considered normal and those outside flagged as anomalous.
@@ -785,9 +786,9 @@ In conclusion, the combination of unreliable thresholds and pronounced class imb
\newsection{setup_experiments_environment}{Experiment Overview \& Computational Environment}
Across all experiments we vary three factors: (i) latent space dimensionality, (ii) encoder architecture (LeNet-inspired vs. Efficient), and (iii) the amount of semi-supervision (labeling regime). To keep results comparable, we fix the remaining training hyperparameters: all autoencoders are pretrained for $E_A = 50$~epochs with ADAM as an optimzer at a starting learning rate of $L_A = 1\cdot 10^{-5}$; all DeepSAD models are then trained for $E_M = 150$~epochs with the same optimizer and starting learning rate ($L_M = 1\cdot 10^{-5}$). The DeepSAD label-weighting parameter is kept at $\eta = 1$ and the regularization rate at $\lambda = 1\cdot 10^{-6}$ for all runs. Every configuration is evaluated with 5-fold cross-validation, and we report fold means.
Across all experiments, we vary three factors: (i) latent space dimensionality, (ii) encoder architecture (LeNet-inspired vs. Efficient), and (iii) the amount of semi-supervision (labeling regime). To keep results comparable, we fix the remaining training hyperparameters: all autoencoders are pretrained for $E_A = 50$~epochs with ADAM as an optimizer at a starting learning rate of $L_A = 1\cdot 10^{-5}$; all DeepSAD models are then trained for $E_M = 150$~epochs with the same optimizer and starting learning rate ($L_M = 1\cdot 10^{-5}$). The DeepSAD label-weighting parameter is kept at $\eta = 1$ and the regularization rate at $\lambda = 1\cdot 10^{-6}$ for all runs. Every configuration is evaluated with 5-fold cross-validation, and we report fold means.
We first search over the latent bottleneck size by pretraining autoencoders only. For both encoder backbones, we evaluate latent sizes $32, 64, 128, 256, 512, 768,$ and $1024$. The goal is to identify compact yet expressive representations and to compare the autoencoding performance between the two network architectures LeNet-inspired and Efficient. Additionally, we are interested in finding possible correlations between the autoencoder performance and the DeepSAD anomaly detection performance.
We first search over the latent bottleneck size by pretraining autoencoders only. For both encoder backbones, we evaluate latent sizes $32, 64, 128, 256, 512, 768,$ and $1024$. The goal is to identify compact yet expressive representations and to compare the autoencoding performance between the two network architectures, LeNet-inspired and Efficient. Additionally, we are interested in finding possible correlations between the autoencoder performance and the DeepSAD anomaly detection performance.
Using the same latent sizes and backbones, we train full DeepSAD models initialized from the pretrained encoders. We study three supervision regimes, from unsupervised to strongly supervised (see Table~\ref{tab:labeling_regimes} for proportions within the training folds):
\begin{itemize}
@@ -795,7 +796,7 @@ Using the same latent sizes and backbones, we train full DeepSAD models initiali
\item \textbf{Low supervision:} $(50,10)$ labeled samples.
\item \textbf{High supervision:} $(500,100)$ labeled samples.
\end{itemize}
Percentages in Table~\ref{tab:labeling_regimes} are computed relative to the training split of each fold (80\% of the data) from the experiment-based labeling scheme. Importantly, for semi-supervised labels we \emph{only} use hand-selected, unambiguous smoke intervals from the manually-defined evaluation scheme, to avoid injecting mislabeled data into training.
Percentages in Table~\ref{tab:labeling_regimes} are computed relative to the training split of each fold (80\% of the data) from the experiment-based labeling scheme. Importantly, for semi-supervised labels, we \emph{only} use hand-selected, unambiguous smoke intervals from the manually-defined evaluation scheme to avoid injecting mislabeled data into training.
\begin{table}
\centering
@@ -901,7 +902,7 @@ Pretraining runtimes for the autoencoders are reported in Table~\ref{tab:ae_pret
\end{tabularx}
\end{table}
The full DeepSAD training times are shown in Table~\ref{tab:train_runtimes_compact}, alongside the two classical baselines Isolation Forest and OCSVM. Here the contrast between methods is clear: while DeepSAD requires on the order of 1520 minutes of GPU training per configuration and fold, both baselines complete training in seconds on CPU. The OCSVM training can only be this fast due to the reduced input dimensionality from utilizing DeepSAD's pretraining encoder as a preprocessing step, although other dimensionality reduction methods may also be used which could require less computational resources for this step.
The full DeepSAD training times are shown in Table~\ref{tab:train_runtimes_compact}, alongside the two classical baselines, Isolation Forest and OCSVM. Here, the contrast between methods is clear: while DeepSAD requires on the order of 1520 minutes of GPU training per configuration and fold, both baselines complete training in seconds on CPU. The OCSVM training can only be this fast due to the reduced input dimensionality from utilizing DeepSAD's pretraining encoder as a preprocessing step, although other dimensionality reduction methods may also be used, which could require less computational resources for this step.
\begin{table}
\centering
@@ -952,7 +953,7 @@ Together, these results provide a comprehensive overview of the computational re
\newchapter{results_discussion}{Results and Discussion}
The \rev{evaluation experiments which the setup in in Chapter~\ref{chp:experimental_setup} described,} are presented in this chapter. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and OCSVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on \rev{data} that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen \rev{data}, offering a more practical perspective on their potential for real-world rescue robotics applications.
In this chapter, we present the \rev{evaluation experiments, outlined in Chapter~\ref{chp:experimental_setup}}. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and OCSVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on \rev{data} that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen \rev{data}, offering a more practical perspective on their potential for real-world rescue robotics applications.
% --- Section: Autoencoder Pretraining Results ---
\newsection{results_pretraining}{Autoencoder Pretraining Results}
@@ -995,7 +996,7 @@ Due to the challenges of ground truth quality, evaluation results must be interp
\item \textbf{Manually-defined labels:} A cleaner ground truth, containing only clearly degraded frames. This removes mislabeled intervals and allows nearly perfect separation. However, it also simplifies the task too much, because borderline cases are excluded.
\end{itemize}
Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimensions, labeling regimes, and methods. Under experiment-based evaluation, both DeepSAD variants consistently outperform the baselines, reaching AP values around 0.600.66 compared to 0.21 for \rev{the} Isolation Forest and 0.310.49 for OCSVM. Under manually-defined evaluation, DeepSAD achieves nearly perfect AP in all settings, while the baselines remain much lower. This contrast shows that the lower AP under experiment-based evaluation is not a weakness of DeepSAD itself, but a direct result of mislabeled samples in the evaluation data. The manually-defined scheme therefore confirms that DeepSAD separates clearly normal from clearly degraded frames very well, while also highlighting that label noise must be kept in mind when interpreting the experiment-based results.
Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimensions, labeling regimes, and methods. Under experiment-based evaluation, both DeepSAD variants consistently outperform the baselines, reaching AP values around 0.600.66 compared to 0.21 for \rev{the} Isolation Forest and 0.310.49 for OCSVM. Under manually-defined evaluation, DeepSAD achieves nearly perfect AP in all settings, while the baselines remain much lower. This contrast shows that the lower AP under experiment-based evaluation is not a weakness of DeepSAD itself, but a direct result of mislabeled samples in the evaluation data. Therefore, the manually-defined scheme confirms that DeepSAD separates clearly normal from clearly degraded frames very well, while also highlighting that label noise must be kept in mind when interpreting the experiment-based results.
\begin{table}
\centering
@@ -1050,18 +1051,18 @@ Taken together, the two evaluation schemes provide complementary insights. The e
\FloatBarrier
\paragraph{Effect of latent space dimensionality.}
During autoencoder pretraining we observed that reconstruction loss decreased monotonically with larger latent spaces, as expected: a bigger bottleneck allows the encoderdecoder to retain more information. If autoencoder performance were directly predictive of DeepSAD performance, we would therefore expect average precision to improve with larger latent dimensions. The actual results, however, show the opposite trend (Figure~\ref{fig:latent_dim_ap}): compact latent spaces (32128) achieve the highest AP, while performance declines as the latent size grows. This inverse correlation is most clearly visible in the unsupervised case. Part of this effect can be attributed to evaluation label noise, which larger spaces amplify. More importantly, it shows that autoencoder performance does not translate directly into DeepSAD performance. Pretraining losses can still help compare different architectures for robustness, and performance but they cannot be used to tune the latent dimensionality: the dimensionality that minimizes reconstruction loss in pretraining is not necessarily the one that maximizes anomaly detection performance in DeepSAD.
During autoencoder pretraining, we observed that reconstruction loss decreased monotonically with larger latent spaces, as expected: a bigger bottleneck allows the encoderdecoder to retain more information. If autoencoder performance were directly predictive of DeepSAD performance, we would therefore expect average precision to improve with larger latent dimensions. The actual results, however, show the opposite trend (Figure~\ref{fig:latent_dim_ap}): compact latent spaces (32128) achieve the highest AP, while performance declines as the latent size grows. This inverse correlation is most clearly visible in the unsupervised case. Part of this effect can be attributed to evaluation label noise, which larger spaces amplify. More importantly, it shows that autoencoder performance does not translate directly into DeepSAD performance. Pretraining losses can still help compare different architectures for robustness and performance, but they cannot be used to tune the latent dimensionality: the dimensionality that minimizes reconstruction loss in pretraining is not necessarily the one that maximizes anomaly detection performance in DeepSAD.
% \paragraph{Effect of latent space dimensionality.}
% Figure~\ref{fig:latent_dim_ap} shows how average precision changes with latent dimension under the experiment-based evaluation. The best performance is reached with compact latent spaces (32128), while performance drops as the latent dimension grows. This can be explained by how the latent space controls the separation between normal and anomalous samples. Small bottlenecks act as a form of regularization, keeping the representation compact and making it easier to distinguish clear anomalies from normal frames. Larger latent spaces increase model capacity, but this extra flexibility also allows more overlap between normal frames and the mislabeled anomalies from the evaluation data. As a result, the model struggles more to keep the two groups apart.
%
% This effect is clearly visible in the precision--recall curves. For DeepSAD at all dimensionalities we observe high initial precision and a steep drop once the evaluation demands that mislabeled anomalies be included. However, the sharpness of this drop depends on the latent size: at 32 dimensions the fall is comparably more gradual, while at 1024 it is almost vertical. In practice, this means that higher-dimensional latent spaces amplify the label-noise problem and lead to sudden precision collapses once the clear anomalies have been detected. Compact latent spaces are therefore more robust under noisy evaluation conditions and appear to be the safer choice for real-world deployment.
\figc{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}{width=.7\textwidth}
\figc{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows an inverse correlation between AP and latent space size.}{width=.7\textwidth}
\FloatBarrier
\paragraph{Effect of semi-supervised labeling.}
Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves the best AP, while the lightly supervised regime \((50/10)\) performs worst. With many labels \((500/100)\), performance improves again but remains slightly below the unsupervised case. This pattern also appears under the manually-defined evaluation, which excludes mislabeled frames. The drop with light supervision therefore cannot be explained by noisy evaluation targets, but must stem from the training process itself.
Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves the best AP, while the lightly supervised regime \((50/10)\) performs worst. With many labels \((500/100)\), performance improves again but remains slightly below the unsupervised case. This pattern also appears under the manually-defined evaluation, which excludes mislabeled frames. Consequently, the drop with light supervision cannot be explained by noisy evaluation targets, but must stem from the training process itself.
The precision--recall curves in Figure~\ref{fig:prc_over_semi} show that the overall curve shapes are similar across regimes, but shifted relative to one another in line with the AP ordering \((0/0) > (500/100) > (50/10)\). We attribute these shifts to overfitting: when only a few anomalies are labeled, the model fits them too strongly, and if those examples differ too much from other anomalies, generalization suffers. This explains why lightly supervised training performs even worse than unsupervised training, which avoids this bias.
@@ -1069,7 +1070,7 @@ The precision--recall curves in Figure~\ref{fig:prc_over_semi} show that the ove
The LeNet variant illustrates this effect most clearly, showing unusually high variance across folds in the lightly supervised case. In several folds, precision drops untypically early, which supports the idea that the model has overfit to a poorly chosen subset of labeled anomalies. The Efficient variant is less affected, maintaining more stable precision plateaus, which suggests it is more robust to such overfitting, which we observe consistently for nearly all latent dimensionalities.
With many labels \((500/100)\), the results become more stable again and the PRC curves closely resemble the unsupervised case, only shifted slightly left. A larger and more diverse set of labeled anomalies reduces the risk of unlucky sampling and improves generalization, but it still cannot fully match the unsupervised regime, where no overfitting to a specific labeled subset occurs. The only exception is an outlier at latent dimension 512 for LeNet, where the curve again resembles the lightly supervised case, likely due to label sampling effects amplified by higher latent capacity.
With many labels \((500/100)\), the results become more stable again, and the PRC curves closely resemble the unsupervised case, only shifted slightly left. A larger and more diverse set of labeled anomalies reduces the risk of unlucky sampling and improves generalization, but it still cannot fully match the unsupervised regime, where no overfitting to a specific labeled subset occurs. The only exception is an outlier at latent dimension 512 for LeNet, where the curve again resembles the lightly supervised case, likely due to label sampling effects amplified by higher latent capacity.
In summary, three consistent patterns emerge: (i) a very small number of labels can hurt performance by causing overfitting to specific examples, (ii) many labels reduce this problem but still do not surpass unsupervised generalization, and (iii) encoder architecture strongly affects robustness, with \rev{the LeNet-inspired encoder} being more sensitive to unstable behavior than \rev{the Efficient encoder}.
@@ -1077,15 +1078,15 @@ In summary, three consistent patterns emerge: (i) a very small number of labels
\newsection{results_inference}{Inference on Held-Out Experiments}
In addition to the evaluation of PRC and AP obtained from $k$-fold cross-validation with varying hyperparameters, we also examine the behavior of the fully trained methods when applied to previously unseen, held-out experiments.
While the prior analysis provided valuable insights into the classification capabilities of the methods, it was limited by two factors: first, the binary ground-truth labels were of uneven quality due to aforementioned mislabeling of frames, and second, the binary formulation does not reflect our overarching goal of quantifying sensor degradation on a continuous scale.
While the prior analysis provided valuable insights into the classification capabilities of the methods, it was limited by two factors: first, the binary ground-truth labels were of uneven quality due to the aforementioned mislabeling of frames, and second, the binary formulation does not reflect our overarching goal of quantifying sensor degradation on a continuous scale.
To provide a more intuitive understanding of how the methods might perform in real-world applications, we therefore present results from running inference sequentially on entire experiments.
These frame-by-frame time-axis plots simulate online inference and illustrate how anomaly scores evolve as data is captured, thereby serving as a candidate metric for quantifying the degree of \rev{LiDAR} degradation during operation.
%\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of anomaly detection methods with statistical indicators across clean (dashed) and degraded (solid) experiments. Each subplot shows one method (DeepSAD--LeNet, DeepSAD--Efficient, OCSVM, Isolation Forest). Red curves denote how strongly the anomaly score deviates from clean-experiment baseline; blue and green curves denote the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent Space Dimensionality was 32 and semi-supervised labeling regime was 0 normal and 0 anomalous samples during training.}
\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of inference on unseen experiment for clean (dashed) vs. degraded (solid) experiments. Each subplot compares one method to statistical indicators. Red curves show method's anomaly score deviation from its clean baseline; blue and green curves indicate the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent dimension: 32; training regime: 0 normal, 0 anomalous samples.}
\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of inference on unseen experiment for clean (dashed) vs. degraded (solid) experiments. Every subplot compares one method to statistical indicators. Red curves show each method's anomaly score deviation from its clean baseline; blue and green curves indicate the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent dimension: 32; training regime: 0 normal, 0 anomalous samples.}
As discussed in Section~\ref{sec:setup_baselines_evaluation}, we apply $z$-score normalization to enable comparison of the different methods during inference. After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.
As discussed in Section~\ref{sec:setup_baselines_evaluation}, we apply $z$-score normalization to enable comparison of the different methods during inference. After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques, such as running averages.
The plots in Figure~\ref{fig:results_inference_normal_vs_degraded} highlight important differences in how well the tested methods distinguish between normal and degraded sensor conditions. The plots show how strongly the method's scores deviate from their clean-data baseline and include statistical indicators (missing points and near-sensor particle hits) in blue and green.
@@ -1104,7 +1105,7 @@ This thesis set out to answer the research question stated in Chapter~\ref{chp:i
\begin{quote}
Can autonomous robots quantify the reliability of \rev{LiDAR} sensor data in hazardous environments to make more informed decisions?
\end{quote}
Our results indicate a qualified “yes.” Using anomaly detection (AD)—in particular DeepSAD—we can obtain scores that (i) separate clearly normal from clearly degraded scans and (ii) track degradation trends over time on held-out experiments (see Sections~\ref{sec:results_deepsad} and \ref{sec:results_inference}). At the same time, the absence of robust ground truth limits how confidently we can assess \emph{continuous} quantification quality and complicates cross-method comparisons. The remainder of this chapter summarizes what we contribute, what we learned, and what is still missing.
Our results indicate a qualified “yes.” Using anomaly detection (AD)—in particular DeepSAD—we can obtain scores that (i) separate clearly normal from clearly degraded scans and (ii) track degradation trends over time on held-out experiments (see Sections~\ref{sec:results_deepsad} and \ref{sec:results_inference}). At the same time, the absence of robust ground truth limits how confidently we can assess \emph{continuous} quantification quality and complicates cross-method comparisons. The remainder of this chapter summarizes what we contributed, what we learned, and what is still missing.
\paragraph{Main contributions.}
\begin{itemize}
@@ -1115,9 +1116,9 @@ Our results indicate a qualified “yes.” Using anomaly detection (AD)—in pa
\item \textbf{Semi-supervision insight.} In our data, \emph{unsupervised} DeepSAD performed best; \emph{light} labeling (50/10) performed worst; \emph{many} labels (500/100) partially recovered performance but did not surpass \rev{the unsupervised approach}. Evidence from \rev{precision--recall curve (PRC)} shapes and fold variance points to \emph{training-side overfitting to a small labeled set}, an effect that persists even under clean manually-defined evaluation (Table~\ref{tab:results_ap}, Figure~\ref{fig:prc_over_semi}).
\item \textbf{Encoder architecture matters.} The Efficient encoder \rev{specifically tailored to the application at hand} outperformed the LeNet-inspired variant in pretraining and downstream AD, indicating that representation quality substantially affects DeepSAD performance (Section~\ref{sec:results_pretraining}, Section~\ref{sec:results_deepsad}).
\item \textbf{Encoder architecture matters.} The Efficient encoder, \rev{specifically tailored to the application at hand,} outperformed the LeNet-inspired variant in pretraining and downstream AD, indicating that representation quality substantially affects DeepSAD performance (Section~\ref{sec:results_pretraining}, Section~\ref{sec:results_deepsad}).
\item \textbf{Temporal inference recipe.} For deployment-oriented analysis we propose $z$-score normalization based on clean data and causal EMA smoothing to obtain interpretable time-series anomaly scores on full experiments (Section~\ref{sec:results_inference}).
\item \textbf{Temporal inference recipe.} For deployment-oriented analysis, we propose $z$-score normalization based on clean data and causal EMA smoothing to obtain interpretable time-series anomaly scores on full experiments (Section~\ref{sec:results_inference}).
\end{itemize}
\paragraph{Practical recommendations.}
@@ -1133,9 +1134,9 @@ We now turn to the main limiting factor that emerged throughout this work: the l
\newsection{conclusion_data}{Missing Ground Truth as an Obstacle}
The most significant obstacle identified in this work is the absence of robust and comprehensive ground truth for \rev{LiDAR} degradation. As discussed in Chapter~\ref{chp:data_preprocessing}, it is not trivial to define what “degradation” precisely means in practice. Although error models for \rev{LiDAR} and theoretical descriptions of how airborne particles affect laser returns exist, these models typically quantify errors at the level of individual points (e.g., missing returns, spurious near-range hits). Such metrics, however, may not be sufficient to assess the impact of degraded data on downstream perception. For example, a point cloud with relatively few but highly localized errors—such as those caused by a dense smoke cloud—may cause a SLAM algorithm to misinterpret the region as a solid obstacle. In contrast, a point cloud with a greater number of dispersed errors might be easier to filter and thus cause little or no disruption in mapping. Consequently, the notion of “degradation” must extend beyond point-level error statistics to include how different error patterns propagate to downstream modules.
The most significant obstacle identified in this work is the absence of a robust and comprehensive ground truth for \rev{LiDAR} degradation. As discussed in Chapter~\ref{chp:data_preprocessing}, it is not trivial to define what “degradation” precisely means in practice. Although error models for \rev{LiDAR} and theoretical descriptions of how airborne particles affect laser returns exist, these models typically quantify errors at the level of individual points (e.g., missing returns, spurious near-range hits). Such metrics, however, may not be sufficient to assess the impact of degraded data on downstream perception. For example, a point cloud with relatively few but highly localized errors—such as those caused by a dense smoke cloud—may cause a SLAM algorithm to misinterpret the region as a solid obstacle. In contrast, a point cloud with a greater number of dispersed errors might be easier to filter and thus cause little or no disruption in mapping. Consequently, the notion of “degradation” must extend beyond point-level error statistics to include how different error patterns propagate to downstream modules.
To our knowledge, no public datasets with explicit ground truth for \rev{LiDAR} degradation exist. Even if such data were collected, for example with additional smoke sensors, it is unclear whether this would provide a usable ground truth. A smoke sensor measures only at a single point in space, while \rev{LiDAR} observes many points across the environment from a distance, so the two do not directly translate. In our dataset, we relied on the fact that clean and degraded experiments were clearly separated: data from degraded runs was collected only after artificial smoke had been released. However, the degree of degradation varied strongly within each run. Because the smoke originated from a single machine in the middle of the sensor platform's traversal path, early and late frames were often nearly as clear as those from clean experiments. This led to mislabeled frames at the run boundaries and limited the reliability of experiment-based evaluation. As shown in Section~\ref{sec:results_deepsad}, this effect capped achievable AP scores even for strong models. The underlying difficulty is not only label noise, but also the challenge of collecting labeled subsets that are representative of the full range of anomalies.
To our knowledge, no public datasets with explicit ground truth for \rev{LiDAR} degradation exist. Even if such data were collected, for example, with additional smoke sensors, it is unclear whether this would provide a usable ground truth. A smoke sensor measures only at a single point in space, while \rev{LiDAR} observes many points across the environment from a distance, so the two do not directly translate. In our dataset, we relied on the fact that clean and degraded experiments were clearly separated: data from degraded runs was collected only after artificial smoke had been released. However, the degree of degradation varied strongly within each run. Because the smoke originated from a single machine in the middle of the sensor platform's traversal path, early and late frames were often nearly as clear as those from clean experiments. This led to mislabeled frames at the run boundaries and limited the reliability of experiment-based evaluation. As shown in Section~\ref{sec:results_deepsad}, this effect capped achievable AP scores even for strong models. The underlying difficulty is not only label noise, but also the challenge of collecting labeled subsets that are representative of the full range of anomalies.
One promising direction is to evaluate degradation not directly on raw \rev{LiDAR} frames but via its downstream impact. For example, future work could assess degradation based on discrepancies between a previously mapped 3D environment and the output of a SLAM algorithm operating under degraded conditions. In such a setup, subjective labeling may still be required in special cases (e.g., dense smoke clouds treated as solid obstacles by SLAM), but it would anchor evaluation more closely to the ultimate users of the data.
@@ -1143,11 +1144,11 @@ Finally, the binary ground truth employed here is insufficient for the quantific
\newsection{conclusion_ad}{Insights into DeepSAD and AD for Degradation Quantification}
This work has shown that the DeepSAD principle is applicable to \rev{LiDAR} degradation in hazardous environments and yields promising detection performance as well as runtime feasibility (see Sections~\ref{sec:results_deepsad} and~\ref{sec:setup_experiments_environment}). Compared to simpler baselines such as Isolation Forest and OCSVM, DeepSAD achieved much stronger separation between clean and degraded data. While OCSVM showed smoother but weaker separation and Isolation Forest produced high false positives even in clean runs, both DeepSAD variants maintained large high-precision regions before collapsing under mislabeled evaluation targets.
This work has shown that the DeepSAD principle is applicable to \rev{LiDAR} degradation in hazardous environments and yields promising detection performance as well as runtime feasibility (see Sections~\ref{sec:results_deepsad} and~\ref{sec:setup_experiments_environment}). Compared to simpler baselines such as Isolation Forest and OCSVM, DeepSAD achieved much stronger separation between clean and degraded data. While OCSVM showed smoother but weaker separation, and Isolation Forest produced high false positives even in clean runs, both DeepSAD variants maintained large high-precision regions before collapsing under mislabeled evaluation targets.
However, the semi-supervised component of DeepSAD did not improve results in our setting. In fact, adding a small number of labels often reduced performance due to overfitting to narrow subsets of anomalies, while larger labeled sets stabilized training, they still did not surpass the unsupervised regime (see Section~\ref{sec:results_deepsad}). This suggests that without representative and diverse labeled anomalies, unsupervised training remains the safer choice.
However, the semi-supervised component of DeepSAD did not improve results in our setting. In fact, adding a small number of labels often reduced performance due to overfitting to narrow subsets of anomalies. While larger labeled sets stabilized training, they still did not surpass the unsupervised regime (see Section~\ref{sec:results_deepsad}). This suggests that without representative and diverse labeled anomalies, unsupervised training remains the safer choice.
We also observed that the choice of encoder architecture and latent dimensionality are critical. The Efficient encoder consistently outperformed the LeNet-inspired baseline, producing more stable precisionrecall curves and stronger overall results. Similarly, compact latent spaces (32128 dimensions) yielded the best performance and proved more robust under noisy evaluation conditions, while larger latent spaces amplified the impact of mislabeled samples and caused sharper precision collapses. These findings underline the importance of representation design for robust anomaly detection.
We also observed that the choices of encoder architecture and latent dimensionality are critical. The Efficient encoder consistently outperformed the LeNet-inspired baseline, producing more stable precisionrecall curves and stronger overall results. Similarly, compact latent spaces (32128 dimensions) yielded the best performance and proved more robust under noisy evaluation conditions, while larger latent spaces amplified the impact of mislabeled samples and caused sharper precision collapses. These findings underline the importance of representation design for robust anomaly detection.
Finally, inference experiments showed that DeepSADs anomaly scores can track degradation trends over time when normalized and smoothed, suggesting potential for real-world quantification. Future work could explore per-sample weighting of semi-supervised targets, especially if analog ground truth becomes available, allowing DeepSAD to capture varying degrees of degradation as a graded rather than binary signal.

View File

@@ -1,9 +1,9 @@
\addcontentsline{toc}{chapter}{Abstract}
\begin{center}\Large\bfseries Abstract\end{center}\vspace*{1cm}\noindent
Autonomous robots are increasingly used in search and rescue (SAR) missions. In these missions, lidar sensors are often the most important source of environmental data. However, lidar data can degrade under hazardous conditions, especially when airborne particles such as smoke or dust are present. This degradation can lead to errors in mapping and navigation and may endanger both the robot and humans. Therefore, robots need a way to estimate the reliability of their lidar data, so \rev{that} they can make better-informed decisions.
Autonomous robots are increasingly used in search and rescue (SAR) missions. In these missions, LiDAR sensors are often the most important source of environmental data. However, LiDAR data can degrade under hazardous conditions, especially when airborne particles such as smoke or dust are present. This degradation can lead to errors in mapping and navigation and may endanger both the robot and humans. Therefore, robots need a way to estimate the reliability of their LiDAR data, so \rev{that} they can make better-informed decisions.
\bigskip
This thesis investigates whether anomaly detection methods can be used to quantify lidar data degradation \rev{caused by airborne particles such as smoke and dust}. We apply a semi-supervised deep learning approach called DeepSAD, which produces an anomaly score for each lidar scan, serving as a measure of data reliability.
This thesis investigates whether anomaly detection methods can be used to quantify LiDAR data degradation \rev{caused by airborne particles such as smoke and dust}. We apply a semi-supervised deep learning approach called DeepSAD, which produces an anomaly score for each LiDAR scan, serving as a measure of data reliability.
\bigskip
We evaluate this method against baseline methods on a subterranean dataset that includes lidar scans degraded by artificial smoke. Our results show that DeepSAD consistently outperforms the baselines and can clearly distinguish degraded from normal scans. At the same time, we find that the limited availability of labeled data and the lack of robust ground truth remain major challenges. Despite these limitations, our work demonstrates that anomaly detection methods are a promising tool for lidar degradation quantification in SAR scenarios.
We evaluate this method against baseline methods on a subterranean dataset that includes LiDAR scans degraded by artificial smoke. Our results show that DeepSAD consistently outperforms the baselines and can clearly distinguish degraded from normal scans. At the same time, we find that the limited availability of labeled data and the lack of robust ground truth remain major challenges. Despite these limitations, our work demonstrates that anomaly detection methods are a promising tool for LiDAR degradation quantification in SAR scenarios.