fixed plots

formatting
abstract lidar capitalization
2025-10-21 19:04:19 +02:00 · 2025-10-19 17:39:42 +02:00 · 2025-10-19 17:34:38 +02:00 · 2025-10-19 17:29:31 +02:00
25 changed files with 1958 additions and 205 deletions
--- a/thesis/Main.pdf
+++ b/thesis/Main.pdf
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -91,7 +91,8 @@
 }


-\DeclareRobustCommand{\rev}[1]{\textcolor{red}{#1}}
+%\DeclareRobustCommand{\rev}[1]{\textcolor{red}{#1}}
+\DeclareRobustCommand{\rev}[1]{#1}
 \DeclareRobustCommand{\mcah}[1]{}

 % correct bad hyphenation
@@ -252,7 +253,7 @@ Anomaly detection refers to the process of detecting unexpected patterns of data

 Figure~\ref{fig:anomaly_detection_overview} depicts a simple but illustrative example of data that can be classified as either normal or anomalous and shows the problem that anomaly detection methods try to generally solve. A successful anomaly detection method would somehow learn to differentiate normal from anomalous data, for example, by learning the boundaries around the available normal data and classifying it as either normal or anomalous based on its location inside or outside of those boundaries. Another possible approach could calculate an analog value that correlates with the likelihood of a sample being anomalous, for example, by using the sample's distance from the closest normal data cluster's center.

-\figc{anomaly_detection_overview}{figures/anomaly_detection_overview}{An illustrative example of anomalous and normal data containing 2-dimensional data with clusters of normal data $N_1$ and $N_2$, as well as two single anomalies $o_1$ and $o_2$ and a cluster of anomalies $O_3$. Reproduced from~\cite{anomaly_detection_survey}\rev{.}}{width=0.5\textwidth}
+\figc{anomaly_detection_overview}{figures/anomaly_detection_overview}{An illustrative example of anomalous and normal data containing 2-dimensional data with clusters of normal data $N_1$ and $N_2$, as well as two single anomalies $o_1$ and $o_2$ and a cluster of anomalies $O_3$. Reproduced from~\cite{anomaly_detection_survey}\rev{.}}{width=0.55\textwidth}

 By their very nature, anomalies are rare occurrences and oftentimes unpredictable in nature, which makes it hard to define all possible anomalies in any system. It also makes it very challenging to create an algorithm that is capable of detecting anomalies that may have never occurred before and may not have been known to exist during the creation of the detection algorithm. There are many possible approaches to this problem, though they can be roughly grouped into six distinct categories based on the techniques used~\cite{anomaly_detection_survey}:

@@ -426,7 +427,7 @@ To ensure our chosen dataset meets the needs of reliable degradation quantificat

 \begin{enumerate}
 	\item \textbf{Data Modalities:}\\
-	      The dataset must include \rev{LiDAR} sensor data, since we decided to train and evaluate our method on what should be the most universally used sensor type in the given domain. To keep our method as generalized as possible, we chose to only require range-based point cloud data and \rev{opt out of} sensor-specific data such as intensity or reflectivity, though it may be of interest for future work. It is also desirable to have complementary visual data, such as camera images, for better context, manual verification, and understanding of the data.
+	      The dataset must include \rev{LiDAR} sensor data, since we decided to train and evaluate our method on what should be the most universally used sensor type in the given domain. To keep our method as generalized as possible, we chose to only require range-based point cloud data and neglect sensor-specific data such as intensity or reflectivity, though it may be of interest for future work. It is also desirable to have complementary visual data, such as camera images, for better context, manual verification, and understanding of the data.

 	\item \textbf{Context \& Collection Method:}\\
 	      To mirror the real-world conditions of autonomous rescue robots, the data should originate from locations such as subterranean environments (tunnels, caves, collapsed structures), which closely reflect what would be encountered during rescue missions. Ideally, it should be captured from a ground-based, self-driving robot platform in motion instead of aerial, handheld, or stationary collection, to ensure similar circumstances to the target domain.
@@ -513,7 +514,7 @@ In the anomalous experiments, the artificial smoke machine appears to have been

 Regarding the dataset volume, the 10 normal experiments ranged from 88.7 to 363.1 seconds, with an average duration of 157.65 seconds. At a capture rate of 10 frames per second, these experiments yield 15,765 non-degraded point clouds. In contrast, the 4 anomalous experiments, including one stationary experiment lasting 11.7 seconds and another extending to 62.1 seconds, averaged 47.33 seconds, resulting in 1,893 degraded point clouds. In total, the dataset comprises 17,658 point clouds, with approximately 89.28\% classified as non-degraded (normal) and 10.72\% as degraded (anomalous). The distribution of experimental data is visualized in \rev{Figure}~\ref{fig:data_points_pie}.

-\fig{data_points_pie}{figures/data_points_pie.png}{Pie chart visualizing the amount and distribution of normal and anomalous point clouds in \cite{subter}\rev{.}}
+\fig{data_points_pie}{figures/data_points_pie.png}{Pie chart visualizing the amount and distribution of normal and anomalous LiDAR frames (i.e., point clouds) in \cite{subter}\rev{.}}

 The artificial smoke introduces measurable changes that clearly separate the \textit{anomalous} runs from the \textit{normal} baseline.  One change is a larger share of missing points per scan: smoke particles scatter or absorb the laser beam before it reaches a solid target, so the sensor reports an error instead of a distance.  Figure~\ref{fig:data_missing_points} shows the resulting right–shift of the missing-point histogram, a known effect for \rev{LiDAR} sensors in aerosol-filled environments. Another demonstrative effect is the appearance of many spurious returns very close to the sensor; these near-field points arise when back-scatter from the aerosol itself is mistaken for a surface echo. The box plot in \rev{Figure}~\ref{fig:particles_near_sensor} confirms a pronounced increase in sub-50 cm hits under smoke, a range at which we do not expect any non-erroneous measurements. Both effects are consistent with the behaviour reported in \rev{\cite{when_the_dust_settles}}.

@@ -532,9 +533,9 @@ For this reason and to simplify the architecture, we converted the point clouds

 To create this mapping, we leveraged the available measurement indices and channel information inherent in the dense point clouds, which are ordered from 0 to 65,535 in a horizontally ascending, channel-by-channel manner. For sparse point clouds without such indices, one would need to rely on the pitch and yaw angles relative to the sensor's origin to correctly map each point to its corresponding pixel, although this often leads to ambiguous mappings due to numerical errors in angle estimation.

-Figure~\ref{fig:data_projections} displays two examples of \rev{LiDAR} point cloud projections to aid in the reader’s understanding. Although the original point clouds were converted into grayscale images with a resolution of 2048×32 pixels, these raw images can be challenging to interpret. To enhance human readability, we applied the viridis colormap and vertically stretched the images so that each measurement occupies multiple pixels in height. The top projection is derived from a scan without artificial smoke—and therefore minimal degradation—while the lower projection comes from an experiment where artificial smoke introduced significant degradation.
+Figure~\ref{fig:data_projections} displays two examples of \rev{LiDAR} point cloud projections to aid in the reader’s understanding. Although the original point clouds were converted into grayscale images with a resolution of 2048×32 pixels, these raw images can be challenging to interpret. To enhance human readability, we applied the viridis colormap and vertically stretched the images so that each measurement occupies multiple pixels in height. The projection in (a) is derived from a scan without artificial smoke—and therefore minimal degradation—while the projection in (b) comes from an experiment where artificial smoke introduced significant degradation.

-\fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two point clouds, one from an experiment without degradation and one from an experiment with artificial smoke as degradation. To aid the reader's perception, the images are vertically stretched, and a colormap has been applied to the pixels' reciprocal range values, while the actual training data is grayscale.}
+\fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two point clouds, (a) from an experiment without degradation and (b) from an experiment with artificial smoke as degradation. To aid the reader's perception, the images are vertically stretched, and a colormap has been applied to the pixels' reciprocal range values, while the actual training data is grayscale.}


 The remaining challenge was labeling a large enough portion of the dataset in a reasonably accurate manner, whose difficulties and general approach we described in \rev{Section}~\ref{sec:data_req}. Since, to our knowledge, neither our chosen dataset nor any other publicly available one provides objective labels for \rev{LiDAR} data degradation in the SAR domain, we had to define our own labeling approach. With objective measures of degradation unavailable, we explored alternative labeling methods—such as using \rev{the statistical} properties like the number of missing measurements per point cloud or the higher incidence of erroneous measurements near the sensor we described in \rev{Section~\ref{sec:data_dataset}}. Ultimately, we were concerned that these statistical approaches might lead the method to simply mimic the statistical evaluation rather than to quantify degradation in a generalized and robust manner. After considering these options, we decided to label all point clouds from experiments with artificial smoke as anomalies, while point clouds from experiments without smoke were labeled as normal data. This labeling strategy—based on the presence or absence of smoke—is fundamentally an environmental indicator, independent of the intrinsic data properties recorded during the experiments.
@@ -552,7 +553,7 @@ Afraid that the incorrectly labeled data may negatively impact DeepSAD's semi-su

 Under both evaluation schemes, all frames from normal experiments were marked as normal, since they appear to have produced high-quality data throughout. A visualization of how the two evaluation schemes measure up in terms of the number of samples per class can be seen in \rev{Figure}~\ref{fig:data_eval_labels}.

-\fig{data_eval_labels}{figures/data_eval_labels.png}{Pie charts visualizing the number of normal and anomalous labels applied to the dataset per labeling scheme. A large part of the experiment-based anomalous labels had to be removed for the manually-defined scheme, since, subjectively, they were either clearly or possibly not degraded.}
+\fig{data_eval_labels}{figures/data_eval_labels.png}{Pie charts visualizing the number of normal and anomalous labels applied to the dataset for (a) experiment-based labeling scheme and (b) manually-defined labeling scheme. A large part of the experiment-based anomalous labels had to be removed for the manually-defined scheme, since, subjectively, they were either clearly or possibly not degraded.}

 By evaluating and comparing both approaches, we hope to demonstrate a more thorough performance investigation than with only one of the two \rev{labeling schemes}.

@@ -569,16 +570,16 @@ In the following sections, we detail our adaptations to this framework:
 	\item Experimental environment: the hardware and software stack used, with typical training and inference runtimes.
 \end{itemize}

-Together, these components define the full experimental pipeline, from data loading, preprocessing, method training to the evaluation and comparing of methods.
+Together, these components define the full experimental pipeline, from data loading, preprocessing, method training, to the evaluation and comparison of methods.

 \section{Framework \& Data Preparation}


-DeepSAD's PyTorch implementation—our starting point—includes implementations for training on standardized datasets such as MNIST, CIFAR-10 and datasets from \citetitle{odds}~\cite{odds}. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE and SemiDGM with the loaded data and evaluate their performance by calculating the Receiver Operating Characteristic (ROC) and its Area Under the Curve (AUC) for all given algorithms. We adapted this implementation which was originally developed for Python 3.7 to work with Python 3.12 and changed or added \rev{functionality. We allowed loading data from our of} chosen dataset, added DeepSAD models that work with the \rev{LiDAR} projections datatype, added more evaluation methods and an inference module.
+DeepSAD's PyTorch implementation—our starting point—includes implementations for training on standardized datasets such as MNIST, CIFAR-10, and datasets from \citetitle{odds}~\cite{odds}. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE, and SemiDGM, with the loaded data and evaluate their performance by calculating the Receiver Operating Characteristic (ROC) and its Area Under the Curve (AUC) for all given algorithms. We adapted this implementation, which was originally developed for Python 3.7, to work with Python 3.12 and changed or added \rev{functionality. We allowed loading data from our} chosen dataset, added DeepSAD models that work with the \rev{LiDAR} projections datatype, added more evaluation methods, and an inference module.

-The raw SubTER dataset is provided as one ROS bag file per experiment, each containing a dense 3D point cloud from the Ouster OS1-32 \rev{LiDAR}. To streamline training and avoid repeated heavy computation, we project these point clouds offline into 2D “range images” as described in \rev{Section}~\ref{sec:preprocessing} and export them to files as NumPy arrays. Storing precomputed projections allows rapid data loading during training and evaluation. Many modern \rev{LiDARs} can be configured to output range images directly which would bypass the need for post-hoc projection. When available, such native range-image streams can further simplify preprocessing or even allow skipping this step completely.
+The raw SubTER dataset is provided as one ROS bag file per experiment, each containing a dense 3D point cloud from the Ouster OS1-32 \rev{LiDAR}. To streamline training and avoid repeated heavy computation, we project these point clouds offline into 2D “range images” as described in \rev{Section}~\ref{sec:preprocessing} and export them to files as NumPy arrays. Storing precomputed projections allows rapid data loading during training and evaluation. Many modern \rev{LiDARs} can be configured to output range images directly, which would bypass the need for post-hoc projection. When available, such native range-image streams can further simplify preprocessing or even allow skipping this step completely.

-We extended the DeepSAD framework’s PyTorch \texttt{DataLoader} by implementing a custom \texttt{Dataset} class that ingests our precomputed NumPy range-image files and attaches appropriate evaluation labels. Each experiment’s frames are stored as individual \texttt{.npy} files with  the numpy array shape \((\text{Number of Frames}, H, W)\), containing the point clouds' reciprocal range values. Our \texttt{Dataset} initializer scans a directory of these files, loads the NumPy arrays from file into memory, transforms them into PyTorch tensors and assigns evaluation and training labels accordingly.
+We extended the DeepSAD framework’s PyTorch \texttt{DataLoader} by implementing a custom \texttt{Dataset} class that ingests our precomputed NumPy range-image files and attaches appropriate evaluation labels. Each experiment’s frames are stored as individual \texttt{.npy} files with  the numpy array shape \((\text{Number of Frames}, H, W)\), containing the point clouds' reciprocal range values. Our \texttt{Dataset} initializer scans a directory of these files, loads the NumPy arrays from file into memory, transforms them into PyTorch tensors, and assigns evaluation and training labels accordingly.

 The first labeling scheme, called \emph{experiment-based labels}, assigns
 \[
@@ -590,7 +591,7 @@ The first labeling scheme, called \emph{experiment-based labels}, assigns
 \]
 At load time, any file with “smoke” in its name is treated as anomalous (label \(-1\)), and all others (normal experiments) are labeled \(+1\).

-To obtain a second source of ground truth, we also support \emph{manually-defined labels}. A companion JSON file specifies a start and end frame index for each of the four smoke experiments—defining the interval of unequivocal degradation. During loading the second label $y_{man}$ is assigned as follows:
+To obtain a second source of ground truth, we also support \emph{manually-defined labels}. A companion JSON file specifies a start and end frame index for each of the four smoke experiments—defining the interval of unequivocal degradation. During loading, the second label $y_{man}$ is assigned as follows:

 \[
 	y_{\mathrm{man}} =
@@ -601,19 +602,19 @@ To obtain a second source of ground truth, we also support \emph{manually-define
 	\end{cases}
 \]

-We pass instances of this \texttt{Dataset} to PyTorch’s \texttt{DataLoader}, enabling batch sampling, shuffling, and multi-worker loading. The dataloader returns the preprocessed \rev{LiDAR} projection, both evaluation labels and a semi-supervised training label.
+We pass instances of this \texttt{Dataset} to PyTorch’s \texttt{DataLoader}, enabling batch sampling, shuffling, and multi-worker loading. The dataloader returns the preprocessed \rev{LiDAR} projection, both evaluation labels, and a semi-supervised training label.

-To control the supervision of DeepSAD's training, our custom PyTorch \texttt{Dataset} accepts two integer parameters, \texttt{num\_labelled\_normal} and \texttt{num\_labelled\_anomalous}, which specify how many samples of each class should retain their labels during training. We begin with the manually-defined evaluation labels, to not use mislabeled anomalous frames for the semi-supervision. Then, we randomly un-label (set to 0) enough samples of each class until exactly \texttt{num\_labelled\_normal} normals and \texttt{num\_labelled\_anomalous} anomalies remain labeled.
+To control the supervision of DeepSAD's training, our custom PyTorch \texttt{Dataset} accepts two integer parameters, \texttt{num\_labelled\_normal} and \texttt{num\_labelled\_anomalous}, which specify how many samples of each class should retain their labels during training. We begin with the manually-defined evaluation labels, so as not to use mislabeled anomalous frames for the semi-supervision. Then, we randomly un-label (set to 0) enough samples of each class until exactly \texttt{num\_labelled\_normal} normals and \texttt{num\_labelled\_anomalous} anomalies remain labeled.

 To obtain robust performance estimates on our relatively small dataset, we implement $k$-fold cross-validation. A single integer parameter, \texttt{num\_folds}, controls the number of splits. We use scikit-learn’s \texttt{KFold} (from \texttt{sklearn.model\_selection}) with \texttt{shuffle=True} and a fixed random seed to partition each experiment’s frames into \texttt{num\_folds} disjoint folds. Training then proceeds across $k$ rounds, each time training on $(k-1)/k$ of the data and evaluating on the remaining $1/k$. In our experiments, we set \texttt{num\_folds=5}, yielding an 80/20 train/evaluation split per fold.

-For inference (i.e.\ model validation on held-out experiments), we provide a second \texttt{Dataset} class that loads a single experiment's NumPy file (no k-fold splitting), does not assign any labels to the frames nor does it shuffle frames, preserving temporal order. This setup enables seamless, frame-by-frame scoring of complete runs—crucial for analyzing degradation dynamics over an entire experiment.
+For inference (i.e., model validation on held-out experiments), we provide a second \texttt{Dataset} class that loads a single experiment's NumPy file (no k-fold splitting), does not assign any labels to the frames, nor does it shuffle frames, preserving temporal order. This setup enables seamless, frame-by-frame scoring of complete runs—crucial for analyzing degradation dynamics over an entire experiment.

 \section{Model Configuration}

-Since the neural network architecture trained in the \rev{DeepSAD} method is not fixed as described in \rev{Section}~\ref{sec:algorithm_details} but rather chosen based on the input data, we also had to choose an autoencoder architecture befitting our preprocessed \rev{LiDAR} data projections. Since \rev{\cite{degradation_quantification_rain}} reported success in training DeepSAD on similar data we firstly adapted the network architecture utilized by them for our use case, which is based on the simple and well understood LeNet architecture~\cite{lenet}. Additionally we were interested in evaluating the importance and impact of a well-suited network architecture for DeepSAD's performance and therefore designed a second network architecture henceforth \rev{refered} to as "efficient architecture" to incorporate a few modern techniques, befitting our use case.
+Since the neural network architecture trained in the \rev{DeepSAD} method is not fixed as described in \rev{Section}~\ref{sec:algorithm_details} but rather chosen based on the input data, we also had to choose an autoencoder architecture befitting our preprocessed \rev{LiDAR} data projections. Since \rev{\cite{degradation_quantification_rain}} reported success in training DeepSAD on similar data, we first adapted the network architecture utilized by them for our use case, which is based on the simple and well-understood LeNet architecture~\cite{lenet}. Additionally, we were interested in evaluating the importance and impact of a well-suited network architecture for DeepSAD's performance and therefore designed a second network architecture, henceforth \rev{referred} to as "efficient architecture", to incorporate a few modern techniques, befitting our use case.

-The LeNet-inspired autoencoder can be split into an encoder network (\rev{Figure}~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (\rev{Figure}~\ref{fig:setup_arch_lenet_decoder}) with a latent space \rev{in between} the two parts. Such an arrangement is typical for autoencoder architectures as we discussed in \rev{Section}~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture which is used to infer the degradation quantification in our use case, once trained.
+The LeNet-inspired autoencoder can be split into an encoder network (\rev{Figure}~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (\rev{Figure}~\ref{fig:setup_arch_lenet_decoder}) with a latent space \rev{in between} the two parts. Such an arrangement is typical for autoencoder architectures, as we discussed in \rev{Section}~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture, which is used to infer the degradation quantification in our use case, once trained.

 \figc{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{
 	Architecture of the LeNet-inspired encoder. The input is a \rev{LiDAR} range image of size
@@ -637,7 +638,7 @@ The LeNet-inspired encoder network (see \rev{Figure}~\ref{fig:setup_arch_lenet_e
 	$4\times 512\times 8$ (channels $\times$ width $\times$ height).
 	The first upsampling stage applies interpolation with scale factor 2, followed by a
 	transpose convolution with 8 output channels, batch normalization, and LeakyReLU activation,
-	yielding $8\times 1024\times 16$. The second stage again upsamples by factor 2 and applies
+	yielding $8\times 1024\times 16$. The second stage again upsamples by a factor of 2 and applies
 	a transpose convolution, reducing the channels to 1. This produces the reconstructed output
 	of size $1\times 2048\times 32$, which matches the original input dimensionality required
 	for the autoencoding objective.
@@ -645,14 +646,14 @@ The LeNet-inspired encoder network (see \rev{Figure}~\ref{fig:setup_arch_lenet_e

 The decoder network (see \rev{Figure}~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $4\times 512\times 8$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $1\times 2048\times 32$, matching the original input dimensionality required for the autoencoding objective.

-Even though the LeNet-inspired encoder proved capable of achieving our degradation quantification objective in initial experiments, we identified several shortcomings that motivated the design of a second, more efficient architecture. The most important issue concerns the shape of the CNN's receptive field (RF) which describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may hinder the network from learning to recognize fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region, but in principle it can be computed independently per axis.
+Even though the LeNet-inspired encoder proved capable of achieving our degradation quantification objective in initial experiments, we identified several shortcomings that motivated the design of a second, more efficient architecture. The most important issue concerns the shape of the CNN's receptive field (RF), which describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may hinder the network from learning to recognize fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region, but in principle it can be computed independently per axis.

 %\figc{setup_ef_concept}{figures/setup_ef_concept}{Receptive fields in a CNN. Each output activation aggregates information from a region of the input; stacking layers expands this region, while kernel size, stride, and padding control how quickly it grows and what shape it takes. (A) illustrates slower, fine-grained growth; (B) shows faster expansion, producing a larger—potentially anisotropic—receptive field and highlighting the trade-off between detail and context. Reproduced from~\cite{ef_concept_source}}{width=.6\textwidth}

-The RF shape's issue arises from the fact that spinning multi-beam \rev{LiDAR} oftentimes produce point clouds posessing dense horizontal but limited vertical resolution. In our \rev{case, this} results in a pixel-per-degree resolution of approximately $5.69\,\sfrac{pixel}{deg}$ vertically and $1.01\,\sfrac{pixel}{deg}$ horizontally. Consequently, the LeNet-inspired encoder’s calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the network’s ability to capture degradation patterns that extend differently across the two axes.
+The RF shape's issue arises from the fact that spinning multi-beam \rev{LiDAR} oftentimes produce point clouds possessing dense horizontal but limited vertical resolution. In our \rev{case, this} results in a pixel-per-degree resolution of approximately $5.69\,\sfrac{pixel}{deg}$ vertically and $1.01\,\sfrac{pixel}{deg}$ horizontally. Consequently, the LeNet-inspired encoder’s calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the network’s ability to capture degradation patterns that extend differently across the two axes.


-To adjust for this, we decided to modify the network architecture and included further modificatons to improve the method's performance. The encoder (see \rev{Figure}~\ref{fig:setup_arch_ef_encoder}) follows the same general idea as the LeNet-inspired encoder, but incorporates the following modificatons:
+To adjust for this, we decided to modify the network architecture and included further modifications to improve the method's performance. The encoder (see \rev{Figure}~\ref{fig:setup_arch_ef_encoder}) follows the same general idea as the LeNet-inspired encoder, but incorporates the following modifications:
 \begin{itemize}
 	\item \textbf{Non-square convolution kernels.} Depthwise-separable convolutions with kernel size $3 \times 17$ are used instead of square kernels, resulting in an RF of $10 \times 52$ pixels, corresponding to $9.93^{\circ} \times 9.14^{\circ}$, substantially more balanced than the LeNet-inspired network's RF.
 	\item \textbf{Circular padding along azimuth.} The horizontal axis is circularly padded to respect the wrap-around of $360^{\circ}$ \rev{LiDAR} data, preventing artificial seams at the image boundaries.
@@ -684,7 +685,7 @@ The decoder (see \rev{Figure}~\ref{fig:setup_arch_ef_decoder}) mirrors the encod
 \begin{itemize}
 	\item \textbf{Nearest-neighbor upsampling followed by convolution.} Instead of relying solely on transposed convolutions, each upsampling stage first enlarges the feature map using parameter-free nearest-neighbor interpolation, followed by a depthwise-separable convolution. This strategy reduces the risk of checkerboard artifacts while still allowing the network to learn fine detail.
 	\item \textbf{Asymmetric upsampling schedule.} Horizontal resolution is restored more aggressively (e.g., scale factor $1 \times 4$) to reflect the anisotropic downsampling performed in the encoder.
-	\item \textbf{Final convolution with circular padding.} The output is generated using a $(3 \times 17)$ convolution with circular padding along the azimuth similar to the new encoder, ensuring consistent treatment of the 360° \rev{LiDAR} input.
+	\item \textbf{Final convolution with circular padding.} The output is generated using a $(3 \times 17)$ convolution with circular padding along the azimuth, similar to the new encoder, ensuring consistent treatment of the 360° \rev{LiDAR} input.
 \end{itemize}

 \fig{setup_arch_ef_decoder}{diagrams/arch_ef_decoder}{
@@ -703,7 +704,7 @@ The decoder (see \rev{Figure}~\ref{fig:setup_arch_ef_decoder}) mirrors the encod
 }


-To compare the computational efficiency of the two architectures we show the number of trainable parameters and the number of multiply–accumulate operations (MACs) for different latent space sizes used in our experiments in \rev{Table}~\ref{tab:params_lenet_vs_efficient}. Even though the efficient architecture employs more layers and channels which allows the network to learn to recognize more types of patterns when compared to the LeNet-inspired one, the encoders' MACs are quite similar. The more complex decoder design of the efficient network appears to contribute a lot more MACs, which leads to longer pretraining times which we report in \rev{Section}~\ref{sec:setup_experiments_environment}.
+To compare the computational efficiency of the two architectures, we show the number of trainable parameters and the number of multiply–accumulate operations (MACs) for different latent space sizes used in our experiments in \rev{Table}~\ref{tab:params_lenet_vs_efficient}. Even though the efficient architecture employs more layers and channels, which allows the network to learn to recognize more types of patterns when compared to the LeNet-inspired one, the encoders' MACs are quite similar. The more complex decoder design of the efficient network appears to contribute a lot more MACs, which leads to longer pretraining times, which we report in \rev{Section}~\ref{sec:setup_experiments_environment}.

 %& \multicolumn{4}{c}{\textbf{Encoders}} & \multicolumn{4}{c}{\rev{\textbf{Autoencoders (Encoder $+$ Decoder)}}} \\

@@ -750,9 +751,9 @@ To contextualize the performance of DeepSAD, we compare against two widely used

 \paragraph{Isolation Forest} is an ensemble method for anomaly detection that builds on the principle that anomalies are easier to separate from the rest of the data. It constructs many binary decision trees, each by recursively splitting the data at randomly chosen features and thresholds. In this process, the “training” step consists of building the forest of trees: each tree captures different random partitions of the input space, and together they form a diverse set of perspectives on how easily individual samples can be isolated.

-Once trained, the method assigns an anomaly score to new samples by measuring their average path length through the trees. Normal samples, being surrounded by other similar samples, typically require many recursive splits and thus end up deep in the trees. Anomalies, by contrast, stand out in one or more features, which means they can be separated much earlier and end up closer to the root. The shorter the average path length, the more anomalous the sample is considered. This makes Isolation Forest highly scalable and robust: training is efficient and the resulting model is fast to apply to new data. In our setup, we apply Isolation Forest directly to the \rev{LiDAR} input representation, providing a strong non-neural baseline for comparison against DeepSAD.
+Once trained, the method assigns an anomaly score to new samples by measuring their average path length through the trees. Normal samples, being surrounded by other similar samples, typically require many recursive splits and thus end up deep in the trees. Anomalies, by contrast, stand out in one or more features, which means they can be separated much earlier and end up closer to the root. The shorter the average path length, the more anomalous the sample is considered. This makes Isolation Forest highly scalable and robust: training is efficient, and the resulting model is fast to apply to new data. In our setup, we apply Isolation Forest directly to the \rev{LiDAR} input representation, providing a strong non-neural baseline for comparison against DeepSAD.

-\paragraph{OCSVM} takes a very different approach by learning a flexible boundary around normal samples. It assumes all training data to be normal, with the goal of enclosing the majority of these samples in such a way that new points lying outside this boundary can be identified as anomalies. The boundary itself is learned using the support vector machine framework. In essence, OCSVM looks for a hyperplane in some feature space that maximizes the separation between the bulk of the data and the origin. To make this possible even when the normal data has a complex, curved shape, OCSVM uses a kernel function such as the radial basis function (RBF). The kernel implicitly maps the input data into a higher-dimensional space, where the cluster of normal samples becomes easier to separate with a simple hyperplane. When this separation is mapped back to the original input space, it corresponds to a flexible, nonlinear boundary that can adapt to the structure of the data.
+\paragraph{OCSVM} takes a very different approach by learning a flexible boundary around normal samples. It assumes all training data to be normal, with the goal of enclosing the majority of these samples in such a way that new points lying outside this boundary can be identified as anomalies. The boundary itself is learned using the support vector machine framework. In essence, OCSVM looks for a hyperplane in some feature space that maximizes the separation between the bulk of the data and the origin. To make this possible, even when the normal data has a complex, curved shape, OCSVM uses a kernel function such as the radial basis function (RBF). The kernel implicitly maps the input data into a higher-dimensional space, where the cluster of normal samples becomes easier to separate with a simple hyperplane. When this separation is mapped back to the original input space, it corresponds to a flexible, nonlinear boundary that can adapt to the structure of the data.

 During training, the algorithm balances two competing objectives: capturing as many of the normal samples as possible inside the boundary, while keeping the region compact enough to exclude potential outliers. Once this boundary is established, applying OCSVM is straightforward — any new data point is checked against the learned boundary, with points inside considered normal and those outside flagged as anomalous.

@@ -774,9 +775,9 @@ To address this, we instead rely on Precision–Recall Curves (PRC)~\cite{prc},
 	\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}.
 \]

-In our evaluation, this distinction proved practically significant. As illustrated in Figure~\ref{fig:roc_vs_prc}, ROC AUC values for Isolation Forest and DeepSAD appear similarly strong (0.693 vs. 0.782), suggesting comparable performance. However, the PRC reveals a clear divergence: while DeepSAD maintains high precision across recall levels, Isolation Forest suffers a steep decline in precision as recall increases, due to a high number of false positives. The resulting Average Precision (AP)—the area under the PRC—is much lower for Isolation Forest (0.207 vs. 0.633), offering a more realistic account of its performance under imbalance.
+In our evaluation, this distinction proved practically significant. As illustrated in Figure~\ref{fig:roc_vs_prc}, ROC AUC values in (a) appear similarly strong for both, Isolation Forest and DeepSAD (0.693 vs. 0.782), suggesting comparable performance. However, the PRC in (b) reveals a clear divergence: while DeepSAD maintains high precision across recall levels, Isolation Forest suffers a steep decline in precision as recall increases, due to a high number of false positives. The resulting Average Precision (AP)—the area under the PRC—is much lower for Isolation Forest (0.207 vs. 0.633), offering a more realistic account of its performance under imbalance.

-\figc{roc_vs_prc}{figures/setup_roc_vs_prc.png}{Comparison of ROC and PRC for the same evaluation run. ROC fails to reflect the poor performance of Isolation Forest, which misclassifies many normal samples as anomalous at lower thresholds. The PRC exposes this effect, resulting in a substantially lower AP for Isolation Forest than for DeepSAD.}{width=.9\textwidth}
+\figc{roc_vs_prc}{figures/setup_roc_vs_prc.png}{Comparison of ROC (a) and PRC (b) for the same evaluation run. ROC fails to reflect the poor performance of Isolation Forest, which misclassifies many normal samples as anomalous at lower thresholds. The PRC exposes this effect, resulting in a substantially lower AP for Isolation Forest than for DeepSAD.}{width=.9\textwidth}

 In addition to cross-validated performance comparisons, we also apply the trained models to previously unseen, temporally ordered experiments to simulate inference in realistic conditions. Since each method produces scores on a different scale—with different signs and ranges—raw scores are not directly comparable. To enable comparison, we compute a $z$-score~\cite{zscore} per frame, defined as the number of standard deviations a score deviates from the mean of the normal data. To perform the normalization, we compute the mean and standard deviation of anomaly scores on a clean reference experiment. These values are then used to normalize scores from degraded experiments, making it easy to see how much each method's output deviates from its own baseline under degradation. It also facilitates a unified view across methods, even though their outputs are otherwise heterogeneous. In this way, $z$-score normalization supports threshold-free interpretation and enables consistent model comparison during inference.

@@ -785,9 +786,9 @@ In conclusion, the combination of unreliable thresholds and pronounced class imb

 \newsection{setup_experiments_environment}{Experiment Overview \& Computational Environment}

-Across all experiments we vary three factors: (i) latent space dimensionality, (ii) encoder architecture (LeNet-inspired vs. Efficient), and (iii) the amount of semi-supervision (labeling regime). To keep results comparable, we fix the remaining training hyperparameters: all autoencoders are pretrained for $E_A = 50$~epochs with ADAM as an optimzer at a starting learning rate of $L_A = 1\cdot 10^{-5}$; all DeepSAD models are then trained for $E_M = 150$~epochs with the same optimizer and starting learning rate ($L_M = 1\cdot 10^{-5}$). The DeepSAD label-weighting parameter is kept at $\eta = 1$ and the regularization rate at $\lambda = 1\cdot 10^{-6}$ for all runs. Every configuration is evaluated with 5-fold cross-validation, and we report fold means.
+Across all experiments, we vary three factors: (i) latent space dimensionality, (ii) encoder architecture (LeNet-inspired vs. Efficient), and (iii) the amount of semi-supervision (labeling regime). To keep results comparable, we fix the remaining training hyperparameters: all autoencoders are pretrained for $E_A = 50$~epochs with ADAM as an optimizer at a starting learning rate of $L_A = 1\cdot 10^{-5}$; all DeepSAD models are then trained for $E_M = 150$~epochs with the same optimizer and starting learning rate ($L_M = 1\cdot 10^{-5}$). The DeepSAD label-weighting parameter is kept at $\eta = 1$ and the regularization rate at $\lambda = 1\cdot 10^{-6}$ for all runs. Every configuration is evaluated with 5-fold cross-validation, and we report fold means.

-We first search over the latent bottleneck size by pretraining autoencoders only. For both encoder backbones, we evaluate latent sizes $32, 64, 128, 256, 512, 768,$ and $1024$. The goal is to identify compact yet expressive representations and to compare the autoencoding performance between the two network architectures LeNet-inspired and Efficient. Additionally, we are interested in finding possible correlations between the autoencoder performance and the DeepSAD anomaly detection performance.
+We first search over the latent bottleneck size by pretraining autoencoders only. For both encoder backbones, we evaluate latent sizes $32, 64, 128, 256, 512, 768,$ and $1024$. The goal is to identify compact yet expressive representations and to compare the autoencoding performance between the two network architectures, LeNet-inspired and Efficient. Additionally, we are interested in finding possible correlations between the autoencoder performance and the DeepSAD anomaly detection performance.

 Using the same latent sizes and backbones, we train full DeepSAD models initialized from the pretrained encoders. We study three supervision regimes, from unsupervised to strongly supervised (see Table~\ref{tab:labeling_regimes} for proportions within the training folds):
 \begin{itemize}
@@ -795,7 +796,7 @@ Using the same latent sizes and backbones, we train full DeepSAD models initiali
 	\item \textbf{Low supervision:} $(50,10)$ labeled samples.
 	\item \textbf{High supervision:} $(500,100)$ labeled samples.
 \end{itemize}
-Percentages in Table~\ref{tab:labeling_regimes} are computed relative to the training split of each fold (80\% of the data) from the experiment-based labeling scheme. Importantly, for semi-supervised labels we \emph{only} use hand-selected, unambiguous smoke intervals from the manually-defined evaluation scheme, to avoid injecting mislabeled data into training.
+Percentages in Table~\ref{tab:labeling_regimes} are computed relative to the training split of each fold (80\% of the data) from the experiment-based labeling scheme. Importantly, for semi-supervised labels, we \emph{only} use hand-selected, unambiguous smoke intervals from the manually-defined evaluation scheme to avoid injecting mislabeled data into training.

 \begin{table}
 	\centering
@@ -901,7 +902,7 @@ Pretraining runtimes for the autoencoders are reported in Table~\ref{tab:ae_pret
 	\end{tabularx}
 \end{table}

-The full DeepSAD training times are shown in Table~\ref{tab:train_runtimes_compact}, alongside the two classical baselines Isolation Forest and OCSVM. Here the contrast between methods is clear: while DeepSAD requires on the order of 15–20 minutes of GPU training per configuration and fold, both baselines complete training in seconds on CPU. The OCSVM training can only be this fast due to the reduced input dimensionality from utilizing DeepSAD's pretraining encoder as a preprocessing step, although other dimensionality reduction methods may also be used which could require less computational resources for this step.
+The full DeepSAD training times are shown in Table~\ref{tab:train_runtimes_compact}, alongside the two classical baselines, Isolation Forest and OCSVM. Here, the contrast between methods is clear: while DeepSAD requires on the order of 15–20 minutes of GPU training per configuration and fold, both baselines complete training in seconds on CPU. The OCSVM training can only be this fast due to the reduced input dimensionality from utilizing DeepSAD's pretraining encoder as a preprocessing step, although other dimensionality reduction methods may also be used, which could require less computational resources for this step.

 \begin{table}
 	\centering
@@ -952,7 +953,7 @@ Together, these results provide a comprehensive overview of the computational re

 \newchapter{results_discussion}{Results and Discussion}

-The \rev{evaluation experiments which the setup in in Chapter~\ref{chp:experimental_setup} described,} are presented in this chapter. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and OCSVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on \rev{data} that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen \rev{data}, offering a more practical perspective on their potential for real-world rescue robotics applications.
+In this chapter, we present the \rev{evaluation experiments, based on the experimental setup described in Chapter~\ref{chp:experimental_setup}}. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and OCSVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on \rev{data} that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen \rev{data}, offering a more practical perspective on their potential for real-world rescue robotics applications.

 % --- Section: Autoencoder Pretraining Results ---
 \newsection{results_pretraining}{Autoencoder Pretraining Results}
@@ -995,7 +996,7 @@ Due to the challenges of ground truth quality, evaluation results must be interp
 	\item \textbf{Manually-defined labels:} A cleaner ground truth, containing only clearly degraded frames. This removes mislabeled intervals and allows nearly perfect separation. However, it also simplifies the task too much, because borderline cases are excluded.
 \end{itemize}

-Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimensions, labeling regimes, and methods. Under experiment-based evaluation, both DeepSAD variants consistently outperform the baselines, reaching AP values around 0.60–0.66 compared to 0.21 for \rev{the} Isolation Forest and 0.31–0.49 for OCSVM. Under manually-defined evaluation, DeepSAD achieves nearly perfect AP in all settings, while the baselines remain much lower. This contrast shows that the lower AP under experiment-based evaluation is not a weakness of DeepSAD itself, but a direct result of mislabeled samples in the evaluation data. The manually-defined scheme therefore confirms that DeepSAD separates clearly normal from clearly degraded frames very well, while also highlighting that label noise must be kept in mind when interpreting the experiment-based results.
+Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimensions, labeling regimes, and methods. Under experiment-based evaluation, both DeepSAD variants consistently outperform the baselines, reaching AP values around 0.60–0.66 compared to 0.21 for \rev{the} Isolation Forest and 0.31–0.49 for OCSVM. Under manually-defined evaluation, DeepSAD achieves nearly perfect AP in all settings, while the baselines remain much lower. This contrast shows that the lower AP under experiment-based evaluation is not a weakness of DeepSAD itself, but a direct result of mislabeled samples in the evaluation data. Therefore, the manually-defined scheme confirms that DeepSAD separates clearly normal from clearly degraded frames very well, while also highlighting that label noise must be kept in mind when interpreting the experiment-based results.

 \begin{table}
 	\centering
@@ -1045,31 +1046,31 @@ The precision--recall curves \rev{for experiment-based evaluation} (Figure~\ref{

 Taken together, the two evaluation schemes provide complementary insights. The experiment-based labels offer a noisy but realistic setting that shows how methods cope with ambiguous data, while the manually-defined labels confirm that DeepSAD can achieve nearly perfect separation when the ground truth is clean. The combination of both evaluations makes clear that (i) DeepSAD is stronger than the baselines under both conditions, (ii) the apparent performance limits under experiment-based labels are mainly due to label noise, and (iii) interpreting results requires care, since performance drops in the curves often reflect mislabeled samples rather than model failures. At the same time, both schemes remain binary classifications and therefore cannot directly evaluate the central question of whether anomaly scores can serve as a continuous measure of degradation. For this reason, we extend the analysis in Section~\ref{sec:results_inference}, where inference on entire unseen experiments is used to provide a more intuitive demonstration of the methods’ potential for quantifying \rev{LiDAR} degradation in practice.

-\fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OCSVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
+\fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves (a) - (g) over all latent dimensionalities 32 - 1024 for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OCSVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}

 \FloatBarrier

 \paragraph{Effect of latent space dimensionality.}
-During autoencoder pretraining we observed that reconstruction loss decreased monotonically with larger latent spaces, as expected: a bigger bottleneck allows the encoder–decoder to retain more information. If autoencoder performance were directly predictive of DeepSAD performance, we would therefore expect average precision to improve with larger latent dimensions. The actual results, however, show the opposite trend (Figure~\ref{fig:latent_dim_ap}): compact latent spaces (32–128) achieve the highest AP, while performance declines as the latent size grows. This inverse correlation is most clearly visible in the unsupervised case. Part of this effect can be attributed to evaluation label noise, which larger spaces amplify. More importantly, it shows that autoencoder performance does not translate directly into DeepSAD performance. Pretraining losses can still help compare different architectures for robustness, and performance but they cannot be used to tune the latent dimensionality: the dimensionality that minimizes reconstruction loss in pretraining is not necessarily the one that maximizes anomaly detection performance in DeepSAD.
+During autoencoder pretraining, we observed that reconstruction loss decreased monotonically with larger latent spaces, as expected: a bigger bottleneck allows the encoder–decoder to retain more information. If autoencoder performance were directly predictive of DeepSAD performance, we would therefore expect average precision to improve with larger latent dimensions. The actual results, however, show the opposite trend (Figure~\ref{fig:latent_dim_ap}): compact latent spaces (32–128) achieve the highest AP, while performance declines as the latent size grows. This inverse correlation is most clearly visible in the unsupervised case. Part of this effect can be attributed to evaluation label noise, which larger spaces amplify. More importantly, it shows that autoencoder performance does not translate directly into DeepSAD performance. Pretraining losses can still help compare different architectures for robustness and performance, but they cannot be used to tune the latent dimensionality: the dimensionality that minimizes reconstruction loss in pretraining is not necessarily the one that maximizes anomaly detection performance in DeepSAD.

 % \paragraph{Effect of latent space dimensionality.}
 % Figure~\ref{fig:latent_dim_ap} shows how average precision changes with latent dimension under the experiment-based evaluation. The best performance is reached with compact latent spaces (32–128), while performance drops as the latent dimension grows. This can be explained by how the latent space controls the separation between normal and anomalous samples. Small bottlenecks act as a form of regularization, keeping the representation compact and making it easier to distinguish clear anomalies from normal frames. Larger latent spaces increase model capacity, but this extra flexibility also allows more overlap between normal frames and the mislabeled anomalies from the evaluation data. As a result, the model struggles more to keep the two groups apart.
 %
 % This effect is clearly visible in the precision--recall curves. For DeepSAD at all dimensionalities we observe high initial precision and a steep drop once the evaluation demands that mislabeled anomalies be included. However, the sharpness of this drop depends on the latent size: at 32 dimensions the fall is comparably more gradual, while at 1024 it is almost vertical. In practice, this means that higher-dimensional latent spaces amplify the label-noise problem and lead to sudden precision collapses once the clear anomalies have been detected. Compact latent spaces are therefore more robust under noisy evaluation conditions and appear to be the safer choice for real-world deployment.

-\figc{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}{width=.7\textwidth}
+\figc{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows an inverse correlation between AP and latent space size.}{width=.7\textwidth}

 \FloatBarrier
 \paragraph{Effect of semi-supervised labeling.}
-Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves the best AP, while the lightly supervised regime \((50/10)\) performs worst. With many labels \((500/100)\), performance improves again but remains slightly below the unsupervised case. This pattern also appears under the manually-defined evaluation, which excludes mislabeled frames. The drop with light supervision therefore cannot be explained by noisy evaluation targets, but must stem from the training process itself.
+Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves the best AP, while the lightly supervised regime \((50/10)\) performs worst. With many labels \((500/100)\), performance improves again but remains slightly below the unsupervised case. This pattern also appears under the manually-defined evaluation, which excludes mislabeled frames. Consequently, the drop with light supervision cannot be explained by noisy evaluation targets, but must stem from the training process itself.

 The precision--recall curves in Figure~\ref{fig:prc_over_semi} show that the overall curve shapes are similar across regimes, but shifted relative to one another in line with the AP ordering \((0/0) > (500/100) > (50/10)\). We attribute these shifts to overfitting: when only a few anomalies are labeled, the model fits them too strongly, and if those examples differ too much from other anomalies, generalization suffers. This explains why lightly supervised training performs even worse than unsupervised training, which avoids this bias.

-\figc{prc_over_semi}{figures/results_prc_over_semi.png}{\rev{PRCs} at latent dimension~32 for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (\rev{top}) and Efficient (\rev{bottom}) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}{width=.7\textwidth}
+\figc{prc_over_semi}{figures/results_prc_over_semi.png}{\rev{PRCs} from experiment-based evaluation for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (\rev{a}) and Efficient (\rev{b}) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}{width=.7\textwidth}

 The LeNet variant illustrates this effect most clearly, showing unusually high variance across folds in the lightly supervised case. In several folds, precision drops untypically early, which supports the idea that the model has overfit to a poorly chosen subset of labeled anomalies. The Efficient variant is less affected, maintaining more stable precision plateaus, which suggests it is more robust to such overfitting, which we observe consistently for nearly all latent dimensionalities.

-With many labels \((500/100)\), the results become more stable again and the PRC curves closely resemble the unsupervised case, only shifted slightly left. A larger and more diverse set of labeled anomalies reduces the risk of unlucky sampling and improves generalization, but it still cannot fully match the unsupervised regime, where no overfitting to a specific labeled subset occurs. The only exception is an outlier at latent dimension 512 for LeNet, where the curve again resembles the lightly supervised case, likely due to label sampling effects amplified by higher latent capacity.
+With many labels \((500/100)\), the results become more stable again, and the PRC curves closely resemble the unsupervised case, only shifted slightly left. A larger and more diverse set of labeled anomalies reduces the risk of unlucky sampling and improves generalization, but it still cannot fully match the unsupervised regime, where no overfitting to a specific labeled subset occurs. The only exception is an outlier at latent dimension 512 for LeNet, where the curve again resembles the lightly supervised case, likely due to label sampling effects amplified by higher latent capacity.

 In summary, three consistent patterns emerge: (i) a very small number of labels can hurt performance by causing overfitting to specific examples, (ii) many labels reduce this problem but still do not surpass unsupervised generalization, and (iii) encoder architecture strongly affects robustness, with \rev{the LeNet-inspired encoder} being more sensitive to unstable behavior than \rev{the Efficient encoder}.

@@ -1077,19 +1078,19 @@ In summary, three consistent patterns emerge: (i) a very small number of labels
 \newsection{results_inference}{Inference on Held-Out Experiments}

 In addition to the evaluation of PRC and AP obtained from $k$-fold cross-validation with varying hyperparameters, we also examine the behavior of the fully trained methods when applied to previously unseen, held-out experiments.
-While the prior analysis provided valuable insights into the classification capabilities of the methods, it was limited by two factors: first, the binary ground-truth labels were of uneven quality due to aforementioned mislabeling of frames, and second, the binary formulation does not reflect our overarching goal of quantifying sensor degradation on a continuous scale.
+While the prior analysis provided valuable insights into the classification capabilities of the methods, it was limited by two factors: first, the binary ground-truth labels were of uneven quality due to the aforementioned mislabeling of frames, and second, the binary formulation does not reflect our overarching goal of quantifying sensor degradation on a continuous scale.
 To provide a more intuitive understanding of how the methods might perform in real-world applications, we therefore present results from running inference sequentially on entire experiments.
 These frame-by-frame time-axis plots simulate online inference and illustrate how anomaly scores evolve as data is captured, thereby serving as a candidate metric for quantifying the degree of \rev{LiDAR} degradation during operation.

 %\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of anomaly detection methods with statistical indicators across clean (dashed) and degraded (solid) experiments. Each subplot shows one method (DeepSAD--LeNet, DeepSAD--Efficient, OCSVM, Isolation Forest). Red curves denote how strongly the anomaly score deviates from clean-experiment baseline; blue and green curves denote the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent Space Dimensionality was 32 and semi-supervised labeling regime was 0 normal and 0 anomalous samples during training.}

-\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of inference on unseen experiment for clean (dashed) vs. degraded (solid) experiments. Each subplot compares one method to statistical indicators. Red curves show method's anomaly score deviation from its clean baseline; blue and green curves indicate the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent dimension: 32; training regime: 0 normal, 0 anomalous samples.}
+\fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of inference on unseen data for clean (dashed) vs. degraded (solid) experiments. Each subplot, (a) - (d), compares one method's anomaly score deviation from its clean baseline in red to statistical indicators in blue and green, which indicate the percentage of missing \rev{LiDAR} points and near-sensor particle hits, respectively. Latent dimension: 32; training regime: 0 normal, 0 anomalous samples. Smoothed with EMA $\alpha=0.1$.}

-As discussed in Section~\ref{sec:setup_baselines_evaluation}, we apply $z$-score normalization to enable comparison of the different methods during inference.  After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.
+As discussed in Section~\ref{sec:setup_baselines_evaluation}, we apply $z$-score normalization to enable comparison of the different methods during inference.  After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques, such as running averages.

 The plots in Figure~\ref{fig:results_inference_normal_vs_degraded} highlight important differences in how well the tested methods distinguish between normal and degraded sensor conditions. The plots show how strongly the method's scores deviate from their clean-data baseline and include statistical indicators (missing points and near-sensor particle hits) in blue and green.

-Among the four approaches, the strongest separation is achieved by DeepSAD (Efficient), followed by DeepSAD (LeNet), then OCSVM. For Isolation Forest, the anomaly scores are already elevated in the clean experiment, which prevents reliable differentiation between normal and degraded runs and makes the method unsuitable in this context.
+Among the four approaches, the strongest separation is achieved by DeepSAD Efficient (b), followed by DeepSAD LeNet (a), then OCSVM (c). For Isolation Forest (d), the anomaly scores are already elevated in the clean experiment, which prevents reliable differentiation between normal and degraded runs and makes the method unsuitable in this context.

 When comparing the methods to the statistical indicators, some similarities in shape may suggest that the methods partly capture these statistics, although such interpretations should be made with caution.
 The anomaly detection models are expected to have learned additional patterns that are not directly observable from simple statistics, and these may also contribute to their ability to separate degraded from clean data.
@@ -1104,7 +1105,7 @@ This thesis set out to answer the research question stated in Chapter~\ref{chp:i
 \begin{quote}
 	Can autonomous robots quantify the reliability of \rev{LiDAR} sensor data in hazardous environments to make more informed decisions?
 \end{quote}
-Our results indicate a qualified “yes.” Using anomaly detection (AD)—in particular DeepSAD—we can obtain scores that (i) separate clearly normal from clearly degraded scans and (ii) track degradation trends over time on held-out experiments (see Sections~\ref{sec:results_deepsad} and \ref{sec:results_inference}). At the same time, the absence of robust ground truth limits how confidently we can assess \emph{continuous} quantification quality and complicates cross-method comparisons. The remainder of this chapter summarizes what we contribute, what we learned, and what is still missing.
+Our results indicate a qualified “yes.” Using anomaly detection (AD)—in particular DeepSAD—we can obtain scores that (i) separate clearly normal from clearly degraded scans and (ii) track degradation trends over time on held-out experiments (see Sections~\ref{sec:results_deepsad} and \ref{sec:results_inference}). At the same time, the absence of robust ground truth limits how confidently we can assess \emph{continuous} quantification quality and complicates cross-method comparisons. The remainder of this chapter summarizes what we contributed, what we learned, and what is still missing.

 \paragraph{Main contributions.}
 \begin{itemize}
@@ -1115,9 +1116,9 @@ Our results indicate a qualified “yes.” Using anomaly detection (AD)—in pa

 	\item \textbf{Semi-supervision insight.} In our data, \emph{unsupervised} DeepSAD performed best; \emph{light} labeling (50/10) performed worst; \emph{many} labels (500/100) partially recovered performance but did not surpass \rev{the unsupervised approach}. Evidence from \rev{precision--recall curve (PRC)} shapes and fold variance points to \emph{training-side overfitting to a small labeled set}, an effect that persists even under clean manually-defined evaluation (Table~\ref{tab:results_ap}, Figure~\ref{fig:prc_over_semi}).

-	\item \textbf{Encoder architecture matters.} The Efficient encoder \rev{specifically tailored to the application at hand} outperformed the LeNet-inspired variant in pretraining and downstream AD, indicating that representation quality substantially affects DeepSAD performance (Section~\ref{sec:results_pretraining}, Section~\ref{sec:results_deepsad}).
+	\item \textbf{Encoder architecture matters.} The Efficient encoder, \rev{specifically tailored to the application at hand,} outperformed the LeNet-inspired variant in pretraining and downstream AD, indicating that representation quality substantially affects DeepSAD performance (Section~\ref{sec:results_pretraining}, Section~\ref{sec:results_deepsad}).

-	\item \textbf{Temporal inference recipe.} For deployment-oriented analysis we propose $z$-score normalization based on clean data and causal EMA smoothing to obtain interpretable time-series anomaly scores on full experiments (Section~\ref{sec:results_inference}).
+	\item \textbf{Temporal inference recipe.} For deployment-oriented analysis, we propose $z$-score normalization based on clean data and causal EMA smoothing to obtain interpretable time-series anomaly scores on full experiments (Section~\ref{sec:results_inference}).
 \end{itemize}

 \paragraph{Practical recommendations.}
@@ -1133,9 +1134,9 @@ We now turn to the main limiting factor that emerged throughout this work: the l

 \newsection{conclusion_data}{Missing Ground Truth as an Obstacle}

-The most significant obstacle identified in this work is the absence of robust and comprehensive ground truth for \rev{LiDAR} degradation. As discussed in Chapter~\ref{chp:data_preprocessing}, it is not trivial to define what “degradation” precisely means in practice. Although error models for \rev{LiDAR} and theoretical descriptions of how airborne particles affect laser returns exist, these models typically quantify errors at the level of individual points (e.g., missing returns, spurious near-range hits). Such metrics, however, may not be sufficient to assess the impact of degraded data on downstream perception. For example, a point cloud with relatively few but highly localized errors—such as those caused by a dense smoke cloud—may cause a SLAM algorithm to misinterpret the region as a solid obstacle. In contrast, a point cloud with a greater number of dispersed errors might be easier to filter and thus cause little or no disruption in mapping. Consequently, the notion of “degradation” must extend beyond point-level error statistics to include how different error patterns propagate to downstream modules.
+The most significant obstacle identified in this work is the absence of a robust and comprehensive ground truth for \rev{LiDAR} degradation. As discussed in Chapter~\ref{chp:data_preprocessing}, it is not trivial to define what “degradation” precisely means in practice. Although error models for \rev{LiDAR} and theoretical descriptions of how airborne particles affect laser returns exist, these models typically quantify errors at the level of individual points (e.g., missing returns, spurious near-range hits). Such metrics, however, may not be sufficient to assess the impact of degraded data on downstream perception. For example, a point cloud with relatively few but highly localized errors—such as those caused by a dense smoke cloud—may cause a SLAM algorithm to misinterpret the region as a solid obstacle. In contrast, a point cloud with a greater number of dispersed errors might be easier to filter and thus cause little or no disruption in mapping. Consequently, the notion of “degradation” must extend beyond point-level error statistics to include how different error patterns propagate to downstream modules.

-To our knowledge, no public datasets with explicit ground truth for \rev{LiDAR} degradation exist. Even if such data were collected, for example with additional smoke sensors, it is unclear whether this would provide a usable ground truth. A smoke sensor measures only at a single point in space, while \rev{LiDAR} observes many points across the environment from a distance, so the two do not directly translate. In our dataset, we relied on the fact that clean and degraded experiments were clearly separated: data from degraded runs was collected only after artificial smoke had been released. However, the degree of degradation varied strongly within each run. Because the smoke originated from a single machine in the middle of the sensor platform's traversal path, early and late frames were often nearly as clear as those from clean experiments. This led to mislabeled frames at the run boundaries and limited the reliability of experiment-based evaluation. As shown in Section~\ref{sec:results_deepsad}, this effect capped achievable AP scores even for strong models. The underlying difficulty is not only label noise, but also the challenge of collecting labeled subsets that are representative of the full range of anomalies.
+To our knowledge, no public datasets with explicit ground truth for \rev{LiDAR} degradation exist. Even if such data were collected, for example, with additional smoke sensors, it is unclear whether this would provide a usable ground truth. A smoke sensor measures only at a single point in space, while \rev{LiDAR} observes many points across the environment from a distance, so the two do not directly translate. In our dataset, we relied on the fact that clean and degraded experiments were clearly separated: data from degraded runs was collected only after artificial smoke had been released. However, the degree of degradation varied strongly within each run. Because the smoke originated from a single machine in the middle of the sensor platform's traversal path, early and late frames were often nearly as clear as those from clean experiments. This led to mislabeled frames at the run boundaries and limited the reliability of experiment-based evaluation. As shown in Section~\ref{sec:results_deepsad}, this effect capped achievable AP scores even for strong models. The underlying difficulty is not only label noise, but also the challenge of collecting labeled subsets that are representative of the full range of anomalies.

 One promising direction is to evaluate degradation not directly on raw \rev{LiDAR} frames but via its downstream impact. For example, future work could assess degradation based on discrepancies between a previously mapped 3D environment and the output of a SLAM algorithm operating under degraded conditions. In such a setup, subjective labeling may still be required in special cases (e.g., dense smoke clouds treated as solid obstacles by SLAM), but it would anchor evaluation more closely to the ultimate users of the data.

@@ -1143,11 +1144,11 @@ Finally, the binary ground truth employed here is insufficient for the quantific

 \newsection{conclusion_ad}{Insights into DeepSAD and AD for Degradation Quantification}

-This work has shown that the DeepSAD principle is applicable to \rev{LiDAR} degradation in hazardous environments and yields promising detection performance as well as runtime feasibility (see Sections~\ref{sec:results_deepsad} and~\ref{sec:setup_experiments_environment}). Compared to simpler baselines such as Isolation Forest and OCSVM, DeepSAD achieved much stronger separation between clean and degraded data. While OCSVM showed smoother but weaker separation and Isolation Forest produced high false positives even in clean runs, both DeepSAD variants maintained large high-precision regions before collapsing under mislabeled evaluation targets.
+This work has shown that the DeepSAD principle is applicable to \rev{LiDAR} degradation in hazardous environments and yields promising detection performance as well as runtime feasibility (see Sections~\ref{sec:results_deepsad} and~\ref{sec:setup_experiments_environment}). Compared to simpler baselines such as Isolation Forest and OCSVM, DeepSAD achieved much stronger separation between clean and degraded data. While OCSVM showed smoother but weaker separation, and Isolation Forest produced high false positives even in clean runs, both DeepSAD variants maintained large high-precision regions before collapsing under mislabeled evaluation targets.

-However, the semi-supervised component of DeepSAD did not improve results in our setting. In fact, adding a small number of labels often reduced performance due to overfitting to narrow subsets of anomalies, while larger labeled sets stabilized training, they still did not surpass the unsupervised regime (see Section~\ref{sec:results_deepsad}). This suggests that without representative and diverse labeled anomalies, unsupervised training remains the safer choice.
+However, the semi-supervised component of DeepSAD did not improve results in our setting. In fact, adding a small number of labels often reduced performance due to overfitting to narrow subsets of anomalies. While larger labeled sets stabilized training, they still did not surpass the unsupervised regime (see Section~\ref{sec:results_deepsad}). This suggests that without representative and diverse labeled anomalies, unsupervised training remains the safer choice.

-We also observed that the choice of encoder architecture and latent dimensionality are critical. The Efficient encoder consistently outperformed the LeNet-inspired baseline, producing more stable precision–recall curves and stronger overall results. Similarly, compact latent spaces (32–128 dimensions) yielded the best performance and proved more robust under noisy evaluation conditions, while larger latent spaces amplified the impact of mislabeled samples and caused sharper precision collapses. These findings underline the importance of representation design for robust anomaly detection.
+We also observed that the choices of encoder architecture and latent dimensionality are critical. The Efficient encoder consistently outperformed the LeNet-inspired baseline, producing more stable precision–recall curves and stronger overall results. Similarly, compact latent spaces (32–128 dimensions) yielded the best performance and proved more robust under noisy evaluation conditions, while larger latent spaces amplified the impact of mislabeled samples and caused sharper precision collapses. These findings underline the importance of representation design for robust anomaly detection.

 Finally, inference experiments showed that DeepSAD’s anomaly scores can track degradation trends over time when normalized and smoothed, suggesting potential for real-world quantification. Future work could explore per-sample weighting of semi-supervised targets, especially if analog ground truth becomes available, allowing DeepSAD to capture varying degrees of degradation as a graded rather than binary signal.

--- a/thesis/figures/ae_elbow_test_loss_anomaly.png
+++ b/thesis/figures/ae_elbow_test_loss_anomaly.png
--- a/thesis/figures/ae_elbow_test_loss_overall.png
+++ b/thesis/figures/ae_elbow_test_loss_overall.png
--- a/thesis/figures/data_2d_projections.png
+++ b/thesis/figures/data_2d_projections.png
--- a/thesis/figures/data_combined_anomalies_timeline.png
+++ b/thesis/figures/data_combined_anomalies_timeline.png
--- a/thesis/figures/data_missing_points.png
+++ b/thesis/figures/data_missing_points.png
--- a/thesis/figures/data_points_pie.png
+++ b/thesis/figures/data_points_pie.png
--- a/thesis/figures/particles_near_sensor_boxplot_zoomed_500.png
+++ b/thesis/figures/particles_near_sensor_boxplot_zoomed_500.png
--- a/thesis/figures/results_inference_normal_vs_degraded.png
+++ b/thesis/figures/results_inference_normal_vs_degraded.png
--- a/thesis/figures/results_prc.png
+++ b/thesis/figures/results_prc.png
--- a/thesis/figures/results_prc_over_semi.png
+++ b/thesis/figures/results_prc_over_semi.png
--- a/thesis/thesis_preamble/abstract.tex
+++ b/thesis/thesis_preamble/abstract.tex
@@ -1,9 +1,9 @@
 \addcontentsline{toc}{chapter}{Abstract}
 \begin{center}\Large\bfseries Abstract\end{center}\vspace*{1cm}\noindent
-Autonomous robots are increasingly used in search and rescue (SAR) missions. In these missions, lidar sensors are often the most important source of environmental data. However, lidar data can degrade under hazardous conditions, especially when airborne particles such as smoke or dust are present. This degradation can lead to errors in mapping and navigation and may endanger both the robot and humans. Therefore, robots need a way to estimate the reliability of their lidar data, so \rev{that} they can make better-informed decisions.
+Autonomous robots are increasingly used in search and rescue (SAR) missions. In these missions, LiDAR sensors are often the most important source of environmental data. However, LiDAR data can degrade under hazardous conditions, especially when airborne particles such as smoke or dust are present. This degradation can lead to errors in mapping and navigation and may endanger both the robot and humans. Therefore, robots need a way to estimate the reliability of their LiDAR data, so that they can make better-informed decisions.
 \bigskip

-This thesis investigates whether anomaly detection methods can be used to quantify lidar data degradation \rev{caused by airborne particles such as smoke and dust}. We apply a semi-supervised deep learning approach called DeepSAD, which produces an anomaly score for each lidar scan, serving as a measure of data reliability.
+This thesis investigates whether anomaly detection methods can be used to quantify LiDAR data degradation caused by airborne particles such as smoke and dust. We apply a semi-supervised deep learning approach called DeepSAD, which produces an anomaly score for each LiDAR scan, serving as a measure of data reliability.
 \bigskip

-We evaluate this method against baseline methods on a subterranean dataset that includes lidar scans degraded by artificial smoke. Our results show that DeepSAD consistently outperforms the baselines and can clearly distinguish degraded from normal scans. At the same time, we find that the limited availability of labeled data and the lack of robust ground truth remain major challenges. Despite these limitations, our work demonstrates that anomaly detection methods are a promising tool for lidar degradation quantification in SAR scenarios.
+We evaluate this method against baseline methods on a subterranean dataset that includes LiDAR scans degraded by artificial smoke. Our results show that DeepSAD consistently outperforms the baselines and can clearly distinguish degraded from normal scans. At the same time, we find that the limited availability of labeled data and the lack of robust ground truth remain major challenges. Despite these limitations, our work demonstrates that anomaly detection methods are a promising tool for LiDAR degradation quantification in SAR scenarios.
--- a/tools/devenv.nix
+++ b/tools/devenv.nix
@@ -1,6 +1,6 @@
 { pkgs, ... }:
 let
-  native_dependencies = with pkgs.python312Packages; [
+  native_dependencies = with pkgs.python311Packages; [
    torch-bin
    torchvision-bin
    aggdraw # for visualtorch
@@ -16,7 +16,7 @@ in
  packages = native_dependencies ++ tools;
  languages.python = {
    enable = true;
-    package = pkgs.python312;
+    package = pkgs.python311;
    uv = {
      enable = true;
      sync.enable = true;
--- a/tools/plot_scripts/ae_elbow_lenet.py
+++ b/tools/plot_scripts/ae_elbow_lenet.py
@@ -12,7 +12,7 @@ import numpy as np
 import polars as pl

 # CHANGE THIS IMPORT IF YOUR LOADER MODULE IS NAMED DIFFERENTLY
-from plot_scripts.load_results import load_pretraining_results_dataframe
+from load_results import load_pretraining_results_dataframe

 # ----------------------------
 # Config
@@ -78,8 +78,8 @@ def build_arch_curves_from_df(
            "overall": (dims, means, stds),
        } }
    """
-    if "split" not in df.columns:
-        raise ValueError("Expected 'split' column in AE dataframe.")
+    # if "split" not in df.columns:
+    #     raise ValueError("Expected 'split' column in AE dataframe.")
    if "scores" not in df.columns:
        raise ValueError("Expected 'scores' column in AE dataframe.")
    if "network" not in df.columns or "latent_dim" not in df.columns:
@@ -88,7 +88,7 @@ def build_arch_curves_from_df(
        raise ValueError(f"Expected '{label_field}' column in AE dataframe.")

    # Keep only test split
-    df = df.filter(pl.col("split") == "test")
+    # df = df.filter(pl.col("split") == "test")

    groups: dict[tuple[str, int], dict[str, list[float]]] = {}

@@ -201,7 +201,7 @@ def plot_multi_loss_curve(arch_results, title, output_path, colors=None):

    plt.xlabel("Latent Dimensionality")
    plt.ylabel("Test Loss")
-    plt.title(title)
+    # plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.xticks(all_dims)
--- a/tools/plot_scripts/data_anomalies_timeline.py
+++ b/tools/plot_scripts/data_anomalies_timeline.py
@@ -171,28 +171,28 @@ def plot_combined_timeline(
            range(num_bins), near_sensor_binned, color=color, linestyle="--", alpha=0.6
        )

-        # Add vertical lines for manually labeled frames if available
-        if all_paths[i].with_suffix(".npy").name in manually_labeled_anomaly_frames:
-            begin_frame, end_frame = manually_labeled_anomaly_frames[
-                all_paths[i].with_suffix(".npy").name
-            ]
-            # Convert frame numbers to normalized timeline positions
-            begin_pos = (begin_frame / exp_len) * (num_bins - 1)
-            end_pos = (end_frame / exp_len) * (num_bins - 1)
+        # # Add vertical lines for manually labeled frames if available
+        # if all_paths[i].with_suffix(".npy").name in manually_labeled_anomaly_frames:
+        #     begin_frame, end_frame = manually_labeled_anomaly_frames[
+        #         all_paths[i].with_suffix(".npy").name
+        #     ]
+        #     # Convert frame numbers to normalized timeline positions
+        #     begin_pos = (begin_frame / exp_len) * (num_bins - 1)
+        #     end_pos = (end_frame / exp_len) * (num_bins - 1)

-            # Add vertical lines with matching color and loose dotting
-            ax1.axvline(
-                x=begin_pos,
-                color=color,
-                linestyle=":",
-                alpha=0.6,
-            )
-            ax1.axvline(
-                x=end_pos,
-                color=color,
-                linestyle=":",
-                alpha=0.6,
-            )
+        #     # Add vertical lines with matching color and loose dotting
+        #     ax1.axvline(
+        #         x=begin_pos,
+        #         color=color,
+        #         linestyle=":",
+        #         alpha=0.6,
+        #     )
+        #     ax1.axvline(
+        #         x=end_pos,
+        #         color=color,
+        #         linestyle=":",
+        #         alpha=0.6,
+        #     )

    # Customize axes
    ax1.set_xlabel("Normalized Timeline")
@@ -202,7 +202,7 @@ def plot_combined_timeline(
    ax1.set_ylabel("Missing Points (%)")
    ax2.set_ylabel("Points with <0.5m Range (%)")

-    plt.title(title)
+    # plt.title(title)

    # Create legends without fixed positions
    # First get all lines and labels for experiments
@@ -221,7 +221,8 @@ def plot_combined_timeline(
    )

    # Create single legend in top right corner with consistent margins
-    fig.legend(all_handles, all_labels, loc="upper right", borderaxespad=4.8)
+    # fig.legend(all_handles, all_labels, loc="upper right", borderaxespad=2.8)
+    fig.legend(all_handles, all_labels, bbox_to_anchor=(0.95, 0.99))

    plt.grid(True, alpha=0.3)

--- a/tools/plot_scripts/data_count_lidar_frames.py
+++ b/tools/plot_scripts/data_count_lidar_frames.py
@@ -122,8 +122,8 @@ def plot_data_points_pie(normal_experiment_frames, anomaly_experiment_frames):

    # prepare data for pie chart
    labels = [
-        "Normal Lidar Frames\nNon-Degraded Pointclouds",
-        "Anomalous Lidar Frames\nDegraded Pointclouds",
+        "Normal Lidar Frames\nNon-Degraded Point Clouds",
+        "Anomalous Lidar Frames\nDegraded Point Clouds",
    ]
    sizes = [total_normal_frames, total_anomaly_frames]
    explode = (0.1, 0)  # explode the normal slice
@@ -150,9 +150,9 @@ def plot_data_points_pie(normal_experiment_frames, anomaly_experiment_frames):
        va="center",
        color="black",
    )
-    plt.title(
-        "Distribution of Normal and Anomalous\nPointclouds in all Experiments (Lidar Frames)"
-    )
+    # plt.title(
+    #     "Distribution of Normal and Anomalous\nPointclouds in all Experiments (Lidar Frames)"
+    # )
    plt.tight_layout()

    # save the plot
--- a/tools/plot_scripts/data_missing_points.py
+++ b/tools/plot_scripts/data_missing_points.py
@@ -5,7 +5,6 @@ from pathlib import Path

 import matplotlib.pyplot as plt
 import numpy as np
-from pointcloudset import Dataset

 # define data path containing the bag files
 all_data_path = Path("/home/fedex/mt/data/subter")
@@ -82,7 +81,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
    plt.figure(figsize=(10, 5))
    plt.hist(missing_points_normal, bins=100, alpha=0.5, label="Normal Experiments")
    plt.hist(missing_points_anomaly, bins=100, alpha=0.5, label="Anomaly Experiments")
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Number of Missing Points")
    plt.ylabel("Number of Pointclouds")
    plt.legend()
@@ -109,7 +108,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        label="Anomaly Experiments",
        orientation="horizontal",
    )
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Number of Pointclouds")
    plt.ylabel("Number of Missing Points")
    plt.legend()
@@ -142,7 +141,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        label="Anomaly Experiments",
        density=True,
    )
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Number of Missing Points")
    plt.ylabel("Density")
    plt.legend()
@@ -169,7 +168,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        label="Anomaly Experiments (With Artifical Smoke)",
        density=True,
    )
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Percentage of Missing Lidar Measurements")
    plt.ylabel("Density")
    # display the x axis as percentages
@@ -210,7 +209,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        alpha=0.5,
        label="Anomaly Experiments",
    )
-    plt.title(title)
+    # plt.title(title)
    plt.xlabel("Number of Missing Points")
    plt.ylabel("Normalized Density")
    plt.legend()
--- a/tools/plot_scripts/data_particles_near_sensor.py
+++ b/tools/plot_scripts/data_particles_near_sensor.py
@@ -5,7 +5,6 @@ from pathlib import Path

 import matplotlib.pyplot as plt
 import numpy as np
-from pointcloudset import Dataset

 # define data path containing the bag files
 all_data_path = Path("/home/fedex/mt/data/subter")
@@ -164,7 +163,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        plt.gca().set_yticklabels(
            ["{:.0f}%".format(y * 100) for y in plt.gca().get_yticks()]
        )
-        plt.title("Particles Closer than 0.5m to the Sensor")
+        # plt.title("Particles Closer than 0.5m to the Sensor")
        plt.ylabel("Percentage of measurements closer than 0.5m")
        plt.tight_layout()
        plt.savefig(output_datetime_path / f"particles_near_sensor_boxplot_{rt}.png")
@@ -186,7 +185,7 @@ def plot_data_points(normal_experiment_paths, anomaly_experiment_paths, title):
        plt.gca().set_yticklabels(
            ["{:.0f}%".format(y * 100) for y in plt.gca().get_yticks()]
        )
-        plt.title("Particles Closer than 0.5m to the Sensor")
+        # plt.title("Particles Closer than 0.5m to the Sensor")
        plt.ylabel("Percentage of measurements closer than 0.5m")
        plt.ylim(0, 0.05)
        plt.tight_layout()
--- a/tools/plot_scripts/data_spherical_projection.py
+++ b/tools/plot_scripts/data_spherical_projection.py
@@ -112,18 +112,27 @@ cmap = get_colormap_with_special_missing_color(
    args.colormap, args.missing_data_color, args.reverse_colormap
 )

-# --- Create a figure with 2 vertical subplots ---
+# --- Create a figure with 2 vertical subplots and move titles to the left ---
 fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(10, 5))
-for ax, frame, title in zip(
+# leave extra left margin for the left-side labels
+fig.subplots_adjust(left=0.14, hspace=0.05)
+
+for ax, frame, label in zip(
    (ax1, ax2),
    (frame1, frame2),
-    (
-        "Projection of Lidar Frame without Degradation",
-        "Projection of Lidar Frame with Degradation (Artifical Smoke)",
-    ),
+    ("(a)", "(b)"),
 ):
    im = ax.imshow(frame, cmap=cmap, aspect="auto", vmin=global_vmin, vmax=global_vmax)
-    ax.set_title(title)
+    # place the "title" to the left, vertically centered relative to the axes
+    ax.text(
+        -0.02,  # negative x places text left of the axes (in axes coordinates)
+        0.5,
+        label,
+        transform=ax.transAxes,
+        va="center",
+        ha="right",
+        fontsize=12,
+    )
    ax.axis("off")

 # Adjust layout to fit margins for a paper
--- a/tools/plot_scripts/results_inference_timelines_exp_compare.py
+++ b/tools/plot_scripts/results_inference_timelines_exp_compare.py
@@ -260,11 +260,11 @@ def baseline_transform(clean: np.ndarray, other: np.ndarray, mode: str):


 def pick_method_series(gdf: pl.DataFrame, label: str) -> Optional[np.ndarray]:
-    if label == "DeepSAD (LeNet)":
+    if label == "DeepSAD LeNet":
        sel = gdf.filter(
            (pl.col("network") == "subter_LeNet") & (pl.col("model") == "deepsad")
        )
-    elif label == "DeepSAD (efficient)":
+    elif label == "DeepSAD Efficient":
        sel = gdf.filter(
            (pl.col("network") == "subter_efficient") & (pl.col("model") == "deepsad")
        )
@@ -311,8 +311,8 @@ def compare_two_experiments_progress(
    include_stats: bool = True,
 ):
    methods = [
-        "DeepSAD (LeNet)",
-        "DeepSAD (efficient)",
+        "DeepSAD LeNet",
+        "DeepSAD Efficient",
        "OCSVM",
        "Isolation Forest",
    ]
@@ -392,8 +392,8 @@ def compare_two_experiments_progress(
    axes = axes.ravel()

    method_to_axidx = {
-        "DeepSAD (LeNet)": 0,
-        "DeepSAD (efficient)": 1,
+        "DeepSAD LeNet": 0,
+        "DeepSAD Efficient": 1,
        "OCSVM": 2,
        "Isolation Forest": 3,
    }
@@ -404,6 +404,8 @@ def compare_two_experiments_progress(
    if not stats_available:
        print("[WARN] One or both stats missing. Subplots will include methods only.")

+    letters = ["a", "b", "c", "d"]
+
    for label, axidx in method_to_axidx.items():
        ax = axes[axidx]
        yc = curves_clean.get(label)
@@ -412,7 +414,7 @@ def compare_two_experiments_progress(
            ax.text(
                0.5, 0.5, "No data", ha="center", va="center", transform=ax.transAxes
            )
-            ax.set_title(label)
+            ax.set_title(f"({letters[axidx]}) {label}")
            ax.grid(True, alpha=0.3)
            continue

@@ -435,6 +437,7 @@ def compare_two_experiments_progress(
        )
        ax.set_ylabel(y_label)
        ax.set_title(label)
+        ax.set_title(f"({letters[axidx]}) {label}")
        ax.grid(True, alpha=0.3)

        # Right axis #1 (closest to plot): Missing points (%)
@@ -550,11 +553,11 @@ def compare_two_experiments_progress(
    for ax in axes:
        ax.set_xlabel("Progress through experiment (%)")

-    fig.suptitle(
-        f"AD Method vs Stats Inference — progress-normalized\n"
-        f"Transform: z-score normalized to non-degraded experiment | EMA(α={EMA_ALPHA_METHODS})",
-        fontsize=14,
-    )
+    # fig.suptitle(
+    #     f"AD Method vs Stats Inference — progress-normalized\n"
+    #     f"Transform: z-score normalized to non-degraded experiment | EMA(α={EMA_ALPHA_METHODS})",
+    #     fontsize=14,
+    # )
    fig.tight_layout(rect=[0, 0, 1, 0.99])

    out_name = (
--- a/tools/plot_scripts/results_latent_space_comparisons.py
+++ b/tools/plot_scripts/results_latent_space_comparisons.py
@@ -161,7 +161,7 @@ def _ensure_dim_axes(fig_title: str):
    fig, axes = plt.subplots(
        nrows=4, ncols=2, figsize=(12, 16), constrained_layout=True
    )
-    fig.suptitle(fig_title, fontsize=14)
+    # fig.suptitle(fig_title, fontsize=14)
    axes = axes.ravel()
    return fig, axes

@@ -213,11 +213,13 @@ def plot_grid_from_df(
    legend_labels = []
    have_legend = False

+    letters = ["a", "b", "c", "d", "e", "f", "g", "h"]
+
    for i, dim in enumerate(LATENT_DIMS):
        if i >= 7:
            break  # last slot reserved for legend
        ax = axes[i]
-        ax.set_title(f"Latent Dim. = {dim}")
+        ax.set_title(f"({letters[i]}) Latent Dim. = {dim}")
        ax.grid(True, alpha=0.3)

        if kind == "roc":
--- a/tools/plot_scripts/results_semi_labels_comparison.py
+++ b/tools/plot_scripts/results_semi_labels_comparison.py
@@ -260,9 +260,9 @@ def make_figures_for_dim(
    fig_roc, axes = plt.subplots(
        nrows=2, ncols=1, figsize=(7, 10), constrained_layout=True
    )
-    fig_roc.suptitle(
-        f"ROC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
-    )
+    # fig_roc.suptitle(
+    #     f"ROC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
+    # )

    _plot_panel(
        axes[0],
@@ -272,7 +272,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="roc",
    )
-    axes[0].set_title("DeepSAD (LeNet) + Baselines")
+    axes[0].set_title("(a) DeepSAD (LeNet) + Baselines")

    _plot_panel(
        axes[1],
@@ -282,7 +282,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="roc",
    )
-    axes[1].set_title("DeepSAD (Efficient) + Baselines")
+    axes[1].set_title("(b) DeepSAD (Efficient) + Baselines")

    out_roc = out_dir / f"roc_{latent_dim}_{eval_type}.png"
    fig_roc.savefig(out_roc, dpi=150, bbox_inches="tight")
@@ -292,9 +292,9 @@ def make_figures_for_dim(
    fig_prc, axes = plt.subplots(
        nrows=2, ncols=1, figsize=(7, 10), constrained_layout=True
    )
-    fig_prc.suptitle(
-        f"PRC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
-    )
+    # fig_prc.suptitle(
+    #     f"PRC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
+    # )

    _plot_panel(
        axes[0],
@@ -304,7 +304,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="prc",
    )
-    axes[0].set_title("DeepSAD (LeNet) + Baselines")
+    axes[0].set_title("(a)")

    _plot_panel(
        axes[1],
@@ -314,7 +314,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="prc",
    )
-    axes[1].set_title("DeepSAD (Efficient) + Baselines")
+    axes[1].set_title("(b)")

    out_prc = out_dir / f"prc_{latent_dim}_{eval_type}.png"
    fig_prc.savefig(out_prc, dpi=150, bbox_inches="tight")
--- a/tools/pyproject.toml
+++ b/tools/pyproject.toml
@@ -6,6 +6,7 @@ readme = "README.md"
 requires-python = ">=3.11.9"
 dependencies = [
    "pandas>=2.3.2",
+    "pointcloudset>=0.11.0",
    "polars>=1.33.0",
    "pyarrow>=21.0.0",
    "tabulate>=0.9.0",
--- a/tools/uv.lock
+++ b/tools/uv.lock
Author	SHA1	Message	Date
Jan Kowalczyk	7b5accb6c5	fixed plots	2025-10-21 19:04:19 +02:00
Jan Kowalczyk	8f983b890f	formatting	2025-10-19 17:39:42 +02:00
Jan Kowalczyk	6cd2c7fbef	abstract lidar capitalization	2025-10-19 17:34:38 +02:00
Jan Kowalczyk	62c424cd54	grammarly done	2025-10-19 17:29:31 +02:00