reworked results chpt

wip
2025-09-27 19:01:59 +02:00 · 2025-09-27 16:34:52 +02:00
7 changed files with 631 additions and 156 deletions
--- a/thesis/Main.pdf
+++ b/thesis/Main.pdf
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -238,7 +238,10 @@ For remote controlled robots a human operator can make these decisions but many
 In this thesis we aim to answer this question by assessing a deep learning-based anomaly detection method and its performance when quantifying the sensor data's degradation. The employed algorithm is a semi-supervised anomaly detection algorithm which uses manually labeled training data to improve its performance over unsupervised methods. We show how much the introduction of these labeled samples improves the methods performance. The models output is an anomaly score which quantifies the data reliability and can be used by algorithms that rely on the sensor data. These reliant algorithms may decide to for example slow down the robot to collect more data, choose alternative routes, signal for help or rely more heavily on other sensor's input data.
-\todo[inline]{discuss results (we showed X)}
+%We showed that anomaly detection methods are suitable for the task at hand, enabling quantification of lidar sensor data degradation on sub-terranean data similar to the environments SAR missions take place in. We found that the semi-supervised learning-based method DeepSAD performed better than well-known anomaly detection baselines but found a lack of suitable training data and especially labeled evaluation data to hold back the quality of assessing the methods' expected performance under real-life conditions. 
 Our experiments demonstrate that anomaly detection methods are indeed applicable to this task, allowing lidar data degradation to be quantified on subterranean datasets representative of SAR environments. Among the tested approaches, the semi-supervised method DeepSAD consistently outperformed established baselines such as Isolation Forest and OC-SVM. At the same time, the lack of suitable training data—and in particular the scarcity of reliable evaluation labels—proved to be a major limitation, constraining the extent to which the expected real-world performance of these methods could be assessed.
 %\todo[inline, color=green!40]{autonomous robots have many sensors for understanding the world around them, especially visual sensors (lidar, radar, ToF, ultrasound, optical cameras, infrared cameras), they use that data for navigation mapping, SLAM algorithms, and decision making. these are often deep learning algorithms, oftentimes only trained on good data}
 %\todo[inline, color=green!40]{difficult environments for sensors to produce good data quality (earthquakes, rescue robots), produced data may be unreliable, we don't know how trustworthy that data is (no quantification, confidence), since all navigation and decision making is based on input data, this makes the whole pipeline untrustworthy/problematic}
@@ -273,15 +276,21 @@ The method we employ produces an analog score that reflects the confidence in th
 \newsection{thesis_structure}{Structure of the Thesis}
 The remainder of this thesis is organized as follows.
 Chapter~\ref{chp:background} introduces the theoretical background and related work, covering anomaly detection methods, semi-supervised learning algorithms, autoencoders, and the fundamentals of LiDAR sensing.
 Chapter~\ref{chp:deepsad} presents the DeepSAD algorithm in detail, including its optimization objective, network architecture, and hyperparameters.
 In Chapter~\ref{chp:data_preprocessing}, we describe the dataset, the preprocessing pipeline, and our labeling strategies.
 Chapter~\ref{chp:experimental_setup} outlines the experimental design, implementation details, and evaluation protocol.
 The results are presented and discussed in Chapter~\ref{chp:results_discussion}, where we analyze the performance of DeepSAD compared to baseline methods.
 Finally, Chapter~\ref{chp:conclusion_future_work} concludes the thesis by summarizing the main findings, highlighting limitations, and discussing open questions and directions for future work.
 \threadtodo
 {explain how structure will guide reader from zero knowledge to answer of research question}
 {since reader knows what we want to show, an outlook over content is a nice transition}
 {state structure of thesis and explain why specific background is necessary for next section}
 {reader knows what to expect $\rightarrow$ necessary background info and related work}
 \todo[inline]{brief overview of thesis structure}
 \todo[inline, color=green!40]{in section x we discuss anomaly detection, semi-supervised learning since such an algorithm was used as the chosen method,  we also discuss how lidar works and the data it produces. then in we discuss in detail the chosen method Deep SAD in section X, in section 4 we discuss the traing and evaluation data, in sec 5 we describe our setup for training and evaluation (whole pipeline). results are presented and discussed in section 6. section 7 contains a conclusion and discusses future work}
 \newchapter{background}{Background and Related Work}
 %\todo[inline, color=green!40]{in this section we will discuss necessary background knowledge for our chosen method and the sensor data we work with. related work exists mostly from autonomous driving which does not include subter data and mostly looks at precipitation as source of degradation, we modeled after one such paper and try to adapt the same method for the domain of rescue robots, this method is a semi-supervised deep learning approach to anomaly detection which we describe in more detail in sections 2.1 and 2.2. in the last subsection 2.3 we discuss lidar sensors and the data they produce}
@@ -409,7 +418,7 @@ In unsupervised learning, models work directly with raw data, without any ground
 In reinforcement learning, the model—often called an agent—learns by interacting with an environment, that provides feedback in the form of rewards or penalties. At each step, the agent observes the environment’s state, selects an action, and an interpreter judges the action's outcome based on how the environment changed, providing a scalar reward or penalty that reflects the desirability of that outcome. The agent’s objective is to adjust its decision-making strategy to maximize the cumulative reward over time, balancing exploration of new actions with exploitation of known high-reward behaviors. This trial-and-error approach is well suited to sequential decision problems in complex settings, such as autonomous navigation or robotic control, where each choice affects both the immediate state and future possibilities.
-\todo[inline, color=green!40]{illustration reinforcement learning}
+%\todo[inline, color=green!40]{illustration reinforcement learning}
 Semi-Supervised learning algorithms are an inbetween category of supervised and unsupervised algorithms, in that they use a mixture of labeled and unlabeled data. Typically vastly more unlabeled data is used during training of such algorithms than labeled data, due to the effort and expertise required to label large quantities of data correctly. Semi-supervised methods are oftentimes an effort to improve a machine learning algorithm belonging to either the supervised or unsupervised category. Supervised methods such as classification tasks are enhanced by using large amounts of unlabeled data to augment the supervised training without additional need of labeling work. Alternatively, unsupervised methods like clustering algorithms may not only use unlabeled data but improve their performance by considering some hand-labeled data during training.
 %Semi-Supervised learning algorithms are an inbetween category of supervised and unsupervised algorithms, in that they use a mixture of labeled and unlabeled data. Typically vastly more unlabeled data is used during training of such algorithms than labeled data, due to the effort and expertise required to label large quantities of data correctly. The type of task performed by semi-supervised methods can originate from either supervised learningor unsupervised learning domain. For classification tasks which are oftentimes achieved using supervised learning the additional unsupervised data is added during training with the hope to achieve a better outcome than when training only with the supervised portion of the data. In contrast for unsupervised learning use cases such as clustering algorithms, the addition of labeled samples can help guide the learning algorithm to improve performance over fully unsupervised training.
@@ -430,7 +439,7 @@ Autoencoders are a type of neural network architecture, whose main goal is learn
 \fig{autoencoder_general}{figures/autoencoder_principle_placeholder.png}{PLACEHOLDER - An illustration of autoencoders' general architecture and reconstruction task.}
 %\todo[inline, color=green!40]{explain figure}
-\todo[inline, color=green!40]{Paragraph about Variational Autoencoders? generative models vs discriminative models, enables other common use cases such as generating new data by changing parameterized generative distribution in latent space - VAES are not really relevant, maybe leave them out and just mention them shortly, with the hint that they are important but too much to explain since they are not key knowledge for this thesis}
+%\todo[inline, color=green!40]{Paragraph about Variational Autoencoders? generative models vs discriminative models, enables other common use cases such as generating new data by changing parameterized generative distribution in latent space - VAES are not really relevant, maybe leave them out and just mention them shortly, with the hint that they are important but too much to explain since they are not key knowledge for this thesis}
 One key use case of autoencoders is to employ them as a dimensionality reduction technique. In that case, the latent space inbetween the encoder and decoder is of a lower dimensionality than the input data itself. Due to the aforementioned reconstruction goal, the shared information between the input data and its latent space representation is maximized, which is known as following the infomax principle. After training such an autoencoder, it may be used to generate lower-dimensional representations of the given datatype, enabling more performant computations which may have been infeasible to achieve on the original data. DeepSAD uses an autoencoder in a pre-training step to achieve this goal among others.
@@ -713,7 +722,6 @@ Based on the previously discussed requirements and the challenges of obtaining r
 %-------------------------------------------------
 % Compact sensor overview (row numbers follow Fig.~\ref{fig:subter_platform})
 %------------------------------------------------- 
 \todo[inline]{todo: check table for accuracy/errors}
 \begin{table}[htbp]
 	\centering
 	\caption{On–board sensors recorded in the \citetitle{subter} dataset.	Numbers match the labels in Fig.~\ref{fig:subter_platform}; only the most salient details are shown for quick reference.}
@@ -868,7 +876,7 @@ To create this mapping, we leveraged the available measurement indices and chann
 Figure~\ref{fig:data_projections} displays two examples of LiDAR point cloud projections to aid in the reader’s understanding. Although the original point clouds were converted into grayscale images with a resolution of 2048×32 pixels, these raw images can be challenging to interpret. To enhance human readability, we applied the viridis colormap and vertically stretched the images so that each measurement occupies multiple pixels in height. The top projection is derived from a scan without artificial smoke—and therefore minimal degradation—while the lower projection comes from an experiment where artificial smoke introduced significant degradation.
-\todo[inline, color=green!40]{add same projections as they are used in training? grayscale without vertical scaling}
+%\todo[inline, color=green!40]{add same projections as they are used in training? grayscale without vertical scaling}
 \fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two pointclouds, one from an experiment without degradation and one from an experiment with artifical smoke as degradation. To aid the readers perception, the images are vertically stretched and a colormap has been applied to the pixels' reciprocal range values, while the actual training data is grayscale.}
@@ -1239,53 +1247,44 @@ Our experimental setup consisted of two stages. First, we conducted a hyperparam
 Second, we trained the full DeepSAD models on the same latent space sizes in order to investigate how autoencoder performance transfers to anomaly detection performance. Specifically, we aimed to answer whether poor autoencoder reconstructions necessarily imply degraded DeepSAD results, or whether the two stages behave differently. To disentangle these effects, both network architectures (LeNet-inspired and Efficient) were trained with identical configurations, allowing for a direct architectural comparison.
-Furthermore, we investigated the effect of semi-supervised labeling. DeepSAD can incorporate labeled data during training, and we wanted to investigate the impact of labeling on anomaly detection performance. To this end, each configuration was trained under three labeling regimes:
+% Furthermore, we investigated the effect of semi-supervised labeling. DeepSAD can incorporate labeled data during training, and we wanted to investigate the impact of labeling on anomaly detection performance. To this end, each configuration was trained under three labeling regimes:
 % \begin{itemize}
 % 	\item \textbf{Unsupervised:} $(0,0)$ labeled samples of (normal, anomalous) data.
 % 	\item \textbf{Low supervision:} $(50,10)$ labeled samples.
 % 	\item \textbf{High supervision:} $(500,100)$ labeled samples.
 % \end{itemize}
 Furthermore, we examined the effect of semi-supervised labeling on DeepSAD’s performance. As summarized in Table~\ref{tab:labeling_regimes}, three labeling regimes were tested, ranging from fully unsupervised training to progressively larger amounts of supervision:
 \begin{itemize}
 	\item \textbf{Unsupervised:} $(0,0)$ labeled samples of (normal, anomalous) data.
 	\item \textbf{Low supervision:} $(50,10)$ labeled samples.
 	\item \textbf{High supervision:} $(500,100)$ labeled samples.
 \end{itemize}
 The percentages reported in Table~\ref{tab:labeling_regimes} are relative to the training folds after 5-fold cross-validation. Here, the classes “normal,” “anomalous,” and “unknown” follow the same definition as in the experiment-based labeling scheme. In particular, the “unknown” category arises because for semi-supervised anomaly labels we only used the manually selected, unambiguous degradation intervals from smoke experiments. Frames outside of these intervals were treated as “unknown” rather than anomalous, so as to prevent mislabeled data from being used during training. This design choice ensured that the inclusion of labeled samples could not inadvertently reduce performance by introducing additional label noise.
 \begin{table}[h]
 	\centering
 	\caption{Proportion of labeled samples in the training folds for each labeling regime.
 		Percentages are computed relative to the available training data after 5-fold splitting
 		(80\% of the dataset per fold). Unknown samples were never labeled.}
 	\renewcommand{\arraystretch}{1.15}
 	\begin{tabularx}{\linewidth}{lYYYY}
 		\toprule
 		\textbf{Regime} & \textbf{\% normals labeled} & \textbf{\% anomalies labeled} & \textbf{\% overall labeled} & \textbf{\% overall unlabeled} \\
 		\midrule
 		(0,0)           & 0.00\%                      & 0.00\%                        & 0.00\%                      & 100.00\%                      \\
 		(50,10)         & 0.40\%                      & 1.46\%                        & 0.42\%                      & 99.58\%                       \\
 		(500,100)       & 3.96\%                      & 14.57\%                       & 4.25\%                      & 95.75\%                       \\
 		\bottomrule
 	\end{tabularx}
 	\label{tab:labeling_regimes}
 \end{table}
 All models were pre-trained for 50~epochs and then trained for 150~epochs with the same learning rate of $1\cdot 10^{-5}$ and evaluated with 5-fold cross-validation.
 Table~\ref{tab:exp_grid} summarizes the full experiment matrix.
 % \begin{table}[h]
 % 	\centering
 % 	\caption{Experiment grid of all DeepSAD trainings. Each latent space size was tested for both network architectures and three levels of semi-supervised labeling.}
 % 	\begin{tabular}{c|c|c}
 % 		\toprule
 % 		\textbf{Latent sizes}              & \textbf{Architectures}    & \textbf{Labeling regimes (normal, anomalous)} \\
 % 		\midrule
 % 		$32, 64, 128, 256, 512, 768, 1024$ & LeNet-inspired, Efficient & (0,0), (50,10), (500,100)                     \\
 % 		\bottomrule
 % 	\end{tabular}
 % 	\label{tab:exp_grid}
 % \end{table}
 % \begin{table}[h]
 % 	\centering
 % 	\caption{Experiment grid of all DeepSAD trainings. Each latent space size was tested for both network architectures and three levels of semi-supervised labeling.}
 % 	\renewcommand{\arraystretch}{1.2}
 % 	\begin{tabularx}{\textwidth}{cXX}
 % 		\hline
 % 		\textbf{Latent sizes} & \textbf{Architectures} & \textbf{Labeling regimes (normal, anomalous)} \\
 % 		\hline
 % 		\begin{tabular}{@{}c@{}}
 % 			32 \\ 64 \\ 128 \\ 256 \\ 512 \\ 768 \\ 1024
 % 		\end{tabular}
 % 		                      &
 % 		\begin{tabular}{@{}c@{}}
 % 			LeNet-inspired \\ Efficient
 % 		\end{tabular}
 % 		                      &
 % 		\begin{tabular}{@{}c@{}}
 % 			(0,0) \\ (50,10) \\ (500,100)
 % 		\end{tabular}                                                                   \\
 % 		\hline
 % 	\end{tabularx}
 % 	\label{tab:exp_grid}
 % \end{table}
 \begin{table}[h]
 	\centering
 	\caption{Parameter space for the DeepSAD grid search. Each latent size is tested for both architectures and all labeling regimes.}
@@ -1562,15 +1561,13 @@ Since only per-sample reconstruction losses were retained during pretraining, we
 % --- Section: DeepSAD Training Results ---
 \newsection{results_deepsad}{DeepSAD Detection Performance}
-Due to the challenges of ground truth quality, evaluation results must be interpreted with care. Two complementary evaluation schemes were introduced earlier:
+Due to the challenges of ground truth quality, evaluation results must be interpreted with care. As introduced earlier, we consider two complementary evaluation schemes:
 \begin{itemize}
-	\item \textbf{Experiment-based labels}, which provide an objective way to assign anomaly labels to entire degraded runs. However, this inevitably marks many near-normal frames at the start and end of these runs as anomalous. These knowingly “false” labels reduce the maximum achievable average precision, since even a perfect classifier cannot separate ambiguous from normal samples under this scheme.
+	\item \textbf{Experiment-based labels:} An objective way to assign anomaly labels to all frames from degraded runs. However, this also marks many near-normal frames at the start and end of runs as anomalous. These knowingly false labels lower the maximum achievable AP, since even a perfect model cannot separate these mislabeled normals from true anomalies.
-	\item \textbf{Hand-labeled labels}, which include only clearly degraded frames. These remove the mislabeled intervals and allow nearly perfect classification. While this evaluation is useful to show that performance losses in the experiment-based scheme stem from label noise, it would be uninformative in isolation because the targets become too easily distinguishable.
+	\item \textbf{Hand-labeled labels:} A cleaner ground truth, containing only clearly degraded frames. This removes mislabeled intervals and allows nearly perfect separation. However, it also simplifies the task too much, because borderline cases are excluded.
 \end{itemize}
-Table~\ref{tab:results_ap} gives an overview of average precision (AP) across all latent dimensions, labeling regimes, and methods. Under experiment-based labels, both DeepSAD variants consistently outperform the baselines, achieving AP values around 0.60–0.66 compared to 0.21 for IsoForest and 0.31–0.49 for OC-SVM. This demonstrates that even with noisy evaluation data, DeepSAD provides substantially stronger discriminative ability. Under hand-labeled evaluation, DeepSAD reaches nearly perfect AP across all settings, while the baselines remain much lower (IsoForest around 0.23–0.24, OC-SVM between 0.4 and 0.7).
+Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimensions, labeling regimes, and methods. Under experiment-based evaluation, both DeepSAD variants consistently outperform the baselines, reaching AP values around 0.60–0.66 compared to 0.21 for Isolation Forest and 0.31–0.49 for OC-SVM. Under hand-labeled evaluation, DeepSAD achieves nearly perfect AP in all settings, while the baselines remain much lower. This contrast shows that the lower AP under experiment-based evaluation is not a weakness of DeepSAD itself, but a direct result of mislabeled samples in the evaluation data. The hand-labeled scheme therefore confirms that DeepSAD separates clearly normal from clearly degraded frames very well, while also highlighting that label noise must be kept in mind when interpreting the experiment-based results.
 The contrast between the two evaluation schemes indicates, on the one hand, that the reduced AP in the experiment-based evaluation is largely due to mislabeled or ambiguous samples at the start and end of degraded runs. On the other hand, the perfect classification performance in the hand-labeled evaluation also reflects that only clearly degraded samples remain, meaning that borderline cases were removed entirely. This makes it impossible to assess how DeepSAD handles frames with weak or gradual degradation: the results show that it excels at separating clearly normal from clearly degraded samples, but they do not tell us whether it can reliably classify in-between cases where subjective judgment would otherwise be required. Consequently, both evaluation schemes are informative in complementary ways: experiment-based labels allow relative comparison under noisy, realistic conditions, while hand-labeled labels demonstrate the upper bound of performance when ambiguous samples are excluded.
 \begin{table}[t]
 	\centering
@@ -1616,63 +1613,31 @@ The contrast between the two evaluation schemes indicates, on the one hand, that
 \end{table}
-% \begin{table}[t]
+The precision--recall curves (Figure~\ref{fig:prc_representative}) illustrate these effects more clearly. For DeepSAD, precision stays close to 1 until about 0.5 recall, after which it drops off sharply. This plateau corresponds to the fraction of truly degraded frames in the anomalous set. Once recall moves beyond this point, the evaluation demands that the model also “find” the mislabeled anomalies near the run boundaries. To do so, the decision threshold must be lowered so far that many true normal frames are also flagged, which causes precision to collapse. The baselines behave differently: OC-SVM shows a smooth but weaker decline without a strong high-precision plateau, while Isolation Forest detects only a few extreme anomalies before collapsing to near-random performance. These operational differences are hidden in a single AP number but are important for judging how the methods would behave in deployment.
 % 	\centering
 % 	\caption{AP means $\pm$ std across 5 folds for experiment-based evaluation only, grouped by labeling regime.}
 % 	\label{tab:results_ap_with_std}
 % 	\begin{tabularx}{\textwidth}{c*{4}{Y}}
 % 		\toprule
 % 		Latent Dim. & \rotheader{DeepSAD                                                                    \\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OC-SVM} \\
 % 		\midrule
 % 		\multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{0/0}\)}                                      \\
 % 		\addlinespace[2pt]
 % 		32          & 0.664$\,\pm\,0.029$ & 0.650$\,\pm\,0.017$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
 % 		64          & 0.635$\,\pm\,0.018$ & 0.643$\,\pm\,0.016$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
 % 		128         & 0.642$\,\pm\,0.022$ & 0.642$\,\pm\,0.017$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
 % 		256         & 0.615$\,\pm\,0.022$ & 0.631$\,\pm\,0.015$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
 % 		512         & 0.613$\,\pm\,0.015$ & 0.635$\,\pm\,0.016$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
 % 		768         & 0.609$\,\pm\,0.036$ & 0.617$\,\pm\,0.016$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
 % 		1024        & 0.607$\,\pm\,0.018$ & 0.612$\,\pm\,0.018$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
 % 		\midrule
 % 		\multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{50/10}\)}                                    \\
 % 		\addlinespace[2pt]
 % 		32          & 0.569$\,\pm\,0.061$ & 0.582$\,\pm\,0.008$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
 % 		64          & 0.590$\,\pm\,0.032$ & 0.592$\,\pm\,0.017$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
 % 		128         & 0.566$\,\pm\,0.078$ & 0.588$\,\pm\,0.011$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
 % 		256         & 0.598$\,\pm\,0.027$ & 0.587$\,\pm\,0.015$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
 % 		512         & 0.550$\,\pm\,0.157$ & 0.587$\,\pm\,0.020$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
 % 		768         & 0.596$\,\pm\,0.014$ & 0.577$\,\pm\,0.017$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
 % 		1024        & 0.601$\,\pm\,0.012$ & 0.568$\,\pm\,0.013$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
 % 		\midrule
 % 		\multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{500/100}\)}                                  \\
 % 		\addlinespace[2pt]
 % 		32          & 0.625$\,\pm\,0.015$ & 0.621$\,\pm\,0.009$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
 % 		64          & 0.611$\,\pm\,0.014$ & 0.621$\,\pm\,0.018$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
 % 		128         & 0.607$\,\pm\,0.014$ & 0.615$\,\pm\,0.015$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
 % 		256         & 0.604$\,\pm\,0.022$ & 0.612$\,\pm\,0.017$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
 % 		512         & 0.578$\,\pm\,0.109$ & 0.608$\,\pm\,0.018$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
 % 		768         & 0.597$\,\pm\,0.014$ & 0.598$\,\pm\,0.017$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
 % 		1024        & 0.601$\,\pm\,0.013$ & 0.591$\,\pm\,0.016$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
 % 		\bottomrule
 % 	\end{tabularx}
 % \end{table}
-Representative precision–recall curves illustrate how methods differ in their operating regimes (Figure~\ref{fig:prc_representative}). DeepSAD shows a stable high-precision region up to about 0.5 recall, followed by a sharp drop once it is forced to classify borderline cases. OC-SVM declines gradually without ever reaching a strong plateau, while Isolation Forest detects only a few extreme anomalies before collapsing to near-random performance. These qualitative differences are masked in single-number metrics but are critical for interpreting how the methods would behave in deployment.
+Taken together, the two evaluation schemes provide complementary insights. The experiment-based labels offer a noisy but realistic setting that shows how methods cope with ambiguous data, while the hand-labeled labels confirm that DeepSAD can achieve nearly perfect separation when the ground truth is clean. The combination of both evaluations makes clear that (i) DeepSAD is stronger than the baselines under both conditions, (ii) the apparent performance limits under experiment-based labels are mainly due to label noise, and (iii) interpreting results requires care, since performance drops in the curves often reflect mislabeled samples rather than model failures. At the same time, both schemes remain binary classifications and therefore cannot directly evaluate the central question of whether anomaly scores can serve as a continuous measure of degradation. For this reason, we extend the analysis in Section~\ref{sec:results_inference}, where inference on entire unseen experiments is used to provide a more intuitive demonstration of the methods’ potential for quantifying LiDAR degradation in practice.
 \fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OC-SVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
 \paragraph{Effect of latent space dimensionality.}
 Figure~\ref{fig:latent_dim_ap} shows how average precision changes with latent dimension under the experiment-based evaluation. The best performance is reached with compact latent spaces (32–128), while performance drops as the latent dimension grows. This can be explained by how the latent space controls the separation between normal and anomalous samples. Small bottlenecks act as a form of regularization, keeping the representation compact and making it easier to distinguish clear anomalies from normal frames. Larger latent spaces increase model capacity, but this extra flexibility also allows more overlap between normal frames and the mislabeled anomalies from the evaluation data. As a result, the model struggles more to keep the two groups apart.
-%\newsection{results_latent}{Effect of latent space dimensionality}
+This effect is clearly visible in the precision--recall curves. For DeepSAD at all dimensionalities we observe high the high initial precision and steep drop once the evaluation demands that mislabeled anomalies be included. However, the sharpness of this drop depends on the latent size: at 32 dimensions the fall is comparably more gradual, while at 1024 it is almost vertical. In practice, this means that higher-dimensional latent spaces amplify the label-noise problem and lead to sudden precision collapses once the clear anomalies have been detected. Compact latent spaces are therefore more robust under noisy evaluation conditions and appear to be the safer choice for real-world deployment.
 %Figure~\ref{fig:latent_dim_ap} plots AP versus latent dimension under the experiment-based evaluation. DeepSAD benefits from compact latent spaces (e.g., 32–128), with diminishing or negative returns at larger codes. We argue that the most likely reason for the declining performance with increasing latent space size is due to the network learning
 Figure~\ref{fig:latent_dim_ap} plots AP versus latent dimension under the experiment-based evaluation. DeepSAD benefits most from compact latent spaces (e.g., 32–128), with diminishing or even negative returns at larger code sizes. We argue that two interacting effects likely explain this trend. First, higher-dimensional latent spaces increase model capacity and reduce the implicit regularization provided by smaller bottlenecks, leading to overfitting. Second, as illustrated by the representative PRC curves in Figure~\ref{fig:prc_representative}, DeepSAD exhibits a steep decline in precision once recall exceeds roughly 0.5. We attribute this effect primarily to mislabeled or ambiguous samples in the experiment-based evaluation: once the model is forced to classify these borderline cases, precision inevitably drops. Importantly, while such a drop is visible across all latent dimensions, its sharpness increases with latent size. At small dimensions (e.g., 32), the decline is noticeable but somewhat gradual, whereas at 1024 it becomes nearly vertical. This suggests that larger latent spaces exacerbate the difficulty of distinguishing borderline anomalies from normal data, leading to more abrupt collapses in precision once the high-confidence region is exhausted.
 \fig{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}
-%\newsection{results_semi}{Effect of semi-supervised labeling regime}
+\paragraph{Effect of semi-supervised labeling.}
-Refering back to the results in table~\ref{tab:results_ap} compares AP across labeling regimes (0/0, 50/10, 500/100). Surprisingly, the unsupervised regime (0/0) often performs best; adding labels does not consistently help, likely due to label noise and the scarcity/ambiguity of anomalous labels. Baselines (which do not use labels) are stable across regimes.
+Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves the best AP, while the lightly supervised regime \((50/10)\) performs worst. With many labels \((500/100)\), performance improves again but remains slightly below the unsupervised case. This pattern also appears under the hand-labeled evaluation, which excludes mislabeled frames. The drop with light supervision therefore cannot be explained by noisy evaluation targets, but must stem from the training process itself.
-\todo[inline]{rework this discussion of semi-supervised labeling and how it affected our results}
+The precision--recall curves in Figure~\ref{fig:prc_over_semi} show that the overall curve shapes are similar across regimes, but shifted relative to one another in line with the AP ordering \((0/0) > (500/100) > (50/10)\). We attribute these shifts to overfitting: when only a few anomalies are labeled, the model fits them too strongly, and if those examples differ too much from other anomalies, generalization suffers. This explains why lightly supervised training performs even worse than unsupervised training, which avoids this bias.
 \fig{prc_over_semi}{figures/results_prc_over_semi.png}{Precision--recall curves at latent dimension~32 for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (left) and Efficient (right) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}
 The LeNet variant illustrates this effect most clearly, showing unusually high variance across folds in the lightly supervised case. In several folds, precision drops untypically early, which supports the idea that the model has overfit to a poorly chosen subset of labeled anomalies. The Efficient variant is less affected, maintaining more stable precision plateaus, which suggests it is more robust to such overfitting, which we observe consistently for nearly all latent dimensionalities.
 With many labels \((500/100)\), the results become more stable again and the PRC curves closely resemble the unsupervised case, only shifted slightly left. A larger and more diverse set of labeled anomalies reduces the risk of unlucky sampling and improves generalization, but it still cannot fully match the unsupervised regime, where no overfitting to a specific labeled subset occurs. The only exception is an outlier at latent dimension 512 for LeNet, where the curve again resembles the lightly supervised case, likely due to label sampling effects amplified by higher latent capacity.
 In summary, three consistent patterns emerge: (i) a very small number of labels can hurt performance by causing overfitting to specific examples, (ii) many labels reduce this problem but still do not surpass unsupervised generalization, and (iii) encoder architecture strongly affects robustness, with LeNet being more sensitive to unstable behavior than Efficient.
 % --- Section: Autoencoder Pretraining Results ---
 \newsection{results_inference}{Inference on Held-Out Experiments}
@@ -1690,36 +1655,13 @@ Among the four approaches, the strongest separation is achieved by DeepSAD (Effi
 For Isolation Forest, the anomaly scores are already elevated in the clean experiment, which prevents reliable differentiation between normal and degraded runs and makes the method unsuitable in this context.
 It is important to note that the score axes are scaled individually per method, so comparisons should focus on relative separation rather than absolute values.
-Because the raw anomaly scores produced by the different methods are on incomparable scales (depending, for example, on network architecture or latent space dimensionality), we first applied a $z$-score normalization.
+Because anomaly scores are on incomparable scales, we apply $z$-score normalization based on the clean experiment. This allows deviations in degraded runs to be measured relative to the clean baseline, enabling direct comparison across methods. To allow comparison between the clean and degraded experiments, the mean and standard deviation were estimated exclusively from the clean experiment and then used to normalize the degraded scores as well. This ensures that increases in the degraded runs are interpreted relative to the distribution of the clean baseline, whereas computing separate $z$-scores per experiment would only reveal deviations within each run individually and not enable a meaningful cross-experiment comparison. It should be noted that the $z$-scores remain method-specific, meaning that while relative separation between clean and degraded runs can be compared within a method, the absolute scales across different methods are not directly comparable; readers should therefore take note of the differing axis ranges for each subplot. After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.
 The $z$-score is a standardized measure that rescales values in terms of their deviation from the mean relative to the standard deviation, making outputs from different models directly comparable in terms of how many standard deviations they deviate from normal behavior.
 To allow comparison between the clean and degraded experiments, the mean and standard deviation were estimated exclusively from the clean experiment and then used to normalize the degraded scores as well.
 This ensures that increases in the degraded runs are interpreted relative to the distribution of the clean baseline, whereas computing separate $z$-scores per experiment would only reveal deviations within each run individually and not enable a meaningful cross-experiment comparison.
 It should be noted that the $z$-scores remain method-specific, meaning that while relative separation between clean and degraded runs can be compared within a method, the absolute scales across different methods are not directly comparable; readers should therefore take note of the differing axis ranges for each subplot.
 After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing.
 EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference.
 Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.
 %Since the raw anomaly scores were highly noisy across all methods, \textbf{exponential moving average (EMA) smoothing} was applied.
 %EMA was chosen because it is causal (does not rely on future data), which makes it usable in real-world online detection scenarios.
 %Although it introduces a small time delay, this delay is shorter than for other techniques such as running averages.
 The red method curves can also be compared with the blue and green statistical indicators (missing points and near-sensor particle hits).
 While some similarities in shape may suggest that the methods partly capture these statistics, such interpretations should be made with caution.
 The anomaly detection models are expected to have learned additional patterns that are not directly observable from simple statistics, and these may also contribute to their ability to separate degraded from clean data.
 % \newchapter{conclusion_future_work}{Conclusion and Future Work}
 % \todo[inline]{overall we were successful in showing that AD can be used for degradation quantification but there is quite some future work before usable and especially obstacles in data availability and ground truth came up during thesis which show that more work in that direction is required for progress in this field}
 % \newsection{conclusion_data}{Missing Ground-Truth as an Obstacle}
 % \todo[inline]{biggest obstacle missing ground-truth, discuss what ground-truth is? we have a missing comprehensive understanding of what we mean when we say degradation so future work would be a better understanding but not only as theoretical which errors can occur but rather also how they can affect users of the data. complex models which not only assume disperesed smoke but also single origin of smoke which creates dense objects that may confuse algorithms. }
 % \todo[inline]{we also discussed that objective ground truth would be hard to collect with typical smoke sensors not only due to the aforementioned question what degradation is and if the amount/density/particle size of smoke would even be enough to get a full picture of degradation for such use-cases but also due to the different nature of smoke-sensors which collect data only at the local point in space where the sensor is located and on the other hand lidar sensors which use lasers to collect data about the environment from a distnace, resulting in differing information which may not be all that useful as groundt truth}
 % \todo[inline]{most likely user is SLAM whose mapping will most like also be used by other decision algorithms, instead of directly using lidar data. so maybe furure work could assign degradation based on difference between previously mapped ground truth of 3d model of world and of output from SLAM algorithms, maybe subjective labeling will be necessary especially where single points of origin result in smoke clouds which are seen as solid but morphing objects by slam algorithms.}
 % \todo[inline]{since quantification is goal the binary kind of ground-truth is also a bit lacking in that it allows us to show classification performance of methods and subjective judgement of inference shows that it very well might work not only as classifier but as regression/quantification for inbetween cases, but nonetheless analog evaluation targets would strongly improve confidence in quantification performance instead of only classification performance.}
 % \newsection{conclusion_ad}{Insights into DeepSAD and AD for Degradation Quantification}
 % \todo[inline]{we've shown that deepsad principle works and is usable for this use-case even when looking at feasibility due to runtimes. we've also shown that generally AD may be used although more complex algorithms such as DeepSAD outperform simple methods like isoforest. interestingly in our case the semi-supervised nature of deepsad could not show improved performance, although due to our noisy evaluation data its hard to interpret the results in a meaningful way for this. we have shown that choosing the correct architecture for the encoder can make quite a difference not only in pre-training preformance but also in training of the AD method, although once again the noisy evaluation targets make interpretation of performance hard to say definitively how strongly this affects the outcome. an interersting future work in this case could be to evaluate different architectures for known datasets with good evaluation targets to show how important the choice of encoder architecture really is for DeepSAD and maybe related methods. another interesting avenue of research could be the introduction of per sample weighted semi-supervised targets, if analog ground-truth is available this could allow DeepSAD to better learn to quantify the anomalous nature of a sample instead of simply training it to classify samples.}
 % \newsection{conclusion_open_questions}{Open Questions for Degradation Quantification}
 % \todo[inline]{possible avenues we know of but did not look into are: difference between two temporally related frames as input to how much degradation there is, sensor fusion with other sensors (ultrasonic sensor may look through dense clouds which lidar cannot penetrate), tested architecture of DeepSAD use full 360 degree pointcloud as input which could be problematic if degradation occurs only in part of pointcloud (specific direction). possible solution is smaller input window (segment point cloud into sections depending on horizontal and or vertical angles) and calculate anomaly score/degradation quantification per seciton and therefore per direction of the sensor. this was tested in a small capacity and proved quite promising but we left it as future work in this thesis. this is also related to the problem that current solution only works for data from pointclouds with exact same resolution and could be used to enable the technique to work with multiple types of lidar, although unclear if different resolution per angle will work/affect performance of DeepSAD since this was not tested}
 \newchapter{conclusion_future_work}{Conclusion and Future Work}
 This thesis set out to answer the following research question, formulated in Chapter~\ref{chp:introduction}:
@@ -1734,20 +1676,12 @@ Based on the experiments presented in Chapter~\ref{sec:results_deepsad} and Chap
 The main contributions of this thesis can be summarized as follows:
 \begin{itemize}
 	\item \textbf{Empirical evaluation:} A systematic comparison of DeepSAD against Isolation Forest and OC-SVM for lidar degradation detection, demonstrating that DeepSAD consistently outperforms simpler baselines.
-	\item \textbf{Analysis of latent dimensionality:} An investigation of how representation size influences performance and stability under noisy labels, revealing that smaller latent spaces are more robust in this setting.
+	\item \textbf{Analysis of latent dimensionality:} An investigation of how representation size influences performance and stability under noisy labels, revealing that smaller latent spaces are more robust in this setting and that in spite of the high input dimensionality of whole point clouds produced by spinning lidar sensors, bottlenecks as small as 32 dimensions can achieve promising performance.
 	\item \textbf{Analysis of semi-supervised training labels:} An evaluation of different semi-supervised labeling regimes, showing that in our case purely unsupervised training yielded the best performance. Adding a small number of labels reduced performance, while a higher ratio of labels led to partial recovery. This pattern may indicate overfitting effects, although interpretation is complicated by the presence of mislabeled evaluation targets.
 	\item \textbf{Analysis of encoder architecture:} A comparison between a LeNet-inspired and an Efficient encoder showed that the choice of architecture has a decisive influence on DeepSAD’s performance. The Efficient encoder outperformed the LeNet-inspired baseline not only during autoencoder pretraining but also in anomaly detection. While the exact magnitude of this improvement is difficult to quantify due to noisy evaluation targets, the results underline the importance of encoder design for representation quality in DeepSAD.
 	\item \textbf{Feasibility study:} An exploration of runtime, temporal inference plots, and downstream applicability, indicating that anomaly scores correlate with degradation trends and could provide a foundation for future quantification methods.
 \end{itemize}
 % This thesis investigated the feasibility of using anomaly detection (AD) methods, and in particular DeepSAD, for quantifying lidar degradation in subterranean environments for use-cases such as autonomous rescue robots. The main findings can be summarized as follows:
 % \begin{itemize}
 % 	\item AD can in principle be applied to the problem of degradation quantification, with DeepSAD clearly outperforming simpler baselines such as Isolation Forest and OC-SVM (cf.~\ref{sec:results_deepsad}).
 % 	\item Evaluation is severely constrained by the lack of reliable ground truth, which limits interpretation of quantitative results and hampers the assessment of degradation quantification even just for binary classification but especially beyond that.
 % 	\item Despite these challenges, our results provide encouraging evidence that anomaly detection scores correlate with degradation trends (cf.~\ref{sec:results_inference}), motivating future work toward more reliable evaluation protocols and more expressive ground truth.
 % \end{itemize}
 \newsection{conclusion_data}{Missing Ground Truth as an Obstacle}
 %The most significant obstacle identified in this work is the absence of robust and comprehensive ground truth for lidar degradation. As discussed in Chapter~\ref{chp:data_preprocessing}, it is not trivial to define what “degradation” precisely means in practice. While intuitive descriptions exist (e.g., dispersed smoke, dense clouds, localized plumes), these translate poorly into objective evaluation targets. Future work should therefore not only refine the conceptual understanding of degradation but also connect it to downstream users of the data (e.g., SLAM or mapping algorithms), where errors manifest differently.
--- a/thesis/figures/results_ap_over_latent.png
+++ b/thesis/figures/results_ap_over_latent.png
--- a/thesis/figures/results_prc_over_semi.png
+++ b/thesis/figures/results_prc_over_semi.png
--- a/tools/plot_scripts/results_ap_over_latent.py
+++ b/tools/plot_scripts/results_ap_over_latent.py
@@ -0,0 +1,273 @@
 #!/usr/bin/env python3
 from __future__ import annotations
 import shutil
 from dataclasses import dataclass
 from datetime import datetime
 from pathlib import Path
 from typing import Dict, List, Tuple
 import matplotlib.pyplot as plt
 import numpy as np
 import polars as pl
 from matplotlib.ticker import MaxNLocator
 # =========================
 # Config
 # =========================
 ROOT = Path("/home/fedex/mt/results/copy")
 OUTPUT_DIR = Path("/home/fedex/mt/plots/results_ap_over_latent")
 # Labeling regimes (shown as separate subplots)
 SEMI_LABELING_REGIMES: list[tuple[int, int]] = [(0, 0), (50, 10), (500, 100)]
 # Evaluations: separate figure per eval
 EVALS: list[str] = ["exp_based", "manual_based"]
 # X-axis (latent dims)
 LATENT_DIMS: list[int] = [32, 64, 128, 256, 512, 768, 1024]
 # Visual style
 FIGSIZE = (8, 8)  # one tall figure with 3 compact subplots
 MARKERSIZE = 7
 SCATTER_ALPHA = 0.95
 LINEWIDTH = 2.0
 TREND_LINEWIDTH = 2.2
 BAND_ALPHA = 0.18
 # Toggle: show ±1 std bands (k-fold variability)
 SHOW_STD_BANDS = True  # <<< set to False to hide the bands
 # Colors for the two DeepSAD backbones
 COLOR_LENET = "#1f77b4"  # blue
 COLOR_EFFICIENT = "#ff7f0e"  # orange
 # =========================
 # Loader
 # =========================
 from load_results import load_results_dataframe
 # =========================
 # Helpers
 # =========================
 def _with_net_label(df: pl.DataFrame) -> pl.DataFrame:
    return df.with_columns(
        pl.when(
            pl.col("network").cast(pl.Utf8).str.to_lowercase().str.contains("lenet")
        )
        .then(pl.lit("LeNet"))
        .when(
            pl.col("network").cast(pl.Utf8).str.to_lowercase().str.contains("efficient")
        )
        .then(pl.lit("Efficient"))
        .otherwise(pl.col("network").cast(pl.Utf8))
        .alias("net_label")
    )
 def _filter_deepsad(df: pl.DataFrame) -> pl.DataFrame:
    return df.filter(
        (pl.col("model") == "deepsad")
        & (pl.col("eval").is_in(EVALS))
        & (pl.col("latent_dim").is_in(LATENT_DIMS))
        & (pl.col("net_label").is_in(["LeNet", "Efficient"]))
    ).select(
        "eval",
        "net_label",
        "latent_dim",
        "semi_normals",
        "semi_anomalous",
        "fold",
        "ap",
    )
@dataclass(frozen=True)
 class Agg:
    mean: float
    std: float
 def aggregate_ap(df: pl.DataFrame) -> Dict[Tuple[str, str, int, int, int], Agg]:
    out: Dict[Tuple[str, str, int, int, int], Agg] = {}
    gb = (
        df.group_by(
            ["eval", "net_label", "latent_dim", "semi_normals", "semi_anomalous"]
        )
        .agg(pl.col("ap").mean().alias("mean"), pl.col("ap").std().alias("std"))
        .to_dicts()
    )
    for row in gb:
        key = (
            str(row["eval"]),
            str(row["net_label"]),
            int(row["latent_dim"]),
            int(row["semi_normals"]),
            int(row["semi_anomalous"]),
        )
        m = float(row["mean"]) if row["mean"] == row["mean"] else np.nan
        s = float(row["std"]) if row["std"] == row["std"] else np.nan
        out[key] = Agg(mean=m, std=s)
    return out
 def _lin_trend(xs: List[int], ys: List[float]) -> Tuple[np.ndarray, np.ndarray]:
    if len(xs) < 2:
        return np.array(xs, dtype=float), np.array(ys, dtype=float)
    x = np.array(xs, dtype=float)
    y = np.array(ys, dtype=float)
    a, b = np.polyfit(x, y, 1)
    x_fit = np.linspace(x.min(), x.max(), 200)
    y_fit = a * x_fit + b
    return x_fit, y_fit
 def _dynamic_ylim(all_vals: List[float], all_errs: List[float]) -> Tuple[float, float]:
    vals = np.array(all_vals, dtype=float)
    errs = np.array(all_errs, dtype=float) if SHOW_STD_BANDS else np.zeros_like(vals)
    valid = np.isfinite(vals)
    if not np.any(valid):
        return (0.0, 1.0)
    v, e = vals[valid], errs[valid]
    lo = np.min(v - e)
    hi = np.max(v + e)
    span = max(1e-3, hi - lo)
    pad = 0.08 * span
    y0 = max(0.0, lo - pad)
    y1 = min(1.0, hi + pad)
    if (y1 - y0) < 0.08:
        mid = 0.5 * (y0 + y1)
        y0 = max(0.0, mid - 0.04)
        y1 = min(1.0, mid + 0.04)
    return (float(y0), float(y1))
 def _get_dim_mapping(dims: list[int]) -> dict[int, int]:
    """Map actual dimensions to evenly spaced positions (0, 1, 2, ...)"""
    return {dim: i for i, dim in enumerate(dims)}
 def plot_eval(ev: str, agg: Dict[Tuple[str, str, int, int, int], Agg], outdir: Path):
    fig, axes = plt.subplots(
        len(SEMI_LABELING_REGIMES),
        1,
        figsize=FIGSIZE,
        constrained_layout=True,
        sharex=True,
    )
    if len(SEMI_LABELING_REGIMES) == 1:
        axes = [axes]
    # Create dimension mapping
    dim_mapping = _get_dim_mapping(LATENT_DIMS)
    for ax, regime in zip(axes, SEMI_LABELING_REGIMES):
        semi_n, semi_a = regime
        data = {}
        for net in ["LeNet", "Efficient"]:
            xs, ys, es = [], [], []
            for dim in LATENT_DIMS:
                key = (ev, net, dim, semi_n, semi_a)
                if key in agg:
                    xs.append(
                        dim_mapping[dim]
                    )  # Use mapped position instead of actual dim
                    ys.append(agg[key].mean)
                    es.append(agg[key].std)
            data[net] = (xs, ys, es)
        for net, color in [("LeNet", COLOR_LENET), ("Efficient", COLOR_EFFICIENT)]:
            xs, ys, es = data[net]
            if not xs:
                continue
            # Set evenly spaced ticks with actual dimension labels
            ax.set_xticks(list(dim_mapping.values()))
            ax.set_xticklabels(LATENT_DIMS)
            ax.yaxis.set_major_locator(MaxNLocator(nbins=5))
            ax.scatter(
                xs, ys, s=35, color=color, alpha=SCATTER_ALPHA, label=f"{net} (points)"
            )
            x_fit, y_fit = _lin_trend(xs, ys)  # Now using mapped positions
            ax.plot(
                x_fit,
                y_fit,
                color=color,
                linewidth=TREND_LINEWIDTH,
                label=f"{net} (trend)",
            )
            if SHOW_STD_BANDS and es and np.any(np.isfinite(es)):
                ylo = np.clip(np.array(ys) - np.array(es), 0.0, 1.0)
                yhi = np.clip(np.array(ys) + np.array(es), 0.0, 1.0)
                ax.fill_between(
                    xs, ylo, yhi, color=color, alpha=BAND_ALPHA, linewidth=0
                )
        all_vals, all_errs = [], []
        for net in ["LeNet", "Efficient"]:
            _, ys, es = data[net]
            all_vals.extend(ys)
            all_errs.extend(es)
        y0, y1 = _dynamic_ylim(all_vals, all_errs)
        ax.set_ylim(y0, y1)
        ax.set_title(f"Labeling regime {semi_n}/{semi_a}", fontsize=11)
        ax.grid(True, alpha=0.35)
    axes[-1].set_xlabel("Latent dimension")
    for ax in axes:
        ax.set_ylabel("AP")
    handles, labels = axes[0].get_legend_handles_labels()
    fig.legend(handles, labels, ncol=2, loc="upper center", bbox_to_anchor=(0.75, 0.97))
    fig.suptitle(f"AP vs. Latent Dimensionality — {ev.replace('_', ' ')}", y=1.05)
    fname = f"ap_trends_{ev}.png"
    fig.savefig(outdir / fname, dpi=150)
    plt.close(fig)
 def plot_all(agg: Dict[Tuple[str, str, int, int, int], Agg], outdir: Path):
    outdir.mkdir(parents=True, exist_ok=True)
    for ev in EVALS:
        plot_eval(ev, agg, outdir)
 def main():
    df = load_results_dataframe(ROOT, allow_cache=True)
    df = _with_net_label(df)
    df = _filter_deepsad(df)
    agg = aggregate_ap(df)
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    archive_dir = OUTPUT_DIR / "archive"
    archive_dir.mkdir(parents=True, exist_ok=True)
    ts_dir = archive_dir / datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    ts_dir.mkdir(parents=True, exist_ok=True)
    plot_all(agg, ts_dir)
    try:
        script_path = Path(__file__)
        shutil.copy2(script_path, ts_dir / script_path.name)
    except Exception:
        pass
    latest = OUTPUT_DIR / "latest"
    latest.mkdir(parents=True, exist_ok=True)
    for f in latest.iterdir():
        if f.is_file():
            f.unlink()
    for f in ts_dir.iterdir():
        if f.is_file():
            shutil.copy2(f, latest / f.name)
    print(f"Saved plots to: {ts_dir}")
    print(f"Also updated: {latest}")
 if __name__ == "__main__":
    main()
--- a/tools/plot_scripts/results_ap_over_semi.py
+++ b/tools/plot_scripts/results_ap_over_semi.py
@@ -0,0 +1,260 @@
 #!/usr/bin/env python3
 from __future__ import annotations
 import shutil
 from dataclasses import dataclass
 from datetime import datetime
 from pathlib import Path
 from typing import Dict, List, Tuple
 import matplotlib.pyplot as plt
 import numpy as np
 import polars as pl
 from matplotlib.ticker import MaxNLocator
 # =========================
 # Config
 # =========================
 ROOT = Path("/home/fedex/mt/results/copy")
 OUTPUT_DIR = Path("/home/fedex/mt/plots/results_ap_over_semi")
 # Labeling regimes (shown as separate subplots)
 SEMI_LABELING_REGIMES: list[tuple[int, int]] = [(0, 0), (50, 10), (500, 100)]
 # Evaluations: separate figure per eval
 EVALS: list[str] = ["exp_based", "manual_based"]
 # X-axis (latent dims)
 LATENT_DIMS: list[int] = [32, 64, 128, 256, 512, 768, 1024]
 LATENT_DIM: int = [32, 64, 128, 256, 512, 768, 1024]
 # Visual style
 FIGSIZE = (8, 8)  # one tall figure with 3 compact subplots
 MARKERSIZE = 7
 SCATTER_ALPHA = 0.95
 LINEWIDTH = 2.0
 TREND_LINEWIDTH = 2.2
 BAND_ALPHA = 0.18
 # Toggle: show ±1 std bands (k-fold variability)
 SHOW_STD_BANDS = True  # <<< set to False to hide the bands
 # Colors for the two DeepSAD backbones
 COLOR_LENET = "#1f77b4"  # blue
 COLOR_EFFICIENT = "#ff7f0e"  # orange
 # =========================
 # Loader
 # =========================
 from load_results import load_results_dataframe
 # =========================
 # Helpers
 # =========================
 def _with_net_label(df: pl.DataFrame) -> pl.DataFrame:
    return df.with_columns(
        pl.when(
            pl.col("network").cast(pl.Utf8).str.to_lowercase().str.contains("lenet")
        )
        .then(pl.lit("LeNet"))
        .when(
            pl.col("network").cast(pl.Utf8).str.to_lowercase().str.contains("efficient")
        )
        .then(pl.lit("Efficient"))
        .otherwise(pl.col("network").cast(pl.Utf8))
        .alias("net_label")
    )
 def _filter_deepsad(df: pl.DataFrame) -> pl.DataFrame:
    return df.filter(
        (pl.col("model") == "deepsad")
        & (pl.col("eval").is_in(EVALS))
        & (pl.col("latent_dim").is_in(LATENT_DIMS))
        & (pl.col("net_label").is_in(["LeNet", "Efficient"]))
    ).select(
        "eval",
        "net_label",
        "latent_dim",
        "semi_normals",
        "semi_anomalous",
        "fold",
        "ap",
    )
@dataclass(frozen=True)
 class Agg:
    mean: float
    std: float
 def aggregate_ap(df: pl.DataFrame) -> Dict[Tuple[str, str, int, int, int], Agg]:
    out: Dict[Tuple[str, str, int, int, int], Agg] = {}
    gb = (
        df.group_by(
            ["eval", "net_label", "latent_dim", "semi_normals", "semi_anomalous"]
        )
        .agg(pl.col("ap").mean().alias("mean"), pl.col("ap").std().alias("std"))
        .to_dicts()
    )
    for row in gb:
        key = (
            str(row["eval"]),
            str(row["net_label"]),
            int(row["latent_dim"]),
            int(row["semi_normals"]),
            int(row["semi_anomalous"]),
        )
        m = float(row["mean"]) if row["mean"] == row["mean"] else np.nan
        s = float(row["std"]) if row["std"] == row["std"] else np.nan
        out[key] = Agg(mean=m, std=s)
    return out
 def _lin_trend(xs: List[int], ys: List[float]) -> Tuple[np.ndarray, np.ndarray]:
    if len(xs) < 2:
        return np.array(xs, dtype=float), np.array(ys, dtype=float)
    x = np.array(xs, dtype=float)
    y = np.array(ys, dtype=float)
    a, b = np.polyfit(x, y, 1)
    x_fit = np.linspace(x.min(), x.max(), 200)
    y_fit = a * x_fit + b
    return x_fit, y_fit
 def _dynamic_ylim(all_vals: List[float], all_errs: List[float]) -> Tuple[float, float]:
    vals = np.array(all_vals, dtype=float)
    errs = np.array(all_errs, dtype=float) if SHOW_STD_BANDS else np.zeros_like(vals)
    valid = np.isfinite(vals)
    if not np.any(valid):
        return (0.0, 1.0)
    v, e = vals[valid], errs[valid]
    lo = np.min(v - e)
    hi = np.max(v + e)
    span = max(1e-3, hi - lo)
    pad = 0.08 * span
    y0 = max(0.0, lo - pad)
    y1 = min(1.0, hi + pad)
    if (y1 - y0) < 0.08:
        mid = 0.5 * (y0 + y1)
        y0 = max(0.0, mid - 0.04)
        y1 = min(1.0, mid + 0.04)
    return (float(y0), float(y1))
 def plot_eval(ev: str, agg: Dict[Tuple[str, str, int, int, int], Agg], outdir: Path):
    fig, axes = plt.subplots(
        len(SEMI_LABELING_REGIMES),
        1,
        figsize=FIGSIZE,
        constrained_layout=True,
        sharex=True,
    )
    if len(SEMI_LABELING_REGIMES) == 1:
        axes = [axes]
    for ax, regime in zip(axes, SEMI_LABELING_REGIMES):
        semi_n, semi_a = regime
        data = {}
        for net in ["LeNet", "Efficient"]:
            xs, ys, es = [], [], []
            for dim in LATENT_DIMS:
                key = (ev, net, dim, semi_n, semi_a)
                if key in agg:
                    xs.append(dim)
                    ys.append(agg[key].mean)
                    es.append(agg[key].std)
            data[net] = (xs, ys, es)
        for net, color in [("LeNet", COLOR_LENET), ("Efficient", COLOR_EFFICIENT)]:
            xs, ys, es = data[net]
            if not xs:
                continue
            ax.set_xticks(LATENT_DIMS)
            ax.yaxis.set_major_locator(MaxNLocator(nbins=5))  # e.g., always 5 ticks
            ax.scatter(
                xs, ys, s=35, color=color, alpha=SCATTER_ALPHA, label=f"{net} (points)"
            )
            x_fit, y_fit = _lin_trend(xs, ys)
            ax.plot(
                x_fit,
                y_fit,
                color=color,
                linewidth=TREND_LINEWIDTH,
                label=f"{net} (trend)",
            )
            if SHOW_STD_BANDS and es and np.any(np.isfinite(es)):
                ylo = np.clip(np.array(ys) - np.array(es), 0.0, 1.0)
                yhi = np.clip(np.array(ys) + np.array(es), 0.0, 1.0)
                ax.fill_between(
                    xs, ylo, yhi, color=color, alpha=BAND_ALPHA, linewidth=0
                )
        all_vals, all_errs = [], []
        for net in ["LeNet", "Efficient"]:
            _, ys, es = data[net]
            all_vals.extend(ys)
            all_errs.extend(es)
        y0, y1 = _dynamic_ylim(all_vals, all_errs)
        ax.set_ylim(y0, y1)
        ax.set_title(f"Labeling regime {semi_n}/{semi_a}", fontsize=11)
        ax.grid(True, alpha=0.35)
    axes[-1].set_xlabel("Latent dimension")
    for ax in axes:
        ax.set_ylabel("AP")
    handles, labels = axes[0].get_legend_handles_labels()
    fig.legend(handles, labels, ncol=2, loc="upper center", bbox_to_anchor=(0.75, 0.97))
    fig.suptitle(f"AP vs. Latent Dimensionality — {ev.replace('_', ' ')}", y=1.05)
    fname = f"ap_trends_{ev}.png"
    fig.savefig(outdir / fname, dpi=150)
    plt.close(fig)
 def plot_all(agg: Dict[Tuple[str, str, int, int, int], Agg], outdir: Path):
    outdir.mkdir(parents=True, exist_ok=True)
    for ev in EVALS:
        plot_eval(ev, agg, outdir)
 def main():
    df = load_results_dataframe(ROOT, allow_cache=True)
    df = _with_net_label(df)
    df = _filter_deepsad(df)
    agg = aggregate_ap(df)
    OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
    archive_dir = OUTPUT_DIR / "archive"
    archive_dir.mkdir(parents=True, exist_ok=True)
    ts_dir = archive_dir / datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    ts_dir.mkdir(parents=True, exist_ok=True)
    plot_all(agg, ts_dir)
    try:
        script_path = Path(__file__)
        shutil.copy2(script_path, ts_dir / script_path.name)
    except Exception:
        pass
    latest = OUTPUT_DIR / "latest"
    latest.mkdir(parents=True, exist_ok=True)
    for f in latest.iterdir():
        if f.is_file():
            f.unlink()
    for f in ts_dir.iterdir():
        if f.is_file():
            shutil.copy2(f, latest / f.name)
    print(f"Saved plots to: {ts_dir}")
    print(f"Also updated: {latest}")
 if __name__ == "__main__":
    main()
--- a/tools/plot_scripts/results_semi_labels_comparison.py
+++ b/tools/plot_scripts/results_semi_labels_comparison.py
@@ -8,11 +8,11 @@ from pathlib import Path
 import matplotlib.pyplot as plt
 import numpy as np
 import polars as pl
 from matplotlib.lines import Line2D
 from scipy.stats import sem, t
 # CHANGE THIS IMPORT IF YOUR LOADER MODULE NAME IS DIFFERENT
-from plot_scripts.load_results import load_results_dataframe
+from load_results import load_results_dataframe
 from matplotlib.lines import Line2D
 from scipy.stats import sem, t
 # ---------------------------------
 # Config
@@ -23,6 +23,10 @@ OUTPUT_DIR = Path("/home/fedex/mt/plots/results_semi_labels_comparison")
 LATENT_DIMS = [32, 64, 128, 256, 512, 768, 1024]
 SEMI_REGIMES = [(0, 0), (50, 10), (500, 100)]
 EVALS = ["exp_based", "manual_based"]
 EVALS_LABELS = {
    "exp_based": "Experiment-Based Labels",
    "manual_based": "Manually-Labeled",
 }
 # Interp grids
 ROC_GRID = np.linspace(0.0, 1.0, 200)
@@ -30,6 +34,10 @@ PRC_GRID = np.linspace(0.0, 1.0, 200)
 # Baselines are duplicated across nets; use Efficient-only to avoid repetition
 BASELINE_NET = "Efficient"
 BASELINE_LABELS = {
    "isoforest": "Isolation Forest",
    "ocsvm": "One-Class SVM",
 }
 # Colors/styles
 COLOR_BASELINES = {
@@ -147,12 +155,8 @@ def _select_rows(
    return df.filter(pl.all_horizontal(exprs))
-def _auc_list(sub: pl.DataFrame) -> list[float]:
+def _auc_list(sub: pl.DataFrame, kind: str) -> list[float]:
-    return [x for x in sub.select("auc").to_series().to_list() if x is not None]
+    return [x for x in sub.select(f"{kind}_auc").to_series().to_list() if x is not None]
 def _ap_list(sub: pl.DataFrame) -> list[float]:
    return [x for x in sub.select("ap").to_series().to_list() if x is not None]
 def _plot_panel(
@@ -165,7 +169,7 @@ def _plot_panel(
    kind: str,
 ):
    """
-    Plot one panel: DeepSAD (net_for_deepsad) with 3 regimes + baselines (from Efficient).
+    Plot one panel: DeepSAD (net_for_deepsad) with 3 regimes + Baselines (from Efficient).
    Legend entries include mean±CI of AUC/AP.
    """
    ax.grid(True, alpha=0.3)
@@ -200,9 +204,9 @@ def _plot_panel(
            continue
        # Metric for legend
-        metric_vals = _auc_list(sub_b) if kind == "roc" else _ap_list(sub_b)
+        metric_vals = _auc_list(sub_b, kind)
        m, ci = mean_ci(metric_vals)
-        lab = f"{model} ({'AUC' if kind == 'roc' else 'AP'}={m:.3f}±{ci:.3f})"
+        lab = f"{BASELINE_LABELS[model]}\n(AUC={m:.3f}±{ci:.3f})"
        color = COLOR_BASELINES[model]
        h = ax.plot(grid, mean_y, lw=2, color=color, label=lab)[0]
@@ -230,9 +234,9 @@ def _plot_panel(
        if np.all(np.isnan(mean_y)):
            continue
-        metric_vals = _auc_list(sub_d) if kind == "roc" else _ap_list(sub_d)
+        metric_vals = _auc_list(sub_d, kind)
        m, ci = mean_ci(metric_vals)
-        lab = f"DeepSAD {net_for_deepsad} — semi {sn}/{sa} ({'AUC' if kind == 'roc' else 'AP'}={m:.3f}±{ci:.3f})"
+        lab = f"DeepSAD {net_for_deepsad} — {sn}/{sa}\n(AUC={m:.3f}±{ci:.3f})"
        color = COLOR_REGIMES[regime]
        ls = LINESTYLES[regime]
@@ -246,7 +250,7 @@ def _plot_panel(
        ax.plot([0, 1], [0, 1], "k--", alpha=0.6, label="Chance")
    # Legend
-    ax.legend(loc="lower right", fontsize=9, frameon=True)
+    ax.legend(loc="upper right", fontsize=9, frameon=True)
 def make_figures_for_dim(
@@ -254,9 +258,11 @@ def make_figures_for_dim(
 ):
    # ROC: 2×1
    fig_roc, axes = plt.subplots(
-        nrows=1, ncols=2, figsize=(14, 5), constrained_layout=True
+        nrows=2, ncols=1, figsize=(7, 10), constrained_layout=True
    )
    fig_roc.suptitle(
        f"ROC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
    )
    fig_roc.suptitle(f"ROC — {eval_type} — latent_dim={latent_dim}", fontsize=14)
    _plot_panel(
        axes[0],
@@ -266,7 +272,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="roc",
    )
-    axes[0].set_title("DeepSAD (LeNet) + baselines")
+    axes[0].set_title("DeepSAD (LeNet) + Baselines")
    _plot_panel(
        axes[1],
@@ -276,7 +282,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="roc",
    )
-    axes[1].set_title("DeepSAD (Efficient) + baselines")
+    axes[1].set_title("DeepSAD (Efficient) + Baselines")
    out_roc = out_dir / f"roc_{latent_dim}_{eval_type}.png"
    fig_roc.savefig(out_roc, dpi=150, bbox_inches="tight")
@@ -284,9 +290,11 @@ def make_figures_for_dim(
    # PRC: 2×1
    fig_prc, axes = plt.subplots(
-        nrows=1, ncols=2, figsize=(14, 5), constrained_layout=True
+        nrows=2, ncols=1, figsize=(7, 10), constrained_layout=True
    )
    fig_prc.suptitle(
        f"PRC — {EVALS_LABELS[eval_type]} — Latent Dim.={latent_dim}", fontsize=14
    )
    fig_prc.suptitle(f"PRC — {eval_type} — latent_dim={latent_dim}", fontsize=14)
    _plot_panel(
        axes[0],
@@ -296,7 +304,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="prc",
    )
-    axes[0].set_title("DeepSAD (LeNet) + baselines")
+    axes[0].set_title("DeepSAD (LeNet) + Baselines")
    _plot_panel(
        axes[1],
@@ -306,7 +314,7 @@ def make_figures_for_dim(
        latent_dim=latent_dim,
        kind="prc",
    )
-    axes[1].set_title("DeepSAD (Efficient) + baselines")
+    axes[1].set_title("DeepSAD (Efficient) + Baselines")
    out_prc = out_dir / f"prc_{latent_dim}_{eval_type}.png"
    fig_prc.savefig(out_prc, dpi=150, bbox_inches="tight")
Author	SHA1	Message	Date
Jan Kowalczyk	e00d1a33e3	reworked results chpt	2025-09-27 19:01:59 +02:00
Jan Kowalczyk	c270783225	wip	2025-09-27 16:34:52 +02:00