wip

2025-09-22 08:15:54 +02:00
parent a20a4a0832
commit 8e7c210872
5 changed files with 454 additions and 194 deletions
--- a/thesis/Main.pdf
+++ b/thesis/Main.pdf
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -943,7 +943,7 @@ Together, these components define the full experimental pipeline, from data prep
 {codebase, github, dataloading, training, testing, baselines}
 {codebase understood $\rightarrow$ how was it adapted}
-DeepSAD's PyTorch implementation includes standardized datasets such as MNIST, CIFAR-10 and datasets from \citetitle{odds}~\cite{odds}, as well as suitable network architectures for the corresponding datatypes. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE and SemiDGM with the loaded data and evaluate their performance by calculating the ROC area under curve for all given algorithms. We adapted this implementation which was originally developed for Python 3.7 to work with Python 3.12 and changed or added functionality for dataloading our chosen dataset, added DeepSAD models that work with the lidar projections datatype, added more evaluation methods and an inference module.
+DeepSAD's PyTorch implementation includes standardized datasets such as MNIST, CIFAR-10 and datasets from \citetitle{odds}~\cite{odds}, as well as suitable network architectures for the corresponding datatypes. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE and SemiDGM with the loaded data and evaluate their performance by calculating the Average Precision as well as the Precision Recall Curve for all given algorithms. We adapted this implementation which was originally developed for Python 3.7 to work with Python 3.12 and changed or added functionality for dataloading our chosen dataset, added DeepSAD models that work with the lidar projections datatype, added more evaluation methods and an inference module.
 \newsubsubsectionNoTOC{SubTER dataset preprocessing, train/test splits, and label strategy}
@@ -1434,7 +1434,6 @@ Inference latency per sample is presented in Table~\ref{tab:inference_latency_co
 Together, these results provide a comprehensive overview of the computational requirements of our experimental setup. They show that while our deep semi-supervised approach is significantly more demanding during training than classical baselines, it remains highly efficient at inference, which is the decisive factor for deployment in time-critical domains such as rescue robotics.
 \newchapter{results_discussion}{Results and Discussion}
 \threadtodo
 {Introduce the structure and scope of the results chapter}
@@ -1442,62 +1441,18 @@ Together, these results provide a comprehensive overview of the computational re
 {State that we will first analyze autoencoder results, then anomaly detection performance, and finally inference experiments}
 {Clear roadmap $\rightarrow$ prepares reader for detailed sections}
-%The results from the experiments described in chapter~\ref{chp:experimental_setup} will be presented in this chapter as follows: the pretraining results from training the two autoencoder architectures for multiple latent space dimensionalities will be shown an discussed in section~\ref{sec:results_pretraining}, the results from training DeepSAD and comparing it with the baseline algorithms will be presented and discussed in section~\ref{sec:results_deepsad} and lastly we will present some plots from running inference on experiments which were held-out during training, to improve the reader's grasp on how the algorithms would perform and may be used in real-world applications.
+% \newchapter{results_discussion}{Results and Discussion}
-
+%
-The experiments described in Chapter~\ref{chp:experimental_setup} are presented in this chapter. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and One-Class SVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on experiments that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen traversals, offering a more practical perspective on their potential for real-world rescue robotics applications.
+% The experiments described in Chapter~\ref{chp:experimental_setup} are presented in this chapter. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and One-Class SVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on experiments that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen traversals, offering a more practical perspective on their potential for real-world rescue robotics applications.
-
+%
 % --- Section: Autoencoder Pretraining Results ---
 \newsection{results_pretraining}{Autoencoder Pretraining Results}
 The results of pretraining the two autoencoder architectures are summarized in Table~\ref{tab:pretraining_loss}. Reconstruction performance is reported as mean squared error (MSE), with trends visualized in Figure~\ref{fig:ae_loss_overall}. The results show that the modified Efficient architecture consistently outperforms the LeNet-inspired baseline across all latent space dimensionalities. The improvement is most pronounced at lower-dimensional bottlenecks (e.g., 32 or 64 dimensions) but remains observable up to 1024 dimensions, although the gap narrows.
 \begin{table}[t]
 	\centering
 	\label{tab:pretraining_loss}
 	\begin{tabularx}{\textwidth}{c*{2}{Y}|*{2}{Y}}
 		\toprule
 		            & \multicolumn{2}{c}{Overall loss} & \multicolumn{2}{c}{Anomaly loss}                            \\
 		\cmidrule(lr){2-3} \cmidrule(lr){4-5}
 		Latent Dim. & LeNet                            & Efficient                        & LeNet  & Efficient       \\
 		\midrule
 		32          & 0.0223                           & \textbf{0.0136}                  & 0.0701 & \textbf{0.0554} \\
 		64          & 0.0168                           & \textbf{0.0117}                  & 0.0613 & \textbf{0.0518} \\
 		128         & 0.0140                           & \textbf{0.0110}                  & 0.0564 & \textbf{0.0506} \\
 		256         & 0.0121                           & \textbf{0.0106}                  & 0.0529 & \textbf{0.0498} \\
 		512         & 0.0112                           & \textbf{0.0103}                  & 0.0514 & \textbf{0.0491} \\
 		768         & 0.0109                           & \textbf{0.0102}                  & 0.0505 & \textbf{0.0490} \\
 		1024        & 0.0106                           & \textbf{0.0101}                  & 0.0500 & \textbf{0.0489} \\
 		\bottomrule
 	\end{tabularx}
 	\caption{Autoencoder pre-training MSE losses across latent dimensions. Left: overall loss; Right: anomaly-only loss. Cells show means across folds (no $\pm$std). Maximum observed standard deviation across all cells (not shown): 0.0067.}
 \end{table}
 \fig{ae_loss_overall}{figures/ae_elbow_test_loss_overall.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures.}
 Because overall reconstruction loss might obscure how well encoders represent anomalous samples, we additionally evaluate reconstruction errors only on degraded samples from hand-labeled smoke segments (Figure~\ref{fig:ae_loss_degraded}). As expected, reconstruction losses are about 0.05 higher on these challenging samples than in the overall evaluation. However, the relative advantage of the Efficient architecture remains, suggesting that its improvements extend to anomalous inputs as well.
 \fig{ae_loss_degraded}{figures/ae_elbow_test_loss_anomaly.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures, evaluated only on degraded data from hand-labeled smoke experiments.}
 % It is important to note that absolute MSE values are difficult to interpret in isolation, as their magnitude depends on the data scaling and chosen reconstruction target. More detailed evaluations in terms of error in meters, relative error, or distance-binned metrics could provide richer insights into encoder quality. However, since the downstream anomaly detection results (Section~\ref{sec:results_deepsad}) do not reveal significant differences between pretraining regimes, such detailed pretraining evaluation was not pursued here. Instead, these metrics are left as promising directions for future work, particularly if pretraining were to play a larger role in final detection performance.
 %In the following, we therefore focus on the main question of whether improved reconstruction performance during pretraining translates into measurable benefits for anomaly detection in DeepSAD.
 It is important to note that absolute MSE values are difficult to interpret in isolation, as their magnitude depends on the data scaling and chosen reconstruction target. More detailed evaluations in terms of error in meters, relative error, or distance-binned metrics could provide richer insights into encoder quality. However, since the downstream anomaly detection results (Section~\ref{sec:results_deepsad}) do not reveal significant differences between pretraining regimes, such detailed pretraining evaluation was not pursued here. Instead, we restrict ourselves to reporting the reconstruction trends and leave more in-depth pretraining analysis as future work.
 % % --- Section: Autoencoder Pretraining Results ---
 % \newsection{results_pretraining}{Autoencoder Pretraining Results}
 %
-% \threadtodo
+% The results of pretraining the two autoencoder architectures are summarized in Table~\ref{tab:pretraining_loss}. Reconstruction performance is reported as mean squared error (MSE), with trends visualized in Figure~\ref{fig:ae_loss_overall}. The results show that the modified Efficient architecture consistently outperforms the LeNet-inspired baseline across all latent space dimensionalities. The improvement is most pronounced at lower-dimensional bottlenecks (e.g., 32 or 64 dimensions) but remains observable up to 1024 dimensions, although the gap narrows.
 % {Present autoencoder reconstruction performance across architectures and latent sizes}
 % {Important because latent size and architecture determine representation quality, which may affect DeepSAD later}
 % {Show reconstruction losses over latent dimensions, compare Efficient vs LeNet}
 % {Understanding representation capacity $\rightarrow$ motivates analyzing if AE results transfer to DeepSAD}
 %
 % The results from pretraining the two autoencoder architectures are shown as MSE-Loss in table~\ref{tab:pretraining_loss} and demonstrate that the modifications to the original LeNet-inspired architecture improved the reconstruction performance as can be seen in figure~\ref{fig:ae_loss_overall} which shows that especially at lower latent space dimensionalities the modified architecture results in a highly improved loss when compared to the LeNet-inspired one and that the improvement persists all the way up to a 1024-dimensional latent space, though the difference becomes less pronounced.
 %
 % \begin{table}[t]
 % 	\centering
 % 	\caption{Autoencoder pre-training MSE losses across latent dimensions. Left: overall loss; Right: anomaly-only loss. Cells show means across folds (no $\pm$std). Maximum observed standard deviation across all cells (not shown): 0.0067.}
 % 	\label{tab:pretraining_loss}
 % 	\begin{tabularx}{\textwidth}{c*{2}{Y}|*{2}{Y}}
 % 		\toprule
@@ -1514,36 +1469,121 @@ It is important to note that absolute MSE values are difficult to interpret in i
 % 		1024        & 0.0106                           & \textbf{0.0101}                  & 0.0500 & \textbf{0.0489} \\
 % 		\bottomrule
 % 	\end{tabularx}
 % 	\caption{Autoencoder pre-training MSE losses across latent dimensions. Left: overall loss; Right: anomaly-only loss. Cells show means across folds (no $\pm$std). Maximum observed standard deviation across all cells (not shown): 0.0067.}
 % \end{table}
 %
 % \fig{ae_loss_overall}{figures/ae_elbow_test_loss_overall.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures.}
 %
-% \threadtodo
+% Because overall reconstruction loss might obscure how well encoders represent anomalous samples, we additionally evaluate reconstruction errors only on degraded samples from hand-labeled smoke segments (Figure~\ref{fig:ae_loss_degraded}). As expected, reconstruction losses are about 0.05 higher on these challenging samples than in the overall evaluation. However, the relative advantage of the Efficient architecture remains, suggesting that its improvements extend to anomalous inputs as well.
 % {Analyze anomaly reconstruction performance specifically}
 % {Critical because degraded inputs may reconstruct differently, showing whether networks capture degradation structure}
 % {Show reconstruction losses on anomalous-only data subset}
 % {This analysis $\rightarrow$ motivates testing whether better AE reconstructions imply better anomaly detection}
 %
-% Since it could be argued, that the overall reconstruction loss is not a good metric for evaluating the encoders' capability of extracting the most important information of anomalous data, we also plot the reconstruction loss of the two architectures for only anomalous samples from the hand-labeled sections of the experiments containing artifical smoke in figure~\ref{fig:ae_loss_degraded}. These evaluations show that while the loss of degraded sample reconstruction is overall roughly 0.05 higher than the one from the overall evaluation (which included normal and anomalous samples for reconstruction evaluation) the same improvement per latent space dimesionality between the LeNet-inspired and the Efficient encoder can be observed for anomalous samples, which would indicate that the modified architecture is still an improvement when only looking at its degraded sample reconstruction performance.
+% \fig{ae_loss_degraded}{figures/ae_elbow_test_loss_anomaly.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures, evaluated only on degraded data from hand-labeled smoke experiments.}
 %
-% \fig{ae_loss_degraded}{figures/ae_elbow_test_loss_anomaly.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures evaluated only on degraded data from hand-labeled section of experiments with artifical smoke.}
+% Since only per-sample reconstruction losses were retained during pretraining, we report results in reciprocal-range MSE space. While more interpretable metrics in meters and distance-binned analyses would be desirable, the downstream anomaly detection performance did not differ starkly between encoders, so we did not pursue this additional evaluation. Future work could extend the pretraining analysis with physically interpretable metrics.
 % \todo[inline]{could rerun AE inference, need custom dataloader to reload same eval samples and then save additional data which allows for RMSE in meters, etc}
 %
-% The reported MSE-loss is hard to judge in isolation but the overall better performance of the new architecture allows us to evaluate the importance of autoencoder performance for the anomaly detection performance of deepsad overall. To gauge the encoder performance' impact on the overall DeepSAD performance we next present the results of training DeepSAD not on a single but on all of the same latent space dimeionsionalities which we explored during the pre-training evaluation, which allows us to check if see similar differences between the architectures for the anomaly detection performance.
+% % --- Section: DeepSAD Training Results ---
 % \newsection{results_deepsad}{DeepSAD Detection Performance}
 %
-% --- Section: DeepSAD Training Results ---
+% Due to the problems regarding ground truth of the training and evaluation data, it is important to discuss the evaluation results in detail, since there are quite a few subtleties that affect the results interpretation, which may not be apparent at a glance. Nonethless, we give an overview of the results in table~\ref{tab:results_ap} which depicts the average precision from all training runs, namely the two DeepSAD architectures compared to the two baselines for all latent dimensionalities and all labeling regimes respectively. In addition to the experiment-based evaluation results we also report the results from evaluations which only used the hand-labled anomalous data samples, to remove samples from degraded experiments as evaluation targets, that were blatanly mislabeled in the experiment-based evaluation or at least uncertain in their normal/anomalous classification.
-\newsection{results_deepsad}{DeepSAD Detection Performance}
+%
 % \begin{table}[t]
 % 	\centering
 % 	\caption{AP means across 5 folds for both evaluations, grouped by labeling regime. Maximum observed standard deviation across all cells (not shown in table): 0.282.}
 % 	\label{tab:results_ap}
 % 	\begin{tabularx}{\textwidth}{c*{4}{Y}|*{4}{Y}}
 % 		\toprule
 % 		            & \multicolumn{4}{c}{Experiment-based eval.} & \multicolumn{4}{c}{Handlabeled eval.}                                                                   \\
 % 		\cmidrule(lr){2-5} \cmidrule(lr){6-9}
 % 		Latent Dim. & \rotheader{DeepSAD                                                                                                                                   \\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OC-SVM} & \rotheader{DeepSAD\\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OC-SVM} \\
 % 		\midrule
 % 		\multicolumn{9}{l}{\textbf{Labeling regime: }\(\mathbf{0/0}\) \textit{(normal/anomalous samples labeled)}}                                                         \\
 % 		\addlinespace[2pt]
 % 		32          & \textbf{0.664}                             & 0.650                                 & 0.217 & 0.315 & \textbf{1.000} & \textbf{1.000} & 0.241 & 0.426 \\
 % 		64          & 0.635                                      & \textbf{0.643}                        & 0.215 & 0.371 & \textbf{1.000} & \textbf{1.000} & 0.233 & 0.531 \\
 % 		128         & \textbf{0.642}                             & \textbf{0.642}                        & 0.218 & 0.486 & \textbf{1.000} & \textbf{1.000} & 0.241 & 0.729 \\
 % 		256         & 0.615                                      & \textbf{0.631}                        & 0.214 & 0.452 & 0.999          & \textbf{1.000} & 0.236 & 0.664 \\
 % 		512         & 0.613                                      & \textbf{0.635}                        & 0.216 & 0.397 & \textbf{1.000} & \textbf{1.000} & 0.241 & 0.550 \\
 % 		768         & 0.609                                      & \textbf{0.617}                        & 0.219 & 0.439 & 0.997          & \textbf{1.000} & 0.244 & 0.624 \\
 % 		1024        & 0.607                                      & \textbf{0.612}                        & 0.215 & 0.394 & 0.997          & \textbf{1.000} & 0.235 & 0.529 \\
 % 		\midrule
 % 		\multicolumn{9}{l}{\textbf{Labeling regime: }\(\mathbf{50/10}\) \textit{(normal/anomalous samples labeled)}}                                                       \\
 % 		\addlinespace[2pt]
 % 		32          & 0.569                                      & \textbf{0.582}                        & 0.217 & 0.315 & 0.933          & \textbf{0.976} & 0.241 & 0.426 \\
 % 		64          & 0.590                                      & \textbf{0.592}                        & 0.215 & 0.371 & 0.970          & \textbf{0.986} & 0.233 & 0.531 \\
 % 		128         & 0.566                                      & \textbf{0.588}                        & 0.218 & 0.486 & 0.926          & \textbf{0.983} & 0.241 & 0.729 \\
 % 		256         & \textbf{0.598}                             & 0.587                                 & 0.214 & 0.452 & 0.978          & \textbf{0.984} & 0.236 & 0.664 \\
 % 		512         & 0.550                                      & \textbf{0.587}                        & 0.216 & 0.397 & 0.863          & \textbf{0.978} & 0.241 & 0.550 \\
 % 		768         & \textbf{0.596}                             & 0.577                                 & 0.219 & 0.439 & \textbf{0.992} & 0.974          & 0.244 & 0.624 \\
 % 		1024        & \textbf{0.601}                             & 0.568                                 & 0.215 & 0.394 & \textbf{0.990} & 0.966          & 0.235 & 0.529 \\
 % 		\midrule
 % 		\multicolumn{9}{l}{\textbf{Labeling regime: }\(\mathbf{500/100}\) \textit{(normal/anomalous samples labeled)}}                                                     \\
 % 		\addlinespace[2pt]
 % 		32          & \textbf{0.625}                             & 0.621                                 & 0.217 & 0.315 & \textbf{0.999} & 0.997          & 0.241 & 0.426 \\
 % 		64          & 0.611                                      & \textbf{0.621}                        & 0.215 & 0.371 & 0.996          & \textbf{0.998} & 0.233 & 0.531 \\
 % 		128         & 0.607                                      & \textbf{0.615}                        & 0.218 & 0.486 & 0.996          & \textbf{0.998} & 0.241 & 0.729 \\
 % 		256         & 0.604                                      & \textbf{0.612}                        & 0.214 & 0.452 & 0.984          & \textbf{0.998} & 0.236 & 0.664 \\
 % 		512         & 0.578                                      & \textbf{0.608}                        & 0.216 & 0.397 & 0.916          & \textbf{0.998} & 0.241 & 0.550 \\
 % 		768         & 0.597                                      & \textbf{0.598}                        & 0.219 & 0.439 & 0.994          & \textbf{0.995} & 0.244 & 0.624 \\
 % 		1024        & \textbf{0.601}                             & 0.591                                 & 0.215 & 0.394 & 0.990          & \textbf{0.993} & 0.235 & 0.529 \\
 % 		\bottomrule
 % 	\end{tabularx}
 % \end{table}
 %
 \newchapter{results_discussion}{Results and Discussion}
-\threadtodo
+The experiments described in Chapter~\ref{chp:experimental_setup} are presented in this chapter. We begin in Section~\ref{sec:results_pretraining} with the pretraining stage, where the two autoencoder architectures were trained across multiple latent space dimensionalities. These results provide insight into the representational capacity of each architecture. In Section~\ref{sec:results_deepsad}, we turn to the main experiments: training DeepSAD models and benchmarking them against baseline algorithms (Isolation Forest and One-Class SVM). Finally, in Section~\ref{sec:results_inference}, we present inference results on experiments that were held out during training. These plots illustrate how the algorithms behave when applied sequentially to unseen traversals, offering a more practical perspective on their potential for real-world rescue robotics applications.
-{Introduce DeepSAD anomaly detection results compared to baselines}
+
-{Core part of evaluation: shows if DeepSAD provides benefit beyond standard methods}
+% --- Section: Autoencoder Pretraining Results ---
-{Explain ROC/PRC as evaluation metrics, show curves for all latent sizes, unsupervised case}
+\newsection{results_pretraining}{Autoencoder Pretraining Results}
-{Results here $\rightarrow$ baseline comparison and semi-supervised effects}
+
 The results of pretraining the two autoencoder architectures are summarized in Table~\ref{tab:pretraining_loss}. Reconstruction performance is reported as mean squared error (MSE), with trends visualized in Figure~\ref{fig:ae_loss_overall}. The results show that the modified Efficient architecture consistently outperforms the LeNet-inspired baseline across all latent space dimensionalities. The improvement is most pronounced at lower-dimensional bottlenecks (e.g., 32 or 64 dimensions) but remains observable up to 1024 dimensions, although the gap narrows.
 \begin{table}[t]
 	\centering
-	\setlength{\tabcolsep}{4pt}
+	\caption{Autoencoder pre-training MSE losses across latent dimensions. Left: overall loss; Right: anomaly-only loss. Cells show means across folds (no $\pm$std). Maximum observed standard deviation across all cells (not shown): 0.0067.}
-	\renewcommand{\arraystretch}{1.2}
+	\label{tab:pretraining_loss}
 	\begin{tabularx}{\textwidth}{c*{2}{Y}|*{2}{Y}}
 		\toprule
 		            & \multicolumn{2}{c}{Overall loss} & \multicolumn{2}{c}{Anomaly loss}                            \\
 		\cmidrule(lr){2-3} \cmidrule(lr){4-5}
 		Latent Dim. & LeNet                            & Efficient                        & LeNet  & Efficient       \\
 		\midrule
 		32          & 0.0223                           & \textbf{0.0136}                  & 0.0701 & \textbf{0.0554} \\
 		64          & 0.0168                           & \textbf{0.0117}                  & 0.0613 & \textbf{0.0518} \\
 		128         & 0.0140                           & \textbf{0.0110}                  & 0.0564 & \textbf{0.0506} \\
 		256         & 0.0121                           & \textbf{0.0106}                  & 0.0529 & \textbf{0.0498} \\
 		512         & 0.0112                           & \textbf{0.0103}                  & 0.0514 & \textbf{0.0491} \\
 		768         & 0.0109                           & \textbf{0.0102}                  & 0.0505 & \textbf{0.0490} \\
 		1024        & 0.0106                           & \textbf{0.0101}                  & 0.0500 & \textbf{0.0489} \\
 		\bottomrule
 	\end{tabularx}
 \end{table}
 \fig{ae_loss_overall}{figures/ae_elbow_test_loss_overall.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures.}
 Because overall reconstruction loss might obscure how well encoders represent anomalous samples, we additionally evaluate reconstruction errors only on degraded samples from hand-labeled smoke segments (Figure~\ref{fig:ae_loss_degraded}). As expected, reconstruction losses are about 0.05 higher on these challenging samples than in the overall evaluation. However, the relative advantage of the Efficient architecture remains, suggesting that its improvements extend to anomalous inputs as well.
 \fig{ae_loss_degraded}{figures/ae_elbow_test_loss_anomaly.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures, evaluated only on degraded data from hand-labeled smoke experiments.}
 Since only per-sample reconstruction losses were retained during pretraining, we report results in reciprocal-range MSE space. While more interpretable metrics in meters and distance-binned analyses would be desirable, the downstream anomaly detection performance did not differ starkly between encoders, so we did not pursue this additional evaluation. Future work could extend the pretraining analysis with physically interpretable metrics.
 % --- Section: DeepSAD Training Results ---
 \newsection{results_deepsad}{DeepSAD Detection Performance}
 Due to the challenges of ground truth quality, evaluation results must be interpreted with care. Two complementary evaluation schemes were introduced earlier:
 \begin{itemize}
 	\item \textbf{Experiment-based labels}, which provide an objective way to assign anomaly labels to entire degraded runs. However, this inevitably marks many near-normal frames at the start and end of these runs as anomalous. These knowingly “false” labels reduce the maximum achievable average precision, since even a perfect classifier cannot separate ambiguous from normal samples under this scheme.
 	\item \textbf{Hand-labeled labels}, which include only clearly degraded frames. These remove the mislabeled intervals and allow nearly perfect classification. While this evaluation is useful to show that performance losses in the experiment-based scheme stem from label noise, it would be uninformative in isolation because the targets become too easily distinguishable.
 \end{itemize}
 \subsection{Overall results}
 Table~\ref{tab:results_ap} gives an overview of average precision (AP) across all latent dimensions, labeling regimes, and methods. Under experiment-based labels, both DeepSAD variants consistently outperform the baselines, achieving AP values around 0.60–0.66 compared to 0.21 for IsoForest and 0.31–0.49 for OC-SVM. This demonstrates that even with noisy evaluation data, DeepSAD provides substantially stronger discriminative ability. Under hand-labeled evaluation, DeepSAD reaches nearly perfect AP across all settings, while the baselines remain much lower (IsoForest around 0.23–0.24, OC-SVM between 0.4 and 0.7).
 The contrast between the two evaluation schemes indicates, on the one hand, that the reduced AP in the experiment-based evaluation is largely due to mislabeled or ambiguous samples at the start and end of degraded runs. On the other hand, the perfect classification performance in the hand-labeled evaluation also reflects that only clearly degraded samples remain, meaning that borderline cases were removed entirely. This makes it impossible to assess how DeepSAD handles frames with weak or gradual degradation: the results show that it excels at separating clearly normal from clearly degraded samples, but they do not tell us whether it can reliably classify in-between cases where subjective judgment would otherwise be required. Consequently, both evaluation schemes are informative in complementary ways: experiment-based labels allow relative comparison under noisy, realistic conditions, while hand-labeled labels demonstrate the upper bound of performance when ambiguous samples are excluded.
 \begin{table}[t]
 	\centering
 	\caption{AP means across 5 folds for both evaluations, grouped by labeling regime. Mean observed standard deviation per method: DeepSAD (LeNet) 0.015; DeepSAD (Efficient) 0.012; IsoForest 0.011; OC-SVM 0.091.}
 	\label{tab:results_ap}
 	\begin{tabularx}{\textwidth}{c*{4}{Y}|*{4}{Y}}
 		\toprule
 		            & \multicolumn{4}{c}{Experiment-based eval.} & \multicolumn{4}{c}{Handlabeled eval.}                                                                   \\
@@ -1552,104 +1592,93 @@ It is important to note that absolute MSE values are difficult to interpret in i
 		\midrule
 		\multicolumn{9}{l}{\textbf{Labeling regime: }\(\mathbf{0/0}\) \textit{(normal/anomalous samples labeled)}}                                                         \\
 		\addlinespace[2pt]
-		32          & \textbf{0.801}                             & 0.791                                 & 0.717 & 0.752          & \textbf{1.000} & \textbf{1.000} & 0.921 & 0.917 \\
+		32          & \textbf{0.664}                             & 0.650                                 & 0.217 & 0.315 & \textbf{1.000} & \textbf{1.000} & 0.241 & 0.426 \\
-		64          & 0.776                                      & \textbf{0.786}                        & 0.718 & 0.742          & \textbf{1.000} & \textbf{1.000} & 0.917 & 0.931 \\
+		64          & 0.635                                      & \textbf{0.643}                        & 0.215 & 0.371 & \textbf{1.000} & \textbf{1.000} & 0.233 & 0.531 \\
-		128         & \textbf{0.784}                             & \textbf{0.784}                        & 0.719 & 0.775          & \textbf{1.000} & \textbf{1.000} & 0.921 & 0.967 \\
+		128         & \textbf{0.642}                             & \textbf{0.642}                        & 0.218 & 0.486 & \textbf{1.000} & \textbf{1.000} & 0.241 & 0.729 \\
-		256         & 0.762                                      & 0.772                                 & 0.712 & \textbf{0.793} & \textbf{1.000} & \textbf{1.000} & 0.918 & 0.966 \\
+		256         & 0.615                                      & \textbf{0.631}                        & 0.214 & 0.452 & 0.999          & \textbf{1.000} & 0.236 & 0.664 \\
-		512         & 0.759                                      & 0.784                                 & 0.712 & \textbf{0.804} & \textbf{1.000} & \textbf{1.000} & 0.920 & 0.949 \\
+		512         & 0.613                                      & \textbf{0.635}                        & 0.216 & 0.397 & \textbf{1.000} & \textbf{1.000} & 0.241 & 0.550 \\
-		768         & 0.749                                      & 0.754                                 & 0.713 & \textbf{0.812} & \textbf{1.000} & \textbf{1.000} & 0.923 & 0.960 \\
+		768         & 0.609                                      & \textbf{0.617}                        & 0.219 & 0.439 & 0.997          & \textbf{1.000} & 0.244 & 0.624 \\
-		1024        & 0.757                                      & 0.750                                 & 0.716 & \textbf{0.821} & \textbf{1.000} & \textbf{1.000} & 0.919 & 0.956 \\
+		1024        & 0.607                                      & \textbf{0.612}                        & 0.215 & 0.394 & 0.997          & \textbf{1.000} & 0.235 & 0.529 \\
 		\midrule
 		\multicolumn{9}{l}{\textbf{Labeling regime: }\(\mathbf{50/10}\) \textit{(normal/anomalous samples labeled)}}                                                       \\
 		\addlinespace[2pt]
-		32          & 0.741                                      & 0.747                                 & 0.717 & \textbf{0.752} & 0.990          & \textbf{0.998} & 0.921 & 0.917 \\
+		32          & 0.569                                      & \textbf{0.582}                        & 0.217 & 0.315 & 0.933          & \textbf{0.976} & 0.241 & 0.426 \\
-		64          & \textbf{0.757}                             & 0.750                                 & 0.718 & 0.742          & 0.998          & \textbf{0.999} & 0.917 & 0.931 \\
+		64          & 0.590                                      & \textbf{0.592}                        & 0.215 & 0.371 & 0.970          & \textbf{0.986} & 0.233 & 0.531 \\
-		128         & 0.746                                      & 0.751                                 & 0.719 & \textbf{0.775} & 0.991          & \textbf{0.999} & 0.921 & 0.967 \\
+		128         & 0.566                                      & \textbf{0.588}                        & 0.218 & 0.486 & 0.926          & \textbf{0.983} & 0.241 & 0.729 \\
-		256         & 0.746                                      & 0.750                                 & 0.712 & \textbf{0.793} & \textbf{0.999} & \textbf{0.999} & 0.918 & 0.966 \\
+		256         & \textbf{0.598}                             & 0.587                                 & 0.214 & 0.452 & 0.978          & \textbf{0.984} & 0.236 & 0.664 \\
-		512         & 0.760                                      & 0.763                                 & 0.712 & \textbf{0.804} & 0.972          & \textbf{0.999} & 0.920 & 0.949 \\
+		512         & 0.550                                      & \textbf{0.587}                        & 0.216 & 0.397 & 0.863          & \textbf{0.978} & 0.241 & 0.550 \\
-		768         & 0.749                                      & 0.747                                 & 0.713 & \textbf{0.812} & \textbf{1.000} & 0.998          & 0.923 & 0.960 \\
+		768         & \textbf{0.596}                             & 0.577                                 & 0.219 & 0.439 & \textbf{0.992} & 0.974          & 0.244 & 0.624 \\
-		1024        & 0.748                                      & 0.732                                 & 0.716 & \textbf{0.821} & \textbf{0.999} & 0.998          & 0.919 & 0.956 \\
+		1024        & \textbf{0.601}                             & 0.568                                 & 0.215 & 0.394 & \textbf{0.990} & 0.966          & 0.235 & 0.529 \\
 		\midrule
 		\multicolumn{9}{l}{\textbf{Labeling regime: }\(\mathbf{500/100}\) \textit{(normal/anomalous samples labeled)}}                                                     \\
 		\addlinespace[2pt]
-		32          & 0.765                                      & \textbf{0.775}                        & 0.717 & 0.752          & \textbf{1.000} & \textbf{1.000} & 0.921 & 0.917 \\
+		32          & \textbf{0.625}                             & 0.621                                 & 0.217 & 0.315 & \textbf{0.999} & 0.997          & 0.241 & 0.426 \\
-		64          & 0.754                                      & \textbf{0.773}                        & 0.718 & 0.742          & \textbf{1.000} & \textbf{1.000} & 0.917 & 0.931 \\
+		64          & 0.611                                      & \textbf{0.621}                        & 0.215 & 0.371 & 0.996          & \textbf{0.998} & 0.233 & 0.531 \\
-		128         & 0.758                                      & 0.769                                 & 0.719 & \textbf{0.775} & \textbf{1.000} & \textbf{1.000} & 0.921 & 0.967 \\
+		128         & 0.607                                      & \textbf{0.615}                        & 0.218 & 0.486 & 0.996          & \textbf{0.998} & 0.241 & 0.729 \\
-		256         & 0.749                                      & 0.768                                 & 0.712 & \textbf{0.793} & 0.999          & \textbf{1.000} & 0.918 & 0.966 \\
+		256         & 0.604                                      & \textbf{0.612}                        & 0.214 & 0.452 & 0.984          & \textbf{0.998} & 0.236 & 0.664 \\
-		512         & 0.766                                      & 0.770                                 & 0.712 & \textbf{0.804} & 0.989          & \textbf{1.000} & 0.920 & 0.949 \\
+		512         & 0.578                                      & \textbf{0.608}                        & 0.216 & 0.397 & 0.916          & \textbf{0.998} & 0.241 & 0.550 \\
-		768         & 0.746                                      & 0.750                                 & 0.713 & \textbf{0.812} & \textbf{1.000} & \textbf{1.000} & 0.923 & 0.960 \\
+		768         & 0.597                                      & \textbf{0.598}                        & 0.219 & 0.439 & 0.994          & \textbf{0.995} & 0.244 & 0.624 \\
-		1024        & 0.743                                      & 0.739                                 & 0.716 & \textbf{0.821} & \textbf{1.000} & \textbf{1.000} & 0.919 & 0.956 \\
+		1024        & \textbf{0.601}                             & 0.591                                 & 0.215 & 0.394 & 0.990          & \textbf{0.993} & 0.235 & 0.529 \\
 		\bottomrule
 	\end{tabularx}
 	\caption{ROC AUC means across 5 folds for both evaluations, grouped by labeling regime. Maximum observed standard deviation across all cells (not shown in table): 0.06.}
 \end{table}
 %\fig{roc_prc_unsup}{figures/roc_prc_unsup.png}{ROC and PRC curves for DeepSAD, Isolation Forest, and OCSVM (unsupervised, all latent dimensions).}
-\threadtodo
+% \begin{table}[t]
-{Interpret unsupervised results across architectures and baselines}
+% 	\centering
-{Important to establish the baseline performance levels}
+% 	\caption{AP means $\pm$ std across 5 folds for experiment-based evaluation only, grouped by labeling regime.}
-{Compare AUCs: Isolation Forest weakest, OCSVM moderate (uses encoder), DeepSAD best}
+% 	\label{tab:results_ap_with_std}
-{Sets expectation for whether supervision improves or harms performance}
+% 	\begin{tabularx}{\textwidth}{c*{4}{Y}}
 % 		\toprule
 % 		Latent Dim. & \rotheader{DeepSAD                                                                    \\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OC-SVM} \\
 % 		\midrule
 % 		\multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{0/0}\)}                                      \\
 % 		\addlinespace[2pt]
 % 		32          & 0.664$\,\pm\,0.029$ & 0.650$\,\pm\,0.017$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
 % 		64          & 0.635$\,\pm\,0.018$ & 0.643$\,\pm\,0.016$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
 % 		128         & 0.642$\,\pm\,0.022$ & 0.642$\,\pm\,0.017$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
 % 		256         & 0.615$\,\pm\,0.022$ & 0.631$\,\pm\,0.015$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
 % 		512         & 0.613$\,\pm\,0.015$ & 0.635$\,\pm\,0.016$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
 % 		768         & 0.609$\,\pm\,0.036$ & 0.617$\,\pm\,0.016$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
 % 		1024        & 0.607$\,\pm\,0.018$ & 0.612$\,\pm\,0.018$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
 % 		\midrule
 % 		\multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{50/10}\)}                                    \\
 % 		\addlinespace[2pt]
 % 		32          & 0.569$\,\pm\,0.061$ & 0.582$\,\pm\,0.008$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
 % 		64          & 0.590$\,\pm\,0.032$ & 0.592$\,\pm\,0.017$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
 % 		128         & 0.566$\,\pm\,0.078$ & 0.588$\,\pm\,0.011$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
 % 		256         & 0.598$\,\pm\,0.027$ & 0.587$\,\pm\,0.015$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
 % 		512         & 0.550$\,\pm\,0.157$ & 0.587$\,\pm\,0.020$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
 % 		768         & 0.596$\,\pm\,0.014$ & 0.577$\,\pm\,0.017$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
 % 		1024        & 0.601$\,\pm\,0.012$ & 0.568$\,\pm\,0.013$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
 % 		\midrule
 % 		\multicolumn{5}{l}{\textbf{Labeling regime: }\(\mathbf{500/100}\)}                                  \\
 % 		\addlinespace[2pt]
 % 		32          & 0.625$\,\pm\,0.015$ & 0.621$\,\pm\,0.009$ & 0.217$\,\pm\,0.010$ & 0.315$\,\pm\,0.050$ \\
 % 		64          & 0.611$\,\pm\,0.014$ & 0.621$\,\pm\,0.018$ & 0.215$\,\pm\,0.008$ & 0.371$\,\pm\,0.076$ \\
 % 		128         & 0.607$\,\pm\,0.014$ & 0.615$\,\pm\,0.015$ & 0.218$\,\pm\,0.010$ & 0.486$\,\pm\,0.088$ \\
 % 		256         & 0.604$\,\pm\,0.022$ & 0.612$\,\pm\,0.017$ & 0.214$\,\pm\,0.010$ & 0.452$\,\pm\,0.064$ \\
 % 		512         & 0.578$\,\pm\,0.109$ & 0.608$\,\pm\,0.018$ & 0.216$\,\pm\,0.012$ & 0.397$\,\pm\,0.053$ \\
 % 		768         & 0.597$\,\pm\,0.014$ & 0.598$\,\pm\,0.017$ & 0.219$\,\pm\,0.008$ & 0.439$\,\pm\,0.093$ \\
 % 		1024        & 0.601$\,\pm\,0.013$ & 0.591$\,\pm\,0.016$ & 0.215$\,\pm\,0.003$ & 0.394$\,\pm\,0.049$ \\
 % 		\bottomrule
 % 	\end{tabularx}
 % \end{table}
-\threadtodo
+Representative precision–recall curves illustrate how methods differ in their operating regimes (Figure~\ref{fig:prc_representative}). DeepSAD shows a stable high-precision region up to about 0.5 recall, followed by a sharp drop once it is forced to classify borderline cases. OC-SVM declines gradually without ever reaching a strong plateau, while Isolation Forest detects only a few extreme anomalies before collapsing to near-random performance. These qualitative differences are masked in single-number metrics but are critical for interpreting how the methods would behave in deployment.
 {Present semi-supervised regimes and their effects}
 {Semi-supervision is central to DeepSAD; must show how labels change outcomes}
 {Show ROC/PRC plots for selected latent sizes under different labeling regimes}
 {This leads $\rightarrow$ analysis of why few labels harmed but many labels improved}
-%\fig{roc_prc_semi}{figures/roc_prc_semi.png}{ROC and PRC curves for selected latent sizes under different semi-supervised regimes.}
+\fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OC-SVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
-\threadtodo
+\subsection{Effect of latent space dimensionality}
-{Discuss surprising supervision dynamics}
+Figure~\ref{fig:latent_dim_ap} plots AP versus latent dimension under the experiment-based evaluation. DeepSAD benefits from compact latent spaces (e.g., 32–128), with diminishing or negative returns at larger codes. Baseline methods are largely flat across dimensions, reflecting their reliance on fixed embeddings. (Hand-labeled results saturate and are shown in the appendix.)
 {Reader expects supervision to always help; but results show nuance}
 {Interpret why few labels overfit, many labels help, unsupervised sometimes best}
 {This discussion $\rightarrow$ motivates looking at model behavior over time via inference}
-% --- Section: Inference Experiments ---
+\fig{latent_dim_ap}{figures/latent_dim_ap.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD benefits from smaller codes; baselines remain flat.}
 \newsection{results_inference}{Inference on Held-Out Experiments}
-\threadtodo
+\subsection{Effect of semi-supervised labeling regime}
-{Introduce inference evaluation on unseen experiments}
+Figure~\ref{fig:labeling_regime_ap} compares AP across labeling regimes (0/0, 50/10, 500/100). Surprisingly, the unsupervised regime (0/0) often performs best; adding labels does not consistently help, likely due to label noise and the scarcity/ambiguity of anomalous labels. Baselines (which do not use labels) are stable across regimes.
 {This tests real-world usefulness: continuous scan-level degradation quantification}
 {Explain setup: EMA-smoothed z-scores compared against heuristic degradation indicators}
 {From static metrics $\rightarrow$ to temporal behavior analysis}
-%\fig{inference_indicators}{figures/inference_indicators.png}{Example inference traces: EMA-smoothed anomaly scores compared to missing-point percentage and near-sensor returns.}
+\fig{labeling_regime_ap}{figures/labeling_regime_ap.png}{AP across semi-supervised labeling regimes. Unsupervised training often performs best; added labels do not yield consistent gains under noisy conditions.}
 \threadtodo
 {Analyze correlation of anomaly scores with degradation indicators}
 {Important because it shows methods behave as intended even without perfect ground truth}
 {Discuss qualitative similarity, emphasize scores as degradation proxies}
 {Sets stage $\rightarrow$ for clean vs degraded comparison}
 \threadtodo
 {Compare anomaly score dynamics between clean and degraded experiments}
 {Tests whether scores separate normal vs degraded traversals reliably}
 {Show normalized z-score plots using clean-experiment parameters}
 {Final confirmation $\rightarrow$ methods are meaningful for degradation quantification}
 %\fig{inference_clean_vs_smoke}{figures/inference_clean_vs_smoke.png}{Normalized anomaly scores for a clean vs degraded experiment. Clear amplitude separation is visible.}
 % \todo[inline]{introductory paragraph results}
 % \todo[inline]{autoencoder results, compare lenet to efficient, shows that efficient is better and especially at lower latent dims, interesting to see in future exps if autencoder results appear to transfer to deepsad training results, therefore not a single latent dim in later exps, but rather all so it can be compared. also interesting to see if efficient better than lenet since reconstruction loss is better for efficient}
 %
 % \todo[inline]{we already have results graphs loss over latent dims with both lenet and effficient arch in plot, we also have overall plot as well as one for evaluation only with degraded data (anomalies) to see how good the networks are in reconstructing anomalies, not only normal data, plots enough or table with results necessary?}
 %
 % \todo[inline]{transition to main training results, should we show ROC/PRC comparisons of methods first or should we first show inference as score over time for one (during training left out) experiment?}
 %
 % \todo[inline]{main training compare roc/prc of 7 latent dimensionalities with 0 normal 0 anomalous semi regime, plot with 7 subplots, both deepsad better than baselines in both labeling regimes (experiment based and subjective hand-labeled evaluations). as expected isoforest worst since its simplest, ocsvm better since it profits from pre-trained encoder which should be good at dim reduction while maximizing retained information, efficient and lenet have similar results, although efficient has less variance between folds which could either mean its more effective at finding patterns (due to maybe more channels, better receptive field, etc) or it could mean it overfits more readily to data? not sure tbh and I don't think we can interpret from these limited evaluations, but better evaluation not possible without good ground truth}
 %
 % \todo[inline]{main training compare roc/prc of semi-regimes from 2 or 3 latent dimensionalities, show that unsupervised was best, then heavily semi-supervised then a few labeled samples in last position, why was this? maybe the few labeled examples create overfit already and lot of them improve overfit but are worse than generalized unsupervised version?}
 %
 % \todo[inline]{inference results showing the general workings of the methods on two experiments (one degraded, one normal - so no smoke) which were left out during training of these methods. inference plots of which 2 kinds exist: one that compares the smoothed z-score of the methods (to reduce noise in plots with EMA, which is not dependent on future data, so could be used in realtime and reacts way faster than moving averages and z-score is used since the analog output values from the different methods have different signs and magnitudes) with two statistical values we discussed in data section, namely missing percentage of points per lidar scan and erroneous near-sensor returns which have to be early returns per scan. these show that all methods have comparative qualities to these statistics, although the should not be taken as a ground truth, just as an indicator showing that generally the intended use case appears to be fulfilled by all methods (which was to interpret the anomaly score as a degradtaion quantification of each individual scan)}
 % \todo[inline]{the second kind of inference plots shows the difference between scores produced on normal (non-degraded) experiment data vs scores produced on anomalous (degraded) data by normalizing the timeline of two experiments of which one contains no smoke and one has been degraded with artificial smoke. this has been achieved by using the z-score parameters of the clean data scores on both the clean experiment scores and the degraded experiment scores to show that there is a large difference between the amplitudes of these methods' scores for the two types of experiments}
 %
 % \todo[inline]{anything else for results or simply transition to conclusion and future work?}
 %
 %
 % \newsection{hyperparameter_analysis}{Hyperparameter Analysis}
 % \todo[inline]{result for different amounts of labeled data}
 \newchapter{conclusion_future_work}{Conclusion and Future Work}
 \newsection{conclusion}{Conclusion}
--- a/tools/plot_scripts/load_results.py
+++ b/tools/plot_scripts/load_results.py
@@ -26,7 +26,8 @@ SCHEMA_STATIC = {
    "eval": pl.Utf8,  # "exp_based" | "manual_based"
    "fold": pl.Int32,
    # metrics
-    "auc": pl.Float64,
+    "roc_auc": pl.Float64,  # <-- renamed from 'auc'
    "prc_auc": pl.Float64,  # <-- new
    "ap": pl.Float64,
    # per-sample scores: list of (idx, label, score)
    "scores": pl.List(
@@ -114,6 +115,43 @@ SCHEMA_INFERENCE = {
 # ------------------------------------------------------------
 # Helpers: curve/scores normalizers (tuples/ndarrays -> dict/list)
 # ------------------------------------------------------------
 def compute_prc_auc_from_curve(prc_curve: dict | None) -> float | None:
    """
    Compute AUC of the Precision-Recall curve via trapezoidal rule.
    Expects prc_curve = {"precision": [...], "recall": [...], "thr": [...] (optional)}.
    Robust to NaNs, unsorted recall, and missing endpoints; returns np.nan if empty.
    """
    if not prc_curve:
        return np.nan
    precision = np.asarray(prc_curve.get("precision", []), dtype=float)
    recall = np.asarray(prc_curve.get("recall", []), dtype=float)
    if precision.size == 0 or recall.size == 0:
        return np.nan
    mask = ~(np.isnan(precision) | np.isnan(recall))
    precision, recall = precision[mask], recall[mask]
    if recall.size == 0:
        return np.nan
    # Sort by recall, clip to [0,1]
    order = np.argsort(recall)
    recall = np.clip(recall[order], 0.0, 1.0)
    precision = np.clip(precision[order], 0.0, 1.0)
    # Ensure curve spans [0,1] in recall (hold precision constant at ends)
    if recall[0] > 0.0:
        recall = np.insert(recall, 0, 0.0)
        precision = np.insert(precision, 0, precision[0])
    if recall[-1] < 1.0:
        recall = np.append(recall, 1.0)
        precision = np.append(precision, precision[-1])
    # Trapezoidal AUC
    return float(np.trapezoid(precision, recall))
 def _tolist(x):
    if x is None:
        return None
@@ -357,23 +395,28 @@ def rows_from_ocsvm_default(data: dict, evals: List[str]) -> Dict[str, dict]:
 # Build the Polars DataFrame
 # ------------------------------------------------------------
 def load_results_dataframe(root: Path, allow_cache: bool = True) -> pl.DataFrame:
    """
    Walks experiment subdirs under `root`. For each (model, fold) it adds rows:
    Columns (SCHEMA_STATIC):
      network, latent_dim, semi_normals, semi_anomalous,
      model, eval, fold,
      auc, ap, scores{sample_idx,orig_label,score},
      roc_curve{fpr,tpr,thr}, prc_curve{precision,recall,thr},
      sample_indices, sample_labels, valid_mask,
      train_time, test_time,
      folder, k_fold_num
    """
    if allow_cache:
        cache = root / "results_cache.parquet"
        if cache.exists():
            try:
                df = pl.read_parquet(cache)
                print(f"[info] loaded cached results frame from {cache}")
                # Backward-compat: old caches may have 'auc' but no 'roc_auc'/'prc_auc'
                if "roc_auc" not in df.columns and "auc" in df.columns:
                    df = df.rename({"auc": "roc_auc"})
                if "prc_auc" not in df.columns and "prc_curve" in df.columns:
                    df = df.with_columns(
                        pl.struct(
                            pl.col("prc_curve").struct.field("precision"),
                            pl.col("prc_curve").struct.field("recall"),
                        )
                        .map_elements(
                            lambda s: compute_prc_auc_from_curve(
                                {"precision": s[0], "recall": s[1]}
                            )
                        )
                        .alias("prc_auc")
                    )
                return df
            except Exception as e:
                print(f"[warn] failed to load cache {cache}: {e}")
@@ -408,15 +451,17 @@ def load_results_dataframe(root: Path, allow_cache: bool = True) -> pl.DataFrame
                    continue
                if model == "deepsad":
-                    per_eval = rows_from_deepsad(data, EVALS)  # eval -> dict
+                    per_eval = rows_from_deepsad(data, EVALS)
                elif model == "isoforest":
-                    per_eval = rows_from_isoforest(data, EVALS)  # eval -> dict
+                    per_eval = rows_from_isoforest(data, EVALS)
                elif model == "ocsvm":
-                    per_eval = rows_from_ocsvm_default(data, EVALS)  # eval -> dict
+                    per_eval = rows_from_ocsvm_default(data, EVALS)
                else:
                    per_eval = {}
                for ev, vals in per_eval.items():
                    # compute prc_auc now (fast), rename auc->roc_auc
                    prc_auc_val = compute_prc_auc_from_curve(vals.get("prc"))
                    rows.append(
                        {
                            "network": network,
@@ -426,7 +471,8 @@ def load_results_dataframe(root: Path, allow_cache: bool = True) -> pl.DataFrame
                            "model": model,
                            "eval": ev,
                            "fold": fold,
-                            "auc": vals["auc"],
+                            "roc_auc": vals["auc"],  # renamed
                            "prc_auc": prc_auc_val,  # new
                            "ap": vals["ap"],
                            "scores": vals["scores"],
                            "roc_curve": vals["roc"],
@@ -442,20 +488,19 @@ def load_results_dataframe(root: Path, allow_cache: bool = True) -> pl.DataFrame
                        }
                    )
    # If empty, return a typed empty frame
    if not rows:
        # Return a typed empty frame (new schema)
        return pl.DataFrame(schema=SCHEMA_STATIC)
    df = pl.DataFrame(rows, schema=SCHEMA_STATIC)
-    # Cast to efficient dtypes (categoricals etc.) – no extra sanitation
+    # Cast to efficient dtypes (categoricals etc.)
    df = df.with_columns(
        pl.col("network", "model", "eval").cast(pl.Categorical),
        pl.col(
            "latent_dim", "semi_normals", "semi_anomalous", "fold", "k_fold_num"
        ).cast(pl.Int32),
-        pl.col("auc", "ap", "train_time", "test_time").cast(pl.Float64),
+        pl.col("roc_auc", "prc_auc", "ap", "train_time", "test_time").cast(pl.Float64),
        # NOTE: no cast on 'scores' here; it's already List(Struct) per schema.
    )
    if allow_cache:
--- a/tools/plot_scripts/results_latent_space_comparisons.py
+++ b/tools/plot_scripts/results_latent_space_comparisons.py
@@ -7,10 +7,10 @@ from pathlib import Path
 import matplotlib.pyplot as plt
 import numpy as np
 import polars as pl
 from matplotlib.lines import Line2D
 # CHANGE THIS IMPORT IF YOUR LOADER MODULE IS NAMED DIFFERENTLY
-from plot_scripts.load_results import load_results_dataframe
+from load_results import load_results_dataframe
 from matplotlib.lines import Line2D
 # ----------------------------
 # Config
@@ -26,6 +26,10 @@ SEMI_ANOMALOUS = 10
 # Which evaluation columns to plot
 EVALS = ["exp_based", "manual_based"]
 EVALS_LABELS = {
    "exp_based": "Experiment-Label-Based",
    "manual_based": "Manually-Labeled",
 }
 # Latent dimensions to show as 7 subplots
 LATENT_DIMS = [32, 64, 128, 256, 512, 768, 1024]
@@ -188,7 +192,7 @@ def plot_grid_from_df(
    Create a 2x4 grid of subplots, one per latent dim; 8th panel holds legend.
    kind: 'roc' or 'prc'
    """
-    fig_title = f"{kind.upper()} — {eval_type} (semi = {semi_normals}/{semi_anomalous})"
+    fig_title = f"{kind.upper()} — {EVALS_LABELS[eval_type]} (Semi-Labeling Regime = {semi_normals}/{semi_anomalous})"
    fig, axes = _ensure_dim_axes(fig_title)
    # plotting order & colors
@@ -213,7 +217,7 @@ def plot_grid_from_df(
        if i >= 7:
            break  # last slot reserved for legend
        ax = axes[i]
-        ax.set_title(f"latent_dim = {dim}")
+        ax.set_title(f"Latent Dim. = {dim}")
        ax.grid(True, alpha=0.3)
        if kind == "roc":
--- a/tools/plot_scripts/results_latent_space_tables.py
+++ b/tools/plot_scripts/results_latent_space_tables.py
@@ -5,6 +5,7 @@ from dataclasses import dataclass
 from datetime import datetime
 from pathlib import Path
 import numpy as np
 import polars as pl
 # CHANGE THIS IMPORT IF YOUR LOADER MODULE IS NAMED DIFFERENTLY
@@ -41,6 +42,17 @@ DECIMALS = 3  # cells look like 1.000 or 0.928 (3 decimals)
 # ----------------------------
 # Helpers
 # ----------------------------
 def _fmt_mean_std(mean: float | None, std: float | None) -> str:
    """Format mean ± std with 3 decimals (leading zero), or '--' if missing."""
    if mean is None or not (mean == mean):  # NaN check
        return "--"
    if std is None or not (std == std):
        return f"{mean:.3f}"
    return f"{mean:.3f}$\\,\\pm\\,{std:.3f}$"
 def _with_net_label(df: pl.DataFrame) -> pl.DataFrame:
    """Add a canonical 'net_label' column like the plotting script (LeNet/Efficient/fallback)."""
    return df.with_columns(
@@ -68,7 +80,7 @@ def _filter_base(df: pl.DataFrame) -> pl.DataFrame:
        "net_label",
        "latent_dim",
        "fold",
-        "auc",
+        "ap",
        "eval",
        "semi_normals",
        "semi_anomalous",
@@ -84,7 +96,7 @@ class Cell:
 def _compute_cells(df: pl.DataFrame) -> dict[tuple[str, int, str, str, int, int], Cell]:
    """
    Compute per-(eval, latent_dim, model, net_label, semi_normals, semi_anomalous)
-    mean/std for AUC across folds.
+    mean/std for AP across folds.
    """
    if df.is_empty():
        return {}
@@ -107,9 +119,7 @@ def _compute_cells(df: pl.DataFrame) -> dict[tuple[str, int, str, str, int, int]
                "semi_anomalous",
            ]
        )
-        .agg(
+        .agg(pl.col("ap").mean().alias("mean_ap"), pl.col("ap").std().alias("std_ap"))
            pl.col("auc").mean().alias("mean_auc"), pl.col("auc").std().alias("std_auc")
        )
        .to_dicts()
    )
@@ -123,10 +133,96 @@ def _compute_cells(df: pl.DataFrame) -> dict[tuple[str, int, str, str, int, int]
            int(row["semi_normals"]),
            int(row["semi_anomalous"]),
        )
-        out[key] = Cell(mean=row.get("mean_auc"), std=row.get("std_auc"))
+        out[key] = Cell(mean=row.get("mean_ap"), std=row.get("std_ap"))
    return out
 def method_label(model: str, net_label: str) -> str:
    """Map (model, net_label) to the four method names used in headers/caption."""
    if model == "deepsad" and net_label == "LeNet":
        return "DeepSAD (LeNet)"
    if model == "deepsad" and net_label == "Efficient":
        return "DeepSAD (Efficient)"
    if model == "isoforest":
        return "IsoForest"
    if model == "ocsvm":
        return "OC-SVM"
    # ignore anything else (e.g., other backbones)
    return ""
 def per_method_median_std_from_cells(
    cells: dict[tuple[str, int, str, str, int, int], Cell],
 ) -> dict[str, float]:
    """Compute the median std across all cells, per method."""
    stds_by_method: dict[str, list[float]] = {
        "DeepSAD (LeNet)": [],
        "DeepSAD (Efficient)": [],
        "IsoForest": [],
        "OC-SVM": [],
    }
    for key, cell in cells.items():
        (ev, dim, model, net, semi_n, semi_a) = key
        name = method_label(model, net)
        if name and (cell.std is not None) and (cell.std == cell.std):  # not NaN
            stds_by_method[name].append(cell.std)
    return {
        name: float(np.median(vals)) if vals else float("nan")
        for name, vals in stds_by_method.items()
    }
 def per_method_max_std_from_cells(
    cells: dict[tuple[str, int, str, str, int, int], Cell],
 ) -> tuple[dict[str, float], dict[str, tuple]]:
    """
    Scan the aggregated 'cells' and return:
      - max_std_by_method: dict {"DeepSAD (LeNet)": 0.037, ...}
      - argmax_key_by_method: which cell (eval, dim, model, net, semi_n, semi_a) produced that max
    Only considers the four methods shown in the table.
    """
    max_std_by_method: dict[str, float] = {
        "DeepSAD (LeNet)": float("nan"),
        "DeepSAD (Efficient)": float("nan"),
        "IsoForest": float("nan"),
        "OC-SVM": float("nan"),
    }
    argmax_key_by_method: dict[str, tuple] = {}
    for key, cell in cells.items():
        (ev, dim, model, net, semi_n, semi_a) = key
        name = method_label(model, net)
        if name == "" or cell.std is None or not (cell.std == cell.std):  # empty/NaN
            continue
        cur = max_std_by_method.get(name, float("nan"))
        if (cur != cur) or (cell.std > cur):  # handle NaN initial
            max_std_by_method[name] = cell.std
            argmax_key_by_method[name] = key
    # Replace remaining NaNs with 0.0 for nice formatting
    for k, v in list(max_std_by_method.items()):
        if not (v == v):  # NaN
            max_std_by_method[k] = 0.0
    return max_std_by_method, argmax_key_by_method
 def _fmt_val(val: float | None) -> str:
    """
    Format value as:
      - '--' if None/NaN
      - '1.0' if exactly 1 (within 1e-9)
      - '.xx' otherwise (2 decimals, no leading 0)
    """
    if val is None or not (val == val):  # None or NaN
        return "--"
    if abs(val - 1.0) < 1e-9:
        return "1.0"
    return f"{val:.2f}".lstrip("0")
 def _fmt_mean(mean: float | None) -> str:
    return "--" if (mean is None or not (mean == mean)) else f"{mean:.{DECIMALS}f}"
@@ -150,6 +246,61 @@ def _bold_best_mask_display(values: list[float | None], decimals: int) -> list[b
    return [(v is not None and v == maxv) for v in rounded]
 def _build_exp_based_table(
    cells: dict[tuple[str, int, str, str, int, int], Cell],
    *,
    semi_labeling_regimes: list[tuple[int, int]],
 ) -> str:
    """
    Build LaTeX table with mean ± std values for experiment-based evaluation only.
    """
    header_cols = [
        r"\rotheader{DeepSAD\\(LeNet)}",
        r"\rotheader{DeepSAD\\(Efficient)}",
        r"\rotheader{IsoForest}",
        r"\rotheader{OC-SVM}",
    ]
    lines: list[str] = []
    lines.append(r"\begin{table}[t]")
    lines.append(r"\centering")
    lines.append(r"\setlength{\tabcolsep}{4pt}")
    lines.append(r"\renewcommand{\arraystretch}{1.2}")
    lines.append(r"\begin{tabularx}{\textwidth}{c*{4}{Y}}")
    lines.append(r"\toprule")
    lines.append(r"Latent Dim. & " + " & ".join(header_cols) + r" \\")
    lines.append(r"\midrule")
    for idx, (semi_n, semi_a) in enumerate(semi_labeling_regimes):
        # regime label row
        lines.append(
            rf"\multicolumn{{5}}{{l}}{{\textbf{{Labeling regime: }}\(\mathbf{{{semi_n}/{semi_a}}}\)}} \\"
        )
        lines.append(r"\addlinespace[2pt]")
        for dim in LATENT_DIMS:
            row_vals = []
            for model, net in METHOD_COLUMNS:
                key = ("exp_based", dim, model, net, semi_n, semi_a)
                cell = cells.get(key, Cell(None, None))
                row_vals.append(_fmt_mean_std(cell.mean, cell.std))
            lines.append(f"{dim} & " + " & ".join(row_vals) + r" \\")
        if idx < len(semi_labeling_regimes) - 1:
            lines.append(r"\midrule")
    lines.append(r"\bottomrule")
    lines.append(r"\end{tabularx}")
    lines.append(
        r"\caption{AP means $\pm$ std across 5 folds for experiment-based evaluation only, grouped by labeling regime.}"
    )
    lines.append(r"\end{table}")
    return "\n".join(lines)
 def _build_single_table(
    cells: dict[tuple[str, int, str, str, int, int], Cell],
    *,
@@ -224,6 +375,12 @@ def _build_single_table(
                cell = cells.get(key, Cell(None, None))
                means_left.append(cell.mean)
                cell_strs_left.append(_fmt_mean(cell.mean))
                # mean_str = _fmt_val(cell.mean)
                # std_str = _fmt_val(cell.std)
                # if mean_str == "--":
                #     cell_strs_left.append("--")
                # else:
                #     cell_strs_left.append(f"{mean_str} $\\textpm$ {std_str}")
                push_std(cell.std)
            # Right group: manual_based
@@ -233,6 +390,12 @@ def _build_single_table(
                cell = cells.get(key, Cell(None, None))
                means_right.append(cell.mean)
                cell_strs_right.append(_fmt_mean(cell.mean))
                # mean_str = _fmt_val(cell.mean)
                # std_str = _fmt_val(cell.std)
                # if mean_str == "--":
                #     cell_strs_right.append("--")
                # else:
                #     cell_strs_right.append(f"{mean_str} $\\textpm$ {std_str}")
                push_std(cell.std)
            # Bolding per group based on displayed precision
@@ -264,11 +427,23 @@ def _build_single_table(
    lines.append(r"\bottomrule")
    lines.append(r"\end{tabularx}")
-    # Caption with max std (not shown in table)
+    # Compute per-method max std across everything included in the table
-    max_std_str = "n/a" if max_std is None else f"{max_std:.{DECIMALS}f}"
+    # max_std_by_method, argmax_key = per_method_max_std_from_cells(cells)
    median_std_by_method = per_method_median_std_from_cells(cells)
    # Optional: print where each max came from (helps verify)
    for name, v in median_std_by_method.items():
        print(f"[max-std] {name}: {v:.3f}")
    cap_parts = []
    for name in ["DeepSAD (LeNet)", "DeepSAD (Efficient)", "IsoForest", "OC-SVM"]:
        v = median_std_by_method.get(name, 0.0)
        cap_parts.append(f"{name} {v:.3f}")
    cap_str = "; ".join(cap_parts)
    lines.append(
-        rf"\caption{{AUC means across 5 folds for both evaluations, grouped by labeling regime. "
+        rf"\caption{{AP means across 5 folds for both evaluations, grouped by labeling regime. "
-        rf"Maximum observed standard deviation across all cells (not shown in table): {max_std_str}.}}"
+        rf"Maximum observed standard deviation per method (not shown in table): {cap_str}.}}"
    )
    lines.append(r"\end{table}")
@@ -296,10 +471,17 @@ def main():
        cells, semi_labeling_regimes=SEMI_LABELING_REGIMES
    )
-    out_name = "auc_table_all_evals_all_regimes.tex"
+    out_name = "ap_table_all_evals_all_regimes.tex"
    out_path = ts_dir / out_name
    out_path.write_text(tex, encoding="utf-8")
    # Build experiment-based table with mean ± std
    tex_exp = _build_exp_based_table(cells, semi_labeling_regimes=SEMI_LABELING_REGIMES)
    out_name_exp = "ap_table_exp_based_mean_std.tex"
    out_path_exp = ts_dir / out_name_exp
    out_path_exp.write_text(tex_exp, encoding="utf-8")
    # Copy this script to preserve the code used for the outputs
    script_path = Path(__file__)
    shutil.copy2(script_path, ts_dir / script_path.name)