draft

2025-09-29 18:54:35 +02:00
parent d5f5a09d6f
commit 6040f5f144
3 changed files with 23 additions and 20 deletions
--- a/thesis/Main.pdf
+++ b/thesis/Main.pdf
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -389,8 +389,7 @@ From equation~\ref{eq:deepsvdd_optimization_objective} it is easy to understand
 The first term of equation~\ref{eq:deepsad_optimization_objective} stays mostly the same, differing only in its consideration of the introduced $m$ labeled datasamples for its proportionality. The second term is newly introduced to incorporate the labeled data samples with hyperparameter $\eta$'s strength, by either minimizing or maximizing the distance between the samples latent represenation and $\mathbf{c}$ depending on each data samples label $\tilde{y}$. The standard L2 regularization is kept identical to Deep SVDD's optimization objective. It can also be observed that in case of $m = 0$ labeled samples, DeepSAD falls back to Deep SVDD's optimization objective and can therefore be used in a completely unsupervised fashion as well.


-\newsubsubsectionNoTOC{Hyperparameters}
-
+\paragraph{Hyperparameters}
 DeepSAD relies on several tuneable hyperparameters that influence different stages of the algorithm. The most relevant ones are summarized and discussed below.

 \begin{itemize}
@@ -567,7 +566,6 @@ In the following sections, we detail our adaptations to this framework:
 Together, these components define the full experimental pipeline, from data loading, preprocessing, method training to the evaluation and comparing of methods.

 \section{Framework \& Data Preparation}
-%\newsubsubsectionNoTOC{DeepSAD PyTorch codebase and our adaptations}


 DeepSAD's PyTorch implementation—our starting point—includes implementations for training on standardized datasets such as MNIST, CIFAR-10 and datasets from \citetitle{odds}~\cite{odds}. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE and SemiDGM with the loaded data and evaluate their performance by calculating the Receiver Operating Characteristic (ROC) and its Area Under the Curve (AUC) for all given algorithms. We adapted this implementation which was originally developed for Python 3.7 to work with Python 3.12 and changed or added functionality for dataloading our chosen dataset, added DeepSAD models that work with the lidar projections datatype, added more evaluation methods and an inference module.
@@ -728,7 +726,7 @@ To compare the computational efficiency of the two architectures we show the num

 \FloatBarrier

-\newsubsubsectionNoTOC{Baseline methods (Isolation Forest, OCSVM)}
+\paragraph{Baseline methods (Isolation Forest, OCSVM)}


 To contextualize the performance of DeepSAD, we compare against two widely used baselines: Isolation Forest and OCSVM. Both are included in the original DeepSAD codebase and the associated paper, and they represent well-understood but conceptually different families of anomaly detection. In our setting, the raw input dimensionality ($2048 \times 32$ per frame) is too high for a direct OCSVM fit, so we reuse the DeepSAD autoencoder’s \emph{encoder} as a learned dimensionality reduction (to the same latent size as DeepSAD), to allow OCSVM training on this latent space. Together, these two baselines cover complementary perspectives: raw input tree-based partitioning (Isolation Forest) and dimensionality reduced kernel-based boundary learning (OCSVM), providing a broad and well-established basis for comparison.
@@ -956,17 +954,17 @@ Due to the challenges of ground truth quality, evaluation results must be interp
 	\item \textbf{Manually-defined labels:} A cleaner ground truth, containing only clearly degraded frames. This removes mislabeled intervals and allows nearly perfect separation. However, it also simplifies the task too much, because borderline cases are excluded.
 \end{itemize}

-Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimensions, labeling regimes, and methods. Under experiment-based evaluation, both DeepSAD variants consistently outperform the baselines, reaching AP values around 0.60–0.66 compared to 0.21 for Isolation Forest and 0.31–0.49 for OC-SVM. Under manually-defined evaluation, DeepSAD achieves nearly perfect AP in all settings, while the baselines remain much lower. This contrast shows that the lower AP under experiment-based evaluation is not a weakness of DeepSAD itself, but a direct result of mislabeled samples in the evaluation data. The manually-defined scheme therefore confirms that DeepSAD separates clearly normal from clearly degraded frames very well, while also highlighting that label noise must be kept in mind when interpreting the experiment-based results.
+Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimensions, labeling regimes, and methods. Under experiment-based evaluation, both DeepSAD variants consistently outperform the baselines, reaching AP values around 0.60–0.66 compared to 0.21 for Isolation Forest and 0.31–0.49 for OCSVM. Under manually-defined evaluation, DeepSAD achieves nearly perfect AP in all settings, while the baselines remain much lower. This contrast shows that the lower AP under experiment-based evaluation is not a weakness of DeepSAD itself, but a direct result of mislabeled samples in the evaluation data. The manually-defined scheme therefore confirms that DeepSAD separates clearly normal from clearly degraded frames very well, while also highlighting that label noise must be kept in mind when interpreting the experiment-based results.

 \begin{table}[t]
 	\centering
-	\caption{AP means across 5 folds for both evaluations, grouped by labeling regime. Mean observed standard deviation per method: DeepSAD (LeNet) 0.015; DeepSAD (Efficient) 0.012; IsoForest 0.011; OC-SVM 0.091.}
+	\caption{AP means across 5 folds for both evaluations, grouped by labeling regime. Mean observed standard deviation per method: DeepSAD (LeNet) 0.015; DeepSAD (Efficient) 0.012; IsoForest 0.011; OCSVM 0.091.}
 	\label{tab:results_ap}
 	\begin{tabularx}{\textwidth}{c*{4}{Y}|*{4}{Y}}
 		\toprule
 		            & \multicolumn{4}{c}{Experiment-based eval.} & \multicolumn{4}{c}{Manually-defined eval.}                                                                   \\
 		\cmidrule(lr){2-5} \cmidrule(lr){6-9}
-		Latent Dim. & \rotheader{DeepSAD                                                                                                                                        \\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OC-SVM} & \rotheader{DeepSAD\\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OC-SVM} \\
+		Latent Dim. & \rotheader{DeepSAD                                                                                                                                        \\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OCSVM} & \rotheader{DeepSAD\\(LeNet)} & \rotheader{DeepSAD\\(Efficient)} & \rotheader{IsoForest} & \rotheader{OCSVM} \\
 		\midrule
 		\multicolumn{9}{l}{\textbf{Labeling regime: }\(\mathbf{0/0}\) \textit{(normal/anomalous samples labeled)}}                                                              \\
 		\addlinespace[2pt]
@@ -1002,11 +1000,13 @@ Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimen
 \end{table}


-The precision--recall curves (Figure~\ref{fig:prc_representative}) illustrate these effects more clearly. For DeepSAD, precision stays close to 1 until about 0.5 recall, after which it drops off sharply. This plateau corresponds to the fraction of truly degraded frames in the anomalous set. Once recall moves beyond this point, the evaluation demands that the model also “find” the mislabeled anomalies near the run boundaries. To do so, the decision threshold must be lowered so far that many normal frames are also flagged, which causes precision to collapse. The baselines behave differently: OC-SVM shows a smooth but weaker decline without a strong high-precision plateau, while Isolation Forest collapses to near-random performance. These operational differences are hidden in a single AP number but are important for judging how the methods would behave in deployment.
+The precision--recall curves (Figure~\ref{fig:prc_representative}) illustrate these effects more clearly. For DeepSAD, precision stays close to 1 until about 0.5 recall, after which it drops off sharply. This plateau corresponds to the fraction of truly degraded frames in the anomalous set. Once recall moves beyond this point, the evaluation demands that the model also “find” the mislabeled anomalies near the run boundaries. To do so, the decision threshold must be lowered so far that many normal frames are also flagged, which causes precision to collapse. The baselines behave differently: OCSVM shows a smooth but weaker decline without a strong high-precision plateau, while Isolation Forest collapses to near-random performance. These operational differences are hidden in a single AP number but are important for judging how the methods would behave in deployment.

 Taken together, the two evaluation schemes provide complementary insights. The experiment-based labels offer a noisy but realistic setting that shows how methods cope with ambiguous data, while the manually-defined labels confirm that DeepSAD can achieve nearly perfect separation when the ground truth is clean. The combination of both evaluations makes clear that (i) DeepSAD is stronger than the baselines under both conditions, (ii) the apparent performance limits under experiment-based labels are mainly due to label noise, and (iii) interpreting results requires care, since performance drops in the curves often reflect mislabeled samples rather than model failures. At the same time, both schemes remain binary classifications and therefore cannot directly evaluate the central question of whether anomaly scores can serve as a continuous measure of degradation. For this reason, we extend the analysis in Section~\ref{sec:results_inference}, where inference on entire unseen experiments is used to provide a more intuitive demonstration of the methods’ potential for quantifying lidar degradation in practice.

-\fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OC-SVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
+\fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OCSVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
+
+\FloatBarrier

 \paragraph{Effect of latent space dimensionality.}
 During autoencoder pretraining we observed that reconstruction loss decreased monotonically with larger latent spaces, as expected: a bigger bottleneck allows the encoder–decoder to retain more information. If autoencoder performance were directly predictive of DeepSAD performance, we would therefore expect average precision to improve with larger latent dimensions. The actual results, however, show the opposite trend (Figure~\ref{fig:latent_dim_ap}): compact latent spaces (32–128) achieve the highest AP, while performance declines as the latent size grows. This inverse correlation is most clearly visible in the unsupervised case. Part of this effect can be attributed to evaluation label noise, which larger spaces amplify. More importantly, it shows that autoencoder performance does not translate directly into DeepSAD performance. Pretraining losses can still help compare different architectures for robustness, and performance but they cannot be used to tune the latent dimensionality: the dimensionality that minimizes reconstruction loss in pretraining is not necessarily the one that maximizes anomaly detection performance in DeepSAD.
@@ -1018,6 +1018,7 @@ During autoencoder pretraining we observed that reconstruction loss decreased mo

 \figc{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}{width=.7\textwidth}

+\FloatBarrier
 \paragraph{Effect of semi-supervised labeling.}
 Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves the best AP, while the lightly supervised regime \((50/10)\) performs worst. With many labels \((500/100)\), performance improves again but remains slightly below the unsupervised case. This pattern also appears under the manually-defined evaluation, which excludes mislabeled frames. The drop with light supervision therefore cannot be explained by noisy evaluation targets, but must stem from the training process itself.

@@ -1042,16 +1043,17 @@ These frame-by-frame time-axis plots simulate online inference and illustrate ho

 \fig{results_inference_normal_vs_degraded}{figures/results_inference_normal_vs_degraded.png}{Comparison of anomaly detection methods with statistical indicators across clean (dashed) and degraded (solid) experiments. Each subplot shows one method (DeepSAD--LeNet, DeepSAD--Efficient, OCSVM, Isolation Forest). Red curves denote method anomaly scores normalized to the clean experiment; blue and green curves denote the percentage of missing lidar points and near-sensor particle hits, respectively. Clear separation between clean and degraded runs is observed for the DeepSAD variants and, to a lesser degree, for OCSVM, while Isolation Forest produces high scores even in the clean experiment. Latent Space Dimensionality was 32 and semi-supervised labeling regime was 0 normal and 0 anomalous samples during training.}

-The plots in Fig.~\ref{fig:results_inference_normal_vs_degraded} highlight important differences in how well the tested methods distinguish between normal and degraded sensor conditions.
+The plots in Figure~\ref{fig:results_inference_normal_vs_degraded} highlight important differences in how well the tested methods distinguish between normal and degraded sensor conditions.
 Among the four approaches, the strongest separation is achieved by DeepSAD (Efficient), followed by DeepSAD (LeNet), then OCSVM.
 For Isolation Forest, the anomaly scores are already elevated in the clean experiment, which prevents reliable differentiation between normal and degraded runs and makes the method unsuitable in this context.

-Because anomaly scores are on incomparable scales, we apply $z$-score normalization based on the clean experiment. This allows deviations in degraded runs to be measured relative to the clean baseline, enabling direct comparison across methods. To allow comparison between the clean and degraded experiments, the mean and standard deviation were estimated exclusively from the clean experiment and then used to normalize the degraded scores as well. This ensures that increases in the degraded runs are interpreted relative to the distribution of the clean baseline, whereas computing separate $z$-scores per experiment would only reveal deviations within each run individually and not enable a meaningful cross-experiment comparison. It should be noted that the $z$-scores remain method-specific, meaning that while relative separation between clean and degraded runs can be compared within a method, the absolute scales across different methods are not directly comparable; readers should therefore take note of the differing axis ranges for each subplot. After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.
+Because anomaly scores are on incomparable scales, we apply $z$-score normalization based on the clean experiment. This allows deviations in degraded runs to be measured relative to the clean baseline, enabling direct comparison across methods. To allow comparison between the clean and degraded experiments, the mean and standard deviation were estimated exclusively from the clean experiment and then used to normalize the degraded scores as well. After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.

 The red method curves can also be compared with the blue and green statistical indicators (missing points and near-sensor particle hits).
 While some similarities in shape may suggest that the methods partly capture these statistics, such interpretations should be made with caution.
 The anomaly detection models are expected to have learned additional patterns that are not directly observable from simple statistics, and these may also contribute to their ability to separate degraded from clean data.

+
 % -------------------------------
 % Conclusion & Future Work (intro)
 % -------------------------------
@@ -1065,11 +1067,8 @@ Our results indicate a qualified “yes.” Using anomaly detection (AD)—in pa

 \paragraph{Main contributions.}
 \begin{itemize}
-	\item \textbf{Empirical comparison for lidar degradation.} A systematic evaluation of DeepSAD against Isolation Forest and OC-SVM across latent sizes and labeling regimes, showing that DeepSAD consistently outperforms the baselines under both evaluation schemes (Section~\ref{sec:results_deepsad}).
+	\item \textbf{Empirical comparison for lidar degradation.} A systematic evaluation of DeepSAD against Isolation Forest and OCSVM across latent sizes and labeling regimes, showing that DeepSAD consistently outperforms the baselines under both evaluation schemes (Section~\ref{sec:results_deepsad}).

-	\item \textbf{Two-track evaluation protocol.} We frame and use two complementary label sets: (i) \emph{experiment-based} labels (objective but noisy at run boundaries), and (ii) \emph{manually-defined} intervals (clean but simplified). This pairing clarifies what each scheme can—and cannot—tell us about real performance (Section~\ref{sec:results_deepsad}).
-
-	      % \item \textbf{Latent dimensionality insight.} Compact bottlenecks (32–128) are more robust under noisy labels and yield the best AP; larger latent spaces amplify precision collapses beyond the high-precision plateau (Figure~\ref{fig:latent_dim_ap}). High-dimensional input data apparently can be compressed quite strongly, which may lead to improved performance and better generalization.
 	\item \textbf{Latent dimensionality insight.}
 	      Autoencoder pretraining loss decreases with larger latent spaces, but DeepSAD performance shows the opposite trend: compact bottlenecks (32–128) achieve the highest AP. This contrast demonstrates that pretraining performance does not directly predict DeepSAD performance—latent dimensionality cannot be tuned via autoencoder loss alone, even though it remains useful for comparing architectures.

@@ -1082,7 +1081,7 @@ Our results indicate a qualified “yes.” Using anomaly detection (AD)—in pa

 \paragraph{Practical recommendations.}
 For settings similar to ours, we recommend:
-(i) use PRC/AP for model selection and reporting, since ROC/AUC can give overly optimistic results under strong class imbalance;
+(i) use PRC and AP for model selection and reporting, since ROC and its AUC can give overly optimistic results under strong class imbalance;
 (ii) prefer compact latent spaces (e.g., 32–128) and determine the smallest dimensionality that still preserves task-relevant information;
 (iii) evaluate multiple encoder architectures, as design choices strongly affect performance and robustness;
 (iv) avoid very small labeled sets, which can cause overfitting to narrow anomaly exemplars. If labels are used, collect many and diverse examples—though unsupervised training may still generalize best.
@@ -1103,9 +1102,9 @@ Finally, the binary ground truth employed here is insufficient for the quantific

 \newsection{conclusion_ad}{Insights into DeepSAD and AD for Degradation Quantification}

-This work has shown that the DeepSAD principle is applicable to lidar degradation data and yields promising detection performance as well as runtime feasibility (see Section~\ref{sec:setup_experiments_environment}). Compared to simpler baselines such as Isolation Forest and OC-SVM, DeepSAD achieved much stronger separation between clean and degraded data. While OC-SVM showed smoother but weaker separation and Isolation Forest produced high false positives even in clean runs, both DeepSAD variants maintained large high-precision regions before collapsing under mislabeled evaluation targets.
+This work has shown that the DeepSAD principle is applicable to lidar degradation in hazardous environments and yields promising detection performance as well as runtime feasibility (see Sections~\ref{sec:results_deepsad} and~\ref{sec:setup_experiments_environment}). Compared to simpler baselines such as Isolation Forest and OCSVM, DeepSAD achieved much stronger separation between clean and degraded data. While OCSVM showed smoother but weaker separation and Isolation Forest produced high false positives even in clean runs, both DeepSAD variants maintained large high-precision regions before collapsing under mislabeled evaluation targets.

-However, the semi-supervised component of DeepSAD did not improve results in our setting. In fact, adding a small number of labels often reduced performance due to overfitting to narrow subsets of anomalies, while larger labeled sets stabilized training but still did not surpass the unsupervised regime (see Section~\ref{sec:results_deepsad}). This suggests that without representative and diverse labeled anomalies, unsupervised training remains the safer choice.
+However, the semi-supervised component of DeepSAD did not improve results in our setting. In fact, adding a small number of labels often reduced performance due to overfitting to narrow subsets of anomalies, while larger labeled sets stabilized training, they still did not surpass the unsupervised regime (see Section~\ref{sec:results_deepsad}). This suggests that without representative and diverse labeled anomalies, unsupervised training remains the safer choice.

 We also observed that the choice of encoder architecture and latent dimensionality are critical. The Efficient encoder consistently outperformed the LeNet-inspired baseline, producing more stable precision–recall curves and stronger overall results. Similarly, compact latent spaces (32–128 dimensions) yielded the best performance and proved more robust under noisy evaluation conditions, while larger latent spaces amplified the impact of mislabeled samples and caused sharper precision collapses. These findings underline the importance of representation design for robust anomaly detection.

@@ -1120,7 +1119,7 @@ Several promising avenues remain open for future exploration:
 	\item \textbf{Lidar intensity:} Lidar typically save an intensity value per point, indicating the strength of the reflected optical signal, which could be incorporated to improve degradation quantification.
 	\item \textbf{Sensor fusion:} Combining lidar with complementary sensors (e.g., ultrasonic sensors that penetrate dense clouds) could mitigate blind spots inherent to single-sensor evaluation.
 	\item \textbf{Input segmentation:} The DeepSAD architecture tested here processed full 360° lidar scans. This may obscure localized degradations. Segmenting point clouds into angular sectors and computing anomaly scores per sector could provide more fine-grained quantification. Preliminary tests in this direction were promising, but were not pursued further in this thesis.
-	\item \textbf{Cross-sensor generalization:} Current experiments assume identical sensor resolution. Extending the method to work across different lidar types, including those with varying angular resolutions, remains an open question and would enhance applicability in heterogeneous robotic fleets.
+	\item \textbf{Cross-sensor generalization:} Current experiments assume identical sensor resolution. Extending the method to work across different lidar types, including those with varying angular resolutions, remains an open question and would enhance applicability in heterogeneous robotic fleets and allow the incorporation of more datasets during training.
 \end{itemize}

 In summary, while this thesis demonstrates the feasibility of using anomaly detection for lidar degradation quantification, significant challenges remain. Chief among them are the definition and collection of ground truth, the development of analog evaluation targets, and architectural adaptations for more complex real-world scenarios. Addressing these challenges will be critical for moving from proof-of-concept to practical deployment in rescue robotics and beyond.
--- a/thesis/thesis_preamble/abstract.tex
+++ b/thesis/thesis_preamble/abstract.tex
@@ -1,3 +1,7 @@
 \addcontentsline{toc}{chapter}{Abstract (English)}
 \begin{center}\Large\bfseries Abstract (English)\end{center}\vspace*{1cm}\noindent 
-Write some fancy abstract here!
+Autonomous robots are increasingly used in search and rescue (SAR) missions. In these missions, lidar sensors are often the most important source of environmental data. However, lidar data can degrade under hazardous conditions, especially when airborne particles such as smoke or dust are present. This degradation can lead to errors in mapping and navigation and may endanger both the robot and human rescuers. Robots therefore need a way to estimate the reliability of their lidar data.
+
+This thesis investigates whether anomaly detection methods can be used to quantify lidar data degradation. We apply a semi-supervised deep learning approach called DeepSAD. The method produces an anomaly score for each lidar scan, which can serve as a measure of data reliability.
+
+We evaluate this method against baseline methods on an subterranean dataset that includes lidar scans degraded by artificial smoke. Our results show that DeepSAD consistently outperforms the baselines and can clearly distinguish degraded from normal scans. At the same time, we find that the limited availability of labeled data and the lack of robust ground truth remain major challenges. Despite these limitations, our work demonstrates that anomaly detection methods are a promising tool for lidar degradation quantification in SAR scenarios.