wip
This commit is contained in:
@@ -984,7 +984,7 @@ The results of pretraining the two autoencoder architectures are summarized in T
|
||||
|
||||
\figc{ae_loss_overall}{figures/ae_elbow_test_loss_overall.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures.}{width=.9\textwidth}
|
||||
|
||||
Because overall reconstruction loss might obscure how well encoders represent anomalous samples, we additionally evaluate reconstruction errors only on degraded samples from hand-labeled smoke segments (Figure~\ref{fig:ae_loss_degraded}). As expected, reconstruction losses are about 0.05 higher on these challenging samples than in the overall evaluation. However, the relative advantage of the Efficient architecture remains, suggesting that its improvements extend to anomalous inputs as well.
|
||||
Because overall reconstruction loss might obscure how well encoders represent anomalous samples, we additionally evaluate reconstruction errors only on degraded samples from hand-labeled smoke segments (Figure~\ref{fig:ae_loss_degraded}). As expected, reconstruction losses are higher on these challenging samples than in the overall evaluation. However, the relative advantage of the Efficient architecture remains, suggesting that its improvements extend to anomalous inputs as well.
|
||||
|
||||
\figc{ae_loss_degraded}{figures/ae_elbow_test_loss_anomaly.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures, evaluated only on degraded data from hand-labeled smoke experiments.}{width=.9\textwidth}
|
||||
|
||||
@@ -1045,16 +1045,19 @@ Table~\ref{tab:results_ap} summarizes average precision (AP) across latent dimen
|
||||
\end{table}
|
||||
|
||||
|
||||
The precision--recall curves (Figure~\ref{fig:prc_representative}) illustrate these effects more clearly. For DeepSAD, precision stays close to 1 until about 0.5 recall, after which it drops off sharply. This plateau corresponds to the fraction of truly degraded frames in the anomalous set. Once recall moves beyond this point, the evaluation demands that the model also “find” the mislabeled anomalies near the run boundaries. To do so, the decision threshold must be lowered so far that many normal frames are also flagged, which causes precision to collapse. The baselines behave differently: OC-SVM shows a smooth but weaker decline without a strong high-precision plateau, while Isolation Forest detects only a few extreme anomalies before collapsing to near-random performance. These operational differences are hidden in a single AP number but are important for judging how the methods would behave in deployment.
|
||||
The precision--recall curves (Figure~\ref{fig:prc_representative}) illustrate these effects more clearly. For DeepSAD, precision stays close to 1 until about 0.5 recall, after which it drops off sharply. This plateau corresponds to the fraction of truly degraded frames in the anomalous set. Once recall moves beyond this point, the evaluation demands that the model also “find” the mislabeled anomalies near the run boundaries. To do so, the decision threshold must be lowered so far that many normal frames are also flagged, which causes precision to collapse. The baselines behave differently: OC-SVM shows a smooth but weaker decline without a strong high-precision plateau, while Isolation Forest collapses to near-random performance. These operational differences are hidden in a single AP number but are important for judging how the methods would behave in deployment.
|
||||
|
||||
Taken together, the two evaluation schemes provide complementary insights. The experiment-based labels offer a noisy but realistic setting that shows how methods cope with ambiguous data, while the hand-labeled labels confirm that DeepSAD can achieve nearly perfect separation when the ground truth is clean. The combination of both evaluations makes clear that (i) DeepSAD is stronger than the baselines under both conditions, (ii) the apparent performance limits under experiment-based labels are mainly due to label noise, and (iii) interpreting results requires care, since performance drops in the curves often reflect mislabeled samples rather than model failures. At the same time, both schemes remain binary classifications and therefore cannot directly evaluate the central question of whether anomaly scores can serve as a continuous measure of degradation. For this reason, we extend the analysis in Section~\ref{sec:results_inference}, where inference on entire unseen experiments is used to provide a more intuitive demonstration of the methods’ potential for quantifying LiDAR degradation in practice.
|
||||
|
||||
\fig{prc_representative}{figures/results_prc.png}{Representative precision–recall curves over all latent dimensionalities for semi-labeling regime 0/0 from experiment-based evaluation labels. DeepSAD maintains a large high-precision operating region before collapsing; OC-SVM declines smoother but exhibits high standard deviation between folds; IsoForest collapses quickly and remains flat. DeepSAD's fall-off is at least partly due to known mislabeled evaluation targets.}
|
||||
|
||||
\paragraph{Effect of latent space dimensionality.}
|
||||
Figure~\ref{fig:latent_dim_ap} shows how average precision changes with latent dimension under the experiment-based evaluation. The best performance is reached with compact latent spaces (32–128), while performance drops as the latent dimension grows. This can be explained by how the latent space controls the separation between normal and anomalous samples. Small bottlenecks act as a form of regularization, keeping the representation compact and making it easier to distinguish clear anomalies from normal frames. Larger latent spaces increase model capacity, but this extra flexibility also allows more overlap between normal frames and the mislabeled anomalies from the evaluation data. As a result, the model struggles more to keep the two groups apart.
|
||||
During autoencoder pretraining we observed that reconstruction loss decreased monotonically with larger latent spaces, as expected: a bigger bottleneck allows the encoder–decoder to retain more information. If autoencoder performance were directly predictive of DeepSAD performance, we would therefore expect average precision to improve with larger latent dimensions. The actual results, however, show the opposite trend (Figure~\ref{fig:latent_dim_ap}): compact latent spaces (32–128) achieve the highest AP, while performance declines as the latent size grows. This inverse correlation is most clearly visible in the unsupervised case. Part of this effect can be attributed to evaluation label noise, which larger spaces amplify. More importantly, it shows that autoencoder performance does not translate directly into DeepSAD performance. Pretraining losses can still help compare different architectures for robustness, and performance but they cannot be used to tune the latent dimensionality: the dimensionality that minimizes reconstruction loss in pretraining is not necessarily the one that maximizes anomaly detection performance in DeepSAD.
|
||||
|
||||
This effect is clearly visible in the precision--recall curves. For DeepSAD at all dimensionalities we observe high initial precision and a steep drop once the evaluation demands that mislabeled anomalies be included. However, the sharpness of this drop depends on the latent size: at 32 dimensions the fall is comparably more gradual, while at 1024 it is almost vertical. In practice, this means that higher-dimensional latent spaces amplify the label-noise problem and lead to sudden precision collapses once the clear anomalies have been detected. Compact latent spaces are therefore more robust under noisy evaluation conditions and appear to be the safer choice for real-world deployment.
|
||||
% \paragraph{Effect of latent space dimensionality.}
|
||||
% Figure~\ref{fig:latent_dim_ap} shows how average precision changes with latent dimension under the experiment-based evaluation. The best performance is reached with compact latent spaces (32–128), while performance drops as the latent dimension grows. This can be explained by how the latent space controls the separation between normal and anomalous samples. Small bottlenecks act as a form of regularization, keeping the representation compact and making it easier to distinguish clear anomalies from normal frames. Larger latent spaces increase model capacity, but this extra flexibility also allows more overlap between normal frames and the mislabeled anomalies from the evaluation data. As a result, the model struggles more to keep the two groups apart.
|
||||
%
|
||||
% This effect is clearly visible in the precision--recall curves. For DeepSAD at all dimensionalities we observe high initial precision and a steep drop once the evaluation demands that mislabeled anomalies be included. However, the sharpness of this drop depends on the latent size: at 32 dimensions the fall is comparably more gradual, while at 1024 it is almost vertical. In practice, this means that higher-dimensional latent spaces amplify the label-noise problem and lead to sudden precision collapses once the clear anomalies have been detected. Compact latent spaces are therefore more robust under noisy evaluation conditions and appear to be the safer choice for real-world deployment.
|
||||
|
||||
\figc{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}{width=.7\textwidth}
|
||||
|
||||
@@ -1085,7 +1088,6 @@ These frame-by-frame time-axis plots simulate online inference and illustrate ho
|
||||
The plots in Fig.~\ref{fig:results_inference_normal_vs_degraded} highlight important differences in how well the tested methods distinguish between normal and degraded sensor conditions.
|
||||
Among the four approaches, the strongest separation is achieved by DeepSAD (Efficient), followed by DeepSAD (LeNet), then OCSVM.
|
||||
For Isolation Forest, the anomaly scores are already elevated in the clean experiment, which prevents reliable differentiation between normal and degraded runs and makes the method unsuitable in this context.
|
||||
It is important to note that the score axes are scaled individually per method, so comparisons should focus on relative separation rather than absolute values.
|
||||
|
||||
Because anomaly scores are on incomparable scales, we apply $z$-score normalization based on the clean experiment. This allows deviations in degraded runs to be measured relative to the clean baseline, enabling direct comparison across methods. To allow comparison between the clean and degraded experiments, the mean and standard deviation were estimated exclusively from the clean experiment and then used to normalize the degraded scores as well. This ensures that increases in the degraded runs are interpreted relative to the distribution of the clean baseline, whereas computing separate $z$-scores per experiment would only reveal deviations within each run individually and not enable a meaningful cross-experiment comparison. It should be noted that the $z$-scores remain method-specific, meaning that while relative separation between clean and degraded runs can be compared within a method, the absolute scales across different methods are not directly comparable; readers should therefore take note of the differing axis ranges for each subplot. After normalization, the resulting time series were still highly noisy, which motivated the application of exponential moving average (EMA) smoothing. EMA was chosen because it is causal (does not rely on future data) and thus suitable for real-time inference. Although it introduces a small time delay, this delay is shorter than for other smoothing techniques such as running averages.
|
||||
|
||||
@@ -1110,7 +1112,9 @@ Our results indicate a qualified “yes.” Using anomaly detection (AD)—in pa
|
||||
|
||||
\item \textbf{Two-track evaluation protocol.} We frame and use two complementary label sets: (i) \emph{experiment-based} labels (objective but noisy at run boundaries), and (ii) \emph{hand-labeled} intervals (clean but simplified). This pairing clarifies what each scheme can—and cannot—tell us about real performance (Section~\ref{sec:results_deepsad}).
|
||||
|
||||
\item \textbf{Latent dimensionality insight.} Compact bottlenecks (32–128) are more robust under noisy labels and yield the best AP; larger latent spaces amplify precision collapses beyond the high-precision plateau (Figure~\ref{fig:latent_dim_ap}). High-dimensional input data apparently can be compressed quite strongly, which may lead to improved performance and better generalization.
|
||||
% \item \textbf{Latent dimensionality insight.} Compact bottlenecks (32–128) are more robust under noisy labels and yield the best AP; larger latent spaces amplify precision collapses beyond the high-precision plateau (Figure~\ref{fig:latent_dim_ap}). High-dimensional input data apparently can be compressed quite strongly, which may lead to improved performance and better generalization.
|
||||
\item \textbf{Latent dimensionality insight.}
|
||||
Autoencoder pretraining loss decreases with larger latent spaces, but DeepSAD performance shows the opposite trend: compact bottlenecks (32–128) achieve the highest AP. This contrast demonstrates that pretraining performance does not directly predict DeepSAD performance—latent dimensionality cannot be tuned via autoencoder loss alone, even though it remains useful for comparing architectures.
|
||||
|
||||
\item \textbf{Semi-supervision insight.} In our data, \emph{unsupervised} DeepSAD performed best; \emph{light} labeling (50/10) performed worst; \emph{many} labels (500/100) partially recovered performance but did not surpass unsupervised. Evidence from PRC shapes and fold variance points to \emph{training-side overfitting to a small labeled set}, an effect that persists even under clean hand-labeled evaluation (Table~\ref{tab:results_ap}, Figure~\ref{fig:prc_over_semi}).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user