This commit is contained in:
Jan Kowalczyk
2025-09-11 14:50:16 +02:00
parent 35766b9028
commit e4b298cf06
6 changed files with 362 additions and 26 deletions

Binary file not shown.

View File

@@ -1137,29 +1137,36 @@ The decoder (see figure~\ref{fig:setup_arch_ef_decoder}) mirrors the encoders
}
Even though both encoders were designed for the same input dimensionality of $1\times 2048 \times 32$, their computational requirements differ significantly. To quantify this, we compared the number of trainable parameters and the number of multiplyaccumulate operations (MACs) for different latent space sizes used in our experiments.
%Even though both encoders were designed for the same input dimensionality of $1\times 2048 \times 32$, their computational requirements differ significantly. To quantify this, we compared the number of trainable parameters and the number of multiplyaccumulate operations (MACs) for different latent space sizes used in our experiments.
To compare the computational efficiency of the two architectures we show the number of trainable parameters and the number of multiplyaccumulate operations (MACs) for different latent space sizes used in our experiments in table~\ref{tab:params_lenet_vs_efficient}. Even though the efficient architecture employs more layers and channels which allows the network to learn to recognize more types of patterns when compared to the LeNet-inspired one, the encoders' MACs are quite similar. The more complex decoder design of the efficient network appears to contribute a lot more MACs, which leads to longer pretraining times which we report in section~\ref{sec:setup_experiments_environment}.
\begin{table}[h]
\begin{table}[!ht]
\centering
\renewcommand{\arraystretch}{1.15}
\begin{tabularx}{\linewidth}{crrrrrrrr}
\hline
& \multicolumn{4}{c}{\textbf{Encoders}} & \multicolumn{4}{c}{\textbf{Autoencoders}} \\
\cline{2-9}
& \multicolumn{2}{c}{\textbf{LeNet}} & \multicolumn{2}{c}{\textbf{Efficient}} & \multicolumn{2}{c}{\textbf{LeNet}} & \multicolumn{2}{c}{\textbf{Efficient}} \\
\cline{2-9}
\textbf{Latent $z$} & \textbf{Params} & \textbf{MACs} & \textbf{Params} & \textbf{MACs} & \textbf{Params} & \textbf{MACs} & \textbf{Params} & \textbf{MACs} \\
\hline
32 & 0.53M & 27.92M & 0.26M & 29.82M & 1.05M & 54.95M & 0.53M & 168.49M \\
64 & 1.05M & 28.44M & 0.53M & 30.08M & 2.10M & 56.00M & 1.06M & 169.02M \\
128 & 2.10M & 29.49M & 1.05M & 30.61M & 4.20M & 58.10M & 2.11M & 170.07M \\
256 & 4.20M & 31.59M & 2.10M & 31.65M & 8.39M & 62.29M & 4.20M & 172.16M \\
512 & 8.39M & 35.78M & 4.20M & 33.75M & 16.78M & 70.68M & 8.40M & 176.36M \\
768 & 12.58M & 39.98M & 6.29M & 35.85M & 25.17M & 79.07M & 12.59M & 180.55M \\
1024 & 16.78M & 44.17M & 8.39M & 37.95M & 33.56M & 87.46M & 16.79M & 184.75M \\
\hline
\end{tabularx}
\caption{Comparison of parameter count and MACs for SubTer\_LeNet and SubTer\_Efficient encoders across different latent space sizes.}
\begin{tabular}{c|cc|cc}
\toprule
\multirow{2}{*}{Latent dim} & \multicolumn{2}{c|}{SubTer\_LeNet} & \multicolumn{2}{c}{SubTer\_Efficient} \\
& Params & MACs & Params & MACs \\
\midrule
32 & 8.40M & 17.41G & 1.17M & 2.54G \\
64 & 16.38M & 17.41G & 1.22M & 2.54G \\
128 & 32.35M & 17.41G & 1.33M & 2.54G \\
256 & 64.30M & 17.41G & 1.55M & 2.54G \\
512 & 128.19M & 17.41G & 1.99M & 2.54G \\
768 & 192.07M & 17.41G & 2.43M & 2.54G \\
1024 & 255.96M & 17.41G & 2.87M & 2.54G \\
\bottomrule
\end{tabular}
\label{tab:lenet_vs_efficient}
\label{tab:params_lenet_vs_efficient}
\end{table}
\todo[inline]{rework table and calculate with actual scripts and network archs in deepsad codebase}
%\todo[inline]{rework table and calculate with actual scripts and network archs in deepsad codebase}
As can be seen, the efficient encoder requires an order of magnitude fewer parameters and significantly fewer operations while maintaining a comparable representational capacity. The key reason is the use of depthwise separable convolutions, aggressive pooling along the densely sampled horizontal axis, and a channel squeezing strategy before the fully connected layer. Interestingly, the Efficient network also processes more intermediate channels (up to 32 compared to only 8 in the LeNet variant), which increases its ability to capture a richer set of patterns despite the reduced computational cost. This combination of efficiency and representational power makes the Efficient encoder a more suitable backbone for our anomaly detection task.
@@ -1211,7 +1218,7 @@ During training, the algorithm balances two competing objectives: capturing as m
We adapted the baseline implementations to our data loader and input format \todo[inline]{briefly describe file layout / preprocessing}, and added support for multiple evaluation targets per frame (two labels per data point), reporting both results per experiment. For OCSVM, the dimensionality reduction step is \emph{always} performed with the corresponding DeepSAD encoder and its autoencoder pretraining weights that match the evaluated setting (i.e., same latent size and backbone). Both baselines, like DeepSAD, output continuous anomaly scores. This allows us to evaluate them directly without committing to a fixed threshold.
\section{Experiment Overview \& Computational Environment}
\newsection{setup_experiments_environment}{Experiment Overview \& Computational Environment}
\threadtodo
{\textit{"What should the reader know after reading this section?"}}
@@ -1239,27 +1246,74 @@ Furthermore, we investigated the effect of semi-supervised labeling. DeepSAD can
All models were pre-trained for 50~epochs and then trained for 150~epochs with the same learning rate of $1\cdot 10^{-5}$ and evaluated with 5-fold cross-validation.
Table~\ref{tab:exp_grid} summarizes the full experiment matrix.
% \begin{table}[h]
% \centering
% \caption{Experiment grid of all DeepSAD trainings. Each latent space size was tested for both network architectures and three levels of semi-supervised labeling.}
% \begin{tabular}{c|c|c}
% \toprule
% \textbf{Latent sizes} & \textbf{Architectures} & \textbf{Labeling regimes (normal, anomalous)} \\
% \midrule
% $32, 64, 128, 256, 512, 768, 1024$ & LeNet-inspired, Efficient & (0,0), (50,10), (500,100) \\
% \bottomrule
% \end{tabular}
% \label{tab:exp_grid}
% \end{table}
% \begin{table}[h]
% \centering
% \caption{Experiment grid of all DeepSAD trainings. Each latent space size was tested for both network architectures and three levels of semi-supervised labeling.}
% \renewcommand{\arraystretch}{1.2}
% \begin{tabularx}{\textwidth}{cXX}
% \hline
% \textbf{Latent sizes} & \textbf{Architectures} & \textbf{Labeling regimes (normal, anomalous)} \\
% \hline
% \begin{tabular}{@{}c@{}}
% 32 \\ 64 \\ 128 \\ 256 \\ 512 \\ 768 \\ 1024
% \end{tabular}
% &
% \begin{tabular}{@{}c@{}}
% LeNet-inspired \\ Efficient
% \end{tabular}
% &
% \begin{tabular}{@{}c@{}}
% (0,0) \\ (50,10) \\ (500,100)
% \end{tabular} \\
% \hline
% \end{tabularx}
% \label{tab:exp_grid}
% \end{table}
\begin{table}[h]
\centering
\caption{Experiment grid of all DeepSAD trainings. Each latent space size was tested for both network architectures and three levels of semi-supervised labeling.}
\begin{tabular}{c|c|c}
\caption{Parameter space for the DeepSAD grid search. Each latent size is tested for both architectures and all labeling regimes.}
\renewcommand{\arraystretch}{1.15}
\begin{tabularx}{\linewidth}{lYYY}
\toprule
\textbf{Latent sizes} & \textbf{Architectures} & \textbf{Labeling regimes (normal, anomalous)} \\
& \textbf{Latent sizes} & \textbf{Architectures} & \textbf{Labeling regimes (normal, anomalous)} \\
\midrule
$32, 64, 128, 256, 512, 768, 1024$ & LeNet-inspired, Efficient & (0,0), (50,10), (500,100) \\
\textbf{Levels} &
\makecell[c]{32 \\64\\128\\256\\512\\768\\1024} &
\makecell[c]{LeNet-inspired \\[-2pt]\rule{0.65\linewidth}{0.4pt}\\[-2pt]Efficient} &
\makecell[c]{(0,0) \\(50,10)\\(500,100)} \\
\addlinespace[2pt]
\textbf{Count} & 7 & 2 & 3 \\
\midrule
\multicolumn{2}{c}{\textbf{Total combinations}} & \(\,7 \times 2 \times 3 = \mathbf{42}\,\) & \\
\bottomrule
\end{tabular}
\end{tabularx}
\label{tab:exp_grid}
\end{table}
\threadtodo
{give overview about hardware setup and how long things take to train}
{we know what we trained but not how long that takes}
{table of hardware and of how long different trainings took}
{experiment setup understood $\rightarrow$ what were the experiments' results}
Having outlined the full grid of experiments in Table~\ref{tab:exp_grid}, we next describe the computational environment in which they were conducted. The hardware and software stack used throughout all experiments is summarized in Table~\ref{tab:system_setup}.
These experiments were run on a computational environment for which we summarize the hardware and software stack in table~\ref{tab:system_setup}.
\begin{table}[p]
\centering
@@ -1302,7 +1356,7 @@ Having outlined the full grid of experiments in Table~\ref{tab:exp_grid}, we nex
\end{tabularx}
\end{table}
Pretraining runtimes for the autoencoders are reported in Table~\ref{tab:ae_pretrain_runtimes}. These values are averaged across folds and labeling regimes, since the pretraining step itself does not make use of labels. \todo[inline]{why is efficient taking longer with less params and MACs?}
Pretraining runtimes for the autoencoders are reported in Table~\ref{tab:ae_pretrain_runtimes}. These values are averaged across folds and labeling regimes, since the pretraining step itself does not make use of labels. %\todo[inline]{why is efficient taking longer with less params and MACs?}
\begin{table}
\centering