diff --git a/thesis/Main.tex b/thesis/Main.tex index 854f8a0..e39ef6a 100755 --- a/thesis/Main.tex +++ b/thesis/Main.tex @@ -121,6 +121,8 @@ %\textbf{CONFIDENTIAL} }} +\DeclareMathAlphabet\mathbfcal{OMS}{cmsy}{b}{n} +\newcommand*\wc{{\mkern 2mu\cdot\mkern 2mu}} \input{./base/opt_macros} @@ -274,10 +276,47 @@ As a pre-training step, an autoencoder is trained and its encoder weights are us The introduction of the aforementioned term in Deep SAD's objective allows it to learn in a semi-supervised way, though it can operate in a fully unsupervised mode—effectively reverting to its predecessor, Deep SVDD~\cite{deepsvdd}—when no labeled data are available. However, it also allows for the incorporation of labeled samples during training. This additional supervision helps the model better position known normal samples near the hypersphere center and push known anomalies farther away, thereby enhancing its ability to differentiate between normal and anomalous data. - +%The results of the pre-training are used twofold. Firstly the encoders' weights at the end of pre-training can be used to initialize Deep SAD's weights for the main training step which aligns with the aforementioned Infomax principle, since we can assume the autoencoder maximized the shared information between the input and the latent space represenation. Secondly an initial forward-pass is run on the encoder network for all available training data samples and the results' mean position in the latent space is used to define the hypersphere center $\mathbf{c}$ which according to~\cite{deepsad} allows for faster convergence during the main training step than randomly chosen centroids. An alternative method of initializing the hypersphere center could be to use only labeled normal examples for the forward-pass, so not to pollute $\mathbf{c}$'s position with anomalous samples, which would only be possible if sufficient labeled normal samples are available. From this point on, the hypersphere center $\mathbf{c}$ stays fixed and does not change, which is necessary since it being a free optimization variable could lead to a trivial hypersphere collapse solution if the network was trained fully unsupervised. + + \todo[inline]{maybe pseydocode algorithm block?} + + +The pre-training results are used in two key ways. First, the encoder weights obtained from the autoencoder pre-training initialize DeepSAD’s network for the main training phase. This aligns with the Infomax principle, as the autoencoder is assumed to maximize the mutual information between the input and its latent representation. Second, we perform an initial forward pass through the encoder on all training samples, and the mean of these latent representations is set as the hypersphere center, $\mathbf{c}$. According to \cite{deepsad}, this initialization method leads to faster convergence during the main training phase compared to using a randomly selected centroid. An alternative would be to compute $\mathbf{c}$ using only the labeled normal examples, which would prevent the center from being influenced by anomalous samples; however, this requires a sufficient number of labeled normal samples. Once defined, the hypersphere center $\mathbf{c}$ remains fixed, as allowing it to be optimized freely could in the unsupervised case lead to a hypersphere collapse-a trivial solution where $\phi$ converges to the constant function $\phi \equiv \mathbf{c}$. + +In the main training step, the encoder part of the autoencoder architecture is trained using SGD backpropagation with the goal of Deep SAD's optimization objective as seen in \ref{eq:deepsad_optimization_objective}. The training takes unlabeled data as well as labeled data with binary classification i.e., normal and anomalous. + +Once the network has been successfully trained, new data can be fed through a forward pass and the resulting latent representation's geometric distance to the hypersphere center $\mathbf{c}$ can be interpreted as an anomaly score for which holds that a higher score represents a higher likelihood of the input data being an anomalous sample. The anomaly score itself has to be processed further for effective use, for example a simple thresholding can be used to classify new samples or more complex analysis can be done to utilize the score. + \newsection{algorithm_details}{Algorithm Details and Hyperparameters} - \todo[inline]{backpropagation optimization formula, hyperaparameters explanation} + %\todo[inline]{backpropagation optimization formula, hyperaparameters explanation} + + Since Deep SAD is heavily based on its predecessor Deep SVDD it is helpful to first understand Deep SVDD's optimization objective, so we start with explaining it here. For input space $\mathcal{X} \subseteq \mathbb{R}^D$, output space $\mathcal{Z} \subseteq \mathbb{R}^d$ and a neural network $\phi(\wc; \mathcal{W}) : \mathcal{X} \to \mathcal{Z}$ where $\mathcal{W}$ depicts the neural networks' weights with $L$ layers $\{\mathbf{W}_1, \dots, \mathbf{W}_L\}$, $n$ the number of unlabeled training samples $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, $\mathbf{c}$ the center of the hypersphere in the latent space, Deep SVDD teaches the neural network to cluster normal data closely together in the latent space by defining its optimization objective as seen in~\ref{eq:deepsvdd_optimization_objective}. + + + \begin{equation} + \label{eq:deepsvdd_optimization_objective} + \min_{\mathcal{W}} \quad + \frac{1}{n} \sum_{i=1}^{n}\|\phi(\mathbf{x}_{i};\mathcal{W})-\mathbf{c}\|^{2} + +\frac{\lambda}{2}\sum_{\ell=1}^{L}\|\mathbf{W}^{\ell}\|_{F}^{2}. + \end{equation} + + As can be seen from \ref{eq:deepsvdd_optimization_objective}, Deep SVDD is an unsupervised method which does not rely on labeled data to train the network to differentiate between normal and anomalous data. The first term of the optimization objective depicts the shrinking of the data-encompassing hypersphere around the given center $\mathbf{c}$. For each data sample $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, its geometric distance to $\mathbf{c}$ in the latent space produced by the neural network $\phi(\wc; \mathcal{W})$ is minimized proportionally to the amount of data samples $n$. The second term is a standard L2 regularization term which prevents overfitting with hyperparameter $\lambda > 0$ and $\|\wc\|_F$ denoting the Frobenius norm. + + From this it is easy to understand Deep SAD's optimization objective seen in \ref{eq:deepsad_optimization_objective} which additionally defines $m$ number of labeled data samples $\{(\mathbf{\tilde{x}}_1, \tilde{y}_1), \dots, (\mathbf{\tilde{x}}_m, \tilde{y}_1)\} \in \mathcal{X} \times \mathcal{Y}$ and $\mathcal{Y} = \{-1,+1\}$ for which $\tilde{y} = +1$ denotes normal and $\tilde{y} = -1$ anomalous samples as well as a new hyperparameter $\eta > 0$ which can be used to balance the strength with which labeled and unlabeled samples contribute to the training. + + \begin{equation} + \label{eq:deepsad_optimization_objective} + \min_{\mathcal{W}} \quad + \frac{1}{n+m} \sum_{i=1}^{n}\|\phi(\mathbf{x}_{i};\mathcal{W})-\mathbf{c}\|^{2} + +\frac{\eta}{n+m}\sum_{j=1}^{m}\left(\|\phi(\tilde{\mathbf{x}}_{j};\mathcal{W})-\mathbf{c}\|^{2}\right)^{\tilde{y}_{j}} + +\frac{\lambda}{2}\sum_{\ell=1}^{L}\|\mathbf{W}^{\ell}\|_{F}^{2}. + \end{equation} + + The first term of \ref{eq:deepsad_optimization_objective} stays mostly the same, only atributing for the introduced $m$ labeled datasamples in its proportionality. The second term is newly introduced to incorporate the labeled data samples with the strength of hyperparameter $eta$, by depending on each data samples label $\tilde{y}$ either minimizing or maximizing the distance from the samples latent represenation to $\mathbf{c}$. The third term, is kept identical compared to Deep SVDD as standard L2 regularization. It can also be observed that in case of $m = 0$ labeled samples, Deep SAD falls back to the same optimization objective of Deep SVDD and can therefore be used in a completely unsupervised fashion as well. + + +\newsubsubsectionNoTOC{Hyperparameters} The neural network architecture of DeepSAD is not fixed but rather dependent on the datatype the algorithm is supposed to operate on. This is due to the way it employs an autoencoder for pre-training and the encoder part of the network for its main training step. This makes the adaption of an autoencoder architecture suitable to the specific application necessary but also allows for flexibility in choosing a fitting architecture depending on the application's requirements. For this reason the specific architecture employed, may be considered an hyperparameter of the Deep SAD algorithm. During the pre-training step-as is typical for autoencoders-no labels are necessary since the optimization objective of autoencoders is generally to reproduce the input, as is indicated by the architecture's name. @@ -289,13 +328,6 @@ The introduction of the aforementioned term in Deep SAD's objective allows it to \todo[inline, color=green!40]{in formula X we see the optimization target of the algorithm. explain in one paragraph the variables in the optimization formula} \todo[inline, color=green!40]{explain the three terms (unlabeled, labeled, regularization)} - \begin{equation} - \min_{\mathcal{W}} \quad - \frac{1}{n+m} \sum_{i=1}^{n}\|\phi(\mathbf{x}_{i};\mathcal{W})-\mathbf{c}\|^{2} - +\frac{\eta}{n+m}\sum_{j=1}^{m}\left(\|\phi(\tilde{\mathbf{x}}_{j};\mathcal{W})-\mathbf{c}\|^{2}\right)^{\tilde{y}_{j}} - +\frac{\lambda}{2}\sum_{\ell=1}^{L}\|\mathbf{W}^{\ell}\|_{F}^{2}. - \end{equation} - \newsection{advantages_limitations}{Advantages and Limitations} \todo[inline]{semi supervised, learns normality by amount of data (no labeling/ground truth required), very few labels for better training to specific situation}