some rewording, autoencoder start

This commit is contained in:
Jan Kowalczyk
2025-04-30 16:23:31 +02:00
parent e6cdcf23f5
commit b2386910ec

View File

@@ -351,19 +351,17 @@ As already shortly mentioned at the beginning of this section, anomaly detection
{explain what ML is, how the different approaches work, why to use semi-supervised} {explain what ML is, how the different approaches work, why to use semi-supervised}
{autoencoder special case (un-/self-supervised) used in DeepSAD $\rightarrow$ explain autoencoder} {autoencoder special case (un-/self-supervised) used in DeepSAD $\rightarrow$ explain autoencoder}
Machine learning defines types of algorithms capable of learning from existing data on previously unseen data without being explicitely programmed to do so~\cite{machine_learning_first_definition}. They are oftentimes categorized by the underlying technique employed, by the type of task they are trained to achieve or by the feedback provided to the algorithm during training. The last categorization typically includes supervised learning, unsupervised learning and reinforcement learning. Machine learning defines types of algorithms capable of learning from existing data to perform tasks on previously unseen data without being explicitely programmed to do so~\cite{machine_learning_first_definition}. They are oftentimes categorized by the underlying technique employed, by the type of task they are trained to achieve or by the feedback provided to the algorithm during training. For the latter, the most prominent categories are supervised learning, unsupervised learning and reinforcement learning.
For supervised learning each data sample is augmented by including a label which depicts the ideal output of the algorithm for the given sample. During the learning step these algorithms can compare their generated output with the one provided by an expert to improve their performance. Such labels are typically either a categorical target or a continuous target which are most commonly used for classification and regression tasks respectively. For supervised learning each data sample is augmented by including a label depicting the ideal output the algorithm can produce for the given input. During the learning step these algorithms can compare their generated output with the one provided by an expert and calculate the error between them, minimizing the error to improve performance. Such labels are typically either a categorical or continuous target which are most commonly used for classification and regression tasks respectively.
Unsupervised learning algorithms use raw data without a target label that can be used during the learning process. These types of algorithms are often used to identify underlying patterns in data which may be hard to discover using classical data analysis due to large data size or high data complexity. Common use cases include clustering the data into two or more clusters which can be differentiated from each other according to some predesignated criteria and dimensionality reduction tasks which transform high-dimensional data into a lower-dimensional subspace while retaining meaningful information of the original data. Unsupervised learning algorithms use raw data without a target label that can be used during the learning process. These types of algorithms are often utilized to identify underlying patterns in data which may be hard to discover using classical data analysis due to for example large data size or high data complexity. Cluster analysis depicts one common use case, in which data is grouped into clusters such that data from one cluster resembles other data from the same cluster more closely than data from other clusters, according to some predesignated criteria. Another important use case are dimensionality reduction tasks which transform high-dimensional data into a lower-dimensional subspace while retaining meaningful information of the original data.
The third category -reinforcement learning- takes a more interactive approach to learning in that it provides the algorithm with an environment and an interpreter of the environment's state which can be used during learning to explore new actions and their impact on the environment's state. The interpreter can then provide a reward or punishment to the algorithm based on the outcome of its actions. To improve the algorithms capability it will try to maximize the rewards received from the interpreter while still introducing some randomness to attempt the exploration of different ideas. Reinforcement learning is usually used for cases where an algorithm has to make sequences of decisions in complex environments such as for autonomous driving tasks. A more interactive approach to learning is taken by reinforcement learning, which provides the algorithm with an environment and an interpreter of the environment's state. During training the algorithm explores new possible actions and their impact on the provided environment. The interpreter can then reward or punish the algorithm based on the outcome of its actions. To improve the algorithms capability it will try to maximize the rewards received from the interpreter, retaining some randomness as to enable the exploration of different actions and their outcomes. Reinforcement learning is usually used for cases where an algorithm has to make sequences of decisions in complex environments e.g., autonomous driving tasks.
Semi-Supervised learning algorithms are -as the name implies- an inbetween category of supervised and unsupervised algorithms in that they use a mixture of labeled and unlabeled data. Typically vastly more unlabeled data is used during training of such algorithms than labeled data, oftentimes due to the effort and expertise required to label large quantities of data correctly for supervised training methods. The target tasks of semi-supervised methods can come from both the domains of supervised and unsupervised algorithms. For classification tasks which are typically achieved using supervised learning the additional unsupervised data is added during training with the hope to achieve a better outcome than when training only with the supervised portion of the data. In contrast for typical unsupervised learning tasks such as clustering algorithms, the addition of labeled samples can help guide the learning algorithm to improve performance over fully unsupervised training. Semi-Supervised learning algorithms are an inbetween category of supervised and unsupervised algorithms, in that they use a mixture of labeled and unlabeled data. Typically vastly more unlabeled data is used during training of such algorithms than labeled data, oftentimes due to the effort and expertise required to label large quantities of data correctly for supervised training methods. The target tasks of semi-supervised methods can come from both the domains of supervised and unsupervised algorithms. For classification tasks which are typically achieved using supervised learning the additional unsupervised data is added during training with the hope to achieve a better outcome than when training only with the supervised portion of the data. In contrast for typical unsupervised learning tasks such as clustering algorithms, the addition of labeled samples can help guide the learning algorithm to improve performance over fully unsupervised training.
Anomaly detection tasks can be formulated as both supervised or unsupervised problems, depending on the underlying technique utilized. While supervised anomaly detection methods exist, their suitability depends on the availability of labeled training data and on a reasonable proportionality between normal and anomalous data. Both requirements can be challenging due to the often labour intesive task of labeling data and the anomalies' intrinsic property to occur rarely when compared to normal data. DeepSAD is a semi-supervised method which extends its unsupervised predecessor Deep SVDD to include some labeled samples during training which are intended to improve the algorithm's performance. Both, DeepSAD and Deep SVDD include the training of an autoencoder as a pre-training step whose architecture is of special interest when talking about the categorization of supervised and unsupervised training, due to its unique properties.
%\todo[inline, color=green!40]{our chosen method Deep SAD is a semi-supervised deep learning method whose workings will be discussed in more detail in secion X}
\todo[inline, color=green!40]{DeepSAD is semi-supervised, autoencoder in pre-training is interesting case since its un-/self-supervised. explained in more detail next}
\newsection{autoencoder}{Autoencoder} \newsection{autoencoder}{Autoencoder}
@@ -373,6 +371,8 @@ Semi-Supervised learning algorithms are -as the name implies- an inbetween categ
{explain basic idea, unfixed architecture, infomax, mention usecases, dimension reduction} {explain basic idea, unfixed architecture, infomax, mention usecases, dimension reduction}
{dimensionality reduction, useful for high dim data $\rightarrow$ pointclouds from lidar} {dimensionality reduction, useful for high dim data $\rightarrow$ pointclouds from lidar}
Autoencoders are a type of neural network architecture, whose main goal is learning to encode input data into a representative state, from which the same input can be reconstructed, hence the name. They typically consist of two parts, an encoder and a decoder. The encoder learns to extract the most significant features from the input and convert them into another representation. The reconstruction goal ensures that the most prominent features of the input get retained during the encoding phase, due to the inability of reconstruction without the relevant information. The decoder learns to reconstruct the original input from its encoded representation, by minimizing the error between the original input data and the autoencoder's output. This unique optimization goal creates uncertainty when categorizing autoencoders as an unsupervised or supervised method, since they do not contain additional labels to the input data, while still having an optimal target available in the input data itself. Both of these properties fit certain definitions of supervised and unsupervised learning, though they are mostly labeled as unsupervised in literature, and sometimes proposed to be a special case of self-supervised learning.
\todo[inline]{autoencoder explanation} \todo[inline]{autoencoder explanation}
\todo[inline, color=green!40]{autoencoders are a neural network architecture archetype (words) whose training target is to reproduce the input data itself - hence the name. the architecture is most commonly a mirrored one consisting of an encoder which transforms input data into a hyperspace represantation in a latent space and a decoder which transforms the latent space into the same data format as the input data (phrasing), this method typically results in the encoder learning to extract the most robust and critical information of the data and the (todo maybe something about the decoder + citation for both). it is used in many domains translations, LLMs, something with images (search example + citations)} \todo[inline, color=green!40]{autoencoders are a neural network architecture archetype (words) whose training target is to reproduce the input data itself - hence the name. the architecture is most commonly a mirrored one consisting of an encoder which transforms input data into a hyperspace represantation in a latent space and a decoder which transforms the latent space into the same data format as the input data (phrasing), this method typically results in the encoder learning to extract the most robust and critical information of the data and the (todo maybe something about the decoder + citation for both). it is used in many domains translations, LLMs, something with images (search example + citations)}
\todo[inline, color=green!40]{typical encoder decoder mirrored figure} \todo[inline, color=green!40]{typical encoder decoder mirrored figure}
@@ -428,9 +428,9 @@ In this chapter, we explore the method \emph{Deep Semi-Supervised Anomaly Detect
{how clustering AD generally works, how it does in DeepSAD} {how clustering AD generally works, how it does in DeepSAD}
{since the reader knows the general idea $\rightarrow$ what is the step-by-step?} {since the reader knows the general idea $\rightarrow$ what is the step-by-step?}
\todo[inline]{remove categorization? its a bit complicated since it looks like clustering, also has spectral and can maybe even be interpreted to be information theoretic?} Deep SAD's overall mechanics are similar to clustering-based anomaly detection methods, which according to~\cite{anomaly_detection_survey} typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as an anomaly score. In Deep SAD, these concepts are implemented by employing a neural network, which is jointly trained to map input data onto a latent space and to minimize the volume of an data-encompassing hypersphere, whose center is the aforementioned centroid. The data's geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a larger distance between data and centroid corresponds to a higher probability of a sample being anomalous. This is achieved by shrinking the data-encompassing hypersphere during training, proportionally to all training data, of which is required that there is significantly more normal than anomalous data present. The outcome of this approach is that normal data gets clustered more closely around the centroid, while anomalies appear further away from it as can be seen in the toy example depicted in figure~\ref{fig:deep_svdd_transformation}.
Deep SAD is an anomaly detection algorithm that belongs to the category of clustering-based methods, which according to~\cite{anomaly_detection_survey} typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as an anomaly score. In addition to that, DeepSAD also utilizes a spectral component by mapping the input data onto a lower-dimensional space, which enables it to detect anomalies in high-dimensional complex data types. In Deep SAD, these concepts are implemented by employing a neural network, which is jointly trained to map data into a latent space and to minimize the volume of an data-encompassing hypersphere whose center is the aforementioned centroid. The geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a larger distance between data and centroid corresponds to a higher probability of a sample being anomalous. This is achieved by shrinking the data-encompassing hypersphere during training, proportionally to all training data, of which is required that there is significantly more normal than anomalous data present. The outcome of this approach is that normal data gets clustered more closely around the centroid, while anomalies appear further away from it as can be seen in the toy example depicted in figure~\ref{fig:deep_svdd_transformation}. %Deep SAD is an anomaly detection algorithm that belongs to the category of clustering-based methods, which according to~\cite{anomaly_detection_survey} typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as an anomaly score. In addition to that, DeepSAD also utilizes a spectral component by mapping the input data onto a lower-dimensional space, which enables it to detect anomalies in high-dimensional complex data types. In Deep SAD, these concepts are implemented by employing a neural network, which is jointly trained to map data into a latent space and to minimize the volume of an data-encompassing hypersphere whose center is the aforementioned centroid. The geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a larger distance between data and centroid corresponds to a higher probability of a sample being anomalous. This is achieved by shrinking the data-encompassing hypersphere during training, proportionally to all training data, of which is required that there is significantly more normal than anomalous data present. The outcome of this approach is that normal data gets clustered more closely around the centroid, while anomalies appear further away from it as can be seen in the toy example depicted in figure~\ref{fig:deep_svdd_transformation}.
\fig{deep_svdd_transformation}{figures/deep_svdd_transformation}{DeepSAD teaches a neural network to transform data into a latent space and minimize the volume of an data-encompassing hypersphere centered around a predetermined centroid $\textbf{c}$. \\Reproduced from~\cite{deepsvdd}.} \fig{deep_svdd_transformation}{figures/deep_svdd_transformation}{DeepSAD teaches a neural network to transform data into a latent space and minimize the volume of an data-encompassing hypersphere centered around a predetermined centroid $\textbf{c}$. \\Reproduced from~\cite{deepsvdd}.}
@@ -440,7 +440,7 @@ Deep SAD is an anomaly detection algorithm that belongs to the category of clust
{pre-training is autoencoder, self-supervised, dimensionality reduction} {pre-training is autoencoder, self-supervised, dimensionality reduction}
{pre-training done $\rightarrow$ how are the pre-training results used?} {pre-training done $\rightarrow$ how are the pre-training results used?}
DeepSAD requires a pre-training step, during which an autoencoder is trained on all available training data. One of DeepSAD's goals is to map input data onto a lower dimensional latent space, in which the separation between normal and anomalous data can be achieved. To this end DeepSAD and its predecessor Deep SVDD make use of the autoencoder's reconstruction goal, whose successful training ensures confidence in that the encoder architecture is suitable for extracting the input datas' most prominent information to the latent space inbetween the encoder and decoder. DeepSAD goes on to use just the encoder as its main network architecture, discarding the decoder at this step, since reconstruction of the input is unnecessary. Before DeepSAD's training can begin, a pre-training step is required, during which an autoencoder is trained on all available input data. One of DeepSAD's goals is to map input data onto a lower dimensional latent space, in which the separation between normal and anomalous data can be achieved. To this end DeepSAD and its predecessor Deep SVDD make use of the autoencoder's reconstruction goal, whose successful training ensures confidence in the encoder architecture's suitability for extracting the input datas' most prominent information to the latent space inbetween the encoder and decoder. DeepSAD goes on to use just the encoder as its main network architecture, discarding the decoder at this step, since reconstruction of the input is unnecessary.
%The results of the pre-training are used twofold. Firstly the encoders' weights at the end of pre-training can be used to initialize Deep SAD's weights for the main training step which aligns with the aforementioned Infomax principle, since we can assume the autoencoder maximized the shared information between the input and the latent space represenation. Secondly an initial forward-pass is run on the encoder network for all available training data samples and the results' mean position in the latent space is used to define the hypersphere center $\mathbf{c}$ which according to~\cite{deepsad} allows for faster convergence during the main training step than randomly chosen centroids. An alternative method of initializing the hypersphere center could be to use only labeled normal examples for the forward-pass, so not to pollute $\mathbf{c}$'s position with anomalous samples, which would only be possible if sufficient labeled normal samples are available. From this point on, the hypersphere center $\mathbf{c}$ stays fixed and does not change, which is necessary since it being a free optimization variable could lead to a trivial hypersphere collapse solution if the network was trained fully unsupervised. %The results of the pre-training are used twofold. Firstly the encoders' weights at the end of pre-training can be used to initialize Deep SAD's weights for the main training step which aligns with the aforementioned Infomax principle, since we can assume the autoencoder maximized the shared information between the input and the latent space represenation. Secondly an initial forward-pass is run on the encoder network for all available training data samples and the results' mean position in the latent space is used to define the hypersphere center $\mathbf{c}$ which according to~\cite{deepsad} allows for faster convergence during the main training step than randomly chosen centroids. An alternative method of initializing the hypersphere center could be to use only labeled normal examples for the forward-pass, so not to pollute $\mathbf{c}$'s position with anomalous samples, which would only be possible if sufficient labeled normal samples are available. From this point on, the hypersphere center $\mathbf{c}$ stays fixed and does not change, which is necessary since it being a free optimization variable could lead to a trivial hypersphere collapse solution if the network was trained fully unsupervised.
@@ -488,10 +488,10 @@ As can be seen from \ref{eq:deepsvdd_optimization_objective}, Deep SVDD is an un
\citeauthor{deepsad} argue in \cite{deepsad} that the pre-training step employing an autoencoder—originally introduced in \cite{deepsvdd}—not only allows a geometric interpretation of the method as minimum volume estimation i.e., the shrinking of the data encompassing hypersphere but also a probabilistic one as entropy minimization over the latent distribution. The autoencoding objective during pre-training implicitly maximizes the mutual information between the data and its latent representation, aligning the approach with the Infomax principle while encouraging a latent space with minimal entropy. This insight enabled \citeauthor{deepsad} to introduce an additional term in DeepSADs objective, beyond that of its predecessor Deep SVDD~\cite{deepsvdd}, which incorporates labeled data to better capture the characteristics of normal and anomalous data. They demonstrate that DeepSADs objective effectively models the latent distribution of normal data as having low entropy, while that of anomalous data is characterized by higher entropy. In this framework, anomalies are interpreted as being generated from an infinite mixture of distributions that differ from the normal data distribution. \citeauthor{deepsad} argue in \cite{deepsad} that the pre-training step employing an autoencoder—originally introduced in \cite{deepsvdd}—not only allows a geometric interpretation of the method as minimum volume estimation i.e., the shrinking of the data encompassing hypersphere but also a probabilistic one as entropy minimization over the latent distribution. The autoencoding objective during pre-training implicitly maximizes the mutual information between the data and its latent representation, aligning the approach with the Infomax principle while encouraging a latent space with minimal entropy. This insight enabled \citeauthor{deepsad} to introduce an additional term in DeepSADs objective, beyond that of its predecessor Deep SVDD~\cite{deepsvdd}, which incorporates labeled data to better capture the characteristics of normal and anomalous data. They demonstrate that DeepSADs objective effectively models the latent distribution of normal data as having low entropy, while that of anomalous data is characterized by higher entropy. In this framework, anomalies are interpreted as being generated from an infinite mixture of distributions that differ from the normal data distribution.
The introduction of the aforementioned term in Deep SAD's objective allows it to learn in a semi-supervised way, though it can operate in a fully unsupervised mode—effectively reverting to its predecessor, Deep SVDD~\cite{deepsvdd}—when no labeled data are available. However, it also allows for the incorporation of labeled samples during training. This additional supervision helps the model better position known normal samples near the hypersphere center and push known anomalies farther away, thereby enhancing its ability to differentiate between normal and anomalous data. The introduction of the aforementioned term in Deep SAD's objective allows it to learn in a semi-supervised way, though it can operate in a fully unsupervised mode—effectively reverting to its predecessor, Deep SVDD~\cite{deepsvdd}—when no labeled data are available. This additional supervision helps the model better position known normal samples near the hypersphere center and push known anomalies farther away, thereby enhancing its ability to differentiate between normal and anomalous data.
From this it is easy to understand Deep SAD's optimization objective seen in \ref{eq:deepsad_optimization_objective} which additionally defines $m$ number of labeled data samples $\{(\mathbf{\tilde{x}}_1, \tilde{y}_1), \dots, (\mathbf{\tilde{x}}_m, \tilde{y}_1)\} \in \mathcal{X} \times \mathcal{Y}$ and $\mathcal{Y} = \{-1,+1\}$ for which $\tilde{y} = +1$ denotes normal and $\tilde{y} = -1$ anomalous samples as well as a new hyperparameter $\eta > 0$ which can be used to balance the strength with which labeled and unlabeled samples contribute to the training. From \ref{eq:deepsvdd_optimization_objective} it is easy to understand Deep SAD's optimization objective seen in \ref{eq:deepsad_optimization_objective} which additionally defines $m$ number of labeled data samples $\{(\mathbf{\tilde{x}}_1, \tilde{y}_1), \dots, (\mathbf{\tilde{x}}_m, \tilde{y}_1)\} \in \mathcal{X} \times \mathcal{Y}$ and $\mathcal{Y} = \{-1,+1\}$ for which $\tilde{y} = +1$ denotes normal and $\tilde{y} = -1$ anomalous samples as well as a new hyperparameter $\eta > 0$ which can be used to balance the strength with which labeled and unlabeled samples contribute to the training.
\begin{equation} \begin{equation}
\label{eq:deepsad_optimization_objective} \label{eq:deepsad_optimization_objective}
@@ -501,7 +501,7 @@ From this it is easy to understand Deep SAD's optimization objective seen in \re
+\frac{\lambda}{2}\sum_{\ell=1}^{L}\|\mathbf{W}^{\ell}\|_{F}^{2}. +\frac{\lambda}{2}\sum_{\ell=1}^{L}\|\mathbf{W}^{\ell}\|_{F}^{2}.
\end{equation} \end{equation}
The first term of \ref{eq:deepsad_optimization_objective} stays mostly the same, only atributing for the introduced $m$ labeled datasamples in its proportionality. The second term is newly introduced to incorporate the labeled data samples with the strength of hyperparameter $\eta$, by depending on each data samples label $\tilde{y}$ either minimizing or maximizing the distance from the samples latent represenation to $\mathbf{c}$. The third term, is kept identical compared to Deep SVDD as standard L2 regularization. It can also be observed that in case of $m = 0$ labeled samples, Deep SAD falls back to the same optimization objective of Deep SVDD and can therefore be used in a completely unsupervised fashion as well. The first term of \ref{eq:deepsad_optimization_objective} stays mostly the same, differing only in its consideration of the introduced $m$ labeled datasamples for its proportionality. The second term is newly introduced to incorporate the labeled data samples with hyperparameter $\eta$'s strength, by either minimizing or maximizing the distance between the samples latent represenation and $\mathbf{c}$ depending on each data samples label $\tilde{y}$. The third term, is kept identical compared to Deep SVDD as standard L2 regularization. It can also be observed that in case of $m = 0$ labeled samples, Deep SAD falls back to the same optimization objective of Deep SVDD and can therefore be used in a completely unsupervised fashion as well.
\newsubsubsectionNoTOC{Hyperparameters} \newsubsubsectionNoTOC{Hyperparameters}