background chapter work
This commit is contained in:
@@ -263,9 +263,12 @@ In this thesis we used an anomaly detection method, namely Deep Semi-Supervised
|
||||
|
||||
Chapter~\ref{chp:deepsad} describes DeepSAD in more detail, which shows that it is a clustering based approach with a spectral pre-processing component, in that it first uses a neural network to reduce the inputs dimensionality while simultaneously clustering normal data closely around a given centroid and then gives an anomaly score by calculating the geometric distance between a single data sample and the aforementioned cluster centroid in the lower dimensional subspace. Since our data is high dimensional it makes sense to use a spectral method to reduce the datas dimensionality and an approach which results in an analog value rather than a binary classification is useful for our use-case since we want to quantify not only classify the data degradation.
|
||||
|
||||
\todo[inline, color=green!40]{data availability leading into semi-supervised learning algorithms}
|
||||
%\todo[inline, color=green!40]{data availability leading into semi-supervised learning algorithms}
|
||||
|
||||
As already shortly mentioned at the beginning of this section, anomaly detection methods and their usage are oftentimes challenged by the limited availability of anomalous data, owing to the very nature of anomalies which are rare occurences. Oftentimes the intended use-case is to even find unknown anomalies in a given dataset which have not yet been identified. In addition, it can be challenging to classify anomalies correctly for complex data, since the very definition of an anomaly is dependent on many factors such as the type of data, the intended use-case or even how the data evolves over time. For these reasons most types of anomaly detection approaches limit their reliance on anomalous data during training and many of them do not differentiate between normal and anomalous data at all. DeepSAD is a semi-supervised method which is characterized by using a mixture of labeled and unlabeled data.
|
||||
|
||||
|
||||
|
||||
As shortly already mentioned at the beginning of this section, anomaly detection methods and their usage are oftentimes challenged by the limited availability of data, owing to the very nature of anomalies which are rare occurences. This is one of the reasons, why most approaches limit their reliance on anomalous data during training. Another reason is that oftentimes the intended use-case is to find unknown anomalies in a given dataset which have not yet been identified much less classified.
|
||||
|
||||
|
||||
% strategies of anomaly detection algorithnms according to x include classification, neirest neighbor, clustering, spectral, information theoretic, statistical
|
||||
@@ -288,7 +291,24 @@ As shortly already mentioned at the beginning of this section, anomaly detection
|
||||
\fi
|
||||
|
||||
\newsection{semi_supervised}{Semi-Supervised Learning Algorithms}
|
||||
\todo[inline]{Quick overview of the Deep SAD metho}
|
||||
|
||||
Machine learning defines types of algorithms capable of learning from existing data on previously unseen data without being explicitely programmed to do so~\cite{machine_learning_first_definition}. They are oftentimes categorized by the underlying technique employed, by the type of task they are trained to achieve or by the feedback provided to the algorithm during training. The last categorization typically includes supervised learning, unsupervised learning and reinforcement learning.
|
||||
|
||||
For supervised learning each data sample is augmented by including a label which depicts the ideal output of the algorithm for the given sample. During the learning step these algorithms can compare their generated output with the one provided by an expert to improve their performance. Such labels are typically either a categorical target or a continuous target which are most commonly used for classification and regression tasks respectively.
|
||||
|
||||
Unsupervised learning algorithms use raw data without a target label that can be used during the learning process. These types of algorithms are often used to identify underlying patterns in data which may be hard to discover using classical data analysis due to large data size or high data complexity. Common use cases include clustering the data into two or more clusters which can be differentiated from each other according to some predesignated criteria and dimensionality reduction tasks which transform high-dimensional data into a lower-dimensional subspace while retaining meaningful information of the original data.
|
||||
|
||||
The third category -reinforcement learning- takes a more interactive approach to learning in that it provides the algorithm with an environment and an interpreter of the environment's state which can be used during learning to explore new actions and their impact on the environment's state. The interpreter can then provide a reward or punishment to the algorithm based on the outcome of its actions. To improve the algorithms capability it will try to maximize the rewards received from the interpreter while still introducing some randomness to attempt the exploration of different ideas. Reinforcement learning is usually used for cases where an algorithm has to make sequences of decisions in complex environments such as for autonomous driving tasks.
|
||||
|
||||
Semi-Supervised learning algorithms are -as the name implies- an inbetween category of supervised and unsupervised algorithms in that they use a mixture of labeled and unlabeled data. Typically vastly more unlabeled data is used during training of such algorithms than labeled data, oftentimes due to the effort and expertise required to label large quantities of data correctly for supervised training methods. The target tasks of semi-supervised methods can come from both the domains of supervised and unsupervised algorithms. For classification tasks which are typically achieved using supervised learning the additional unsupervised data is added during training with the hope to achieve a better outcome than when training only with the supervised portion of the data. In contrast for typical unsupervised learning tasks such as clustering algorithms, the addition of labeled samples can help guide the learning algorithm to improve performance over fully unsupervised training.
|
||||
|
||||
|
||||
\todo[inline]{learning based methods categorized into supervised, unsupervised and semi-supervised, based on whether for all, none or some of the data labels are provided hinting at the correct output the method should generate for these samples. historically, machine learning started with unsupervised learning and then progressed to supervised learning. unsupervised learning was oftentimes used to look for emergent patterns in data but supervised deep learning was then successfull in classifciation and regression tasks and it became clear how important dataset size was, for this reason supervised methods were challenging to train since it required a lot of expert labeling. unsupervised methods were explored for these use-cases as well and oftentimes adapted to become semi-supervised by providing partially labeled datasets which were easier to produce but more performant when trained compared to unsupervised methods}
|
||||
|
||||
%Broadly speaking all learning based methods can be categorized into three categories, based on whether or not the training data contains additional information about the expected output of the finalized algorithm for each given data sample. These additional information are typically called labels, which are assigned to each individual training sample, prior to training by an expert in the field. For example in the case of image classification each training image may be labeled by a human expert with one of the target class labels. The algorithm is then supposed to learn to differentiate images from each other during the training and classify them into one of the trained target classes. Methods that require such a target label for each training sample are called supervised methods, owing to the fact that an expert guides the algorithm or helps it learn the expected outputs by providing it with samples and their corresponding correct solutions, like teaching a student by providing the correct solution to a stated problem.
|
||||
%Alternatively, there are algorithms that learn only using the training data itself without any labels produced by experts for the data. The algorithms are therefore called unsupervised methods and are oftentimes used to find emergent patterns in the data itself without the need for prior knowledge of these patterns by an expert. Unsupervised methods
|
||||
|
||||
|
||||
\todo[inline, color=green!40]{deep learning based (neural network with hidden layers), neural networks which get trained using backpropagation, to learn to solve a novel task by defining some target}
|
||||
\todo[inline, color=green!40]{data labels decide training setting (supervised, non-supervised, semi-supervised incl explanation), supervised often classification based, but not possible if no labels available, un-supervised has no well-defined target, often used to fined common hidden factors in data (distribution). semi-supervised more like a sub method of unsupervised which additionally uses little (often handlabelled) data to improve method performance}
|
||||
\todo[inline, color=green!40]{include figure unsupervised, semi-supervised, supervised}
|
||||
@@ -324,7 +344,7 @@ In this chapter, we explore the method \emph{Deep Semi-Supervised Anomaly Detect
|
||||
|
||||
%Deep SAD is a typical clustering based anomaly detection technique which is described in \cite{anomaly_detection_survey} to generally have a two step approach to anomaly detection. First a clustering algorithm is used to cluster data closely together around a centroid and secondly the distances from data to that centroid is calculated and interpreted as an anomaly score. This general idea can also be found in the definition of the Deep SAD algorithm, which uses the encoder part of an autoencoder architecture which is trained to cluster data around a centroid in the latent space of its output. The datas geometric distance to that centroid in the latent space is defined as an anomaly score. Deep SAD is a semi-supervised training based method which can work completely unsupervised (no labeled data available) in which case it falls back to its predecessor method Deep SVDD but additionally allows the introduction of labeleld data samples during training to more accurately map known normal samples near the centroid and known anomalous samples further away from it.
|
||||
|
||||
Deep SAD is an anomaly detection algorithm that belongs to the category of clustering-based methods, which according to~\cite{anomaly_detection_survey} typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as an anomaly score. In Deep SAD, this concept is implemented by employing the encoder part of an autoencoder architecture, which is jointly trained to map data into a latent space and to minimize the volume of an data-encompassing hypersphere whose center is the aforementioned centroid. The geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a higher score corresponds to a higher probability of a sample being anomalous according to the method, due to normal samples clustering more closely around the hypersphere center than anomalies. This general working principle is depicted in figure~\ref{fig:deep_svdd_transformation}.
|
||||
Deep SAD is an anomaly detection algorithm that belongs to the category of clustering-based methods, which according to~\cite{anomaly_detection_survey} typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as an anomaly score. In Deep SAD, this concept is implemented by employing a neural network, which is jointly trained to map data into a latent space and to minimize the volume of an data-encompassing hypersphere whose center is the aforementioned centroid. The geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a higher score corresponds to a higher probability of a sample being anomalous according to the method, due to normal samples clustering more closely around the hypersphere center than anomalies. This general working principle is depicted in figure~\ref{fig:deep_svdd_transformation}.
|
||||
|
||||
\fig{deep_svdd_transformation}{figures/deep_svdd_transformation}{DeepSAD teaches a neural network to transform data into a latent space and minimize the volume of an data-encompassing hypersphere centered around a predetermined centroid $\textbf{c}$. \\Reproduced from~\cite{deepsvdd}.}
|
||||
|
||||
@@ -340,7 +360,7 @@ The introduction of the aforementioned term in Deep SAD's objective allows it to
|
||||
\todo[inline]{maybe pseydocode algorithm block?}
|
||||
|
||||
|
||||
The pre-training results are used in two key ways. First, the encoder weights obtained from the autoencoder pre-training initialize DeepSAD’s network for the main training phase. This aligns with the Infomax principle, as the autoencoder is assumed to maximize the mutual information between the input and its latent representation. Second, we perform an initial forward pass through the encoder on all training samples, and the mean of these latent representations is set as the hypersphere center, $\mathbf{c}$. According to \cite{deepsad}, this initialization method leads to faster convergence during the main training phase compared to using a randomly selected centroid. An alternative would be to compute $\mathbf{c}$ using only the labeled normal examples, which would prevent the center from being influenced by anomalous samples; however, this requires a sufficient number of labeled normal samples. Once defined, the hypersphere center $\mathbf{c}$ remains fixed, as allowing it to be optimized freely could in the unsupervised case lead to a hypersphere collapse-a trivial solution where $\phi$ converges to the constant function $\phi \equiv \mathbf{c}$.
|
||||
The pre-training results are used in two key ways. First, the encoder weights obtained from the autoencoder pre-training initialize DeepSAD’s network for the main training phase. This aligns with the Infomax principle, as the autoencoder is assumed to maximize the mutual information between the input and its latent representation. Second, we perform an initial forward pass through the encoder on all training samples, and the mean of these latent representations is set as the hypersphere center, $\mathbf{c}$. According to \cite{deepsad}, this initialization method leads to faster convergence during the main training phase compared to using a randomly selected centroid. An alternative would be to compute $\mathbf{c}$ using only the labeled normal examples, which would prevent the center from being influenced by anomalous samples; however, this requires a sufficient number of labeled normal samples. Once defined, the hypersphere center $\mathbf{c}$ remains fixed, as allowing it to be optimized freely could in the unsupervised case lead to a hypersphere collapse-a trivial solution where the network learns to map all inputs directly onto the centroid $\mathbf{c}$.
|
||||
|
||||
In the main training step, the encoder part of the autoencoder architecture is trained using SGD backpropagation with the goal of Deep SAD's optimization objective as seen in \ref{eq:deepsad_optimization_objective}. The training takes unlabeled data as well as labeled data with binary classification i.e., normal and anomalous.
|
||||
|
||||
@@ -372,21 +392,22 @@ From this it is easy to understand Deep SAD's optimization objective seen in \re
|
||||
+\frac{\lambda}{2}\sum_{\ell=1}^{L}\|\mathbf{W}^{\ell}\|_{F}^{2}.
|
||||
\end{equation}
|
||||
|
||||
The first term of \ref{eq:deepsad_optimization_objective} stays mostly the same, only atributing for the introduced $m$ labeled datasamples in its proportionality. The second term is newly introduced to incorporate the labeled data samples with the strength of hyperparameter $eta$, by depending on each data samples label $\tilde{y}$ either minimizing or maximizing the distance from the samples latent represenation to $\mathbf{c}$. The third term, is kept identical compared to Deep SVDD as standard L2 regularization. It can also be observed that in case of $m = 0$ labeled samples, Deep SAD falls back to the same optimization objective of Deep SVDD and can therefore be used in a completely unsupervised fashion as well.
|
||||
The first term of \ref{eq:deepsad_optimization_objective} stays mostly the same, only atributing for the introduced $m$ labeled datasamples in its proportionality. The second term is newly introduced to incorporate the labeled data samples with the strength of hyperparameter $\eta$, by depending on each data samples label $\tilde{y}$ either minimizing or maximizing the distance from the samples latent represenation to $\mathbf{c}$. The third term, is kept identical compared to Deep SVDD as standard L2 regularization. It can also be observed that in case of $m = 0$ labeled samples, Deep SAD falls back to the same optimization objective of Deep SVDD and can therefore be used in a completely unsupervised fashion as well.
|
||||
|
||||
|
||||
\newsubsubsectionNoTOC{Hyperparameters}
|
||||
|
||||
The neural network architecture of DeepSAD is not fixed but rather dependent on the datatype the algorithm is supposed to operate on. This is due to the way it employs an autoencoder for pre-training and the encoder part of the network for its main training step. This makes the adaption of an autoencoder architecture suitable to the specific application necessary but also allows for flexibility in choosing a fitting architecture depending on the application's requirements. For this reason the specific architecture employed, may be considered an hyperparameter of the Deep SAD algorithm. During the pre-training step-as is typical for autoencoders-no labels are necessary since the optimization objective of autoencoders is generally to reproduce the input, as is indicated by the architecture's name.
|
||||
\todo[inline]{Talk about choosing the correct architecture (give example receptive fields for image data from object detection?)}
|
||||
|
||||
\todo[inline, color=green!40]{Core idea of the algorithm is to learn a transformation to map input data into a latent space where normal data clusters close together and anomalous data gets mapped further away. to achieve this the methods first includes a pretraining step of an auto-encoder to extract the most relevant information, second it fixes a hypersphere center in the auto-encoders latent space as a target point for normal data and third it traings the network to map normal data closer to that hypersphere center. Fourth The resulting network can map new data into this latent space and interpret its distance from the hypersphere center as an anomaly score which is larger the more anomalous the datapoint is}
|
||||
\todo[inline, color=green!40]{explanation pre-training step: architecture of the autoencoder is dependent on the input data shape, but any data shape is generally permissible. for the autoencoder we do not need any labels since the optimization target is always the input itself. the latent space dimensionality can be chosen based on the input datas complexity (search citations). generally a higher dimensional latent space has more learning capacity but tends to overfit more easily (find cite). the pre-training step is used to find weights for the encoder which genereally extract robust and critical data from the input because TODO read deepsad paper (cite deepsad). as training data typically all data (normal and anomalous) is used during this step.}
|
||||
\todo[inline, color=green!40]{explanation hypersphere center step: an additional positive ramification of the pretraining is that the mean of all pre-training's latent spaces can be used as the hypersphere target around which normal data is supposed to cluster. this is advantageous because it allows the main training to converge faster than choosing a random point in the latent space as hypersphere center. from this point onward the center C is fixed for the main training and inference and does not change anymore.}
|
||||
\todo[inline, color=green!40]{explanation training step: during the main training step the method starts with the pre-trained weights of the encoder but removes the decoder from the architecture since it optimizes the output in the latent space and does not need to reproduce the input data format. it does so by minimizing the geometric distance of each input data's latent space represenation to the previously defined hypersphere center c. Due to normal data being more common in the inputs this results in normal data clustering closely to C and anormal data being pushed away from it. additionally during this step the labeled data is used to more correctly map normal and anormal data}
|
||||
\todo[inline, color=green!40]{explanation inference step: with the trained network we can transform new input data into the latent space and calculate its distance from the hypersphere center which will be smaller the more confident the network is in the data being normal and larger the more likely the data is anomalous. This output score is an analog value dependent on multiple factors like the latent space dimensionality, encoder architecture and ??? and has to be interpreted further to be used (for example thresholding)}
|
||||
%\todo[inline, color=green!40]{Core idea of the algorithm is to learn a transformation to map input data into a latent space where normal data clusters close together and anomalous data gets mapped further away. to achieve this the methods first includes a pretraining step of an auto-encoder to extract the most relevant information, second it fixes a hypersphere center in the auto-encoders latent space as a target point for normal data and third it traings the network to map normal data closer to that hypersphere center. Fourth The resulting network can map new data into this latent space and interpret its distance from the hypersphere center as an anomaly score which is larger the more anomalous the datapoint is}
|
||||
%\todo[inline, color=green!40]{explanation pre-training step: architecture of the autoencoder is dependent on the input data shape, but any data shape is generally permissible. for the autoencoder we do not need any labels since the optimization target is always the input itself. the latent space dimensionality can be chosen based on the input datas complexity (search citations). generally a higher dimensional latent space has more learning capacity but tends to overfit more easily (find cite). the pre-training step is used to find weights for the encoder which genereally extract robust and critical data from the input because TODO read deepsad paper (cite deepsad). as training data typically all data (normal and anomalous) is used during this step.}
|
||||
%\todo[inline, color=green!40]{explanation hypersphere center step: an additional positive ramification of the pretraining is that the mean of all pre-training's latent spaces can be used as the hypersphere target around which normal data is supposed to cluster. this is advantageous because it allows the main training to converge faster than choosing a random point in the latent space as hypersphere center. from this point onward the center C is fixed for the main training and inference and does not change anymore.}
|
||||
%\todo[inline, color=green!40]{explanation training step: during the main training step the method starts with the pre-trained weights of the encoder but removes the decoder from the architecture since it optimizes the output in the latent space and does not need to reproduce the input data format. it does so by minimizing the geometric distance of each input data's latent space represenation to the previously defined hypersphere center c. Due to normal data being more common in the inputs this results in normal data clustering closely to C and anormal data being pushed away from it. additionally during this step the labeled data is used to more correctly map normal and anormal data}
|
||||
%\todo[inline, color=green!40]{explanation inference step: with the trained network we can transform new input data into the latent space and calculate its distance from the hypersphere center which will be smaller the more confident the network is in the data being normal and larger the more likely the data is anomalous. This output score is an analog value dependent on multiple factors like the latent space dimensionality, encoder architecture and ??? and has to be interpreted further to be used (for example thresholding)}
|
||||
|
||||
\todo[inline, color=green!40]{in formula X we see the optimization target of the algorithm. explain in one paragraph the variables in the optimization formula}
|
||||
\todo[inline, color=green!40]{explain the three terms (unlabeled, labeled, regularization)}
|
||||
%\todo[inline, color=green!40]{in formula X we see the optimization target of the algorithm. explain in one paragraph the variables in the optimization formula}
|
||||
%\todo[inline, color=green!40]{explain the three terms (unlabeled, labeled, regularization)}
|
||||
|
||||
\newsection{advantages_limitations}{Advantages and Limitations}
|
||||
\todo[inline]{semi supervised, learns normality by amount of data (no labeling/ground truth required), very few labels for better training to specific situation}
|
||||
|
||||
Reference in New Issue
Block a user