reworked general deepsad algo description

This commit is contained in:
Jan Kowalczyk
2025-04-29 10:12:27 +02:00
parent 1d5d0b3a99
commit e6cdcf23f5

View File

@@ -366,6 +366,13 @@ Semi-Supervised learning algorithms are -as the name implies- an inbetween categ
\todo[inline, color=green!40]{DeepSAD is semi-supervised, autoencoder in pre-training is interesting case since its un-/self-supervised. explained in more detail next} \todo[inline, color=green!40]{DeepSAD is semi-supervised, autoencoder in pre-training is interesting case since its un-/self-supervised. explained in more detail next}
\newsection{autoencoder}{Autoencoder} \newsection{autoencoder}{Autoencoder}
\threadtodo
{Explain how autoencoders work and what they are used for}
{autoencoder used in deepSAD}
{explain basic idea, unfixed architecture, infomax, mention usecases, dimension reduction}
{dimensionality reduction, useful for high dim data $\rightarrow$ pointclouds from lidar}
\todo[inline]{autoencoder explanation} \todo[inline]{autoencoder explanation}
\todo[inline, color=green!40]{autoencoders are a neural network architecture archetype (words) whose training target is to reproduce the input data itself - hence the name. the architecture is most commonly a mirrored one consisting of an encoder which transforms input data into a hyperspace represantation in a latent space and a decoder which transforms the latent space into the same data format as the input data (phrasing), this method typically results in the encoder learning to extract the most robust and critical information of the data and the (todo maybe something about the decoder + citation for both). it is used in many domains translations, LLMs, something with images (search example + citations)} \todo[inline, color=green!40]{autoencoders are a neural network architecture archetype (words) whose training target is to reproduce the input data itself - hence the name. the architecture is most commonly a mirrored one consisting of an encoder which transforms input data into a hyperspace represantation in a latent space and a decoder which transforms the latent space into the same data format as the input data (phrasing), this method typically results in the encoder learning to extract the most robust and critical information of the data and the (todo maybe something about the decoder + citation for both). it is used in many domains translations, LLMs, something with images (search example + citations)}
\todo[inline, color=green!40]{typical encoder decoder mirrored figure} \todo[inline, color=green!40]{typical encoder decoder mirrored figure}
@@ -373,17 +380,38 @@ Semi-Supervised learning algorithms are -as the name implies- an inbetween categ
\todo[inline, color=green!40]{our chosen method Deep SAD uses an autoencoder to translate input data into a latent space, in which it can more easily differentiate between normal and anomalous data} \todo[inline, color=green!40]{our chosen method Deep SAD uses an autoencoder to translate input data into a latent space, in which it can more easily differentiate between normal and anomalous data}
\newsection{lidar_related_work}{Lidar - Light Detection and Ranging} \newsection{lidar_related_work}{Lidar - Light Detection and Ranging}
\threadtodo
{Explain how lidars work and what data they produce}
{understand why data is degraded, and how data looks}
{explain how radar/lidar works, usecases, output = pointclouds, what errors}
{lidar used in automotive $\rightarrow$ related work - rain degradation}
\todo[inline]{related work in lidar} \todo[inline]{related work in lidar}
\todo[inline, color=green!40]{the older more commonly known radar works by sending out an electromagnetic wave in the radiofrequency and detecting the time it takes to return (if it returns at all) signalling a reflective object in the path of the radiowave. lidar works on the same principle but sends out a lightray produced by a laser (citation needed) and measuring the time it takes for the ray to return to the sensor. since the speed of light is constant in air the system can calculate the distance between the sensor and the measured point. modern lidar systems send out multiple, often millions of measurement rays per second which results in a three dimensional point cloud, constructed from the information in which direction the ray was cast and the distance that was measured} \todo[inline, color=green!40]{the older more commonly known radar works by sending out an electromagnetic wave in the radiofrequency and detecting the time it takes to return (if it returns at all) signalling a reflective object in the path of the radiowave. lidar works on the same principle but sends out a lightray produced by a laser (citation needed) and measuring the time it takes for the ray to return to the sensor. since the speed of light is constant in air the system can calculate the distance between the sensor and the measured point. modern lidar systems send out multiple, often millions of measurement rays per second which results in a three dimensional point cloud, constructed from the information in which direction the ray was cast and the distance that was measured}
\todo[inline, color=green!40]{lidar is used in most domains reliant on accurate 3d representations of the world like autonomous driving, robot navigation, (+ maybe quickly look up two other domains), its main advantage is high measurement accuracy, precision (use correct term), and high resolution (possible due to single point measurements instead of cones like radar, ToF, Ultrasonic) which enables more detailed mappings of the environment} \todo[inline, color=green!40]{lidar is used in most domains reliant on accurate 3d representations of the world like autonomous driving, robot navigation, (+ maybe quickly look up two other domains), its main advantage is high measurement accuracy, precision (use correct term), and high resolution (possible due to single point measurements instead of cones like radar, ToF, Ultrasonic) which enables more detailed mappings of the environment}
\todo[inline, color=green!40]{due to point precision, lidar is sensitive to noise/degradation of airborne particles, which may produce early returns, deflections, errrors of light rays, this results in noise in the 3d point cloud and possibly missing data of the measurement behind the aerosol particle.} \todo[inline, color=green!40]{due to point precision, lidar is sensitive to noise/degradation of airborne particles, which may produce early returns, deflections, errrors of light rays, this results in noise in the 3d point cloud and possibly missing data of the measurement behind the aerosol particle.}
\todo[inline, color=green!40]{because of the given advantages of lidar it is most commonly used nowadays on robot platforms for environment mapping and navigiation - so we chose to demonstrate our method based on degraded data collected by a lidar sensor as discussed in more dtail in section (data section)} \todo[inline, color=green!40]{because of the given advantages of lidar it is most commonly used nowadays on robot platforms for environment mapping and navigiation - so we chose to demonstrate our method based on degraded data collected by a lidar sensor as discussed in more dtail in section (data section)}
\newsection{related_work}{Related Work}
\threadtodo
{What other research has been done on this topic}
{reader knows all background, what is starting point of research}
{talk about rain degradation paper from automotive, cleaning pointclouds?}
{Rain paper successful with DeepSAD $\rightarrow$ what is DeepSAD}
\newchapter{deepsad}{Deep SAD: Semi-Supervised Anomaly Detection} \newchapter{deepsad}{Deep SAD: Semi-Supervised Anomaly Detection}
%In this chapter we explore the method \emph{Deep Semi-Supervised Anomaly Detection}~\cite{deepsad} which we employed during our experiments to quanitfy the degradation of lidar scans caused by artifically introduced water vapor from a theater smoke machine. The same approach of modeling a degradation quantification problem as an anomaly detection task was succesfully used in \cite{degradation_quantification_rain} to quantify the degradation caused to lidar scans by bad weather conditions such as rain, fog and snow for autonomous driving tasks. Deep SAD is characterized by it being a deep-learning approach to anomaly detection which enables it to learn more complex anomalous data patterns than more classic statistical approaches and its capability of employing hand-labeled data samples-both normal and anomalous-during its training step to better teach the model to differentiate between know anomalies and normal data than if only an unsupervised approach was used which basically just learns the most common patterns in the implicitely more common normal data and to differentiate anything from that. %In this chapter we explore the method \emph{Deep Semi-Supervised Anomaly Detection}~\cite{deepsad} which we employed during our experiments to quanitfy the degradation of lidar scans caused by artifically introduced water vapor from a theater smoke machine. The same approach of modeling a degradation quantification problem as an anomaly detection task was succesfully used in \cite{degradation_quantification_rain} to quantify the degradation caused to lidar scans by bad weather conditions such as rain, fog and snow for autonomous driving tasks. Deep SAD is characterized by it being a deep-learning approach to anomaly detection which enables it to learn more complex anomalous data patterns than more classic statistical approaches and its capability of employing hand-labeled data samples-both normal and anomalous-during its training step to better teach the model to differentiate between know anomalies and normal data than if only an unsupervised approach was used which basically just learns the most common patterns in the implicitely more common normal data and to differentiate anything from that.
\threadtodo
{Introduce DeepSAD, how and why do we use it}
{let reader know why they need to know about Deepsad in detail}
{explain use-case, similar use-case worked, allude to core features}
{interest/curiosity created $\rightarrow$ wants to learn about DeepSAD}
In this chapter, we explore the method \emph{Deep Semi-Supervised Anomaly Detection} (Deep SAD)~\cite{deepsad}, which we employ to quantify the degradation of LiDAR scans caused by airborne particles in the form of artificially introduced water vapor from a theater smoke machine. A similar approach—modeling degradation quantification as an anomaly detection task—was successfully applied in \cite{degradation_quantification_rain} to assess the impact of adverse weather conditions on LiDAR data for autonomous driving applications. Deep SAD leverages deep learning to capture complex anomalous patterns that classical statistical methods might miss. Furthermore, by incorporating a limited amount of hand-labeled data (both normal and anomalous), it can more effectively differentiate between known anomalies and normal data compared to purely unsupervised methods, which typically learn only the most prevalent patterns in the dataset~\cite{deepsad}. In this chapter, we explore the method \emph{Deep Semi-Supervised Anomaly Detection} (Deep SAD)~\cite{deepsad}, which we employ to quantify the degradation of LiDAR scans caused by airborne particles in the form of artificially introduced water vapor from a theater smoke machine. A similar approach—modeling degradation quantification as an anomaly detection task—was successfully applied in \cite{degradation_quantification_rain} to assess the impact of adverse weather conditions on LiDAR data for autonomous driving applications. Deep SAD leverages deep learning to capture complex anomalous patterns that classical statistical methods might miss. Furthermore, by incorporating a limited amount of hand-labeled data (both normal and anomalous), it can more effectively differentiate between known anomalies and normal data compared to purely unsupervised methods, which typically learn only the most prevalent patterns in the dataset~\cite{deepsad}.
@@ -394,34 +422,60 @@ In this chapter, we explore the method \emph{Deep Semi-Supervised Anomaly Detect
%Deep SAD is a typical clustering based anomaly detection technique which is described in \cite{anomaly_detection_survey} to generally have a two step approach to anomaly detection. First a clustering algorithm is used to cluster data closely together around a centroid and secondly the distances from data to that centroid is calculated and interpreted as an anomaly score. This general idea can also be found in the definition of the Deep SAD algorithm, which uses the encoder part of an autoencoder architecture which is trained to cluster data around a centroid in the latent space of its output. The datas geometric distance to that centroid in the latent space is defined as an anomaly score. Deep SAD is a semi-supervised training based method which can work completely unsupervised (no labeled data available) in which case it falls back to its predecessor method Deep SVDD but additionally allows the introduction of labeleld data samples during training to more accurately map known normal samples near the centroid and known anomalous samples further away from it. %Deep SAD is a typical clustering based anomaly detection technique which is described in \cite{anomaly_detection_survey} to generally have a two step approach to anomaly detection. First a clustering algorithm is used to cluster data closely together around a centroid and secondly the distances from data to that centroid is calculated and interpreted as an anomaly score. This general idea can also be found in the definition of the Deep SAD algorithm, which uses the encoder part of an autoencoder architecture which is trained to cluster data around a centroid in the latent space of its output. The datas geometric distance to that centroid in the latent space is defined as an anomaly score. Deep SAD is a semi-supervised training based method which can work completely unsupervised (no labeled data available) in which case it falls back to its predecessor method Deep SVDD but additionally allows the introduction of labeleld data samples during training to more accurately map known normal samples near the centroid and known anomalous samples further away from it.
Deep SAD is an anomaly detection algorithm that belongs to the category of clustering-based methods, which according to~\cite{anomaly_detection_survey} typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as an anomaly score. In Deep SAD, this concept is implemented by employing a neural network, which is jointly trained to map data into a latent space and to minimize the volume of an data-encompassing hypersphere whose center is the aforementioned centroid. The geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a higher score corresponds to a higher probability of a sample being anomalous according to the method, due to normal samples clustering more closely around the hypersphere center than anomalies. This general working principle is depicted in figure~\ref{fig:deep_svdd_transformation}. \threadtodo
{give general overview about how it works}
{overview helps reader understand method, then go into detail}
{how clustering AD generally works, how it does in DeepSAD}
{since the reader knows the general idea $\rightarrow$ what is the step-by-step?}
\todo[inline]{remove categorization? its a bit complicated since it looks like clustering, also has spectral and can maybe even be interpreted to be information theoretic?}
Deep SAD is an anomaly detection algorithm that belongs to the category of clustering-based methods, which according to~\cite{anomaly_detection_survey} typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as an anomaly score. In addition to that, DeepSAD also utilizes a spectral component by mapping the input data onto a lower-dimensional space, which enables it to detect anomalies in high-dimensional complex data types. In Deep SAD, these concepts are implemented by employing a neural network, which is jointly trained to map data into a latent space and to minimize the volume of an data-encompassing hypersphere whose center is the aforementioned centroid. The geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a larger distance between data and centroid corresponds to a higher probability of a sample being anomalous. This is achieved by shrinking the data-encompassing hypersphere during training, proportionally to all training data, of which is required that there is significantly more normal than anomalous data present. The outcome of this approach is that normal data gets clustered more closely around the centroid, while anomalies appear further away from it as can be seen in the toy example depicted in figure~\ref{fig:deep_svdd_transformation}.
\fig{deep_svdd_transformation}{figures/deep_svdd_transformation}{DeepSAD teaches a neural network to transform data into a latent space and minimize the volume of an data-encompassing hypersphere centered around a predetermined centroid $\textbf{c}$. \\Reproduced from~\cite{deepsvdd}.} \fig{deep_svdd_transformation}{figures/deep_svdd_transformation}{DeepSAD teaches a neural network to transform data into a latent space and minimize the volume of an data-encompassing hypersphere centered around a predetermined centroid $\textbf{c}$. \\Reproduced from~\cite{deepsvdd}.}
\threadtodo
{first step is pre-training, why does it use an autoencoder?}
{go chronologically over how the algorithm works. starting at pre-training}
{pre-training is autoencoder, self-supervised, dimensionality reduction}
{pre-training done $\rightarrow$ how are the pre-training results used?}
%As a pre-training step an autoencoder architecture is trained and its weights are used to initialize its encoder part before training of the method itself begins. \citeauthor{deepsad} argue in~\cite{deepsad} that this pre-training step which was already present in~\cite{deepsvdd}, allows them to not only interpret the method in geometric terms as minimum volume estimation but also in probalistic terms as entropy minimization over the latent distribution, since the autoencoding objective implicitely maximizes the mutual information between the data and its latent space represenation. This insight-that the method follows the Infomax principle with the additional objective of the latent distribution having mininmal entropy-allowed \citeauthor{deepsad} to introduce an additional term in Deep SAD's - over Deep SVDD's objective, which encorporates labeled data to better model the nature of normal and anomalous data. They show that Deep SAD's objective can be interpreted as normal data's distribution in the latent space being modeled to have low entropy and anomalous data's distribution in that latent space being modeled as having high entropy, which they argue captures the nature of the difference between normal and anomalous data by interpreting anomalies ``as being generated from an infinite mixture of distributions that are different from normal data distribution''~\cite{deepsad}. DeepSAD requires a pre-training step, during which an autoencoder is trained on all available training data. One of DeepSAD's goals is to map input data onto a lower dimensional latent space, in which the separation between normal and anomalous data can be achieved. To this end DeepSAD and its predecessor Deep SVDD make use of the autoencoder's reconstruction goal, whose successful training ensures confidence in that the encoder architecture is suitable for extracting the input datas' most prominent information to the latent space inbetween the encoder and decoder. DeepSAD goes on to use just the encoder as its main network architecture, discarding the decoder at this step, since reconstruction of the input is unnecessary.
As a pre-training step, an autoencoder is trained and its encoder weights are used to initialize the model before beginning the main training phase. \citeauthor{deepsad} argue in \cite{deepsad} that this pre-training step—originally introduced in \cite{deepsvdd}—not only allows a geometric interpretation of the method as minimum volume estimation but also a probabilistic one as entropy minimization over the latent distribution. The autoencoding objective implicitly maximizes the mutual information between the data and its latent representation, aligning the approach with the Infomax principle while encouraging a latent space with minimal entropy. This insight enabled \citeauthor{deepsad} to introduce an additional term in DeepSADs objective, beyond that of its predecessor Deep SVDD~\cite{deepsvdd}, which incorporates labeled data to better capture the characteristics of normal and anomalous data. They demonstrate that DeepSADs objective effectively models the latent distribution of normal data as having low entropy, while that of anomalous data is characterized by higher entropy. In this framework, anomalies are interpreted as being generated from an infinite mixture of distributions that differ from the normal data distribution.
The introduction of the aforementioned term in Deep SAD's objective allows it to learn in a semi-supervised way, though it can operate in a fully unsupervised mode—effectively reverting to its predecessor, Deep SVDD~\cite{deepsvdd}—when no labeled data are available. However, it also allows for the incorporation of labeled samples during training. This additional supervision helps the model better position known normal samples near the hypersphere center and push known anomalies farther away, thereby enhancing its ability to differentiate between normal and anomalous data.
%The results of the pre-training are used twofold. Firstly the encoders' weights at the end of pre-training can be used to initialize Deep SAD's weights for the main training step which aligns with the aforementioned Infomax principle, since we can assume the autoencoder maximized the shared information between the input and the latent space represenation. Secondly an initial forward-pass is run on the encoder network for all available training data samples and the results' mean position in the latent space is used to define the hypersphere center $\mathbf{c}$ which according to~\cite{deepsad} allows for faster convergence during the main training step than randomly chosen centroids. An alternative method of initializing the hypersphere center could be to use only labeled normal examples for the forward-pass, so not to pollute $\mathbf{c}$'s position with anomalous samples, which would only be possible if sufficient labeled normal samples are available. From this point on, the hypersphere center $\mathbf{c}$ stays fixed and does not change, which is necessary since it being a free optimization variable could lead to a trivial hypersphere collapse solution if the network was trained fully unsupervised. %The results of the pre-training are used twofold. Firstly the encoders' weights at the end of pre-training can be used to initialize Deep SAD's weights for the main training step which aligns with the aforementioned Infomax principle, since we can assume the autoencoder maximized the shared information between the input and the latent space represenation. Secondly an initial forward-pass is run on the encoder network for all available training data samples and the results' mean position in the latent space is used to define the hypersphere center $\mathbf{c}$ which according to~\cite{deepsad} allows for faster convergence during the main training step than randomly chosen centroids. An alternative method of initializing the hypersphere center could be to use only labeled normal examples for the forward-pass, so not to pollute $\mathbf{c}$'s position with anomalous samples, which would only be possible if sufficient labeled normal samples are available. From this point on, the hypersphere center $\mathbf{c}$ stays fixed and does not change, which is necessary since it being a free optimization variable could lead to a trivial hypersphere collapse solution if the network was trained fully unsupervised.
\todo[inline]{maybe pseydocode algorithm block?} \threadtodo
{what is pre-training output used for, how is centroid calculated and why}
{reader knows about pre-training, what are next steps and how is it used}
{pre-training weights used to init main network, c is mean of forward pass, collapse}
{network built and initialized, centroid fixed $\rightarrow$ start main training}
The pre-training results are used in two more key ways. First, the encoder weights obtained from the autoencoder pre-training initialize DeepSADs network for the main training phase. Second, we perform an initial forward pass through the encoder on all training samples, and the mean of these latent representations is set as the hypersphere center, $\mathbf{c}$. According to \cite{deepsad}, this initialization method leads to faster convergence during the main training phase compared to using a randomly selected centroid. An alternative would be to compute $\mathbf{c}$ using only the labeled normal examples, which would prevent the center from being influenced by anomalous samples; however, this requires a sufficient number of labeled normal samples. Once defined, the hypersphere center $\mathbf{c}$ remains fixed, as allowing it to be optimized freely could in the unsupervised case lead to a hypersphere collapse-a trivial solution where the network learns to map all inputs directly onto the centroid $\mathbf{c}$.
The pre-training results are used in two key ways. First, the encoder weights obtained from the autoencoder pre-training initialize DeepSADs network for the main training phase. This aligns with the Infomax principle, as the autoencoder is assumed to maximize the mutual information between the input and its latent representation. Second, we perform an initial forward pass through the encoder on all training samples, and the mean of these latent representations is set as the hypersphere center, $\mathbf{c}$. According to \cite{deepsad}, this initialization method leads to faster convergence during the main training phase compared to using a randomly selected centroid. An alternative would be to compute $\mathbf{c}$ using only the labeled normal examples, which would prevent the center from being influenced by anomalous samples; however, this requires a sufficient number of labeled normal samples. Once defined, the hypersphere center $\mathbf{c}$ remains fixed, as allowing it to be optimized freely could in the unsupervised case lead to a hypersphere collapse-a trivial solution where the network learns to map all inputs directly onto the centroid $\mathbf{c}$. \threadtodo
{how does the main training work, what data is used, what is the optimization target}
{main training is next step since all preconditions are met}
{main training is SGD backpropagation, minimizing volume, un-/labeled data used}
{network is trained $\rightarrow$ how does one use it for AD?}
In the main training step, the encoder part of the autoencoder architecture is trained using SGD backpropagation with the goal of Deep SAD's optimization objective as seen in \ref{eq:deepsad_optimization_objective}. The training takes unlabeled data as well as labeled data with binary classification i.e., normal and anomalous. In the main training step, DeepSAD's network is trained using SGD backpropagation. The unlabeled training data is used with the goal to minimize an data-encompassing hypersphere. Since one of the pre-conditions of training was the significant prevelance of normal data over anomalies in the training set, normal samples collectively cluster more tightly around the centroid, while the rarer anomalous samples do not contribute as significantly to the optimization, resulting in them staying further from the hypersphere center. The labeled data includes binary class labels signifying their status as either normal or anomalous samples. Labeled anomalies are pushed away from the center by defining their optimization target as maximizing the distance between them and $\mathbf{c}$. Labeled normal samples are treated similar to unlabeled samples with the difference that DeepSAD includes a hyperparameter capable of controling the proportion with which labeled and unlabeled data contribute to the overall optimization. The resulting network has learned to map normal data samples closer to $\mathbf{c}$ in the latent space and anomalies further away.
Once the network has been successfully trained, new data can be fed through a forward pass and the resulting latent representation's geometric distance to the hypersphere center $\mathbf{c}$ can be interpreted as an anomaly score for which holds that a higher score represents a higher likelihood of the input data being an anomalous sample. The anomaly score itself has to be processed further for effective use, for example a simple thresholding can be used to classify new samples or more complex analysis can be done to utilize the score. \todo[inline]{maybe pseudocode algorithm block?}
\threadtodo
{how to use the trained network?}
{since we finished training, we need to know how to utilize it}
{forward pass, calculate distance from c = anomaly score, analog, unknown magnitude}
{General knowledge of the algorithm achieved $\rightarrow$ go into more detail}
To infer if a previously unknown data sample is normal or anomalous, the sample is fed in a forward-pass through the fully trained network. During inference, the centroid $\mathbf{c}$ needs to be known, to calculate the geometric distance of the samples latent representation to $\mathbf{c}$. This distance is tantamount to an anomaly score, which correlates with the likelihood of the sample being anomalous. Due to differences in input data type, training success and latent space dimensionality, the anomaly score's magnitude has to be judged on an individual basis for each trained network. This means, scores produced by one network that signify normal data, may very well clearly indicate an anomaly for another network. The geometric distance between two points in space is a scalar analog value, therefore post-processing of the score is necessary to achieve a binary classification of normal and anomalous if desired.
\newsection{algorithm_details}{Algorithm Details and Hyperparameters} \newsection{algorithm_details}{Algorithm Details and Hyperparameters}
%\todo[inline]{backpropagation optimization formula, hyperaparameters explanation} %\todo[inline]{backpropagation optimization formula, hyperaparameters explanation}
Since Deep SAD is heavily based on its predecessor Deep SVDD it is helpful to first understand Deep SVDD's optimization objective, so we start with explaining it here. For input space $\mathcal{X} \subseteq \mathbb{R}^D$, output space $\mathcal{Z} \subseteq \mathbb{R}^d$ and a neural network $\phi(\wc; \mathcal{W}) : \mathcal{X} \to \mathcal{Z}$ where $\mathcal{W}$ depicts the neural networks' weights with $L$ layers $\{\mathbf{W}_1, \dots, \mathbf{W}_L\}$, $n$ the number of unlabeled training samples $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, $\mathbf{c}$ the center of the hypersphere in the latent space, Deep SVDD teaches the neural network to cluster normal data closely together in the latent space by defining its optimization objective as seen in~\ref{eq:deepsvdd_optimization_objective}. %As a pre-training step an autoencoder architecture is trained and its weights are used to initialize its encoder part before training of the method itself begins. \citeauthor{deepsad} argue in~\cite{deepsad} that this pre-training step which was already present in~\cite{deepsvdd}, allows them to not only interpret the method in geometric terms as minimum volume estimation but also in probalistic terms as entropy minimization over the latent distribution, since the autoencoding objective implicitely maximizes the mutual information between the data and its latent space represenation. This insight-that the method follows the Infomax principle with the additional objective of the latent distribution having mininmal entropy-allowed \citeauthor{deepsad} to introduce an additional term in Deep SAD's - over Deep SVDD's objective, which encorporates labeled data to better model the nature of normal and anomalous data. They show that Deep SAD's objective can be interpreted as normal data's distribution in the latent space being modeled to have low entropy and anomalous data's distribution in that latent space being modeled as having high entropy, which they argue captures the nature of the difference between normal and anomalous data by interpreting anomalies ``as being generated from an infinite mixture of distributions that are different from normal data distribution''~\cite{deepsad}.
Since Deep SAD is heavily based on its predecessor Deep SVDD it is helpful to first understand Deep SVDD's optimization objective, so we start with explaining it here. For input space $\mathcal{X} \subseteq \mathbb{R}^D$, output space $\mathcal{Z} \subseteq \mathbb{R}^d$ and a neural network $\phi(\wc; \mathcal{W}) : \mathcal{X} \to \mathcal{Z}$ where $\mathcal{W}$ depicts the neural networks' weights with $L$ layers $\{\mathbf{W}_1, \dots, \mathbf{W}_L\}$, $n$ the number of unlabeled training samples $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, $\mathbf{c}$ the center of the hypersphere in the latent space, Deep SVDD teaches the neural network to cluster normal data closely together in the latent space by defining its optimization objective as seen in~\ref{eq:deepsvdd_optimization_objective}.
\begin{equation} \begin{equation}
\label{eq:deepsvdd_optimization_objective} \label{eq:deepsvdd_optimization_objective}
@@ -432,6 +486,11 @@ Since Deep SAD is heavily based on its predecessor Deep SVDD it is helpful to fi
As can be seen from \ref{eq:deepsvdd_optimization_objective}, Deep SVDD is an unsupervised method which does not rely on labeled data to train the network to differentiate between normal and anomalous data. The first term of the optimization objective depicts the shrinking of the data-encompassing hypersphere around the given center $\mathbf{c}$. For each data sample $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, its geometric distance to $\mathbf{c}$ in the latent space produced by the neural network $\phi(\wc; \mathcal{W})$ is minimized proportionally to the amount of data samples $n$. The second term is a standard L2 regularization term which prevents overfitting with hyperparameter $\lambda > 0$ and $\|\wc\|_F$ denoting the Frobenius norm. As can be seen from \ref{eq:deepsvdd_optimization_objective}, Deep SVDD is an unsupervised method which does not rely on labeled data to train the network to differentiate between normal and anomalous data. The first term of the optimization objective depicts the shrinking of the data-encompassing hypersphere around the given center $\mathbf{c}$. For each data sample $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, its geometric distance to $\mathbf{c}$ in the latent space produced by the neural network $\phi(\wc; \mathcal{W})$ is minimized proportionally to the amount of data samples $n$. The second term is a standard L2 regularization term which prevents overfitting with hyperparameter $\lambda > 0$ and $\|\wc\|_F$ denoting the Frobenius norm.
\citeauthor{deepsad} argue in \cite{deepsad} that the pre-training step employing an autoencoder—originally introduced in \cite{deepsvdd}—not only allows a geometric interpretation of the method as minimum volume estimation i.e., the shrinking of the data encompassing hypersphere but also a probabilistic one as entropy minimization over the latent distribution. The autoencoding objective during pre-training implicitly maximizes the mutual information between the data and its latent representation, aligning the approach with the Infomax principle while encouraging a latent space with minimal entropy. This insight enabled \citeauthor{deepsad} to introduce an additional term in DeepSADs objective, beyond that of its predecessor Deep SVDD~\cite{deepsvdd}, which incorporates labeled data to better capture the characteristics of normal and anomalous data. They demonstrate that DeepSADs objective effectively models the latent distribution of normal data as having low entropy, while that of anomalous data is characterized by higher entropy. In this framework, anomalies are interpreted as being generated from an infinite mixture of distributions that differ from the normal data distribution.
The introduction of the aforementioned term in Deep SAD's objective allows it to learn in a semi-supervised way, though it can operate in a fully unsupervised mode—effectively reverting to its predecessor, Deep SVDD~\cite{deepsvdd}—when no labeled data are available. However, it also allows for the incorporation of labeled samples during training. This additional supervision helps the model better position known normal samples near the hypersphere center and push known anomalies farther away, thereby enhancing its ability to differentiate between normal and anomalous data.
From this it is easy to understand Deep SAD's optimization objective seen in \ref{eq:deepsad_optimization_objective} which additionally defines $m$ number of labeled data samples $\{(\mathbf{\tilde{x}}_1, \tilde{y}_1), \dots, (\mathbf{\tilde{x}}_m, \tilde{y}_1)\} \in \mathcal{X} \times \mathcal{Y}$ and $\mathcal{Y} = \{-1,+1\}$ for which $\tilde{y} = +1$ denotes normal and $\tilde{y} = -1$ anomalous samples as well as a new hyperparameter $\eta > 0$ which can be used to balance the strength with which labeled and unlabeled samples contribute to the training. From this it is easy to understand Deep SAD's optimization objective seen in \ref{eq:deepsad_optimization_objective} which additionally defines $m$ number of labeled data samples $\{(\mathbf{\tilde{x}}_1, \tilde{y}_1), \dots, (\mathbf{\tilde{x}}_m, \tilde{y}_1)\} \in \mathcal{X} \times \mathcal{Y}$ and $\mathcal{Y} = \{-1,+1\}$ for which $\tilde{y} = +1$ denotes normal and $\tilde{y} = -1$ anomalous samples as well as a new hyperparameter $\eta > 0$ which can be used to balance the strength with which labeled and unlabeled samples contribute to the training.
\begin{equation} \begin{equation}