grammarly deepsad chap

2025-10-12 17:26:07 +02:00
parent b7faf6e1b6
commit 5287f2c557
2 changed files with 16 additions and 16 deletions
--- a/thesis/Main.pdf
+++ b/thesis/Main.pdf
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -345,28 +345,28 @@ In this chapter, we explore the method \rev{DeepSAD}~\cite{deepsad}, which we em
 \newsection{algorithm_description}{Algorithm Description}
-DeepSAD's overall mechanics are similar to clustering-based anomaly detection methods, which, according to \rev{\cite{anomaly_detection_survey}}, typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as anomaly scores. In DeepSAD, these concepts are implemented by employing a neural network, which is jointly trained to map input data onto a latent space and to minimize the volume of an data-encompassing hypersphere, whose center is the aforementioned centroid. The data's geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a larger distance between data and centroid corresponds to a higher probability of a sample being anomalous. This is achieved by shrinking the data-encompassing hypersphere during training, proportionally to all training data, of which is required that there is significantly more normal than anomalous data present. The outcome of this approach is that normal data gets clustered more closely around the centroid, while anomalies appear further away from it as can be seen in the toy example depicted in \rev{Figure}~\ref{fig:deep_svdd_transformation}.
+DeepSAD's overall mechanics are similar to clustering-based anomaly detection methods, which, according to \rev{\cite{anomaly_detection_survey}}, typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as anomaly scores. In DeepSAD, these concepts are implemented by employing a neural network, which is jointly trained to map input data onto a latent space and to minimize the volume of a data-encompassing hypersphere, whose center is the aforementioned centroid. The data's geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a larger distance between data and centroid corresponds to a higher probability of a sample being anomalous. This is achieved by shrinking the data-encompassing hypersphere during training, proportionally to all training data, of which it is required that there is significantly more normal than anomalous data present. The outcome of this approach is that normal data gets clustered more closely around the centroid, while anomalies appear further away from it, as can be seen in the toy example depicted in \rev{Figure}~\ref{fig:deep_svdd_transformation}.
-\fig{deep_svdd_transformation}{figures/deep_svdd_transformation}{DeepSAD teaches a neural network to transform data into a latent space and minimize the volume of an data-encompassing hypersphere centered around a predetermined centroid $\textbf{c}$. \\Reproduced from~\cite{deep_svdd}.}
+\fig{deep_svdd_transformation}{figures/deep_svdd_transformation}{DeepSAD teaches a neural network to transform data into a latent space and minimize the volume of a data-encompassing hypersphere centered around a predetermined centroid $\textbf{c}$. \\Reproduced from~\cite{deep_svdd}.}
-Before DeepSAD's training can begin, a pretraining step is required, during which an autoencoder is trained on all available input data. One of DeepSAD's goals is to map input data onto a lower-dimensional latent space, in which the separation between normal and anomalous data can be achieved. To this end DeepSAD and its predecessor Deep SVDD make use of the autoencoder's reconstruction goal, whose successful training ensures confidence in the encoder architecture's suitability for extracting the input datas' most prominent information to the latent space \rev{in between} the encoder and decoder. DeepSAD goes on to use just the encoder as its main network architecture, discarding the decoder at this step, since reconstruction of the input is unnecessary.
+Before DeepSAD's training can begin, a pretraining step is required, during which an autoencoder is trained on all available input data. One of DeepSAD's goals is to map input data onto a lower-dimensional latent space, in which the separation between normal and anomalous data can be achieved. To this end, DeepSAD and its predecessor Deep SVDD make use of the autoencoder's reconstruction goal, whose successful training ensures confidence in the encoder architecture's suitability for extracting the input data's most prominent information to the latent space \rev{between} the encoder and decoder. DeepSAD goes on to use just the encoder as its main network architecture, discarding the decoder at this step, since reconstruction of the input is unnecessary.
-The pretraining results are used in two more key ways. First, the encoder weights obtained from the autoencoder pretraining initialize DeepSAD’s network for the main training phase. Second, we perform an initial forward pass through the encoder on all training samples, and the mean of these latent representations is set as the hypersphere center, $\mathbf{c}$. According to \citeauthor{deepsad}, this initialization method leads to faster convergence during the main training phase compared to using a randomly selected centroid. An alternative would be to compute $\mathbf{c}$ using only the labeled normal examples, which would prevent the center from being influenced by anomalous samples; however, this requires a sufficient number of labeled normal samples. Once defined, the hypersphere center $\mathbf{c}$ remains fixed, as allowing it to be optimized freely could in the unsupervised case lead to a hypersphere collapse-a trivial solution where the network learns to map all inputs directly onto the centroid $\mathbf{c}$.
+The pretraining results are used in two more key ways. First, the encoder weights obtained from the autoencoder pretraining initialize DeepSAD’s network for the main training phase. Second, we perform an initial forward pass through the encoder on all training samples, and the mean of these latent representations is set as the hypersphere center, $\mathbf{c}$. According to \citeauthor{deepsad}, this initialization method leads to faster convergence during the main training phase compared to using a randomly selected centroid. An alternative would be to compute $\mathbf{c}$ using only the labeled normal examples, which would prevent the center from being influenced by anomalous samples; however, this requires a sufficient number of labeled normal samples. Once defined, the hypersphere center $\mathbf{c}$ remains fixed, as allowing it to be optimized freely could, in the unsupervised case, lead to a hypersphere collapse-a trivial solution where the network learns to map all inputs directly onto the centroid $\mathbf{c}$.
-In the main training step, DeepSAD's network is trained using SGD backpropagation. The unlabeled training data is used with the goal to minimize an data-encompassing hypersphere. Since one of the pre-conditions of training was the significant prevelance of normal data over anomalies in the training set, normal samples collectively cluster more tightly around the centroid, while the rarer anomalous samples do not contribute as significantly to the optimization, resulting in them staying further from the hypersphere center. The labeled data includes binary class labels signifying their status as either normal or anomalous samples. Labeled anomalies are pushed away from the center by defining their optimization target as maximizing the distance between them and $\mathbf{c}$. Labeled normal samples are treated similar to unlabeled samples with the difference that DeepSAD includes a hyperparameter capable of controling the proportion with which labeled and unlabeled data contribute to the overall optimization. The resulting network has learned to map normal data samples closer to $\mathbf{c}$ in the latent space and anomalies further away.
+In the main training step, DeepSAD's network is trained using SGD backpropagation. The unlabeled training data is used with the goal of minimizing a data-encompassing hypersphere. Since one of the preconditions of training was the significant prevalence of normal data over anomalies in the training set, normal samples collectively cluster more tightly around the centroid, while the rarer anomalous samples do not contribute as significantly to the optimization, resulting in them staying further from the hypersphere center. The labeled data includes binary class labels signifying their status as either normal or anomalous samples. Labeled anomalies are pushed away from the center by defining their optimization target as maximizing the distance between them and $\mathbf{c}$. Labeled normal samples are treated similarly to unlabeled samples, with the difference that DeepSAD includes a hyperparameter capable of controlling the proportion with which labeled and unlabeled data contribute to the overall optimization. The resulting network has learned to map normal data samples closer to $\mathbf{c}$ in the latent space and anomalies further away.
 \fig{deepsad_procedure}{diagrams/deepsad_procedure/deepsad_procedure}{Overview of the DeepSAD workflow. Training starts with unlabeled data and optional labeled samples, which are used to pretrain an autoencoder, compute the hypersphere center, and then perform main training with adjustable weighting of labeled versus unlabeled data. During inference, new samples are encoded and their distance to the hypersphere center is used as an anomaly score, with larger distances indicating stronger anomalies.}
-To infer if a previously unknown data sample is normal or anomalous, the sample is fed in a forward-pass through the fully trained network. During inference, the centroid $\mathbf{c}$ needs to be known, to calculate the geometric distance between the samples latent representation and $\mathbf{c}$. This distance \rev{serves as} an anomaly score, which correlates with the likelihood of the sample being anomalous. Due to differences in input data type, training success and latent space dimensionality, the anomaly score's magnitude has to be judged on an individual basis for each trained network. This means, scores produced by one network that signify normal data, may very well clearly indicate an anomaly for another network. The geometric distance between two points in space is a scalar analog value, therefore post-processing of the score is necessary to achieve a binary classification of normal and anomalous if desired.
+To infer if a previously unknown data sample is normal or anomalous, the sample is fed in a forward pass through the fully trained network. During inference, the centroid $\mathbf{c}$ needs to be known to calculate the geometric distance between the sample's latent representation and $\mathbf{c}$. This distance \rev{serves as} an anomaly score, which correlates with the likelihood of the sample being anomalous. Due to differences in input data type, training success, and latent space dimensionality, the anomaly score's magnitude has to be judged on an individual basis for each trained network. This means that scores produced by one network, which signify normal data, may very well clearly indicate an anomaly for another network. The geometric distance between two points in space is a scalar analog value; therefore, post-processing of the score is necessary to achieve a binary classification of normal and anomalous if desired.
-DeepSAD's full training and inference procedure is visualized in \rev{Figure}~\ref{fig:deepsad_procedure}, which gives a comprehensive overview of the dataflows, tuneable hyperparameters and individual steps involved.
+DeepSAD's full training and inference procedure is visualized in \rev{Figure}~\ref{fig:deepsad_procedure}, which gives a comprehensive overview of the dataflows, tunable hyperparameters, and individual steps involved.
 \newsection{algorithm_details}{Algorithm Details and Hyperparameters}
-Since DeepSAD is heavily based on its predecessor \rev{Deep SVDD}~\cite{deep_svdd} it is helpful to first understand Deep SVDD's optimization objective, so we start with explaining it here.  For input space $\mathcal{X} \subseteq \mathbb{R}^D$, output space $\mathcal{Z} \subseteq \mathbb{R}^d$ and a neural network $\phi(\wc; \mathcal{W}) : \mathcal{X} \to \mathcal{Z}$ where $\mathcal{W}$ depicts the neural networks' weights with $L$ layers $\{\mathbf{W}_1, \dots, \mathbf{W}_L\}$, $n$ the number of unlabeled training samples $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, $\mathbf{c}$ the center of the hypersphere in the latent space, Deep SVDD teaches the neural network to cluster normal data closely together in the latent space by defining its optimization objective as \rev{follows.}
+Since DeepSAD is heavily based on its predecessor \rev{Deep SVDD}~\cite{deep_svdd} it is helpful to first understand Deep SVDD's optimization objective, so we start with explaining it here.  For input space $\mathcal{X} \subseteq \mathbb{R}^D$, output space $\mathcal{Z} \subseteq \mathbb{R}^d$, and a neural network $\phi(\wc; \mathcal{W}) : \mathcal{X} \to \mathcal{Z}$, where $\mathcal{W}$ depicts the neural network's weights with $L$ layers $\{\mathbf{W}_1, \dots, \mathbf{W}_L\}$, $n$ the number of unlabeled training samples $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, $\mathbf{c}$ the center of the hypersphere in the latent space, Deep SVDD teaches the neural network to cluster normal data closely together in the latent space by defining its optimization objective as \rev{follows.}
 \begin{equation}
 	\label{eq:deepsvdd_optimization_objective}
@@ -375,11 +375,11 @@ Since DeepSAD is heavily based on its predecessor \rev{Deep SVDD}~\cite{deep_svd
 	+\frac{\lambda}{2}\sum_{\ell=1}^{L}\|\mathbf{W}^{\ell}\|_{F}^{2}.
 \end{equation}
-Deep SVDD is an unsupervised method which does not rely on labeled data to train the network to differentiate between normal and anomalous data. The first term of its optimization objective depicts the shrinking of the data-encompassing hypersphere around the given center $\mathbf{c}$. For each data sample $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, its geometric distance to $\mathbf{c}$ in the latent space produced by the neural network $\phi(\wc; \mathcal{W})$ is minimized proportionally to the amount of data samples $n$. The second term is a standard L2 regularization term which prevents overfitting with hyperparameter $\lambda > 0$  and $\|\wc\|_F$ denoting the Frobenius norm.
+Deep SVDD is an unsupervised method that does not rely on labeled data to train the network to differentiate between normal and anomalous data. The first term of its optimization objective depicts the shrinking of the data-encompassing hypersphere around the given center $\mathbf{c}$. For each data sample $\{\mathbf{x}_1, \dots, \mathbf{x}_n\}$, its geometric distance to $\mathbf{c}$ in the latent space produced by the neural network $\phi(\wc; \mathcal{W})$ is minimized proportionally to the number of data samples $n$. The second term is a standard L2 regularization term, which prevents overfitting with hyperparameter $\lambda > 0$  and $\|\wc\|_F$ denoting the Frobenius norm.
-\citeauthor{deepsad} argue that the pretraining step employing an autoencoder—originally introduced in Deep SVDD—not only allows a geometric interpretation of the method as minimum volume estimation i.e., the shrinking of the data encompassing hypersphere but also a probabilistic one as entropy minimization over the latent distribution. The autoencoding objective during pretraining implicitly maximizes the mutual information between the data and its latent representation, aligning the approach with the Infomax principle while encouraging a latent space with minimal entropy. This insight enabled \citeauthor{deepsad} to introduce an additional term in DeepSAD’s objective, beyond that of its predecessor Deep SVDD, which incorporates labeled data to better capture the characteristics of normal and anomalous data. They demonstrate that DeepSAD’s objective effectively models the latent distribution of normal data as having low entropy, while that of anomalous data is characterized by higher entropy. In this framework, anomalies are interpreted as being generated from an infinite mixture of distributions that differ from the normal data distribution. The introduction of this aforementioned term in DeepSAD's objective allows it to learn in a semi-supervised way, which helps the model better position known normal samples near the hypersphere center and push known anomalies farther away, thereby enhancing its ability to differentiate between normal and anomalous data.
+\citeauthor{deepsad} argue that the pretraining step employing an autoencoder—originally introduced in Deep SVDD—not only allows a geometric interpretation of the method as minimum volume estimation, i.e., the shrinking of the data encompassing hypersphere, but also a probabilistic one as entropy minimization over the latent distribution. The autoencoding objective during pretraining implicitly maximizes the mutual information between the data and its latent representation, aligning the approach with the Infomax principle while encouraging a latent space with minimal entropy. This insight enabled \citeauthor{deepsad} to introduce an additional term in DeepSAD’s objective, beyond that of its predecessor Deep SVDD, which incorporates labeled data to better capture the characteristics of normal and anomalous data. They demonstrate that DeepSAD’s objective effectively models the latent distribution of normal data as having low entropy, while that of anomalous data is characterized by higher entropy. In this framework, anomalies are interpreted as being generated from an infinite mixture of distributions that differ from the normal data distribution. The introduction of this aforementioned term in DeepSAD's objective allows it to learn in a semi-supervised way, which helps the model better position known normal samples near the hypersphere center and push known anomalies farther away, thereby enhancing its ability to differentiate between normal and anomalous data.
-From \rev{Equation}~\ref{eq:deepsvdd_optimization_objective} it is easy to understand DeepSAD's optimization objective seen in \rev{Equation}~\ref{eq:deepsad_optimization_objective} which additionally \rev{uses} $m$ number of labeled data samples $\{(\mathbf{\tilde{x}}_1, \tilde{y}_1), \dots, (\mathbf{\tilde{x}}_m, \tilde{y}_1)\} \in \mathcal{X} \times \mathcal{Y}$ and $\mathcal{Y} = \{-1,+1\}$ for which $\tilde{y} = +1$ denotes normal and $\tilde{y} = -1$ anomalous samples as well as a new hyperparameter $\eta > 0$ which can be used to balance the strength with which labeled and unlabeled samples contribute to the training.
+From \rev{Equation}~\ref{eq:deepsvdd_optimization_objective}, it is easy to understand DeepSAD's optimization objective, seen in \rev{Equation}~\ref{eq:deepsad_optimization_objective}, which additionally \rev{uses} $m$ number of labeled data samples $\{(\mathbf{\tilde{x}}_1, \tilde{y}_1), \dots, (\mathbf{\tilde{x}}_m, \tilde{y}_1)\} \in \mathcal{X} \times \mathcal{Y}$ and $\mathcal{Y} = \{-1,+1\}$ for which $\tilde{y} = +1$ denotes normal and $\tilde{y} = -1$ anomalous samples, as well as a new hyperparameter $\eta > 0$, which can be used to balance the strength with which labeled and unlabeled samples contribute to the training.
 \rev{The objective is}
@@ -391,25 +391,25 @@ From \rev{Equation}~\ref{eq:deepsvdd_optimization_objective} it is easy to under
 	+\frac{\lambda}{2}\sum_{\ell=1}^{L}\|\mathbf{W}^{\ell}\|_{F}^{2}.
 \end{equation}
-The first term of \rev{Equation}~\ref{eq:deepsad_optimization_objective} stays \rev{almost} the same, differing only in its consideration of the introduced $m$ labeled datasamples for its proportionality. The second term is newly introduced to incorporate the labeled data samples with hyperparameter $\eta$'s strength, by either minimizing or maximizing the distance between the samples latent represenation and $\mathbf{c}$ depending on each data samples label $\tilde{y}$. The standard L2 regularization is kept identical to Deep SVDD's optimization objective. It can also be observed that in case of $m = 0$ labeled samples, DeepSAD falls back to Deep SVDD's optimization objective and can therefore be used in a completely unsupervised fashion as well.
+The first term of \rev{Equation}~\ref{eq:deepsad_optimization_objective} stays \rev{almost} the same, differing only in its consideration of the introduced $m$ labeled data samples for its proportionality. The second term is newly introduced to incorporate the labeled data samples with hyperparameter $\eta$'s strength, by either minimizing or maximizing the distance between each sample's latent representation and $\mathbf{c}$, depending on that sample's label $\tilde{y}$. The standard L2 regularization is kept identical to Deep SVDD's optimization objective. It can also be observed that in the case of $m = 0$ labeled samples, DeepSAD falls back to Deep SVDD's optimization objective and can therefore be used in a completely unsupervised fashion as well.
 \paragraph{Hyperparameters}
-DeepSAD relies on several tuneable hyperparameters that influence different stages of the algorithm. The most relevant ones are summarized and discussed below.
+DeepSAD relies on several tunable hyperparameters that influence different stages of the algorithm. The most relevant ones are summarized and discussed below.
 \begin{itemize}
 	\item \textbf{Network architecture $\mathcal{\phi}$} \\
 	      The encoder architecture determines the representational capacity of the model. Because DeepSAD builds on a pretraining autoencoder, the architecture must be expressive enough to reconstruct input data during pretraining, but also compact enough to support separation of normal and anomalous samples in the latent space. The choice of architecture is therefore data-dependent: convolutional encoders are often used for images, while fully connected encoders or other architectures may be more suitable for various data modalities. The architecture directly constrains which patterns the network can learn and thus strongly shapes the latent space structure.
 	\item \textbf{Latent space dimensionality $\mathbb{R}^d$} \\
-	      The size of the latent bottleneck is a critical parameter. If $\mathbb{R}^d$ is too small, the network cannot encode all relevant information, leading to information loss and weak representations. If $\mathbb{R}^d$ is too large, the network risks overfitting by encoding irrelevant detail, while also increasing computational cost. These insights stem from autoencoder literature \cite{deep_learning_book}, but it is unclear whether they apply directly to DeepSAD: here the autoencoder serves only for pretraining, and the encoder is subsequently fine-tuned with a different objective. Thus, the optimal choice of $\mathbb{R}^d$ may not coincide with the value that would be ideal for autoencoder reconstruction alone.
+	      The size of the latent bottleneck is a critical parameter. If $\mathbb{R}^d$ is too small, the network cannot encode all relevant information, leading to information loss and weak representations. If $\mathbb{R}^d$ is too large, the network risks overfitting by encoding irrelevant detail, while also increasing computational cost. These insights stem from autoencoder literature \cite{deep_learning_book}, but it is unclear whether they apply directly to DeepSAD: here, the autoencoder serves only for pretraining, and the encoder is subsequently fine-tuned with a different objective. Thus, the optimal choice of $\mathbb{R}^d$ may not coincide with the value that would be ideal for autoencoder reconstruction alone.
 	\item \textbf{Label weighting $\eta$} \\
 	      The parameter $\eta$ controls the relative contribution of labeled versus unlabeled data in the DeepSAD objective. With $\eta = 1$, both groups contribute equally (normalized by their sample counts). Larger values of $\eta$ emphasize the labeled data, pulling labeled \rev{normal data} closer to the center and pushing labeled anomalies further away. Smaller values emphasize the unlabeled data, effectively reducing the influence of labels. Its impact depends not only on its numerical value but also on the quantity and quality of available labels.
 	\item \textbf{Learning rates $L_A$ and $L_M$} \\
-	      Two learning rates are defined: $L_A$ for the autoencoder pretraining and $L_M$ for the main DeepSAD training. The learning rate sets the step size used during gradient descent updates and thereby controls the stability and speed of training. If it is too high, the optimization may diverge or oscillate; if too low, convergence becomes excessively slow and may get stuck in poor local minima. Schemes with adaptive learning rates such as ADAM may be applied to prevent poor choices.
+	      Two learning rates are defined: $L_A$ for the autoencoder pretraining and $L_M$ for the main DeepSAD training. The learning rate sets the step size used during gradient descent updates and thereby controls the stability and speed of training. If it is too high, the optimization may diverge or oscillate; if too low, convergence becomes excessively slow and may get stuck in poor local minima. Schemes with adaptive learning rates, such as ADAM, may be applied to prevent poor choices.
 	\item \textbf{Number of epochs $E_A$ and $E_M$} \\
 	      The number of training epochs specifies how many full passes over the dataset are made in pretraining ($E_A$) and in the main DeepSAD training ($E_M$). More epochs allow the model to fit more closely to the training data, but also increase the risk of overfitting to noise or mislabeled samples. In practice, the effective number of epochs depends on dataset size, network architecture, and whether early stopping is applied.
 	\item \textbf{Regularization rate $\lambda$} \\
-	      The rate of regularization, where $\lambda$ has to be larger than 0 for regularization to take effect. A higher value decreases the chance of overfitting at the cost of model comlexity.
+	      The rate of regularization, where $\lambda$ has to be larger than 0 for regularization to take effect. A higher value decreases the chance of overfitting at the cost of model complexity.
 \end{itemize}
 \newchapter{data_preprocessing}{Data and Preprocessing}