wip overall small changes to figures

2025-09-28 14:35:10 +02:00
parent f36477ed9b
commit e34a374adc
21 changed files with 148 additions and 265 deletions
--- a/thesis/Main.pdf
+++ b/thesis/Main.pdf
--- a/thesis/Main.tex
+++ b/thesis/Main.tex
@@ -87,21 +87,6 @@
  \addlinespace[2pt]%
 }
 \DeclareRobustCommand{\threadtodo}[4]{%
  % \todo[inline,
  \todo[disable,
        backgroundcolor=red!20,
        bordercolor=red!50,
        textcolor=black!80,
        size=\small,
        caption={Common Thread Note}]{%
        \textbf{Goal:} #1 \newline
        \textbf{Context:} #2 \newline
        \textbf{Method:} #3 \newline
        \textbf{Transition:} #4
  }%
 }
 \DeclareRobustCommand{\sensorcell}[2]{%
  \makecell[l]{#1 \\ \emph{#2}}
 }
@@ -212,17 +197,6 @@
 % mainmatter (=content)
 \newchapter{introduction}{Introduction}
 \threadtodo
 {\textit{"What should the reader know after reading this section?"}}
 {\textit{"Why is that of interest to the reader at this point?"}}
 {\textit{"How am I achieving the stated goal?"}}
 {\textit{"How does this lead to the next question or section?"}}
 \threadtodo
 {Create interest in topic, introduce main goal of thesis, summarize results}
 {Reader only has abstract as context, need to create interest at beginning}
 {emotional rescue missions, explain why data may be bad, state core research question}
 {what has and hasn't been done $\rightarrow$ Scope of Research}
 Autonomous robots have gained more and more prevailance in search and rescue (SAR) missions due to not endangering another human being and still being able to fulfil the difficult tasks of navigating hazardous environments like collapsed structures, identifying and locating victims and assessing the environment's safety for human rescue teams. To understand the environment, robots employ multiple sensor systems such as lidar, radar, ToF, ultrasound, optical cameras or infrared cameras of which lidar is the most prominently used due to its accuracy. The robots use the sensors' data to map their environments, navigate their surroundings and make decisions like which paths to prioritize. Many of the aforementioned algorithms are deep learning-based algorithms which are trained on large amounts of data whose characteristics are learned by the models.
@@ -239,11 +213,6 @@ Our experiments demonstrate that anomaly detection methods are indeed applicable
 \newsection{scope_research}{Scope of Research}
 \threadtodo
 {clearly state what has and hasn't been researched + explanation why}
 {from intro its clear what thesis wants to achieve, now we explain how we do that}
 {state limit on data domain, sensors, output of method + reasoning for decisions}
 {clear what we want to achieve $\rightarrow$ how is thesis structured to show this work}
 In this thesis, we focus our research on the unique challenges faced by autonomous rescue robots, specifically the degradation of sensor data caused by airborne particles. While degradation in sensor data can also arise from adverse weather, material properties, or dynamic elements such as moving leaves, these factors are considered less relevant to the rescue scenarios targeted by our study and are therefore excluded. Although our method is versatile enough to quantify various types of degradation, our evaluation is limited to degradation from airborne particles, as this is the most prevalent issue in the operational environments of autonomous rescue robots.
@@ -262,19 +231,9 @@ The results are presented and discussed in Chapter~\ref{chp:results_discussion},
 Finally, Chapter~\ref{chp:conclusion_future_work} concludes the thesis by summarizing the main findings, highlighting limitations, and discussing open questions and directions for future work.
 \threadtodo
 {explain how structure will guide reader from zero knowledge to answer of research question}
 {since reader knows what we want to show, an outlook over content is a nice transition}
 {state structure of thesis and explain why specific background is necessary for next section}
 {reader knows what to expect $\rightarrow$ necessary background info and related work}
 \newchapter{background}{Background and Related Work}
 \threadtodo
 {explain which background knowledge is necessary and why + mention related work}
 {reader learns what he needs to know and which related work exists}
 {state which background subsections will follow + why and mention related work}
 {necessity of knowledge and order of subsections are explained $\rightarrow$ essential background}
 This thesis tackles a broad, interdisciplinary challenge at the intersection of robotics, computer vision, and data science. In this chapter, we introduce the background of anomaly detection, which we formulate our degradation quantification problem as. Anomaly detection has its roots in statistical analysis and has been successfully applied in various domains. Recently, the incorporation of learning-based techniques, particularly deep learning, has enabled more efficient and effective analysis of large datasets.
@@ -285,23 +244,12 @@ LiDAR sensors function by projecting lasers in multiple directions near-simultan
 \newsection{anomaly_detection}{Anomaly Detection}
 \threadtodo
 {explain AD in general, allude to DeepSAD which was core method here}
 {problem is formulated as AD problem, so reader needs to understand AD}
 {give overview of AD goals, categories and challenges. explain we use DeepSAD}
 {lots of data but few anomalies + hard to label $\rightarrow$ semi-supervised learning}
 Anomaly detection refers to the process of detecting unexpected patterns of data, outliers which deviate significantly from the majority of data which is implicitly defined as normal by its prevalence. In classic statistical analysis these techniques have been studied as early as the 19th century~\cite{anomaly_detection_history}. Since then, a multitude of methods and use-cases for them have been proposed and studied. Examples of applications include healthcare, where computer vision algorithms are used to detect anomalies in medical images for diagnostics and early detection of diseases~\cite{anomaly_detection_medical}, detection of fraud in decentralized financial systems based on block-chain technology~\cite{anomaly_detection_defi} as well as fault detection in industrial machinery using acoustic sound data~\cite{anomaly_detection_manufacturing}.
 Figure~\ref{fig:anomaly_detection_overview} depicts a simple but illustrative example of data which can be classified as either normal or anomalous and shows the problem anomaly detection methods try to generally solve. A successful anomaly detection method would somehow learn to differentiate normal from anomalous data, for example by learning the boundaries around the available normal data and classifying it as either normal or anomalous based on its location inside or outside of those boundaries. Another possible approach could calculate an analog value which correlates with the likelihood of an sample being anomalous, for example by using the sample's distance from the closest normal data cluster's center.
-
+\figc{anomaly_detection_overview}{figures/anomaly_detection_overview}{An illustrative example of anomalous and normal data containing 2-dimensional data with clusters of normal data $N_1$ and $N_2$ as well as two single anomalies $o_1$ and $o_2$ and a cluster of anomalies $O_3$. Reproduced from~\cite{anomaly_detection_survey}}{width=0.5\textwidth}
 \begin{figure}
 	\begin{center}
 		\includegraphics[width=0.5\textwidth]{figures/anomaly_detection_overview}
 	\end{center}
 	\caption{An illustrative example of anomalous and normal data containing 2-dimensional data with clusters of normal data $N_1$ and $N_2$ as well as two single anomalies $o_1$ and $o_2$ and a cluster of anomalies $O_3$. Reproduced from~\cite{anomaly_detection_survey}}\label{fig:anomaly_detection_overview}
 \end{figure}
 By their very nature anomalies are rare occurences and oftentimes unpredictable in nature, which makes it hard to define all possible anomalies in any system. It also makes it very challenging to create an algorithm which is capable of detecting anomalies which may have never occured before and may not have been known to exist during the creation of the detection algorithm. There are many possible approaches to this problem, though they can be roughly grouped into six distinct categories based on the techniques used~\cite{anomaly_detection_survey}:
@@ -326,11 +274,6 @@ As already shortly mentioned at the beginning of this section, anomaly detection
 \newsection{semi_supervised}{Semi-Supervised Learning Algorithms}
 \threadtodo
 {Give machine learning overview, focus on semi-supervised (what \& why)}
 {used method is semi-supervised ML algorithm, reader needs to understand}
 {explain what ML is, how the different approaches work, why to use semi-supervised}
 {autoencoder special case (un-/self-supervised) used in DeepSAD $\rightarrow$ explain autoencoder}
 Machine learning refers to algorithms capable of learning patterns from existing data to perform tasks on previously unseen data, without being explicitely programmed to do so~\cite{machine_learning_first_definition}. Central to many approaches is the definition of an objective function that measures how well the model is performing. The model’s parameters are then adjusted to optimize this objective. By leveraging these data-driven methods, machine learning can handle complex tasks across a wide range of domains.
@@ -348,12 +291,7 @@ Aside from the underlying technique, one can also categorize machine learning al
 In supervised learning, each input sample is paired with a “ground-truth” label representing the desired output. During training, the model makes a prediction and a loss function quantifies the difference between the prediction and the true label. The learning algorithm then adjusts its parameters to minimize this loss, improving its performance over time. Labels are typically categorical (used for classification tasks, such as distinguishing “cat” from “dog”) or continuous (used for regression tasks, like predicting a temperature or distance). Figure~\ref{fig:ml_learning_schema_concept}b illustrates this principle with a classification example \emph{b)}, where labelled data is used to learn a boundary between two classes.
-\begin{figure}
+\figc{ml_learning_schema_concept}{figures/ml_learning_schema_concept.png}{Conceptual illustration of unsupervised (a) and supervised (b) learning. In (a), the inputs are two-dimensional data without labels, and the algorithm groups them into clusters without external guidance. In (b), the inputs have class labels (colors), which serve as training signals for learning a boundary between the two classes. Reproduced from~\cite{ml_supervised_unsupervised_figure_source}.}{width=0.6\textwidth}
 	\centering
 	\includegraphics[width=0.6\textwidth]{figures/ml_learning_schema_concept.png}
 	\caption{Conceptual illustration of unsupervised (a) and supervised (b) learning. In (a), the inputs are two-dimensional data without labels, and the algorithm groups them into clusters without external guidance. In (b), the inputs have class labels (colors), which serve as training signals for learning a boundary between the two classes. Reproduced from~\cite{ml_supervised_unsupervised_figure_source}.}
 	\label{fig:ml_learning_schema_concept}
 \end{figure}
 In unsupervised learning, models work directly with raw data, without any ground-truth labels to guide the learning process. Instead, they optimize an objective that reflects the discovery of useful structure—whether that is grouping similar data points together or finding a compact representation of the data. For example, cluster analysis partitions the dataset into groups so that points within the same cluster are more similar to each other (according to a chosen similarity metric) than to points in other clusters, which can be seen in the toy example \emph{a)} in figure~\ref{fig:ml_learning_schema_concept}. Dimensionality reduction methods, on the other hand, project high-dimensional data into a lower-dimensional space, optimizing for minimal loss of the original data’s meaningful information.
@@ -365,11 +303,6 @@ Machine learning based anomaly detection methods can utilize techniques from all
 \newsection{autoencoder}{Autoencoder}
 \threadtodo
 {Explain how autoencoders work and what they are used for}
 {autoencoder used in deepSAD}
 {explain basic idea, unfixed architecture, infomax, mention usecases, dimension reduction}
 {dimensionality reduction, useful for high dim data $\rightarrow$ pointclouds from lidar}
 Autoencoders are a type of neural network architecture, whose main goal is learning to encode input data into a representative state, from which the same input can be reconstructed, hence the name. They typically consist of two functions, an encoder and a decoder with a latent space inbetween them as depicted in the toy example in figure~\ref{fig:autoencoder_general}. The encoder learns to extract the most significant features from the input and to convert them into the input's latent space representation. The reconstruction goal ensures that the most prominent features of the input get retained during the encoding phase, due to the inherent inability to reconstruct the input if too much relevant information is missing. The decoder simultaneously learns to reconstruct the original input from its encoded latent space representation, by minimizing the error between the input sample and the autoencoder's output. This optimization goal creates uncertainty when categorizing autoencoders as an unsupervised method, although literature commonly defines them as such. While they do not require any labeling of the input data, their optimization target can still calculate the error between the output and the optimal target, which is typically not available for unsupervised methods. For this reason, they are sometimes proposed to be a case of self-supervised learning, a type of machine learning where the data itself can be used to generate a supervisory signal without the need for a domain expert to provide one.
@@ -382,17 +315,12 @@ Autoencoders have been shown to be useful in the anomaly detection domain by ass
 \newsection{lidar_related_work}{Lidar - Light Detection and Ranging}
 \threadtodo
 {Explain how lidars work and what data they produce}
 {understand why data is degraded, and how data looks}
 {explain how radar/lidar works, usecases, output = pointclouds, what errors}
 {rain degradation paper used deepsad $\rightarrow$ explained in detail in next chapter}
 LiDAR (Light Detection and Ranging) measures distance by emitting short laser pulses and timing how long they take to return, an approach many may be familiar with from the more commonly known radar technology, which uses radio-frequency pulses and measures their return time to gauge an object's range. Unlike radar, however, LiDAR operates at much shorter wavelengths and can fire millions of pulses per second, achieving millimeter-level precision and dense, high-resolution 3D point clouds. This fine granularity makes LiDAR ideal for applications such as detailed obstacle mapping, surface reconstruction, and autonomous navigation in complex environments.
 Because the speed of light in air is effectively constant, multiplying half the round‐trip time by that speed gives the distance between the lidar sensor and the reflecting object, as can be seen visualized in figure~\ref{fig:lidar_working_principle}. Modern spinning multi‐beam LiDAR systems emit millions of these pulses every second. Each pulse is sent at a known combination of horizontal and vertical angles, creating a regular grid of measurements: for example, 32 vertical channels swept through 360° horizontally at a fixed angular spacing. While newer solid-state designs (flash, MEMS, phased-array) are emerging, spinning multi-beam LiDAR remains the most commonly seen type in autonomous vehicles and robotics because of its proven range, reliability, and mature manufacturing base.
-\fig{lidar_working_principle}{figures/bg_lidar_principle.png}{An illustration of lidar sensors' working principle. Reproduced from~\cite{bg_lidar_figure_source}}
+\figc{lidar_working_principle}{figures/bg_lidar_principle.png}{An illustration of lidar sensors' working principle. Reproduced from~\cite{bg_lidar_figure_source}}{width=.8\textwidth}
 Each instance a lidar emits and receives a laser pulse, it can use the ray's direction and the calculated distance to produce a single three-dimensional point. By collecting millions of such points each second, the sensor constructs a “point cloud”—a dense set of 3D coordinates relative to the LiDAR’s own position. In addition to X, Y, and Z, many LiDARs also record the intensity or reflectivity of each return, providing extra information about the surface properties of the object hit by the pulse.
@@ -408,58 +336,28 @@ Our method does not aim to remove the noise or degraded points in the lidar data
 \newchapter{deepsad}{Deep SAD: Semi-Supervised Anomaly Detection}
 \threadtodo
 {Introduce DeepSAD, how and why do we use it}
 {let reader know why they need to know about Deepsad in detail}
 {explain use-case, similar use-case worked, allude to core features}
 {interest/curiosity created $\rightarrow$ wants to learn about DeepSAD}
 In this chapter, we explore the method \citetitle{deepsad}~(Deep SAD)~\cite{deepsad}, which we employ to quantify the degradation of LiDAR scans caused by airborne particles in the form of artificially introduced water vapor from a theater smoke machine. A similar approach—modeling degradation quantification as an anomaly detection task—was successfully applied in \citetitle{degradation_quantification_rain}~\cite{degradation_quantification_rain} to assess the impact of adverse weather conditions on LiDAR data for autonomous driving applications. Deep SAD leverages deep learning to capture complex anomalous patterns that classical statistical methods might miss. Furthermore, by incorporating a limited amount of hand-labeled data (both normal and anomalous), it can more effectively differentiate between known anomalies and normal data compared to purely unsupervised methods, which typically learn only the most prevalent patterns in the dataset~\cite{deepsad}.
 \newsection{algorithm_description}{Algorithm Description}
 \threadtodo
 {give general overview about how it works}
 {overview helps reader understand method, then go into detail}
 {how clustering AD generally works, how it does in DeepSAD}
 {since the reader knows the general idea $\rightarrow$ what is the step-by-step?}
 Deep SAD's overall mechanics are similar to clustering-based anomaly detection methods, which according to \citetitle{anomaly_detection_survey}~\cite{anomaly_detection_survey} typically follow a two-step approach. First, a clustering algorithm groups data points around a centroid; then, the distances of individual data points from this centroid are calculated and used as an anomaly score. In Deep SAD, these concepts are implemented by employing a neural network, which is jointly trained to map input data onto a latent space and to minimize the volume of an data-encompassing hypersphere, whose center is the aforementioned centroid. The data's geometric distance in the latent space to the hypersphere center is used as the anomaly score, where a larger distance between data and centroid corresponds to a higher probability of a sample being anomalous. This is achieved by shrinking the data-encompassing hypersphere during training, proportionally to all training data, of which is required that there is significantly more normal than anomalous data present. The outcome of this approach is that normal data gets clustered more closely around the centroid, while anomalies appear further away from it as can be seen in the toy example depicted in figure~\ref{fig:deep_svdd_transformation}.
 \fig{deep_svdd_transformation}{figures/deep_svdd_transformation}{DeepSAD teaches a neural network to transform data into a latent space and minimize the volume of an data-encompassing hypersphere centered around a predetermined centroid $\textbf{c}$. \\Reproduced from~\cite{deepsvdd}.}
 \threadtodo
 {first step is pre-training, why does it use an autoencoder?}
 {go chronologically over how the algorithm works. starting at pre-training}
 {pre-training is autoencoder, self-supervised, dimensionality reduction}
 {pre-training done $\rightarrow$ how are the pre-training results used?}
 Before DeepSAD's training can begin, a pre-training step is required, during which an autoencoder is trained on all available input data. One of DeepSAD's goals is to map input data onto a lower dimensional latent space, in which the separation between normal and anomalous data can be achieved. To this end DeepSAD and its predecessor Deep SVDD make use of the autoencoder's reconstruction goal, whose successful training ensures confidence in the encoder architecture's suitability for extracting the input datas' most prominent information to the latent space inbetween the encoder and decoder. DeepSAD goes on to use just the encoder as its main network architecture, discarding the decoder at this step, since reconstruction of the input is unnecessary.
 \threadtodo
 {what is pre-training output used for, how is centroid calculated and why}
 {reader knows about pre-training, what are next steps and how is it used}
 {pre-training weights used to init main network, c is mean of forward pass, collapse}
 {network built and initialized, centroid fixed $\rightarrow$ start main training}
 The pre-training results are used in two more key ways. First, the encoder weights obtained from the autoencoder pre-training initialize DeepSAD’s network for the main training phase. Second, we perform an initial forward pass through the encoder on all training samples, and the mean of these latent representations is set as the hypersphere center, $\mathbf{c}$. According to \citeauthor{deepsad}, this initialization method leads to faster convergence during the main training phase compared to using a randomly selected centroid. An alternative would be to compute $\mathbf{c}$ using only the labeled normal examples, which would prevent the center from being influenced by anomalous samples; however, this requires a sufficient number of labeled normal samples. Once defined, the hypersphere center $\mathbf{c}$ remains fixed, as allowing it to be optimized freely could in the unsupervised case lead to a hypersphere collapse-a trivial solution where the network learns to map all inputs directly onto the centroid $\mathbf{c}$.
 \threadtodo
 {how does the main training work, what data is used, what is the optimization target}
 {main training is next step since all preconditions are met}
 {main training is SGD backpropagation, minimizing volume, un-/labeled data used}
 {network is trained $\rightarrow$ how does one use it for AD?}
 In the main training step, DeepSAD's network is trained using SGD backpropagation. The unlabeled training data is used with the goal to minimize an data-encompassing hypersphere. Since one of the pre-conditions of training was the significant prevelance of normal data over anomalies in the training set, normal samples collectively cluster more tightly around the centroid, while the rarer anomalous samples do not contribute as significantly to the optimization, resulting in them staying further from the hypersphere center. The labeled data includes binary class labels signifying their status as either normal or anomalous samples. Labeled anomalies are pushed away from the center by defining their optimization target as maximizing the distance between them and $\mathbf{c}$. Labeled normal samples are treated similar to unlabeled samples with the difference that DeepSAD includes a hyperparameter capable of controling the proportion with which labeled and unlabeled data contribute to the overall optimization. The resulting network has learned to map normal data samples closer to $\mathbf{c}$ in the latent space and anomalies further away.
 \fig{deepsad_procedure}{diagrams/deepsad_procedure/deepsad_procedure}{(WORK IN PROGRESS) Depiction of DeepSAD's training procedure, including data flows and tweakable hyperparameters.}
 \threadtodo
 {how to use the trained network?}
 {since we finished training, we need to know how to utilize it}
 {forward pass, calculate distance from c =  anomaly score, analog, unknown magnitude}
 {General knowledge of the algorithm achieved $\rightarrow$ go into more detail}
 To infer if a previously unknown data sample is normal or anomalous, the sample is fed in a forward-pass through the fully trained network. During inference, the centroid $\mathbf{c}$ needs to be known, to calculate the geometric distance of the samples latent representation to $\mathbf{c}$. This distance is tantamount to an anomaly score, which correlates with the likelihood of the sample being anomalous. Due to differences in input data type, training success and latent space dimensionality, the anomaly score's magnitude has to be judged on an individual basis for each trained network. This means, scores produced by one network that signify normal data, may very well clearly indicate an anomaly for another network. The geometric distance between two points in space is a scalar analog value, therefore post-processing of the score is necessary to achieve a binary classification of normal and anomalous if desired.
@@ -508,11 +406,6 @@ The neural network architecture of DeepSAD is not fixed but rather dependent on
 \newchapter{data_preprocessing}{Data and Preprocessing}
 \threadtodo
 {Introduce data chapter, what will be covered here, incite interest}
 {all background covered, deepsad explained, data natural next step}
 {emotional why data scarce, lot of data necessary, what will be covered}
 {what will we talk about next $\rightarrow$ requirements}
 Situations such as earthquakes, structural failures, and other emergencies that require rescue robots are fortunately rare. When these operations do occur, the primary focus is on the rapid and safe rescue of survivors rather than on data collection. Consequently, there is a scarcity of publicly available data from such scenarios. To improve any method, however, a large, diverse, and high-quality dataset is essential for comprehensive evaluation. This challenge is further compounded in our work, as we evaluate a training-based approach that imposes even higher demands on the data, especially requiring a great deal of diverse training samples, making it difficult to find a suitable dataset.
@@ -520,11 +413,6 @@ In this chapter, we outline the specific requirements we established for the dat
 \newsection{data_req}{Data Requirements and Challenges}
 \threadtodo
 {list requirements we had for data}
 {what were our requirements for choosing a dataset}
 {list from basic to more complex with explanations}
 {ground truth for evaluation $\rightarrow$ ground truth/labeling challenges}
 To ensure our chosen dataset meets the needs of reliable degradation quantification in subterranean rescue scenarios, we imposed the following requirements:
@@ -547,11 +435,6 @@ To ensure our chosen dataset meets the needs of reliable degradation quantificat
 \end{enumerate}
 \threadtodo
 {What are the challenges for correctly labeling the data}
 {we alluded to labels being challenging before this}
 {difficult to define degradation, difficult to capture, objective leads to puppet}
 {with all requirements and challenges know $\rightarrow$ what dataset did we choose}
 Quantitative benchmarking of degradation quantification requires a degradation label for every scan. Ideally that label would be a continuous degradation score, although a binary label would still enable meaningful comparison. As the rest of this section shows, producing any reliable label is already challenging and assigning meaningful analog scores may not be feasible at all. Compounding the problem, no public search-and-rescue (SAR) LiDAR data set offers such ground truth as far as we know. To understand the challenges around labeling lidar data degradation, we will look at what constitutes degradation in this context.
@@ -563,11 +446,6 @@ To mitigate the aforementioned risks we adopt a human-centric, binary labelling
 \newsection{data_dataset}{Chosen Dataset}
 \threadtodo
 {give a comprehensive overview about chosen dataset}
 {all requirements/challenges are clear, now reader wants to know about subter dataset}
 {overview, domain, sensors, lidar, experiments, volume, statistics}
 {statistics about degradation, not full picture $\rightarrow$ how did we preprocess and label}
 Based on the previously discussed requirements and the challenges of obtaining reliable labels, we selected the \citetitle{subter}~\cite{subter} for training and evaluation. This dataset comprises multimodal sensor data collected from a robotic platform navigating tunnels and rooms in a subterranean environment, an underground tunnel in Luleå, Sweden. Notably, some experiments incorporated an artificial smoke machine to simulate heavy degradation from aerosol particles, making the dataset particularly well-suited to our use case. A Pioneer 3-AT2 robotic platform, which can be seen in figure~\ref{fig:subter_platform_photo}, was used to mount a multitude of sensors that are described in table~\ref{tab:subter-sensors} and whose mounting locations are depicted in figure~\ref{fig:subter_platform_sketch}.
@@ -638,11 +516,6 @@ Taken together, the percentage of missing points and the proportion of near-sens
 \newsection{preprocessing}{Preprocessing Steps and Labeling}
 \threadtodo
 {explain preprocessing and rationale}
 {raw dataset has been explained, how did we change the data before use}
 {projection because easier autoencoder, how does projection work, index-based}
 {preprocessing known $\rightarrow$ how did we label the data for use}
 As described in Section~\ref{sec:algorithm_description}, the method under evaluation is data type agnostic and can be adapted to work with any kind of data by choosing a suitable autoencoder architecture. In our case, the input data are point clouds produced by a lidar sensor. Each point cloud contains up to 65,536 points, with each point represented by its \emph{X}, \emph{Y}, and \emph{Z} coordinates. To tailor the Deep SAD architecture to this specific data type, we would need to design an autoencoder suitable for processing three-dimensional point clouds. Although autoencoders can be developed for various data types, \citetitle{autoencoder_survey}~\cite{autoencoder_survey} noted that over 60\% of recent research on autoencoders focuses on two-dimensional image classification and reconstruction. Consequently, there is a more established understanding of autoencoder architectures for images compared to those for three-dimensional point clouds.
@@ -654,11 +527,6 @@ Figure~\ref{fig:data_projections} displays two examples of LiDAR point cloud pro
 \fig{data_projections}{figures/data_2d_projections.png}{Two-dimensional projections of two pointclouds, one from an experiment without degradation and one from an experiment with artifical smoke as degradation. To aid the readers perception, the images are vertically stretched and a colormap has been applied to the pixels' reciprocal range values, while the actual training data is grayscale.}
 \threadtodo
 {explain labeling techniques and rationales}
 {raw dataset has been explained, how did we label the data before use}
 {experiment-based labeling, problems, manual labeling for traing, both for evaluation}
 {method and labeled preprocessed data known $\rightarrow$ explain experimental setup}
 The remaining challenge, was labeling a large enough portion of the dataset in a reasonably accurate manner, whose difficulties and general approach we described in section~\ref{sec:data_req}. Since, to our knowledge, neither our chosen dataset nor any other publicly available dataset provide objective labels for LiDAR data degradation in the SAR domain, we had to define our own labeling approach. With objective measures of degradation unavailable, we explored alternative labeling methods—such as using the datas' statistical properties like the number of missing measurements per point cloud or the higher incidence of erroneous measurements near the sensor we described in section~\ref{sec:data_req}. Ultimately, we were concerned that these statistical approaches might lead the method to simply mimic the statistical evaluation rather than to quantify degradation in a generalized and robust manner. After considering these options, we decided to label all point clouds from experiments with artificial smoke as anomalies, while point clouds from experiments without smoke were labeled as normal data. This labeling strategy—based on the presence or absence of smoke—is fundamentally an environmental indicator, independent of the intrinsic data properties recorded during the experiments.
@@ -677,11 +545,6 @@ By evaluating and comparing both approaches, we hope to demonstrate a more thoro
 \newchapter{experimental_setup}{Experimental Setup}
 \threadtodo
 {introduce experimental setup, give overview of what will be covered}
 {motivation, bg, method and data is know and understood, how was it used}
 {codebase, hardware description overview of training setup, details of deepsad setup}
 {overview of chapter given $\rightarrow$ give sequential setup overview}
 We built our experiments on the official DeepSAD PyTorch implementation and evaluation framework, available at \url{https://github.com/lukasruff/Deep-SAD-PyTorch}. This codebase provides routines for loading standard datasets, training DeepSAD and several baseline models, and evaluating their performance.
@@ -699,21 +562,11 @@ Together, these components define the full experimental pipeline, from data prep
 \section{Framework \& Data Preparation}
 \newsubsubsectionNoTOC{DeepSAD PyTorch codebase and our adaptations}
 \threadtodo
 {Explain deepsad codebase as starting point}
 {what is the starting point?}
 {codebase, github, dataloading, training, testing, baselines}
 {codebase understood $\rightarrow$ how was it adapted}
 DeepSAD's PyTorch implementation includes standardized datasets such as MNIST, CIFAR-10 and datasets from \citetitle{odds}~\cite{odds}, as well as suitable network architectures for the corresponding datatypes. The framework can train and test DeepSAD as well as a number of baseline algorithms, namely SSAD, OCSVM, Isolation Forest, KDE and SemiDGM with the loaded data and evaluate their performance by calculating the Average Precision as well as the Precision Recall Curve for all given algorithms. We adapted this implementation which was originally developed for Python 3.7 to work with Python 3.12 and changed or added functionality for dataloading our chosen dataset, added DeepSAD models that work with the lidar projections datatype, added more evaluation methods and an inference module.
 \newsubsubsectionNoTOC{SubTER dataset preprocessing, train/test splits, and label strategy}
 \threadtodo
 {explain how dataloading was adapted}
 {loading data first point step to new training}
 {preprocessed numpy (script), load, labels/meta, split, k-fold}
 {k-fold $\rightarrow$ also adapted in training/testing}
 The raw SubTER dataset is provided as one ROS bag file per experiment, each containing a dense 3D point cloud from the Ouster OS1-32 LiDAR. To streamline training and avoid repeated heavy computation, we project these point clouds offline into 2D “range images” and save them as NumPy arrays. We apply a spherical projection that maps each LiDAR measurement to a pixel in a 2D image of size Height × Width, where Height = number of vertical channels (32) and Width = measurements per rotation (2048). Instead of computing per-point azimuth and elevation angles at runtime, we exploit the sensor’s metadata:
@@ -774,7 +627,7 @@ Since the neural network architecture trained in the deepsad method is not fixed
 The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref{fig:setup_arch_lenet_encoder}) and a decoder network (figure~\ref{fig:setup_arch_lenet_decoder}) with a latent space inbetween the two parts. Such an arrangement is typical for autoencoder architectures as we discussed in section~\ref{sec:autoencoder}. The encoder network is simultaneously DeepSAD's main training architecture which is used to infer the degradation quantification in our use-case, once trained.
-\fig{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{
+\figc{setup_arch_lenet_encoder}{diagrams/arch_lenet_encoder}{
 	Architecture of the LeNet-inspired encoder. The input is a LiDAR range image of size
 	$1\times 2048\times 32$ (channels $\times$ width $\times$ height). The first block (Conv1) applies a
 	$5\times 5$ convolution with 8 output channels, followed by batch normalization, LeakyReLU activation,
@@ -785,11 +638,11 @@ The LeNet-inspired autoencoder can be split into an encoder network (figure~\ref
 	$4\cdot 512 \cdot 8 = 16384$, which maps to the latent space of dimensionality $d$, where $d$ is a tunable
 	hyperparameter ($32 \leq d \leq 1024$ in our experiments). The latent space serves as the compact
 	representation used by DeepSAD for anomaly detection.
-}
+}{width=.8\textwidth}
 The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder}) is a compact convolutional neural network that reduces image data into a lower-dimensional latent space. It consists of two stages of convolution, normalization, non-linear activation, and pooling, followed by a dense layer that defines the latent representation. Conceptually, the convolutional layers learn small filters that detect visual patterns in the input (such as edges or textures). Batch normalization ensures that these learned signals remain numerically stable during training, while a LeakyReLU activation introduces non-linearity, allowing the network to capture more complex relationships. Pooling operations then downsample the feature maps, which reduces the spatial size of the data and emphasizes the most important features. Finally, a dense layer transforms the extracted feature maps into the latent space, which serves as the datas' representation in the reduced dimensionality latent space.
-\fig{setup_arch_lenet_decoder}{diagrams/arch_lenet_decoder}{
+\figc{setup_arch_lenet_decoder}{diagrams/arch_lenet_decoder}{
 	Architecture of the LeNet-inspired decoder. The input is a latent vector of dimension $d$,
 	where $d$ is the tunable latent space size ($32 \leq d \leq 1024$ in our experiments).
 	A fully connected (FC) layer first expands this vector into a feature map of size
@@ -800,13 +653,13 @@ The LeNet-inspired encoder network (see figure~\ref{fig:setup_arch_lenet_encoder
 	a transpose convolution, reducing the channels to 1. This produces the reconstructed output
 	of size $1\times 2048\times 32$, which matches the original input dimensionality required
 	for the autoencoding objective.
-}
+}{width=.8\textwidth}
 The decoder network (see figure~\ref{fig:setup_arch_lenet_decoder}) mirrors the encoder and reconstructs the input from its latent representation. A dense layer first expands the latent vector into a feature map of shape $4\times 512\times 8$, which is then upsampled and refined in two successive stages. Each stage consists of an interpolation step that doubles the spatial resolution, followed by a transpose convolution that learns how to add structural detail. The first stage operates on 4 channels, and the second on 8 channels, with the final transpose convolution reducing the output to a single channel. The result is a reconstructed output of size $1\times 2048\times 32$, matching the original input dimensionality required for the autoencoding objective.
 Even though the LeNet-inspired encoder proved capable of achieving our degradation quantification objective in initial experiments, we identified several shortcomings that motivated the design of a second, more efficient architecture. The most important issue concerns the shape of the CNN's receptive field (RF) which describes the region of the input that influences a single output activation. Its size and aspect ratio determine which structures the network can effectively capture: if the RF is too small, larger patterns cannot be detected, while an excessively large RF may hinder the network from learning to recognize fine details. For standard image data, the RF is often expressed as a symmetric $n \times n$ region, but in principle it can be computed independently per axis.
-\fig{setup_ef_concept}{figures/setup_ef_concept}{Receptive fields in a CNN. Each output activation aggregates information from a region of the input; stacking layers expands this region, while kernel size, stride, and padding control how quickly it grows and what shape it takes. (A) illustrates slower, fine-grained growth; (B) shows faster expansion, producing a larger—potentially anisotropic—receptive field and highlighting the trade-off between detail and context. Reproduced from~\cite{ef_concept_source}}
+\figc{setup_ef_concept}{figures/setup_ef_concept}{Receptive fields in a CNN. Each output activation aggregates information from a region of the input; stacking layers expands this region, while kernel size, stride, and padding control how quickly it grows and what shape it takes. (A) illustrates slower, fine-grained growth; (B) shows faster expansion, producing a larger—potentially anisotropic—receptive field and highlighting the trade-off between detail and context. Reproduced from~\cite{ef_concept_source}}{width=.6\textwidth}
 The RF shape's issue arises from the fact that spinning multi-beam LiDAR oftentimes produce point clouds posessing dense horizontal but limited vertical resolution. In our case this, this results in a pixel-per-degree resolution of approximately $5.69\,\sfrac{pixel}{deg}$ vertically and $1.01\,\sfrac{pixel}{deg}$ horizontally. Consequently, the LeNet-inspired encoder’s calculated receptive field of $16 \times 16$ pixels translates to an angular size of $15.88^{\circ} \times 2.81^{\circ}$, which is highly rectangular in angular space. Such a mismatch risks limiting the network’s ability to capture degradation patterns that extend differently across the two axes.
@@ -889,34 +742,7 @@ To compare the computational efficiency of the two architectures we show the num
 	\label{tab:params_lenet_vs_efficient}
 \end{table}
 \threadtodo
 {how was training/testing adapted (networks overview), inference, ae tuning}
 {data has been loaded, how is it processed}
 {networks defined, training/testing k-fold, more metrics, inference + ae tuning implemented}
 {training procesure known $\rightarrow$ what methods were evaluated}
 \threadtodo
 {custom arch necessary, first lenet then second arch to evaluate importance of arch}
 {training process understood, but what networks were actually trained}
 {custom arch, lenet from paper and simple, receptive field problem, arch really important?}
 {motivation behind archs given $\rightarrow$ what do they look like}
 \threadtodo
 {show and explain both archs}
 {we know why we need them but what do they look like}
 {visualization of archs, explain LeNet and why other arch was chosen that way}
 {both archs known $\rightarrow$ what about the other inputs/hyperparameters}
 \threadtodo
 {give overview of hyperparameters}
 {deepsad arch known, other hyperparameters?}
 {LR, eta, epochs, latent space size (hyper param search), semi labels}
 {everything that goes into training known $\rightarrow$ what experiments were actually done?}
 \newsubsubsectionNoTOC{Baseline methods (Isolation Forest, one-class SVM) and feature extraction via the encoder}
 \threadtodo
 {what methods were evaluated}
 {we know what testing/training was implemented for deepsad, but what is it compared to}
 {isoforest, ocsvm adapted, for ocsvm only dim reduced feasible (ae from deepsad)}
 {compared methods known $\rightarrow$ what methods were used}
 To contextualize the performance of DeepSAD, we compare against two widely used baselines: Isolation Forest and One-Class SVM (OCSVM). Both are included in the original DeepSAD codebase and the associated paper, and they represent well-understood but conceptually different families of anomaly detection. In our setting, the raw input dimensionality ($2048 \times 32$ per frame) is too high for a direct OCSVM fit, so we reuse the DeepSAD autoencoder’s \emph{encoder} as a learned dimensionality reduction (to the same latent size as DeepSAD). This choice is motivated by practicality (compute) and inductive bias: the encoder captures non-linear, domain-specific structure of the LiDAR range images, which linear methods like PCA may miss. Together, these two baselines cover complementary perspectives: tree-based partitioning (Isolation Forest) and kernel-based boundary learning (OCSVM), providing a broad and well-established basis for comparison.
@@ -935,17 +761,7 @@ We adapted the baseline implementations to our data loader and input format, and
 \newsection{setup_experiments_environment}{Experiment Overview \& Computational Environment}
 \threadtodo
 {\textit{"What should the reader know after reading this section?"}}
 {\textit{"Why is that of interest to the reader at this point?"}}
 {\textit{"How am I achieving the stated goal?"}}
 {\textit{"How does this lead to the next question or section?"}}
 \threadtodo
 {give overview of experiments and their motivations}
 {training setup clear, but not what was trained/tested}
 {explanation of what was searched for (ae latent space first), other hyperparams and why}
 {all experiments known $\rightarrow$ how long do they take to train}
 Our experimental setup consisted of two stages. First, we conducted a hyperparameter search over the latent space dimensionality by pretraining the autoencoders alone. For both the LeNet-inspired and the Efficient network, we evaluated latent space sizes of $32, 64, 128, 256, 384, 512, 768,$ and $1024$. Each autoencoder was trained for 50~epochs with a learning rate of $1\cdot 10^{-5}$, and results were averaged across 5-fold cross-validation. The goal of this stage was to identify the ``elbow point'' in reconstruction loss curves, which serves as a practical indicator of a sufficiently expressive, yet compact, representation.
@@ -1006,11 +822,6 @@ Table~\ref{tab:exp_grid} summarizes the full experiment matrix.
 \threadtodo
 {give overview about hardware setup and how long things take to train}
 {we know what we trained but not how long that takes}
 {table of hardware and of how long different trainings took}
 {experiment setup understood $\rightarrow$ what were the experiments' results}
 These experiments were run on a computational environment for which we summarize the hardware and software stack in table~\ref{tab:system_setup}.
@@ -1124,11 +935,6 @@ Inference latency per sample is presented in Table~\ref{tab:inference_latency_co
 Together, these results provide a comprehensive overview of the computational requirements of our experimental setup. They show that while our deep semi-supervised approach is significantly more demanding during training than classical baselines, it remains highly efficient at inference, which is the decisive factor for deployment in time-critical domains such as rescue robotics.
 \threadtodo
 {Introduce the structure and scope of the results chapter}
 {The reader knows the experiments from the previous chapter, but not the outcomes}
 {State that we will first analyze autoencoder results, then anomaly detection performance, and finally inference experiments}
 {Clear roadmap $\rightarrow$ prepares reader for detailed sections}
 \newchapter{results_discussion}{Results and Discussion}
@@ -1160,11 +966,11 @@ The results of pretraining the two autoencoder architectures are summarized in T
 	\end{tabularx}
 \end{table}
-\fig{ae_loss_overall}{figures/ae_elbow_test_loss_overall.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures.}
+\figc{ae_loss_overall}{figures/ae_elbow_test_loss_overall.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures.}{width=.9\textwidth}
 Because overall reconstruction loss might obscure how well encoders represent anomalous samples, we additionally evaluate reconstruction errors only on degraded samples from hand-labeled smoke segments (Figure~\ref{fig:ae_loss_degraded}). As expected, reconstruction losses are about 0.05 higher on these challenging samples than in the overall evaluation. However, the relative advantage of the Efficient architecture remains, suggesting that its improvements extend to anomalous inputs as well.
-\fig{ae_loss_degraded}{figures/ae_elbow_test_loss_anomaly.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures, evaluated only on degraded data from hand-labeled smoke experiments.}
+\figc{ae_loss_degraded}{figures/ae_elbow_test_loss_anomaly.png}{Reconstruction loss across latent dimensions for LeNet-inspired and Efficient architectures, evaluated only on degraded data from hand-labeled smoke experiments.}{width=.9\textwidth}
 Since only per-sample reconstruction losses were retained during pretraining, we report results in reciprocal-range MSE space. While more interpretable metrics in meters and distance-binned analyses would be desirable, the downstream anomaly detection performance did not differ starkly between encoders, so we did not pursue this additional evaluation. Future work could extend the pretraining analysis with physically interpretable metrics.
@@ -1234,14 +1040,14 @@ Figure~\ref{fig:latent_dim_ap} shows how average precision changes with latent d
 This effect is clearly visible in the precision--recall curves. For DeepSAD at all dimensionalities we observe high initial precision and a steep drop once the evaluation demands that mislabeled anomalies be included. However, the sharpness of this drop depends on the latent size: at 32 dimensions the fall is comparably more gradual, while at 1024 it is almost vertical. In practice, this means that higher-dimensional latent spaces amplify the label-noise problem and lead to sudden precision collapses once the clear anomalies have been detected. Compact latent spaces are therefore more robust under noisy evaluation conditions and appear to be the safer choice for real-world deployment.
-\fig{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}
+\figc{latent_dim_ap}{figures/results_ap_over_latent.png}{AP as a function of latent dimension (experiment-based evaluation). DeepSAD shows inverse correlation between AP and latent space size.}{width=.7\textwidth}
 \paragraph{Effect of semi-supervised labeling.}
 Table~\ref{tab:results_ap} shows that the unsupervised regime \((0/0)\) achieves the best AP, while the lightly supervised regime \((50/10)\) performs worst. With many labels \((500/100)\), performance improves again but remains slightly below the unsupervised case. This pattern also appears under the hand-labeled evaluation, which excludes mislabeled frames. The drop with light supervision therefore cannot be explained by noisy evaluation targets, but must stem from the training process itself.
 The precision--recall curves in Figure~\ref{fig:prc_over_semi} show that the overall curve shapes are similar across regimes, but shifted relative to one another in line with the AP ordering \((0/0) > (500/100) > (50/10)\). We attribute these shifts to overfitting: when only a few anomalies are labeled, the model fits them too strongly, and if those examples differ too much from other anomalies, generalization suffers. This explains why lightly supervised training performs even worse than unsupervised training, which avoids this bias.
-\fig{prc_over_semi}{figures/results_prc_over_semi.png}{Precision--recall curves at latent dimension~32 for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (left) and Efficient (right) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}
+\figc{prc_over_semi}{figures/results_prc_over_semi.png}{Precision--recall curves at latent dimension~32 for all three labeling regimes (unsupervised, lightly supervised, heavily supervised), shown separately for the LeNet-inspired (left) and Efficient (right) encoders. Baseline methods are included for comparison. Latent dimension~32 is shown as it achieved the best overall AP and is representative of the typical PRC shapes across dimensions.}{width=.7\textwidth}
 The LeNet variant illustrates this effect most clearly, showing unusually high variance across folds in the lightly supervised case. In several folds, precision drops untypically early, which supports the idea that the model has overfit to a poorly chosen subset of labeled anomalies. The Efficient variant is less affected, maintaining more stable precision plateaus, which suggests it is more robust to such overfitting, which we observe consistently for nearly all latent dimensionalities.
@@ -1341,8 +1147,6 @@ Several promising avenues remain open for future exploration:
 In summary, while this thesis demonstrates the feasibility of using anomaly detection for lidar degradation quantification, significant challenges remain. Chief among them are the definition and collection of ground truth, the development of analog evaluation targets, and architectural adaptations for more complex real-world scenarios. Addressing these challenges will be critical for moving from proof-of-concept to practical deployment in rescue robotics and beyond.
 % end mainmatter
 % **************************************************************************************************
--- a/thesis/base/macros.tex
+++ b/thesis/base/macros.tex
@@ -147,7 +147,7 @@
 %   standard
 \newcommand{\fig}[3]{\begin{figure}\centering\includegraphics[width=\textwidth]{#2}\caption{#3}\label{fig:#1}\end{figure}}%
 %   with controllable parameters
-\newcommand{\figc}[4]{\begin{figure}\centering\includegraphics[#1]{#2}\caption{#3}\label{fig:#4}\end{figure}}%
+\newcommand{\figc}[4]{\begin{figure}\centering\includegraphics[#4]{#2}\caption{#3}\label{fig:#1}\end{figure}}%
 %   two subfigures
 \newcommand{\twofig}[6]{\begin{figure}\centering%
 		\subfigure[#2]{\includegraphics[width=0.495\textwidth]{#1}}%
--- a/thesis/diagrams/arch_ef_decoder.pdf
+++ b/thesis/diagrams/arch_ef_decoder.pdf
--- a/thesis/diagrams/arch_ef_encoder.pdf
+++ b/thesis/diagrams/arch_ef_encoder.pdf
--- a/thesis/diagrams/arch_lenet_decoder.pdf
+++ b/thesis/diagrams/arch_lenet_decoder.pdf
--- a/thesis/diagrams/arch_lenet_encoder.pdf
+++ b/thesis/diagrams/arch_lenet_encoder.pdf
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_decoder.pdf
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_decoder.pdf
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_decoder.py
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_decoder.py
@@ -30,7 +30,8 @@ arch = [
        height=H8 * 1.6,
        depth=D1,
        width=W1,
-        caption=f"Latent Space",
+        caption="Latent Space",
        captionshift=0,
    ),
    # to_connection("fc1", "latent"),
    # --------------------------- DECODER ---------------------------
@@ -39,19 +40,20 @@ arch = [
        "fc3",
        n_filer="{{8×128×8}}",
        zlabeloffset=0.5,
-        offset="(2,0,0)",
+        offset="(2,-.5,0)",
        to="(latent-east)",
        height=H1,
        depth=D512,
        width=W1,
        caption=f"FC",
        captionshift=20,
    ),
    to_Conv(
        "unsqueeze",
        s_filer="{{128×8}}",
        zlabeloffset=0.4,
        n_filer=32,
-        offset="(2,0,0)",
+        offset="(1.4,0,0)",
        to="(fc3-east)",
        height=H8,
        depth=D128,
@@ -62,7 +64,7 @@ arch = [
    # Reshape to 4×8×512
    to_UnPool(
        "up1",
-        offset="(2,0,0)",
+        offset="(1.2,0,0)",
        n_filer=32,
        to="(unsqueeze-east)",
        height=H16,
@@ -101,7 +103,8 @@ arch = [
        height=H16,
        depth=D1024,
        width=W32,
-        caption="",
+        caption="Deconv2",
        captionshift=20,
    ),
    to_Conv(
        "dwdeconv3",
@@ -112,7 +115,7 @@ arch = [
        height=H16,
        depth=D1024,
        width=W1,
-        caption="Deconv2",
+        caption="",
    ),
    to_Conv(
        "dwdeconv4",
@@ -134,7 +137,8 @@ arch = [
        height=H32,
        depth=D2048,
        width=W16,
-        caption="",
+        caption="Deconv3",
        captionshift=10,
    ),
    to_Conv(
        "dwdeconv5",
@@ -145,7 +149,7 @@ arch = [
        height=H32,
        depth=D2048,
        width=W1,
-        caption="Deconv3",
+        caption="",
    ),
    to_Conv(
        "dwdeconv6",
@@ -164,7 +168,7 @@ arch = [
        s_filer="{{2048×32}}",
        zlabeloffset=0.15,
        n_filer=1,
-        offset="(2,0,0)",
+        offset="(1.5,0,0)",
        to="(dwdeconv6-east)",
        height=H32,
        depth=D2048,
@@ -178,12 +182,13 @@ arch = [
        s_filer="{{2048×32}}",
        zlabeloffset=0.15,
        n_filer=1,
-        offset="(2,0,0)",
+        offset="(1.5,0,0)",
        to="(outconv-east)",
        height=H32,
        depth=D2048,
        width=W1,
        caption="Output",
        captionshift=5,
    ),
    # to_connection("deconv2", "out"),
    to_end(),
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_decoder.tex
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_decoder.tex
@@ -28,6 +28,7 @@
    {Box={
        name=latent,
        caption=Latent Space,
        captionshift=0,
        xlabel={{, }},
        zlabeloffset=0.3,
        zlabel=latent dim,
@@ -39,10 +40,11 @@
    };
-\pic[shift={(2,0,0)}] at (latent-east)
+\pic[shift={(2,-.5,0)}] at (latent-east)
    {Box={
        name=fc3,
        caption=FC,
        captionshift=20,
        xlabel={{" ","dummy"}},
        zlabeloffset=0.5,
        zlabel={{8×128×8}},
@@ -55,10 +57,11 @@
    };
-\pic[shift={(2,0,0)}] at (fc3-east) 
+\pic[shift={(1.4,0,0)}] at (fc3-east) 
    {Box={
        name=unsqueeze,
        caption=Unsqueeze,
        captionshift=0,
        xlabel={{32, }},
        zlabeloffset=0.4,
        zlabel={{128×8}},
@@ -70,10 +73,11 @@
    };
-\pic[shift={ (2,0,0) }] at (unsqueeze-east) 
+\pic[shift={ (1.2,0,0) }] at (unsqueeze-east) 
    {Box={
        name=up1,
        caption=,
        captionshift=0,
        fill=\UnpoolColor,
        opacity=0.5,
        xlabel={{32, }},
@@ -88,6 +92,7 @@
    {Box={
        name=dwdeconv1,
        caption=Deconv1,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.3,
        zlabel=,
@@ -103,6 +108,7 @@
    {Box={
        name=dwdeconv2,
        caption=,
        captionshift=0,
        xlabel={{32, }},
        zlabeloffset=0.4,
        zlabel={{256×16}},
@@ -117,7 +123,8 @@
 \pic[shift={ (2,0,0) }] at (dwdeconv2-east) 
    {Box={
        name=up2,
-        caption=,
+        caption=Deconv2,
        captionshift=20,
        fill=\UnpoolColor,
        opacity=0.5,
        xlabel={{32, }},
@@ -131,7 +138,8 @@
 \pic[shift={(0,0,0)}] at (up2-east) 
    {Box={
        name=dwdeconv3,
-        caption=Deconv2,
+        caption=,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.3,
        zlabel=,
@@ -147,6 +155,7 @@
    {Box={
        name=dwdeconv4,
        caption=,
        captionshift=0,
        xlabel={{16, }},
        zlabeloffset=0.17,
        zlabel={{1024×16}},
@@ -161,7 +170,8 @@
 \pic[shift={ (2,0,0) }] at (dwdeconv4-east) 
    {Box={
        name=up3,
-        caption=,
+        caption=Deconv3,
        captionshift=10,
        fill=\UnpoolColor,
        opacity=0.5,
        xlabel={{16, }},
@@ -175,7 +185,8 @@
 \pic[shift={(0,0,0)}] at (up3-east) 
    {Box={
        name=dwdeconv5,
-        caption=Deconv3,
+        caption=,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.3,
        zlabel=,
@@ -191,6 +202,7 @@
    {Box={
        name=dwdeconv6,
        caption=,
        captionshift=0,
        xlabel={{8, }},
        zlabeloffset=0.15,
        zlabel={{2048×32}},
@@ -202,10 +214,11 @@
    };
-\pic[shift={(2,0,0)}] at (dwdeconv6-east) 
+\pic[shift={(1.5,0,0)}] at (dwdeconv6-east) 
    {Box={
        name=outconv,
        caption=Deconv4,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.15,
        zlabel={{2048×32}},
@@ -217,10 +230,11 @@
    };
-\pic[shift={(2,0,0)}] at (outconv-east) 
+\pic[shift={(1.5,0,0)}] at (outconv-east) 
    {Box={
        name=out,
        caption=Output,
        captionshift=5,
        xlabel={{1, }},
        zlabeloffset=0.15,
        zlabel={{2048×32}},
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_encoder.pdf
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_encoder.pdf
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_encoder.py
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_encoder.py
@@ -125,7 +125,7 @@ arch = [
        n_filer=8,
        zlabeloffset=0.45,
        s_filer="{{128×8}}",
-        offset="(2,0,0)",
+        offset="(1,0,0)",
        to="(pool3-east)",
        height=H8,
        depth=D128,
@@ -137,12 +137,13 @@ arch = [
        "fc1",
        n_filer="{{8×128×8}}",
        zlabeloffset=0.5,
-        offset="(2,0,0)",
+        offset="(2,-.5,0)",
        to="(squeeze-east)",
        height=H1,
        depth=D512,
        width=W1,
-        caption=f"FC",
+        caption="FC",
        captionshift=0,
    ),
    # to_connection("pool2", "fc1"),
    # --------------------------- LATENT  ---------------------------
@@ -150,7 +151,7 @@ arch = [
        "latent",
        n_filer="",
        s_filer="latent dim",
-        offset="(2,0,0)",
+        offset="(1.3,0.5,0)",
        to="(fc1-east)",
        height=H8 * 1.6,
        depth=D1,
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_encoder.tex
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_ef_encoder.tex
@@ -28,6 +28,7 @@
    {Box={
        name=input,
        caption=Input,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.2,
        zlabel={{2048×32}},
@@ -43,6 +44,7 @@
    {Box={
        name=dwconv1,
        caption=,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.3,
        zlabel=,
@@ -58,6 +60,7 @@
    {Box={
        name=dwconv2,
        caption=Conv1,
        captionshift=0,
        xlabel={{16, }},
        zlabeloffset=0.15,
        zlabel={{2048×32}},
@@ -76,6 +79,7 @@
        zlabeloffset=0.3,
        zlabel={{512×32}},
        caption=,
        captionshift=0,
        fill=\PoolColor,
        opacity=0.5,
        height=26,
@@ -89,6 +93,7 @@
    {Box={
        name=dwconv3,
        caption=,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.3,
        zlabel=,
@@ -104,6 +109,7 @@
    {Box={
        name=dwconv4,
        caption=Conv2,
        captionshift=0,
        xlabel={{32, }},
        zlabeloffset=0.3,
        zlabel={{512×32}},
@@ -122,6 +128,7 @@
        zlabeloffset=0.45,
        zlabel={{256×16}},
        caption=,
        captionshift=0,
        fill=\PoolColor,
        opacity=0.5,
        height=18,
@@ -138,6 +145,7 @@
        zlabeloffset=0.45,
        zlabel={{128×8}},
        caption=,
        captionshift=0,
        fill=\PoolColor,
        opacity=0.5,
        height=12,
@@ -147,10 +155,11 @@
    };
-\pic[shift={(2,0,0)}] at (pool3-east) 
+\pic[shift={(1,0,0)}] at (pool3-east) 
    {Box={
        name=squeeze,
        caption=Squeeze,
        captionshift=0,
        xlabel={{8, }},
        zlabeloffset=0.45,
        zlabel={{128×8}},
@@ -162,10 +171,11 @@
    };
-\pic[shift={(2,0,0)}] at (squeeze-east)
+\pic[shift={(2,-.5,0)}] at (squeeze-east)
    {Box={
        name=fc1,
        caption=FC,
        captionshift=0,
        xlabel={{" ","dummy"}},
        zlabeloffset=0.5,
        zlabel={{8×128×8}},
@@ -178,10 +188,11 @@
    };
-\pic[shift={(2,0,0)}] at (fc1-east) 
+\pic[shift={(1.3,0.5,0)}] at (fc1-east) 
    {Box={
        name=latent,
        caption=Latent Space,
        captionshift=0,
        xlabel={{, }},
        zlabeloffset=0.3,
        zlabel=latent dim,
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_decoder.pdf
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_decoder.pdf
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_decoder.py
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_decoder.py
@@ -39,19 +39,20 @@ arch = [
        "fc3",
        n_filer="{{4×512×8}}",
        zlabeloffset=0.35,
-        offset="(2,0,0)",
+        offset="(2,-.5,0)",
        to="(latent-east)",
        height=1.3,
        depth=D512,
        width=W1,
        caption=f"FC",
        captionshift=20,
    ),
    # to_connection("latent", "fc3"),
    # Reshape to 4×8×512
    to_UnPool(
        "up1",
        n_filer=4,
-        offset="(2,0,0)",
+        offset="(2.5,0,0)",
        to="(fc3-east)",
        height=H16,
        depth=D1024,
@@ -82,7 +83,8 @@ arch = [
        height=H32,
        depth=D2048,
        width=W8,
-        caption="",
+        caption="Deconv2",
        captionshift=10,
    ),
    # to_connection("deconv1", "up2"),
    # DeConv2 (5×5, same): 8->1, 32×2048
@@ -96,7 +98,7 @@ arch = [
        height=H32,
        depth=D2048,
        width=W1,
-        caption="Deconv2",
+        caption="",
    ),
    # to_connection("up2", "deconv2"),
    # Output
@@ -111,6 +113,7 @@ arch = [
        depth=D2048,
        width=1.0,
        caption="Output",
        captionshift=5,
    ),
    # to_connection("deconv2", "out"),
    to_end(),
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_decoder.tex
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_decoder.tex
@@ -28,6 +28,7 @@
    {Box={
        name=latent,
        caption=Latent Space,
        captionshift=0,
        xlabel={{, }},
        zlabeloffset=0.3,
        zlabel=latent dim,
@@ -39,10 +40,11 @@
    };
-\pic[shift={(2,0,0)}] at (latent-east)
+\pic[shift={(2,-.5,0)}] at (latent-east)
    {Box={
        name=fc3,
        caption=FC,
        captionshift=20,
        xlabel={{" ","dummy"}},
        zlabeloffset=0.35,
        zlabel={{4×512×8}},
@@ -55,10 +57,11 @@
    };
-\pic[shift={ (2,0,0) }] at (fc3-east) 
+\pic[shift={ (2.5,0,0) }] at (fc3-east) 
    {Box={
        name=up1,
        caption=,
        captionshift=0,
        fill=\UnpoolColor,
        opacity=0.5,
        xlabel={{4, }},
@@ -73,6 +76,7 @@
    {Box={
        name=deconv1,
        caption=Deconv1,
        captionshift=0,
        xlabel={{8, }},
        zlabeloffset=0.2,
        zlabel={{1024×16}},
@@ -87,7 +91,8 @@
 \pic[shift={ (2,0,0) }] at (deconv1-east) 
    {Box={
        name=up2,
-        caption=,
+        caption=Deconv2,
        captionshift=10,
        fill=\UnpoolColor,
        opacity=0.5,
        xlabel={{8, }},
@@ -101,7 +106,8 @@
 \pic[shift={(0,0,0)}] at (up2-east) 
    {Box={
        name=deconv2,
-        caption=Deconv2,
+        caption=,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.15,
        zlabel={{2048×32}},
@@ -117,6 +123,7 @@
    {Box={
        name=out,
        caption=Output,
        captionshift=5,
        xlabel={{1, }},
        zlabeloffset=0.15,
        zlabel={{2048×32}},
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_encoder.pdf
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_encoder.pdf
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_encoder.py
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_encoder.py
@@ -91,13 +91,14 @@ arch = [
    to_fc(
        "fc1",
        n_filer="{{4×512×8}}",
-        offset="(2,0,0)",
+        offset="(2,-.5,0)",
        zlabeloffset=0.5,
        to="(pool2-east)",
        height=1.3,
        depth=D512,
        width=W1,
        caption=f"FC",
        captionshift=20,
    ),
    # to_connection("pool2", "fc1"),
    # --------------------------- LATENT  ---------------------------
--- a/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_encoder.tex
+++ b/thesis/third_party/PlotNeuralNet/deepsad/arch_lenet_encoder.tex
@@ -28,6 +28,7 @@
    {Box={
        name=input,
        caption=Input,
        captionshift=0,
        xlabel={{1, }},
        zlabeloffset=0.15,
        zlabel={{2048×32}},
@@ -43,6 +44,7 @@
    {Box={
        name=conv1,
        caption=Conv1,
        captionshift=0,
        xlabel={{8, }},
        zlabeloffset=0.15,
        zlabel={{2048×32}},
@@ -61,6 +63,7 @@
        zlabeloffset=0.3,
        zlabel={{1024×16}},
        caption=,
        captionshift=0,
        fill=\PoolColor,
        opacity=0.5,
        height=18,
@@ -74,6 +77,7 @@
    {Box={
        name=conv2,
        caption=Conv2,
        captionshift=0,
        xlabel={{4, }},
        zlabeloffset=0.4,
        zlabel={{1024×16\hspace{2.5em}512×8}},
@@ -92,6 +96,7 @@
        zlabeloffset=0.3,
        zlabel={{}},
        caption=,
        captionshift=0,
        fill=\PoolColor,
        opacity=0.5,
        height=12,
@@ -101,10 +106,11 @@
    };
-\pic[shift={(2,0,0)}] at (pool2-east)
+\pic[shift={(2,-.5,0)}] at (pool2-east)
    {Box={
        name=fc1,
        caption=FC,
        captionshift=20,
        xlabel={{" ","dummy"}},
        zlabeloffset=0.5,
        zlabel={{4×512×8}},
@@ -121,6 +127,7 @@
    {Box={
        name=latent,
        caption=Latent Space,
        captionshift=0,
        xlabel={{, }},
        zlabeloffset=0.3,
        zlabel=latent dim,
--- a/thesis/third_party/PlotNeuralNet/layers/Box.sty
+++ b/thesis/third_party/PlotNeuralNet/layers/Box.sty
@@ -57,8 +57,12 @@
 \path (b1) edge ["\ylabel",midway] (a1);  %height label
-\tikzstyle{captionlabel}=[text width=15*\LastEastx/\scale,text centered]       
+% \tikzstyle{captionlabel}=[text width=15*\LastEastx/\scale,text centered,xshift=\captionshift pt]       
-\path (\LastEastx/2,-\y/2,+\z/2) + (0,-25pt) coordinate (cap) 
+% \path (\LastEastx/2,-\y/2,+\z/2) + (0,-25pt) coordinate (cap) 
 % edge ["\textcolor{black}{ \bf \caption}"',captionlabel](cap) ; %Block caption/pic object label
 % Place caption: shift the coordinate by captionshift (NEW)
 \path (\LastEastx/2,-\y/2,+\z/2) + (\captionshift pt,-25pt) coordinate (cap)
 edge ["\textcolor{black}{ \bf \caption}"',captionlabel](cap) ; %Block caption/pic object label
 %Define nodes to be used outside on the pic object
@@ -103,6 +107,7 @@ ylabel/.store       in=\ylabel,
 zlabel/.store       in=\zlabel,
 zlabeloffset/.store in=\zlabeloffset,
 caption/.store      in=\caption,
 captionshift/.store in=\captionshift,
 name/.store         in=\name,
 fill/.store         in=\fill,
 opacity/.store      in=\opacity,
@@ -117,5 +122,6 @@ ylabel=,
 zlabel=,
 zlabeloffset=0.3,
 caption=,
 captionshift=0,
 name=,
 }
--- a/thesis/third_party/PlotNeuralNet/pycore/tikzeng.py
+++ b/thesis/third_party/PlotNeuralNet/pycore/tikzeng.py
@@ -75,6 +75,7 @@ def to_Conv(
    height=40,
    depth=40,
    caption=" ",
    captionshift=0,
 ):
    return (
        r"""
@@ -90,6 +91,9 @@ def to_Conv(
        caption="""
        + caption
        + r""",
        captionshift="""
        + str(captionshift)
        + """,
        xlabel={{"""
        + str(n_filer)
        + """, }},
@@ -182,6 +186,7 @@ def to_Pool(
    depth=32,
    opacity=0.5,
    caption=" ",
    captionshift=0,
 ):
    return (
        r"""
@@ -206,6 +211,9 @@ def to_Pool(
        caption="""
        + caption
        + r""",
        captionshift="""
        + str(captionshift)
        + """,
        fill=\PoolColor,
        opacity="""
        + str(opacity)
@@ -236,6 +244,7 @@ def to_UnPool(
    depth=32,
    opacity=0.5,
    caption=" ",
    captionshift=0,
 ):
    return (
        r"""
@@ -251,6 +260,9 @@ def to_UnPool(
        caption="""
        + caption
        + r""",
        captionshift="""
        + str(captionshift)
        + r""",
        fill=\UnpoolColor,
        opacity="""
        + str(opacity)
@@ -335,6 +347,7 @@ def to_ConvSoftMax(
    height=40,
    depth=40,
    caption=" ",
    captionshift=0,
 ):
    return (
        r"""
@@ -350,6 +363,9 @@ def to_ConvSoftMax(
        caption="""
        + caption
        + """,
        captionshift="""
        + str(captionshift)
        + """,
        zlabel="""
        + str(s_filer)
        + """,
@@ -380,6 +396,7 @@ def to_SoftMax(
    depth=25,
    opacity=0.8,
    caption=" ",
    captionshift=0,
    z_label_offset=0,
 ):
    return (
@@ -396,6 +413,9 @@ def to_SoftMax(
        caption="""
        + caption
        + """,
        captionshift="""
        + str(captionshift)
        + """,
        xlabel={{" ","dummy"}},
        zlabel="""
        + str(s_filer)
@@ -455,6 +475,7 @@ def to_fc(
    height=2,
    depth=10,
    caption=" ",
    captionshift=0,
    # titlepos=0,
 ):
    return (
@@ -471,6 +492,9 @@ def to_fc(
        caption="""
        + caption
        + """,
        captionshift="""
        + str(captionshift)
        + """,
        xlabel={{" ","dummy"}},
        zlabeloffset="""
        + str(zlabeloffset)