From d76af9340632128dc3a8b05011b6cf8d53fb0ccb Mon Sep 17 00:00:00 2001
From: Douglas Rumbaugh <dbr4@psu.edu>
Date: Tue, 20 May 2025 17:05:47 -0400
Subject: updates

---
 chapters/beyond-dsp.tex | 223 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)

(limited to 'chapters/beyond-dsp.tex')

diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex
index fd4537c..b94221f 100644
--- a/chapters/beyond-dsp.tex
+++ b/chapters/beyond-dsp.tex
@@ -1283,6 +1283,7 @@ possible to leverage problem-specific details within this interface to
 get better asymptotic performance.
 
 \subsection{Concurrency Control}
+\label{ssec:dyn-concurrency}
 
 \section{Evaluation}
 
@@ -1548,6 +1549,31 @@ characteristics,
     \text{Delete:} \quad &\Theta\left(\log_s n \right)
 \end{align*}
 
+For testing, we considered a dynamized VPTree using $N_B = 1400$, $s =
+8$, the tiering layout policy, and tagged deletes. Because $k$-NN is a
+standard DDSP, we compare with the Bentley-Saxe Method (\textbf{BSM})\footnote{
+    There is one deviation from pure BSM in our implementation. We use
+    the same delete tagging scheme as the rest of our framework, meaning
+    that the hash tables for record lookup are embedded alongside each
+    block, rather than having a single global table. This means that
+    the lookup of the shard containing the record to be deleted runs
+    in $\Theta(\log_2 n)$ time, rather than $\Theta(1)$ time. However,
+    once the block has been identified, our approach allows the record to
+    be deleted in $\Theta(1)$ time, rather than requiring an inefficient
+    point-lookup directly on the VPTree.
+} and a dynamic data structure for the same search problem called an
+M-Tree~\cite{mtree,mtree-impl} (\textbf{MTree}), which is an example of a so-called
+"ball tree" structure that partitions high dimensional space using nodes
+representing spheres, which are merged and split to maintain balance in
+a manner not unlike a B+Tree.  We also consider a static instance of a
+VPTree built over the same set of records (\textbf{VPTree}).  We used
+L2 distance as our metric, which is defined for vectors of $d$
+dimensions as
+\begin{equation*}
+dist(r, s) = \sqrt{\sum_{i=0}^{d-1} \left(r_i - s_i\right)^2} 
+\end{equation*}
+and ran the queries with $k=1000$ relative to a randomly selected point
+in the dataset.
 
 \begin{figure*}
 		\subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn-insert} \label{fig:knn-insert}}
@@ -1559,9 +1585,69 @@ characteristics,
         \label{fig:knn-eval}
 \end{figure*}
 
+The results of this benchmarking are reported in
+Figure~\ref{fig:knn-eval}.  The VPTree is shown here to \emph{vastly}
+out-perform the dynamic data structure in query performance in
+Figure~\ref{fig:knn-query}. Note that the y-axis of this figure
+is log-scaled. Interestingly, the query performance is not severely
+degraded relative to the static baseline regardless of the dynamization
+scheme used, with \textbf{BSM-VPTree} performing slightly \emph{better}
+than our framework for query performance. The reason for this is
+shown in Figure~\ref{fig:knn-insert}, where our framework outperforms
+the Bentley-Saxe method in insertion performance. These results are
+atributible to our selection of framework configuration parameters,
+which are biased towards better insertion performance. Both dynamized
+structures also outperform the dynamic baseline. Finally, as is becoming
+a trend, Figure~\ref{fig:knn-space} shows that the storage requirements
+of the static data structures, dynamized or not, are significantly less
+than M-Tree. M-Tree, like a B+Tree, requires leaving empty slots in its
+nodes to support insertion, and this results in a large amount of wasted
+space.
+
+As a final note, metric indexing is an area where dynamized static
+structures have been shown to work well already, and our results here
+are in line with the results of Naidan and Hetland, who applied BSM
+directly to metric data structures, including VPTree, in their own work
+and showed similar performance advantages~\cite{naidan14}.
+
+
+
+
 
 \subsection{Range Scan}
 
+Next, we will consider applying our dynamization framework to learned
+indices for single-dimensional range scans. A learned index is a sorted
+data structure which attempts to index data by directly modeling a
+function mapping a key to its offset within a storage array. The result
+of a lookup against the index is a estimated location, along with a
+strict error bound, within which the record is guaranteed to be located.
+We apply our framework to create dynamized versions of two static learned
+indices: Triespline~\cite{plex} (\textbf{DE-TS}) and PGM~\cite{pgm}
+(\textbf{DE-PGM}), and compare with a standard Bentley-Saxe dynamized of
+Triespline (\textbf{BSM-TS}). Our dynamic baselines are ALEX~\cite{alex},
+which is dynamic learned index based on a B+Tree like structure, and
+PGM (\textbf{PGM}), which provides support for a dynamic version based
+on Bentley-Saxe dynamization (which is why we have not included a BSM
+version of PGM in our testing).
+
+For our dynamized versions of Triespline and PGM, we configure the
+framework with $N_B = 12000$, $s=8$ and the tiering layout policy. We
+consider range count queries, which traverse the range and return the
+number of records on it, rather than returning the set of records,
+to overcome differences in the query interfaces in our baselines, some
+of which make extra copies of the records. We consider traversing the
+range and counting to be a more fair comparison. Range counts are true
+invertible search problems, and so we use tombstone-deletes. The query
+process itself performs no preprocessing. Local queries use the index to
+identify the first record in the query range and then traverses the range,
+counting the number of records and tombstones encountered. These counts
+are then combined by adding up the total record count from all shards,
+subtracting the total tombstone count, and returning the final count. No
+repeats are necessary. The buffer query simply scans the unsorted array
+and performs the same counting. We examine range count queries with
+a fixed selectivity of $\sigma = 0.1\%$.
+
 \begin{figure*}
 		\centering
 		\subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-rq-insert} \label{fig:rq-insert}}
@@ -1573,8 +1659,64 @@ characteristics,
 		\label{fig:eval-learned-index}
 \end{figure*}
 
+The results of our evaluation are shown in
+Figure~\ref{fig:eval-learned-index}. Figure~\ref{fig:rq-insert} shows
+the insertion performance. DE-TS is the best in all cases, and the pure
+BSM version of Triespline is the worst by a substantial margin. Of
+particular interest in this chart is the inconsisent performance of
+ALEX, which does quite well on the \texttt{books} dataset, and poorly
+on the others. It is worth noting that getting ALEX to run \emph{at
+all} in some cases required a lot of trial and error and tuning, as its
+performance is highly distribution dependent.  Our dynamized version of
+PGM consistently out-performed the built-in dynamic support of the same
+structure.  One shouldn't read \emph{too} much into this result, as PGM
+itself supports some performance tuning and can be adjusted to balance
+between insertion and query performance. We ran it with the author's
+suggested default values, but in principle it could be possible to tune
+it to match our framework's performance here. The important take-away
+from this test is that our generalized framework can easily trade-blows
+with a custom, integrated solution.
+
+The query performance results in Figure~\ref{fig:rq-query} are a bit
+less interesting. All solutions perform similarly, with ALEX again
+showing itself be to fairly distribution dependent in its performance,
+performing the best out of all of the structures on the \texttt{books}
+dataset by a reasonable margin, but falling in line with the others on the
+remaining datasets. The standout result here is the dynamic PGM, which
+performs horrendously compared to all of the other structures. The same
+caveat from the previous paragraph applies here--PGM can be configured
+for better performance. But it's notable that our framework-dynamized PGM
+is able to beat PGM slightly in insertion performance without seeing the
+same massive degredation in query performance that PGM's native update
+suport does in its own update-optmized configuration.\footnote{
+    It's also worth noting that PGM implements tombstone deletes by
+    inserting a record with a matching key to the record to be deleted,
+    and a particular "tombstone" value, rather than using a header. This
+    means that it can not support duplicate keys when deletes are used,
+    unlike our approach. It also means that the records are smaller,
+    which should improve query performance, but we're able to beat it even
+    including the header. PGM is the reason we excluded the \texttt{wiki}
+    dataset from SOSD, as it has duplicate key values.
+} Finally, Figure~\ref{fig:rq-space} shows the storage requirements for
+these data structures. All of the dynamic options require significantly
+more space than the static Triespline, but ALEX requires the most by a
+very large margin. This is in keeping with the previous experiments, which
+all included similarly B+Tree-like structures that required significant
+additional storage space compared to static structures as part of their
+update support.
+
 \subsection{String Search}
 
+As a final example of a search problem, we consider exact string matching
+using the fast succinct trie~\cite{zhang18}. While updatable
+tries aren't terribly unusual~\cite{m-bonsai,dynamic-trie}, succinct data
+structures, which attempt to approach an information-theoretic lower-bound
+on their binary representation of the data, are usually static because
+implementing updates while maintaining these compact representations
+is difficult~\cite{dynamic-trie}. There are specialized approaches for
+dynamizing such structures~\cite{dynamize-succinct}, but in this section
+we consider the effectiveness of our generalized framework for them.
+
 \begin{figure*}
     \centering
 	\subfloat[Update Throughput]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-insert} \label{fig:fst-insert}} 
@@ -1582,17 +1724,75 @@ characteristics,
 	\subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 2mm 0 0]{img/fig-bs-fst-space} \label{fig:fst-size}} 
     %\vspace{-3mm}
     \caption{FST Evaluation}
+    \label{fig:fst-eval}
     %\vspace{-5mm}
 \end{figure*}
 
+Our shard type is a direct wrapper around an implementation of fast
+succinct trie~\cite{fst-impl}. We store the strings in off-record
+storage, and the record type itself contains a pointer to the string in
+storage. Queries use no pre-processing and the local queries directly
+search for a matching string. We use the framework's early abort feature
+to stop as soon as the first result is found, and combine simply checks
+whether this record is a tombstone or not. If it's a tombstone, then
+the lookup is considered to have no found the search string. Otherwise,
+the record is returned. This results in a dynamized structure with the
+following asympotic costs,
+
+
 \begin{align*}
     \text{Insert:} \quad &\Theta\left(\log_s n\right) \\
     \text{Query:}  \quad &\Theta\left(N_B + \log n \log_s n\right ) \\
     \text{Delete:} \quad &\Theta\left(\log_s n \right)
 \end{align*}
 
+We compare our dynamized succinct trie (\textbf{DE-FST}), configured with
+$N_B = 1200$, $s = 8$, the tiering layout policy, and tombstone deletes,
+with a standard Bentley-Saxe dynamization (\textbf{BSM-FST}), as well
+as a single static instance of the structure (\textbf{FST}).
+
+The results are show in Figure~\ref{fig:fst-eval}. As with range scans,
+the Bentley-Saxe method shows horrible insertion performance relative to
+our framework in Figure~\ref{fig:fst-insert}. Note that the significant
+observed difference in update throughput for the two data sets is
+largely attributable to the relative sizes. The \texttt{usra} set is
+far larger than \texttt{english}. Figure~\ref{fig:fst-query} shows that
+our write-optimized framework configuration is slightly out-performed in
+query latency by the standard Bentley-Saxe dynamization, and that both
+dynamized structures are quite a bit slower than the static structure for
+queries. Finally, the storage costs for the data structures are shown
+in Figure~\ref{fig:fst-space}. For the \texttt{english} data set, the
+extra storage cost from decomposing the structure is quite significant,
+but the for \texttt{ursarc} set the sizes are quite comperable. It is
+not unexpected that dynamization would add storage cost for succinct
+(or any compressed) data structures, because the splitting of the records
+across multiple data structures reduces the ability of the structure to
+compress redundant data.
+
 \subsection{Concurrency}
 
+We also tested the preliminary concurrency support described in
+Section~\ref{ssec:dyn-concurrency}, using IRS as our test case, with our
+dynamization configured with $N_B = 1200$, $s=8$, and the tiering layout
+policy. Note that IRS only supports tagging, as it isn't invertible even
+under the IDSP model, and our current concurrency implementation only
+supports deletes with tombstones, so we eschewed deletes entirely for
+this test.
+
+In this benchmark, we used a single thread to insert records
+into the structure at a constant rate, while we deployed a variable
+number of additional threads that continuously issued sampling queries
+against the structure. We used an AGG B+Tree as our baseline. Note
+that, to accurately maintain the aggregate weight counts as records
+are inserted, it is necessary that each operation obtain a lock on
+the root node of the tree~\cite{zhao22}.  This makes this situation
+a good use-case for the automatic concurrency support provided by our
+framework. Figure~\ref{fig:irs-concurrency} shows the results of this
+benchmark for various numbers of concurreny query threads. As can be seen,
+our framework supports a stable update throughput up to 32 query threads,
+whereas the AGG B+Tree suffers from contention for the mutex and sees
+is performance degrade as the number of threads increases.
+
 \begin{figure}
     \centering
 	%\vspace{-2mm}
@@ -1604,3 +1804,26 @@ characteristics,
 \end{figure}
 
 \section{Conclusion}
+
+In this chapter, we sought to develop a set of tools for generalizing
+some of the results from our study of sampling data structures in
+Chapter~\ref{chap:sampling} to apply to a broader set of data structures.
+This results in our development of two new classes of search problem:
+extended decomposable search problems, and iterative deletion decomposable
+search problems.  The former class allows for a pre-processing step
+to be used to generate individualize local queries for each block in a
+decomposed structure, and the latter allows for the query process to be
+repeated as necessary, with possible modifications to the local queries
+each time, to build up the result set iteratively. We then implemented a
+C++ framework for automatically dynamizing static data structures for
+search problems falling into either of these classes, which included an
+LSM tree inspired design space and support for concurrency.
+
+We used this framework to produce dynamized structures for a wide
+variety of search problems, and compared the results to existing
+dynamic baselines, as well as the original Bentley-Saxe method, where
+applicable. The results show that our framework is capable of creating
+dynamic structures that are competitive with, or superior to, custom-built
+dynamic structures, and also has clear performance advantages over the
+classical Bentley-Saxe method.
+
-- 
cgit v1.2.3