Initial commit

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
commit: 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree: 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/beyond-dsp.tex
download: dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz
1 files changed, 863 insertions, 0 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex
new file mode 100644
index 0000000..77f5fb4
--- /dev/null
+++ b/chapters/beyond-dsp.tex
@@ -0,0 +1,863 @@
+\chapter{Generalizing the Framework}
+\label{chap:framework}
+
+The previous chapter demonstrated
+the possible utility of
+designing indexes based upon the dynamic extension of static data
+structures. However, the presented strategy falls short of a general
+framework, as it is specific to sampling problems. In this chapter,
+the techniques of that work will be discussed in more general terms,
+to arrive at a more broadly applicable solution. A general
+framework is proposed, which places only two requirements on supported data
+structures, 
+
+\begin{itemize}
+    \item Extended Decomposability
+    \item Record Identity
+\end{itemize}
+
+In this chapter, first these two properties are defined. Then, 
+a general dynamic extension framework is described which can
+be applied to any data structure supporting these properties. Finally,
+an experimental evaluation is presented that demonstrates the viability
+of this framework.
+
+\section{Extended Decomposability}
+
+Chapter~\ref{chap:sampling} demonstrated how non-DSPs can be efficiently
+addressed using Bentley-Saxe, so long as the query interface is
+modified to accommodate their needs. For Independent sampling
+problems, this involved a two-pass approach, where some pre-processing
+work was performed against each shard and used to construct a shard
+alias structure. This structure was then used to determine how many
+samples to draw from each shard.
+
+To generalize this approach, a new class of decomposability is proposed,
+called \emph{extended decomposability}. At present, its 
+definition is tied tightly to the query interface, rather
+than a formal mathematical definition.  In extended decomposability,
+rather than treating a search problem as a monolith, the algorithm 
+is decomposed into multiple components.
+This allows
+for communication between shards as part of the query process.
+Additionally, rather than using a binary merge operator, extended
+decomposability uses a variadic function that merges all of the
+result sets in one pass, reducing the cost due to merging by a
+logarithmic factor without introducing any new restrictions.
+
+The basic interface that must be supported by a extended-decomposable
+search problem (eDSP) is,
+\begin{itemize}
+
+    \item $\mathbftt{local\_preproc}(\mathcal{I}_i, \mathcal{Q}) \to
+    \mathscr{S}_i$ \\
+        Pre-processes each partition $\mathcal{D}_i$ using index
+        $\mathcal{I}_i$ to produce preliminary information about the 
+        query result on this partition, encoded as an object 
+        $\mathscr{S}_i$.
+
+    \item $\mathbftt{distribute\_query}(\mathscr{S}_1, \ldots,
+    \mathscr{S}_m, \mathcal{Q}) \to \mathcal{Q}_1, \ldots, 
+    \mathcal{Q}_m$\\
+            Processes the list of preliminary information objects
+            $\mathscr{S}_i$ and emits a list of local queries
+            $\mathcal{Q}_i$ to run independently on each partition.
+
+    \item $\mathbftt{local\_query}(\mathcal{I}_i, \mathcal{Q}_i)
+    \to \mathcal{R}_i$ \\
+            Executes the local query $\mathcal{Q}_i$ over partition
+            $\mathcal{D}_i$ using index $\mathcal{I}_i$ and returns a
+            partial result $\mathcal{R}_i$.
+
+    \item $\mathbftt{merge}(\mathcal{R}_1, \ldots \mathcal{R}_m) \to
+    \mathcal{R}$ \\ 
+           Merges the partial results to produce the final answer.
+
+\end{itemize}
+
+The pseudocode for the query algorithm using this interface is,
+\begin{algorithm}
+    \DontPrintSemicolon
+    \SetKwProg{Proc}{procedure}{ BEGIN}{END}
+    \SetKwProg{For}{for}{ DO}{DONE}
+
+    \Proc{\mathbftt{QUERY}($D[]$, $\mathscr{Q}$)} {
+        \For{$i \in [0, |D|)$} {
+            $S[i] := \mathbftt{local\_preproc}(D[i], \mathscr{Q})$
+        } \;
+
+        $ Q := \mathbftt{distribute\_query}(S, \mathscr{Q}) $ \; \;
+
+        \For{$i \in [0, |D|)$} {
+            $R[i] := \mathbftt{local\_query}(D[i], Q[i])$
+        } \;
+
+        $OUT := \mathbftt{merge}(R)$ \;
+
+        \Return {$OUT$} \;
+    }
+\end{algorithm}
+
+In this system, each query can report a partial result with
+\mathbftt{local\_preproc}, which can be used by
+\mathbftt{distribute\_query} to adjust the per-partition query
+parameters, allowing for direct communication of state between
+partitions. Queries which do not need this functionality can simply
+return empty $\mathscr{S}_i$ objects from \mathbftt{local\_preproc}.
+
+\subsection{Query Complexity}
+
+Before describing how to use this new interface and definition to
+support more efficient queries than standard decomposability, more
+more general expression for the cost of querying such a structure should
+be derived.
+Recall that Bentley-Saxe, when applied to a $C(n)$-decomposable
+problem, has the following query cost,
+
+\begin{equation}
+    \label{eq3:Bentley-Saxe}
+    O\left(\log n \cdot \left( Q_s(n) + C(n)\right)\right)
+\end{equation}
+where $Q_s(n)$ is the cost of the query against one partition, and
+$C(n)$ is the cost of the merge operator.
+
+Let $Q_s(n)$ represent the cost of \mathbftt{local\_query} and
+$C(n)$ the cost of \mathbftt{merge} in the extended decomposability
+case. Additionally, let $P(n)$ be the cost of $\mathbftt{local\_preproc}$
+and $\mathcal{D}(n)$ be the cost of \mathbftt{distribute\_query}.
+Additionally, recall that $|D| = \log n$ for the Bentley-Saxe method.
+In this case, the cost of a query is
+\begin{equation}
+    O \left( \log n \cdot P(n) + \mathcal{D}(n) + 
+             \log n \cdot Q_s(n) + C(n) \right)
+\end{equation}
+
+Superficially, this looks to be strictly worse than the Bentley-Saxe
+case in Equation~\ref{eq3:Bentley-Saxe}. However, the important
+thing to understand is that for $C(n)$-decomposable queries, $P(n)
+\in O(1)$ and $\mathcal{D}(n) \in O(1)$, as these steps are unneeded.
+Thus, for normal decomposable queries, the cost actually reduces
+to,
+\begin{equation}
+    O \left( \log n \cdot Q_s(n) + C(n) \right)
+\end{equation}
+which is actually \emph{better} than Bentley-Saxe. Meanwhile, the
+ability perform state-sharing between queries can facilitate better
+solutions than would otherwise be possible.
+
+In light of this new approach, consider the two examples of
+non-decomposable search problems from Section~\ref{ssec:decomp-limits}.
+
+\subsection{k-Nearest Neighbor}
+\label{ssec:knn}
+The KNN problem is $C(n)$-decomposable, and Section~\ref{sssec-decomp-limits-knn}
+arrived at a Bentley-Saxe based solution to this problem based on
+VPTree, with a query cost of
+\begin{equation}
+    O \left( k \log^2 n + k \log n \log k \right)
+\end{equation}
+by running KNN on each partition, and then merging the result sets
+with a heap.
+
+Applying the interface of extended-decomposability to this problem
+allows for some optimizations. Pre-processing is not necessary here,
+but the variadic merge function can be leveraged to get an asymptotically
+better solution. Simply dropping the existing algorithm into this
+interface will result in a merge algorithm with cost,
+\begin{equation}
+    C(n) \in O \left( k \log n \left( \log k + \log\log n\right)\right)
+\end{equation}
+which results in a total query cost that is slightly \emph{worse}
+than the original,
+
+\begin{equation}
+    O \left( k \log^2 n + k \log n \left(\log k + \log\log n\right) \right)
+\end{equation}
+
+The problem is that the number of records considered in a given
+merge has grown from $O(k)$ in the binary merge case to $O(\log n
+\cdot k)$ in the variadic merge. However, because the merge function
+now has access to all of the data at once, the algorithm can be modified
+slightly for better efficiency by only pushing $\log n$ elements
+into the heap at a time. This trick only works if 
+the $R_i$s are in sorted order relative to $f(x, q)$,
+however this condition is satisfied by the result sets returned by
+KNN against a VPTree. Thus, for each $R_i$, the first element in sorted
+order can be inserted into the heap,
+element in sorted order into the heap, tagged with a reference to
+which $R_i$ it was taken from. Then, when the heap is popped, the
+next element from the associated $R_i$ can be inserted.
+This allows the heap's size to be maintained at no larger 
+than $O(\log n)$, and limits the algorithm to no more than
+$k$ pop operations and $\log n + k - 1$ pushes.
+
+This algorithm reduces the cost of KNN on this structure to,
+\begin{equation}
+    O(k \log^2 n + \log n)
+\end{equation}
+which is strictly better than the original.
+
+\subsection{Independent Range Sampling}
+
+The eDSP abstraction also provides sufficient features to implement
+IRS, using the same basic approach as was used in the previous
+chapter. Unlike KNN, IRS will take advantage of the extended query
+interface. Recall from the Chapter~\ref{chap:sampling} that the approach used
+for answering sampling queries (ignoring the buffer, for now) was,
+
+\begin{enumerate}
+    \item Query each shard to establish the weight that should be assigned to the
+        shard in sample size assignments.
+    \item Build an alias structure over those weights.
+    \item For each sample, reference the alias structure to determine which shard
+        to sample from, and then draw the sample.
+\end{enumerate}
+
+This approach can be mapped easily onto the eDSP interface as follows,
+\begin{itemize}
+    \item[\texttt{local\_preproc}] Determine and return the total weight of candidate records for
+        sampling in the shard.
+    \item[\texttt{distribute\_query}] Using the shard weights, construct an alias structure associating
+        each shard with its total weight. Then, query this alias structure $k$ times. For shard $i$, the
+        local query $\mathscr{Q}_i$ will have its sample size assigned based on how many times $i$ is returned
+        during the alias querying.
+    \item[\texttt{local\_query}] Process the local query using the underlying data structure's normal sampling
+        procedure.
+    \item[\texttt{merge}] Union all of the partial results together.
+\end{itemize}
+
+This division of the query maps closely onto the cost function,
+\begin{equation}
+    O\left(P(n) + kS(n)\right)
+\end{equation}
+used in Chapter~\ref{chap:sampling}, where the $W(n) + P(n)$ pre-processing
+cost is associated with the cost of \texttt{local\_preproc} and the
+$kS(n)$ sampling cost is associated with $\texttt{local\_query}$.
+The \texttt{distribute\_query} operation will require $O(\log n)$
+time to construct the shard alias structure, and $O(k)$ time to
+query it. Accounting then for the fact that \texttt{local\_preproc}
+will be called once per shard ($\log n$ times), and a total of $k$
+records will be sampled as the cost of $S(n)$ each, this results
+in a total query cost of,
+\begin{equation}
+    O\left(\left[W(n) + P(n)\right]\log n + k S(n)\right)
+\end{equation}
+which matches the cost in Equation~\ref{eq:sample-cost}.
+
+\section{Record Identity}
+
+Another important consideration for the framework is support for
+deletes, which are important in the contexts of database systems.
+The sampling extension framework supported two techniques
+for the deletion of records: tombstone-based deletes and tagging-based
+deletes. In both cases, the solution required that the shard support
+point lookups, either for checking tombstones or for finding the
+record to mark it as deleted. Implicit in this is an important
+property of the underlying data structure which was taken for granted
+in that work, but which will be made explicit here: record identity.
+
+Delete support requires that each record within the index be uniquely
+identifiable, and linkable directly to a location in storage. This 
+property is called \emph{record identity}.
+ In the context of database
+indexes, it isn't a particularly contentious requirement. Indexes
+already are designed to provide a mapping directly to a record in
+storage, which (at least in the context of RDBMS) must have a unique
+identifier attached. However, in more general contexts, this
+requirement will place some restrictions on the applicability of
+the framework.
+
+For example, approximate data structures or summaries, such as Bloom
+filters~\cite{bloom70} or count-min sketches~\cite{countmin-sketch}
+are data structures which don't necessarily store the underlying
+record. In principle, some summaries \emph{could} be supported by
+normal Bentley-Saxe as there exist mergeable
+summaries~\cite{mergeable-summaries}. But because these data structures
+violate the record identity property, they would not support deletes
+(either in the framework, or Bentley-Saxe). The framework considers
+deletes to be a first-class citizen, and this is formalized by
+requiring record identity as a property that supported data structures
+must have.
+
+\section{The General Framework}
+
+Based on these properties, and the work described in
+Chapter~\ref{chap:sampling}, dynamic extension framework has been devised with
+broad support for data structures. It is implemented in C++20, using templates
+and concepts to define the necessary interfaces. A user of this framework needs
+to provide a definition for their data structure with a prescribed interface
+(called a \texttt{shard}), and a definition for their query following an
+interface based on the above definition of an eDSP. These two classes can then
+be used as template parameters to automatically create a dynamic index, which
+exposes methods for inserting and deleting records, as well as executing
+queries.
+
+\subsection{Framework Design}
+
+\Paragraph{Structure.} The overall design of the general framework
+itself is not substantially different from the sampling framework
+discussed in the Chapter~\ref{chap:sampling}. It consists of a mutable buffer
+and a set of levels containing data structures with geometrically
+increasing capacities.  The \emph{mutable buffer} is a small unsorted
+record array of fixed capacity that buffers incoming inserts. As
+the mutable buffer is kept sufficiently small (e.g. fits in L2 CPU
+cache), the cost of querying it without any auxiliary structures
+can be minimized, while still allowing better insertion performance
+than Bentley-Saxe, which requires rebuilding an index structure for
+each insertion.  The use of an unsorted buffer is necessary to
+ensure that the framework doesn't require an existing dynamic version
+of the index structure being extended, which would defeat the purpose
+of the entire exercise.
+
+The majority of the data within the structure is stored in a sequence
+of \emph{levels} with geometrically increasing record capacity,
+such that the capacity of level $i$ is $s^{i+1}$, where $s$ is a
+configurable parameter called the \emph{scale factor}.  Unlike
+Bentley-Saxe, these levels are permitted to be partially full, which
+allows significantly more flexibility in terms of how reconstruction
+is performed. This also opens up the possibility of allowing each
+level to allocate its record capacity across multiple data structures
+(named \emph{shards}) rather than just one. This decision is called
+the  \emph{layout policy}, with the use of a single structure being
+called \emph{leveling}, and multiple structures being called
+\emph{tiering}.
+
+\begin{figure}
+\centering
+\subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/leveling} \label{fig:leveling}}
+\subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/tiering} \label{fig:tiering}}
+    \caption{\textbf{An overview of the general structure of the
+    dynamic extension framework} using leveling (Figure~\ref{fig:leveling}) and
+tiering (Figure~\ref{fig:tiering}) layout policies. The pictured extension has
+a scale factor of 3, with $L_0$ being at capacity, and $L_1$ being at
+one third capacity. Each shard is shown as a dotted box, wrapping its associated
+dataset ($D_i$), data structure ($I_i$), and auxiliary structures $(A_i)$. }
+\label{fig:framework}
+\end{figure}
+
+\Paragraph{Shards.} The basic building block of the dynamic extension
+is called a shard, defined as $\mathcal{S}_i = (\mathcal{D}_i,
+\mathcal{I}_i, A_i)$, which consists of a partition of the data
+$\mathcal{D}_i$, an instance of the static index structure being
+extended $\mathcal{I}_i$, and an optional auxiliary structure $A_i$.
+To ensure the viability of level reconstruction, the extended data
+structure should at least support a construction method
+$\mathtt{build}(\mathcal{D})$ that can build a new static index
+from a set of records $\mathcal{D}$ from scratch. This set of records
+may come from the mutable buffer, or from a union of underlying
+data of multiple other shards. It is also beneficial for $\mathcal{I}_i$
+to support efficient point-lookups, which can search for a record's
+storage location by its identifier (given by the record identify
+requirements of the framework). The shard can also be customized
+to provide any necessary features for supporting the index being
+extended.  For example, auxiliary data structures like Bloom filters
+or hash tables can be added to improve point-lookup performance,
+or additional, specialized query functions can be provided for use
+by the query functions.
+
+From an implementation standpoint, the shard object provides a shim
+between the data structure and the framework itself. At minimum,
+it must support the following interface,
+\begin{itemize}
+    \item $\mathbftt{construct}(B) \to S$ \\
+    Construct a new shard from the contents of the mutable buffer, $B$.
+
+    \item $\mathbftt{construct}(S_0, \ldots, S_n) \to S$ 
+    Construct a new shard from the records contained within a list of already
+    existing shards.
+
+    \item $\mathbftt{point\_lookup}(r) \to *r$ \\
+    Search for a record, $r$, by identity and return a reference to its
+    location in storage.
+\end{itemize}
+
+\Paragraph{Insertion \& deletion.} The framework supports inserting
+new records and deleting records already in the index. These two
+operations also allow for updates to existing records, by first
+deleting the old version and then inserting a new one. These
+operations are added by the framework automatically, and require
+only a small shim or minor adjustments to the code of the data
+structure being extended within the implementation of the shard
+object.
+
+Insertions are performed by first wrapping the record to be inserted
+with a framework header, and then appending it to the end of the
+mutable buffer. If the mutable buffer is full, it is flushed to
+create a new shard, which is combined into the first level of the
+structure. The level reconstruction process is layout policy
+dependent. In the case of leveling, the underlying data of the
+source shard and the target shard are combined, resulting a new
+shard replacing the target shard in the target level. When using
+tiering, the newly created shard is simply placed into the target
+level. If the target level is full, the framework first triggers a merge on the
+target level, which will create another shard at one higher level,
+and then inserts the former shard at the now empty target level.
+Note that each time a new shard is created, the framework must invoke
+$\mathtt{build}$ to construct a new index from scratch for this
+shard.
+
+The framework supports deletes using two approaches: either by
+inserting a special tombstone record or by performing a lookup for
+the record to be deleted and setting a bit in the header. This
+decision is called the \emph{delete policy}, with the former being
+called \emph{tombstone delete} and the latter \emph{tagged delete}.
+The framework will automatically filter deleted records from query
+results before returning them to the user, either by checking for
+the delete tag, or by performing a lookup of each record for an
+associated tombstone. The number of deleted records within the
+framework can be bounded by canceling tombstones and associated
+records when they meet during reconstruction, or by dropping all
+tagged records when a shard is reconstructed. The framework also
+supports aggressive reconstruction (called \emph{compaction}) to
+precisely bound the number of deleted records within the index,
+which can be helpful to improve the performance of certain types
+of query. This is useful for certain search problems, as was seen with
+sampling queries in Chapter~\ref{chap:sampling}, but is not
+generally necessary to bound query cost in most cases.
+
+\Paragraph{Design space.} The framework described in this section
+has a large design space. In fact, much of the design space has
+similar knobs to the well-known LSM Tree~\cite{dayan17}, albeit in
+a different environment: the framework targets in-memory static
+index structures for general extended decomposable queries without
+efficient index merging support, whereas the LSM-tree targets
+external range indexes that can be efficiently merged.  
+
+The framework's design trades off among auxiliary memory usage, read performance,
+and write performance. The two most significant decisions are the
+choice of layout and delete policy. A tiering layout policy reduces
+write amplification compared to leveling, requiring each record to
+only be written once per level, but increases the number of shards
+within the structure, which can hurt query performance. As for
+delete policy, the use of tombstones turns deletes into insertions,
+which are typically faster. However, depending upon the nature of
+the query being executed, the delocalization of the presence
+information for a record may result in one extra point lookup for
+each record in the result set of a query, vastly reducing read
+performance. In these cases, tagging may make more sense. This
+results in each delete turning into a slower point-lookup, but
+always allows for constant-time visibility checks of records. The
+other two major parameters, scale factor and buffer size, can be
+used to tune the performance once the policies have been selected.
+Generally speaking, larger scale factors result in fewer shards,
+but can increase write amplification under leveling.  Large buffer
+sizes can adversely affect query performance when an unsorted buffer
+is used, while allowing higher update throughput. Because the overall
+design of the framework remains largely unchanged, the design space
+exploration of Section~\ref{ssec:ds-exp} remains relevant here.
+
+\subsection{The Shard Interface}
+
+The shard object serves as a ``shim'' between a data structure and
+the extension framework, providing a set of mandatory functions
+which are used by the framework code to facilitate reconstruction
+and deleting records. The data structure being extended can be
+provided by a different library and included as an attribute via 
+composition/aggregation, or can be directly implemented within the 
+shard class. Additionally, shards can contain any necessary auxiliary
+structures, such as bloom filters or hash tables, as necessary to
+support the required interface.
+
+The require interface for a shard object is as follows,
+\begin{verbatim}
+    new(MutableBuffer) -> Shard
+    new(Shard[]) -> Shard
+    point_lookup(Record, Boolean) -> Record
+    get_data() -> Record
+    get_record_count() -> Int
+    get_tombstone_count() -> Int
+    get_memory_usage() -> Int
+    get_aux_memory_usage() -> Int
+\end{verbatim}
+
+The first two functions are constructors, necessary to build a new Shard
+from either an array of other shards (for a reconstruction), or from
+a mutable buffer (for a buffer flush).\footnote{
+    This is the interface as it currently stands in the existing implementation, but
+    is subject to change. In particular, we are considering changing the shard reconstruction
+    procedure to allow for only one necessary constructor, with a more general interface. As
+    we look to concurrency, being able to construct shards from arbitrary combinations of shards
+    and buffers will become convenient, for example.
+ } 
+The \texttt{point\_lookup} operation is necessary for delete support, and is
+used either to locate a record for delete when tagging is used, or to search
+for a tombstone associated with a record when tombstones are used. The boolean
+is intended to be used to communicate to the shard whether the lookup is
+intended to locate a tombstone or a record, and is meant to be used to allow
+the shard to control whether a point lookup checks a filter before searching,
+but could also be used for other purposes. The \texttt{get\_data}
+function exposes a pointer to the beginning of the array of records contained
+within the shard--it imposes no restriction on the order of these records, but
+does require that all records can be accessed sequentially from this pointer,
+and that the order of records does not change. The rest of the functions are
+accessors for various shard metadata. The record and tombstone count numbers
+are used by the framework for reconstruction purposes.\footnote{The record
+count includes tombstones as well, so the true record count on a level is
+$\text{reccnt} - \text{tscnt}$.} The memory usage statistics are, at present,
+only exposed directly to the user and have no effect on the framework's
+behavior. In the future, these may be used for concurrency control and task
+scheduling purposes.
+
+Beyond these, a shard can expose any additional functions that are necessary
+for its associated query classes. For example, a shard intended to be used for
+range queries might expose upper and lower bound functions, or a shard used for
+nearest neighbor search might expose a nearest-neighbor function.
+
+\subsection{The Query Interface}
+\label{ssec:fw-query-int}
+
+The required interface for a query in the framework is a bit more
+complicated than the interface defined for an eDSP, because the
+framework needs to query the mutable buffer as well as the shards.
+As a result, there is some slight duplication of functions, with
+specialized query and pre-processing routines for both shards and
+buffers. Specifically, a query must define the following functions,
+\begin{verbatim}
+    get_query_state(QueryParameters, Shard) -> ShardState;
+    get_buffer_query_state(QueryParameters, Buffer) -> BufferState;
+
+    process_query_states(QueryParameters, ShardStateList, BufferStateList) -> LocalQueryList;
+
+    query(LocalQuery, Shard) -> ResultList
+    buffer_query(LocalQuery, Buffer) -> ResultList
+
+    merge(ResultList) -> FinalResult
+
+    delete_query_state(ShardState)
+    delete_buffer_query_state(BufferState)
+
+    bool EARLY_ABORT;
+    bool SKIP_DELETE_FILTER;
+\end{verbatim}
+
+The \texttt{get\_query\_state} and \texttt{get\_buffer\_query\_state} functions
+map to the \texttt{local\_preproc} operation of the eDSP definition for shards
+and buffers respectively. \texttt{process\_query\_states} serves the function
+of \texttt{distribute\_query}. Note that this function takes a list of buffer
+states; although the proposed framework above contains only a single buffer,
+future support for concurrency will require multiple buffers, and so the
+interface is set up with support for this. The \texttt{query} and
+\texttt{buffer\_query} functions execute the local query against the shard or
+buffer and return the intermediate results, which are merged using
+\texttt{merge} into a final result set. The \texttt{EARLY\_ABORT} parameter can
+be set to \texttt{true} to force the framework to immediately return as soon as
+the first result is found, rather than querying the entire structure, and the
+\texttt{SKIP\_DELETE\_FILTER} disables the framework's automatic delete
+filtering, allowing deletes to be manually handled within the \texttt{merge}
+function by the developer. These flags exist to allow for optimizations for
+certain types of query. For example, point-lookups can take advantage of
+\texttt{EARLY\_ABORT} to stop as soon as a match is found, and
+\texttt{SKIP\_DELETE\_FILTER} can be used for more efficient tombstone delete
+handling in range queries, where tombstones for results will always be in the
+\texttt{ResultList}s going into \texttt{merge}.
+
+The framework itself answers queries by simply calling these routines in 
+a prescribed order,
+\begin{verbatim}
+query(QueryArguments qa) BEGIN
+    FOR i < BufferCount DO
+        BufferStates[i] = get_buffer_query_state(qa, Buffers[i])
+    DONE
+
+    FOR i < ShardCount DO
+        ShardStates[i] = get_query_state(qa, Shards[i])
+    DONE
+
+    process_query_states(qa, ShardStates, BufferStates)
+
+    FOR i < BufferCount DO 
+        temp = buffer_query(BufferStates[i], Buffers[i])
+        IF NOT SKIP_DELETE_FILTER THEN
+            temp = filter_deletes(temp)
+        END
+        Results[i] = temp;
+
+        IF EARLY_ABORT AND Results[i].size() > 0 THEN
+            delete_states(ShardStates, BufferStates)
+            return merge(Results)
+        END
+    DONE
+
+    FOR i < ShardCount DO
+        temp = query(ShardStates[i], Shards[i])
+        IF NOT SKIP_DELETE_FILTER THEN
+            temp = filter_deletes(temp)
+        END
+        Results[i + BufferCount] = temp
+        IF EARLY_ABORT AD Results[i + BufferCount].size() > 0 THEN
+            delete_states(ShardStates, BufferStates)
+            return merge(Results)
+        END
+    DONE
+
+    delete_states(ShardStates, BufferStates)
+    return merge(Results)
+END
+\end{verbatim}
+
+\subsubsection{Standardized Queries}
+
+Provided with the framework are several "standardized" query classes, including
+point lookup, range query, and IRS. These queries can be freely applied to any
+shard class that implements the necessary optional interfaces. For example, the
+provided IRS and range query both require the shard to implement a
+\texttt{lower\_bound} and \texttt{upper\_bound} function that returns an index.
+They then use this index to access the record array exposed via
+\texttt{get\_data}. This is convenient, because it helps to separate the search
+problem from the data structure, and moves towards presenting these two objects
+as orthogonal.
+
+In the next section the framework is evaluated by producing a number of indexes
+for three different search problems. Specifically, the framework is applied to
+a pair of learned indexes, as well as an ISAM-tree. All three of these shards
+provide the bound interface described above, meaning that the same range query
+class can be used for all of them. It also means that the learned indexes
+automatically have support for IRS. And, of course, they also all can be used
+with the provided point-lookup query, which simply uses the required
+\texttt{point\_lookup} function of the shard.
+
+At present, the framework only supports associating a single query class with
+an index. However, this is simply a limitation of implementation. In the future,
+approaches will be considered for associating arbitrary query classes to allow
+truly multi-purpose indexes to be constructed. This is not to say that every
+data structure will necessarily be efficient at answering every type of query 
+that could be answered using their interface--but in a database system, being
+able to repurpose an existing index to accelerate a wide range of query types
+would certainly seem worth considering.
+
+\section{Framework Evaluation}
+
+The framework was evaluated using three different types of search problem:
+range-count, high-dimensional k-nearest neighbor, and independent range
+sampling. In all three cases, an extended static data structure was compared
+with dynamic alternatives for the same search problem to demonstrate the
+framework's competitiveness.
+
+\subsection{Methodology} 
+
+All tests were performed using Ubuntu 22.04
+LTS on a dual-socket Intel Xeon Gold 6242R server with 384 GiB of
+installed memory and 40 physical cores. Benchmark code was compiled
+using \texttt{gcc} version 11.3.0 at the \texttt{-O3} optimization level.
+
+
+\subsection{Range Queries}
+
+A first test evaluates the performance of the framework in the context of
+range queries against learned indexes. In Chapter~\ref{chap:intro}, the
+lengthy development cycle of this sort of data structure was discussed,
+and so learned indexes were selected as an evaluation candidate to demonstrate
+how this framework could allow such lengthy development lifecycles to be largely
+bypassed.
+
+Specifically, the framework is used to produce dynamic learned indexes based on
+TrieSpline~\cite{plex} (DE-TS) and the static version of PGM~\cite{pgm} (DE-PGM). These
+are both single-pass construction static learned indexes, and thus well suited for use
+within this framework compared to more complex structures like RMI~\cite{RMI}, which have
+more expensive construction algorithms. The two framework-extended data structures are
+compared with dynamic learned indexes, namely ALEX~\cite{ALEX} and the dynamic version of
+PGM~\cite{pgm}. PGM provides an interesting comparison, as its native
+dynamic version was implemented using a slightly modified version Bentley-Saxe method.
+
+When performing range queries over large data sets, the
+copying of query results can introduce significant overhead. Because the four
+tested structures have different data copy behaviors, a range count query was
+used for testing, rather than a pure range query. This search problem exposes
+the searching performance of the data structures, while controlling for different
+data copy behaviors, and so should provide more directly comparable results.
+
+Range count
+queries were executed with a selectivity of $0.01\%$ against three datasets
+from the SOSD benchmark~\cite{sosd-datasets}: \texttt{book}, \texttt{fb}, and
+\texttt{osm}, which all have 200 million 64-bit keys following a variety of
+distributions, which were paired with uniquely generated 64-bit values. There
+is a fourth dataset in SOSD, \texttt{wiki}, which was excluded from testing
+because it contained duplicate keys, which are not supported by dynamic
+PGM.\footnote{The dynamic version of PGM supports deletes using tombstones,
+but doesn't wrap records with a header to accomplish this. Instead it reserves
+one possible value to represent a tombstone. Records are deleted by inserting a
+record having the same key, but this different value. This means that duplicate
+keys, even if they have different values, are unsupported as two records with
+the same key will be treated as a delete by the index.~\cite{pgm} }
+
+The shard implementations for DE-PGM and DE-TS required about 300 lines of
+C++ code each, and no modification to the data structures themselves. For both
+data structures, the framework was configured with a buffer of 12,000 records, a scale
+factor of 8, the tombstone delete policy, and tiering. Each shard stored $D_i$
+as a sorted array of records, used an instance of the learned index for
+$\mathcal{I}_i$, and has no auxiliary structures. The local query routine used
+the learned index to locate the first key in the query range and then iterated
+over the sorted array until the end of the range is reached, counting the
+number of records and tombstones required. The mutable buffer query performed
+the counting over a full scan.  No local preprocessing was needed, and the merge
+operation simply summed the record and tombstone counts, and returned their
+difference.
+
+\begin{figure*}[t]
+    \centering
+    \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-insert} \label{fig:rq-insert}}
+    \subfloat[Query Latency]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-query} \label{fig:rq-query}} \\
+    \subfloat[Index Sizes]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0 ]{img/fig-bs-rq-space} \label{fig:idx-space}}
+    \caption{Range Count Evaluation}
+    \label{fig:results1}
+\end{figure*}
+
+Figure~\ref{fig:rq-insert} shows the update throughput of all competitors. ALEX
+performs the worst in all cases, and PGM performs the best, with the extended
+indexes falling in the middle. It is not unexpected that PGM performs better
+than the framework, because the Bentley-Saxe extension in PGM is custom-built,
+and thus has a tighter integration than a general framework would allow.
+However, even with this advantage, DE-PGM still reaches up to 85\% of PGM's
+insertion throughput. Additionally, Figure~\ref{fig:rq-query} shows that PGM
+pays a large cost in query latency for its advantage in insertion, with the
+framework extended indexes significantly outperforming it. Further, DE-TS even
+outperforms ALEX for query latency in some cases. Finally,
+Figure~\ref{fig:idx-space} shows the storage cost of the indexes, without
+counting the space necessary to store the records themselves. The storage cost
+of a learned index is fairly variable, as it is largely a function of the
+distribution of the data, but in all cases, the extended learned
+indexes, which build compact data arrays without gaps, occupy three orders of
+magnitude smaller storage space compared to ALEX, which requires leaving gaps
+in the data arrays.
+
+\subsection{High-Dimensional k-Nearest Neighbor} 
+The next test evaluates the framework for the extension of high-dimensional 
+metric indexes for the k-nearest neighbor search problem. An M-tree~\cite{mtree}
+was used as the dynamic baseline,\footnote{
+    Specifically, the M-tree implementation tested can be found at \url{https://github.com/dbrumbaugh/M-Tree}
+    and is a fork of a structure written originally by Eduardo D'Avila, modified to compile under C++20. The
+    tree uses a random selection algorithm for ball splitting.
+} and a VPTree~\cite{vptree} as the static structure. The framework was used to
+extend VPTree to produce the dynamic version, DE-VPTree.
+An M-Tree is a tree that partitions records based on
+high-dimensional spheres and supports updates by splitting and merging these
+partitions. 
+A VPTree is a binary tree that is produced by recursively selecting
+a point, called the vantage point, and partitioning records based on their
+distance from that point. This results in a difficult to modify structure that
+can be constructed in $O(n \log n)$ time and can answer KNN queries in $O(k
+\log n)$ time.
+
+DE-VPTree, used a buffer of 12,000 records, a scale factor of 6, tiering, and
+delete tagging. The query was implemented without a pre-processing step, using
+the standard VPTree algorithm for  KNN queries against each shard.  All $k$
+records were determined for each shard, and then the merge operation used a
+heap to merge the results sets together and return the $k$ nearest neighbors
+from the $k\log(n)$ intermediate results. This is a type of query that pays a
+non-constant merge cost, even with the framework's expanded query interface, of
+$O(k \log k)$. In effect, the kNN query must be answered twice: once for each
+shard to get the intermediate result sets, and then a second time within the
+merge operation to select the kNN from the result sets.
+
+\begin{figure}
+    \centering
+    \includegraphics[width=.75\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn}
+    \caption{KNN Index Evaluation}
+    \label{fig:knn}
+\end{figure}
+Euclidean distance was used as the metric for both structures, and $k=1000$ was
+used for all queries. The reference point for each query was selected randomly
+from points within the dataset. Tests were run using the Spanish Billion Words
+dataset~\cite{sbw}, of 300-dimensional vectors. The results are shown in
+Figure~\ref{fig:knn}. In this case, the static nature of the VPTree allows it
+to dominate the M-Tree in query latency, and the simpler reconstruction
+procedure shows a significant insertion performance improvement as well.
+
+\subsection{Independent Range Sampling} 
+Finally, the
+framework was tested using one-dimensional IRS queries. As before,
+a static ISAM-tree was used as the data structure to be extended,
+however the sampling query was implemented using the query interface from
+Section~\ref{ssec:fw-query-int}. The pre-processing step identifies the first
+and last query falling into the range to be sampled from, and determines the
+total weight based on this range, for each shard. Then, in the local query
+generation step, these weights are used to construct and alias structure, which
+is used to assign sample sizes to each shard based on weight to avoid
+introducing skew into the results. After this, the query routine generates
+random numbers between the established bounds to sample records, and the merge
+operation appends the individual result sets together. This static procedure
+only requires a pair of tree traversals per shard, regardless of how many
+samples are taken.
+
+\begin{figure}
+    \centering
+    \subfloat[Query Latency]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-query} \label{fig:irs-query}}
+    \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-insert} \label{fig:irs-insert}}
+    \caption{IRS Index Evaluation}
+    \label{fig:results2}
+\end{figure}
+
+The extended ISAM structure (DE-IRS) was compared to a B$^+$-Tree
+with aggregate weight tags on internal nodes (AGG B+Tree) for sampling
+and insertion performance, and to a single instance of the static ISAM-tree (ISAM), 
+which does not support updates. DE-IRS was configured with a buffer size
+of 12,000 records, a scale factor of 6, tiering, and delete tagging. The IRS
+queries had a selectivity of $0.1\%$ with sample size of $k=1000$. Testing
+was performed using the same datasets as were used for range queries.
+
+Figure~\ref{fig:irs-query}
+shows the significant latency advantage that the dynamically extended ISAM tree
+enjoys compared to a B+Tree. DE-IRS is up to 23 times faster than the B$^+$-Tree at
+answering sampling queries, and only about 3 times slower than the fully static
+solution.  In this case, the extra query cost caused by needing to query
+multiple structures is more than balanced by the query efficiency of each of
+those structures, relative to tree sampling.  Interestingly, the framework also
+results in better update performance compared to the B$^+$-Tree, as shown in
+Figure~\ref{fig:irs-insert}. This is likely because the ISAM shards can be
+efficiently constructed using a combination of sorted-merge operations and
+bulk-loading, and avoid expensive structural modification operations that are
+necessary for maintaining a B$^+$-Tree.
+
+\subsection{Discussion} 
+
+
+The results demonstrate not only that the framework's update support is
+competitive with custom-built dynamic data structures, but that the framework
+is even able to, in many cases, retain some of the query performance advantage 
+of its extended static data structure. This is particularly evident in the k-nearest
+neighbor and independent range sampling tests, where the static version of the
+structure was directly tested as well. These tests demonstrate one of the advantages
+of static data structures: they are able to maintain much tighter inter-record relationships
+than dynamic ones, because update support typically requires relaxing these relationships
+to make it easier to update them. While the framework introduces the overhead of querying
+multiple structures and merging them together, it is clear from the results that this overhead
+is generally less than the overhead incurred by the update support techniques used
+in the dynamic structures. The only case where the framework was defeated in query performance
+was in competition with ALEX, where the resulting query latencies were comparable.
+
+It is also evident that the update support provided by the framework is on par with, if not
+superior, to that provided by the dynamic baselines, at least in terms of throughput. The 
+framework will certainly suffer from larger tail latency spikes, which weren't measured in
+this round of testing, due to the larger scale of the reconstructions, but the amortization
+of these costs over a large number of inserts allows for the maintenance of a respectable
+level of throughput. In fact, the only case where the framework loses in insertion throughput
+is against the dynamic PGM. However, an examination of the query latency reveals that this 
+is likely due to the fact that the standard configuration of the Bently-Saxe variant used
+by PGM is highly tuned for insertion performance, as the query latencies against this data
+structure are far worse than any other learned index tested, so even this result shouldn't
+be taken as a ``clear'' defeat of the framework's implementation.
+
+Overall, it is clear from this evaluation that the dynamic extension framework is a
+promising alternative to manual index redesign for accommodating updates. In almost 
+all cases, the framework-extended static data structures provided superior insertion
+throughput in all cases, and query latencies that either matched or exceeded that of
+the dynamic baselines. Additionally, though it is hard to quantity, the code complexity
+of the framework-extended data structures was much less, with the shard implementations
+requiring only a small amount of relatively straightforward code to interface with pre-existing
+static data structures, or with the necessary data structure implementations themselves being
+simpler.
+
+\section{Conclusion}
+
+In this chapter, a generalize version of the framework originally proposed in
+Chapter~\ref{chap:sampling} was proposed. This framework is based on two
+key properties: extended decomposability and record identity. It is capable
+of extending any data structure and search problem supporting these two properties
+with support for inserts and deletes. An evaluation of this framework was performed
+by extending several static data structures, and comparing the resulting structures'
+performance against dynamic baselines capable of answering the same type of search
+problem. The extended structures generally performed as well as, if not better, than
+their dynamic baselines in query performance, insert performance, or both. This demonstrates
+the capability of this framework to produce viable indexes in a variety of contexts. However,
+the framework is not yet complete. In the next chapter, the work required to bring this
+framework to completion will be described.
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
commit	5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree	276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/beyond-dsp.tex
download	dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz