diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
| commit | 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch) | |
| tree | 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/beyond-dsp.tex | |
| download | dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz | |
Initial commit
Diffstat (limited to 'chapters/beyond-dsp.tex')
| -rw-r--r-- | chapters/beyond-dsp.tex | 863 |
1 files changed, 863 insertions, 0 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex new file mode 100644 index 0000000..77f5fb4 --- /dev/null +++ b/chapters/beyond-dsp.tex @@ -0,0 +1,863 @@ +\chapter{Generalizing the Framework} +\label{chap:framework} + +The previous chapter demonstrated +the possible utility of +designing indexes based upon the dynamic extension of static data +structures. However, the presented strategy falls short of a general +framework, as it is specific to sampling problems. In this chapter, +the techniques of that work will be discussed in more general terms, +to arrive at a more broadly applicable solution. A general +framework is proposed, which places only two requirements on supported data +structures, + +\begin{itemize} + \item Extended Decomposability + \item Record Identity +\end{itemize} + +In this chapter, first these two properties are defined. Then, +a general dynamic extension framework is described which can +be applied to any data structure supporting these properties. Finally, +an experimental evaluation is presented that demonstrates the viability +of this framework. + +\section{Extended Decomposability} + +Chapter~\ref{chap:sampling} demonstrated how non-DSPs can be efficiently +addressed using Bentley-Saxe, so long as the query interface is +modified to accommodate their needs. For Independent sampling +problems, this involved a two-pass approach, where some pre-processing +work was performed against each shard and used to construct a shard +alias structure. This structure was then used to determine how many +samples to draw from each shard. + +To generalize this approach, a new class of decomposability is proposed, +called \emph{extended decomposability}. At present, its +definition is tied tightly to the query interface, rather +than a formal mathematical definition. In extended decomposability, +rather than treating a search problem as a monolith, the algorithm +is decomposed into multiple components. +This allows +for communication between shards as part of the query process. +Additionally, rather than using a binary merge operator, extended +decomposability uses a variadic function that merges all of the +result sets in one pass, reducing the cost due to merging by a +logarithmic factor without introducing any new restrictions. + +The basic interface that must be supported by a extended-decomposable +search problem (eDSP) is, +\begin{itemize} + + \item $\mathbftt{local\_preproc}(\mathcal{I}_i, \mathcal{Q}) \to + \mathscr{S}_i$ \\ + Pre-processes each partition $\mathcal{D}_i$ using index + $\mathcal{I}_i$ to produce preliminary information about the + query result on this partition, encoded as an object + $\mathscr{S}_i$. + + \item $\mathbftt{distribute\_query}(\mathscr{S}_1, \ldots, + \mathscr{S}_m, \mathcal{Q}) \to \mathcal{Q}_1, \ldots, + \mathcal{Q}_m$\\ + Processes the list of preliminary information objects + $\mathscr{S}_i$ and emits a list of local queries + $\mathcal{Q}_i$ to run independently on each partition. + + \item $\mathbftt{local\_query}(\mathcal{I}_i, \mathcal{Q}_i) + \to \mathcal{R}_i$ \\ + Executes the local query $\mathcal{Q}_i$ over partition + $\mathcal{D}_i$ using index $\mathcal{I}_i$ and returns a + partial result $\mathcal{R}_i$. + + \item $\mathbftt{merge}(\mathcal{R}_1, \ldots \mathcal{R}_m) \to + \mathcal{R}$ \\ + Merges the partial results to produce the final answer. + +\end{itemize} + +The pseudocode for the query algorithm using this interface is, +\begin{algorithm} + \DontPrintSemicolon + \SetKwProg{Proc}{procedure}{ BEGIN}{END} + \SetKwProg{For}{for}{ DO}{DONE} + + \Proc{\mathbftt{QUERY}($D[]$, $\mathscr{Q}$)} { + \For{$i \in [0, |D|)$} { + $S[i] := \mathbftt{local\_preproc}(D[i], \mathscr{Q})$ + } \; + + $ Q := \mathbftt{distribute\_query}(S, \mathscr{Q}) $ \; \; + + \For{$i \in [0, |D|)$} { + $R[i] := \mathbftt{local\_query}(D[i], Q[i])$ + } \; + + $OUT := \mathbftt{merge}(R)$ \; + + \Return {$OUT$} \; + } +\end{algorithm} + +In this system, each query can report a partial result with +\mathbftt{local\_preproc}, which can be used by +\mathbftt{distribute\_query} to adjust the per-partition query +parameters, allowing for direct communication of state between +partitions. Queries which do not need this functionality can simply +return empty $\mathscr{S}_i$ objects from \mathbftt{local\_preproc}. + +\subsection{Query Complexity} + +Before describing how to use this new interface and definition to +support more efficient queries than standard decomposability, more +more general expression for the cost of querying such a structure should +be derived. +Recall that Bentley-Saxe, when applied to a $C(n)$-decomposable +problem, has the following query cost, + +\begin{equation} + \label{eq3:Bentley-Saxe} + O\left(\log n \cdot \left( Q_s(n) + C(n)\right)\right) +\end{equation} +where $Q_s(n)$ is the cost of the query against one partition, and +$C(n)$ is the cost of the merge operator. + +Let $Q_s(n)$ represent the cost of \mathbftt{local\_query} and +$C(n)$ the cost of \mathbftt{merge} in the extended decomposability +case. Additionally, let $P(n)$ be the cost of $\mathbftt{local\_preproc}$ +and $\mathcal{D}(n)$ be the cost of \mathbftt{distribute\_query}. +Additionally, recall that $|D| = \log n$ for the Bentley-Saxe method. +In this case, the cost of a query is +\begin{equation} + O \left( \log n \cdot P(n) + \mathcal{D}(n) + + \log n \cdot Q_s(n) + C(n) \right) +\end{equation} + +Superficially, this looks to be strictly worse than the Bentley-Saxe +case in Equation~\ref{eq3:Bentley-Saxe}. However, the important +thing to understand is that for $C(n)$-decomposable queries, $P(n) +\in O(1)$ and $\mathcal{D}(n) \in O(1)$, as these steps are unneeded. +Thus, for normal decomposable queries, the cost actually reduces +to, +\begin{equation} + O \left( \log n \cdot Q_s(n) + C(n) \right) +\end{equation} +which is actually \emph{better} than Bentley-Saxe. Meanwhile, the +ability perform state-sharing between queries can facilitate better +solutions than would otherwise be possible. + +In light of this new approach, consider the two examples of +non-decomposable search problems from Section~\ref{ssec:decomp-limits}. + +\subsection{k-Nearest Neighbor} +\label{ssec:knn} +The KNN problem is $C(n)$-decomposable, and Section~\ref{sssec-decomp-limits-knn} +arrived at a Bentley-Saxe based solution to this problem based on +VPTree, with a query cost of +\begin{equation} + O \left( k \log^2 n + k \log n \log k \right) +\end{equation} +by running KNN on each partition, and then merging the result sets +with a heap. + +Applying the interface of extended-decomposability to this problem +allows for some optimizations. Pre-processing is not necessary here, +but the variadic merge function can be leveraged to get an asymptotically +better solution. Simply dropping the existing algorithm into this +interface will result in a merge algorithm with cost, +\begin{equation} + C(n) \in O \left( k \log n \left( \log k + \log\log n\right)\right) +\end{equation} +which results in a total query cost that is slightly \emph{worse} +than the original, + +\begin{equation} + O \left( k \log^2 n + k \log n \left(\log k + \log\log n\right) \right) +\end{equation} + +The problem is that the number of records considered in a given +merge has grown from $O(k)$ in the binary merge case to $O(\log n +\cdot k)$ in the variadic merge. However, because the merge function +now has access to all of the data at once, the algorithm can be modified +slightly for better efficiency by only pushing $\log n$ elements +into the heap at a time. This trick only works if +the $R_i$s are in sorted order relative to $f(x, q)$, +however this condition is satisfied by the result sets returned by +KNN against a VPTree. Thus, for each $R_i$, the first element in sorted +order can be inserted into the heap, +element in sorted order into the heap, tagged with a reference to +which $R_i$ it was taken from. Then, when the heap is popped, the +next element from the associated $R_i$ can be inserted. +This allows the heap's size to be maintained at no larger +than $O(\log n)$, and limits the algorithm to no more than +$k$ pop operations and $\log n + k - 1$ pushes. + +This algorithm reduces the cost of KNN on this structure to, +\begin{equation} + O(k \log^2 n + \log n) +\end{equation} +which is strictly better than the original. + +\subsection{Independent Range Sampling} + +The eDSP abstraction also provides sufficient features to implement +IRS, using the same basic approach as was used in the previous +chapter. Unlike KNN, IRS will take advantage of the extended query +interface. Recall from the Chapter~\ref{chap:sampling} that the approach used +for answering sampling queries (ignoring the buffer, for now) was, + +\begin{enumerate} + \item Query each shard to establish the weight that should be assigned to the + shard in sample size assignments. + \item Build an alias structure over those weights. + \item For each sample, reference the alias structure to determine which shard + to sample from, and then draw the sample. +\end{enumerate} + +This approach can be mapped easily onto the eDSP interface as follows, +\begin{itemize} + \item[\texttt{local\_preproc}] Determine and return the total weight of candidate records for + sampling in the shard. + \item[\texttt{distribute\_query}] Using the shard weights, construct an alias structure associating + each shard with its total weight. Then, query this alias structure $k$ times. For shard $i$, the + local query $\mathscr{Q}_i$ will have its sample size assigned based on how many times $i$ is returned + during the alias querying. + \item[\texttt{local\_query}] Process the local query using the underlying data structure's normal sampling + procedure. + \item[\texttt{merge}] Union all of the partial results together. +\end{itemize} + +This division of the query maps closely onto the cost function, +\begin{equation} + O\left(P(n) + kS(n)\right) +\end{equation} +used in Chapter~\ref{chap:sampling}, where the $W(n) + P(n)$ pre-processing +cost is associated with the cost of \texttt{local\_preproc} and the +$kS(n)$ sampling cost is associated with $\texttt{local\_query}$. +The \texttt{distribute\_query} operation will require $O(\log n)$ +time to construct the shard alias structure, and $O(k)$ time to +query it. Accounting then for the fact that \texttt{local\_preproc} +will be called once per shard ($\log n$ times), and a total of $k$ +records will be sampled as the cost of $S(n)$ each, this results +in a total query cost of, +\begin{equation} + O\left(\left[W(n) + P(n)\right]\log n + k S(n)\right) +\end{equation} +which matches the cost in Equation~\ref{eq:sample-cost}. + +\section{Record Identity} + +Another important consideration for the framework is support for +deletes, which are important in the contexts of database systems. +The sampling extension framework supported two techniques +for the deletion of records: tombstone-based deletes and tagging-based +deletes. In both cases, the solution required that the shard support +point lookups, either for checking tombstones or for finding the +record to mark it as deleted. Implicit in this is an important +property of the underlying data structure which was taken for granted +in that work, but which will be made explicit here: record identity. + +Delete support requires that each record within the index be uniquely +identifiable, and linkable directly to a location in storage. This +property is called \emph{record identity}. + In the context of database +indexes, it isn't a particularly contentious requirement. Indexes +already are designed to provide a mapping directly to a record in +storage, which (at least in the context of RDBMS) must have a unique +identifier attached. However, in more general contexts, this +requirement will place some restrictions on the applicability of +the framework. + +For example, approximate data structures or summaries, such as Bloom +filters~\cite{bloom70} or count-min sketches~\cite{countmin-sketch} +are data structures which don't necessarily store the underlying +record. In principle, some summaries \emph{could} be supported by +normal Bentley-Saxe as there exist mergeable +summaries~\cite{mergeable-summaries}. But because these data structures +violate the record identity property, they would not support deletes +(either in the framework, or Bentley-Saxe). The framework considers +deletes to be a first-class citizen, and this is formalized by +requiring record identity as a property that supported data structures +must have. + +\section{The General Framework} + +Based on these properties, and the work described in +Chapter~\ref{chap:sampling}, dynamic extension framework has been devised with +broad support for data structures. It is implemented in C++20, using templates +and concepts to define the necessary interfaces. A user of this framework needs +to provide a definition for their data structure with a prescribed interface +(called a \texttt{shard}), and a definition for their query following an +interface based on the above definition of an eDSP. These two classes can then +be used as template parameters to automatically create a dynamic index, which +exposes methods for inserting and deleting records, as well as executing +queries. + +\subsection{Framework Design} + +\Paragraph{Structure.} The overall design of the general framework +itself is not substantially different from the sampling framework +discussed in the Chapter~\ref{chap:sampling}. It consists of a mutable buffer +and a set of levels containing data structures with geometrically +increasing capacities. The \emph{mutable buffer} is a small unsorted +record array of fixed capacity that buffers incoming inserts. As +the mutable buffer is kept sufficiently small (e.g. fits in L2 CPU +cache), the cost of querying it without any auxiliary structures +can be minimized, while still allowing better insertion performance +than Bentley-Saxe, which requires rebuilding an index structure for +each insertion. The use of an unsorted buffer is necessary to +ensure that the framework doesn't require an existing dynamic version +of the index structure being extended, which would defeat the purpose +of the entire exercise. + +The majority of the data within the structure is stored in a sequence +of \emph{levels} with geometrically increasing record capacity, +such that the capacity of level $i$ is $s^{i+1}$, where $s$ is a +configurable parameter called the \emph{scale factor}. Unlike +Bentley-Saxe, these levels are permitted to be partially full, which +allows significantly more flexibility in terms of how reconstruction +is performed. This also opens up the possibility of allowing each +level to allocate its record capacity across multiple data structures +(named \emph{shards}) rather than just one. This decision is called +the \emph{layout policy}, with the use of a single structure being +called \emph{leveling}, and multiple structures being called +\emph{tiering}. + +\begin{figure} +\centering +\subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/leveling} \label{fig:leveling}} +\subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/tiering} \label{fig:tiering}} + \caption{\textbf{An overview of the general structure of the + dynamic extension framework} using leveling (Figure~\ref{fig:leveling}) and +tiering (Figure~\ref{fig:tiering}) layout policies. The pictured extension has +a scale factor of 3, with $L_0$ being at capacity, and $L_1$ being at +one third capacity. Each shard is shown as a dotted box, wrapping its associated +dataset ($D_i$), data structure ($I_i$), and auxiliary structures $(A_i)$. } +\label{fig:framework} +\end{figure} + +\Paragraph{Shards.} The basic building block of the dynamic extension +is called a shard, defined as $\mathcal{S}_i = (\mathcal{D}_i, +\mathcal{I}_i, A_i)$, which consists of a partition of the data +$\mathcal{D}_i$, an instance of the static index structure being +extended $\mathcal{I}_i$, and an optional auxiliary structure $A_i$. +To ensure the viability of level reconstruction, the extended data +structure should at least support a construction method +$\mathtt{build}(\mathcal{D})$ that can build a new static index +from a set of records $\mathcal{D}$ from scratch. This set of records +may come from the mutable buffer, or from a union of underlying +data of multiple other shards. It is also beneficial for $\mathcal{I}_i$ +to support efficient point-lookups, which can search for a record's +storage location by its identifier (given by the record identify +requirements of the framework). The shard can also be customized +to provide any necessary features for supporting the index being +extended. For example, auxiliary data structures like Bloom filters +or hash tables can be added to improve point-lookup performance, +or additional, specialized query functions can be provided for use +by the query functions. + +From an implementation standpoint, the shard object provides a shim +between the data structure and the framework itself. At minimum, +it must support the following interface, +\begin{itemize} + \item $\mathbftt{construct}(B) \to S$ \\ + Construct a new shard from the contents of the mutable buffer, $B$. + + \item $\mathbftt{construct}(S_0, \ldots, S_n) \to S$ + Construct a new shard from the records contained within a list of already + existing shards. + + \item $\mathbftt{point\_lookup}(r) \to *r$ \\ + Search for a record, $r$, by identity and return a reference to its + location in storage. +\end{itemize} + +\Paragraph{Insertion \& deletion.} The framework supports inserting +new records and deleting records already in the index. These two +operations also allow for updates to existing records, by first +deleting the old version and then inserting a new one. These +operations are added by the framework automatically, and require +only a small shim or minor adjustments to the code of the data +structure being extended within the implementation of the shard +object. + +Insertions are performed by first wrapping the record to be inserted +with a framework header, and then appending it to the end of the +mutable buffer. If the mutable buffer is full, it is flushed to +create a new shard, which is combined into the first level of the +structure. The level reconstruction process is layout policy +dependent. In the case of leveling, the underlying data of the +source shard and the target shard are combined, resulting a new +shard replacing the target shard in the target level. When using +tiering, the newly created shard is simply placed into the target +level. If the target level is full, the framework first triggers a merge on the +target level, which will create another shard at one higher level, +and then inserts the former shard at the now empty target level. +Note that each time a new shard is created, the framework must invoke +$\mathtt{build}$ to construct a new index from scratch for this +shard. + +The framework supports deletes using two approaches: either by +inserting a special tombstone record or by performing a lookup for +the record to be deleted and setting a bit in the header. This +decision is called the \emph{delete policy}, with the former being +called \emph{tombstone delete} and the latter \emph{tagged delete}. +The framework will automatically filter deleted records from query +results before returning them to the user, either by checking for +the delete tag, or by performing a lookup of each record for an +associated tombstone. The number of deleted records within the +framework can be bounded by canceling tombstones and associated +records when they meet during reconstruction, or by dropping all +tagged records when a shard is reconstructed. The framework also +supports aggressive reconstruction (called \emph{compaction}) to +precisely bound the number of deleted records within the index, +which can be helpful to improve the performance of certain types +of query. This is useful for certain search problems, as was seen with +sampling queries in Chapter~\ref{chap:sampling}, but is not +generally necessary to bound query cost in most cases. + +\Paragraph{Design space.} The framework described in this section +has a large design space. In fact, much of the design space has +similar knobs to the well-known LSM Tree~\cite{dayan17}, albeit in +a different environment: the framework targets in-memory static +index structures for general extended decomposable queries without +efficient index merging support, whereas the LSM-tree targets +external range indexes that can be efficiently merged. + +The framework's design trades off among auxiliary memory usage, read performance, +and write performance. The two most significant decisions are the +choice of layout and delete policy. A tiering layout policy reduces +write amplification compared to leveling, requiring each record to +only be written once per level, but increases the number of shards +within the structure, which can hurt query performance. As for +delete policy, the use of tombstones turns deletes into insertions, +which are typically faster. However, depending upon the nature of +the query being executed, the delocalization of the presence +information for a record may result in one extra point lookup for +each record in the result set of a query, vastly reducing read +performance. In these cases, tagging may make more sense. This +results in each delete turning into a slower point-lookup, but +always allows for constant-time visibility checks of records. The +other two major parameters, scale factor and buffer size, can be +used to tune the performance once the policies have been selected. +Generally speaking, larger scale factors result in fewer shards, +but can increase write amplification under leveling. Large buffer +sizes can adversely affect query performance when an unsorted buffer +is used, while allowing higher update throughput. Because the overall +design of the framework remains largely unchanged, the design space +exploration of Section~\ref{ssec:ds-exp} remains relevant here. + +\subsection{The Shard Interface} + +The shard object serves as a ``shim'' between a data structure and +the extension framework, providing a set of mandatory functions +which are used by the framework code to facilitate reconstruction +and deleting records. The data structure being extended can be +provided by a different library and included as an attribute via +composition/aggregation, or can be directly implemented within the +shard class. Additionally, shards can contain any necessary auxiliary +structures, such as bloom filters or hash tables, as necessary to +support the required interface. + +The require interface for a shard object is as follows, +\begin{verbatim} + new(MutableBuffer) -> Shard + new(Shard[]) -> Shard + point_lookup(Record, Boolean) -> Record + get_data() -> Record + get_record_count() -> Int + get_tombstone_count() -> Int + get_memory_usage() -> Int + get_aux_memory_usage() -> Int +\end{verbatim} + +The first two functions are constructors, necessary to build a new Shard +from either an array of other shards (for a reconstruction), or from +a mutable buffer (for a buffer flush).\footnote{ + This is the interface as it currently stands in the existing implementation, but + is subject to change. In particular, we are considering changing the shard reconstruction + procedure to allow for only one necessary constructor, with a more general interface. As + we look to concurrency, being able to construct shards from arbitrary combinations of shards + and buffers will become convenient, for example. + } +The \texttt{point\_lookup} operation is necessary for delete support, and is +used either to locate a record for delete when tagging is used, or to search +for a tombstone associated with a record when tombstones are used. The boolean +is intended to be used to communicate to the shard whether the lookup is +intended to locate a tombstone or a record, and is meant to be used to allow +the shard to control whether a point lookup checks a filter before searching, +but could also be used for other purposes. The \texttt{get\_data} +function exposes a pointer to the beginning of the array of records contained +within the shard--it imposes no restriction on the order of these records, but +does require that all records can be accessed sequentially from this pointer, +and that the order of records does not change. The rest of the functions are +accessors for various shard metadata. The record and tombstone count numbers +are used by the framework for reconstruction purposes.\footnote{The record +count includes tombstones as well, so the true record count on a level is +$\text{reccnt} - \text{tscnt}$.} The memory usage statistics are, at present, +only exposed directly to the user and have no effect on the framework's +behavior. In the future, these may be used for concurrency control and task +scheduling purposes. + +Beyond these, a shard can expose any additional functions that are necessary +for its associated query classes. For example, a shard intended to be used for +range queries might expose upper and lower bound functions, or a shard used for +nearest neighbor search might expose a nearest-neighbor function. + +\subsection{The Query Interface} +\label{ssec:fw-query-int} + +The required interface for a query in the framework is a bit more +complicated than the interface defined for an eDSP, because the +framework needs to query the mutable buffer as well as the shards. +As a result, there is some slight duplication of functions, with +specialized query and pre-processing routines for both shards and +buffers. Specifically, a query must define the following functions, +\begin{verbatim} + get_query_state(QueryParameters, Shard) -> ShardState; + get_buffer_query_state(QueryParameters, Buffer) -> BufferState; + + process_query_states(QueryParameters, ShardStateList, BufferStateList) -> LocalQueryList; + + query(LocalQuery, Shard) -> ResultList + buffer_query(LocalQuery, Buffer) -> ResultList + + merge(ResultList) -> FinalResult + + delete_query_state(ShardState) + delete_buffer_query_state(BufferState) + + bool EARLY_ABORT; + bool SKIP_DELETE_FILTER; +\end{verbatim} + +The \texttt{get\_query\_state} and \texttt{get\_buffer\_query\_state} functions +map to the \texttt{local\_preproc} operation of the eDSP definition for shards +and buffers respectively. \texttt{process\_query\_states} serves the function +of \texttt{distribute\_query}. Note that this function takes a list of buffer +states; although the proposed framework above contains only a single buffer, +future support for concurrency will require multiple buffers, and so the +interface is set up with support for this. The \texttt{query} and +\texttt{buffer\_query} functions execute the local query against the shard or +buffer and return the intermediate results, which are merged using +\texttt{merge} into a final result set. The \texttt{EARLY\_ABORT} parameter can +be set to \texttt{true} to force the framework to immediately return as soon as +the first result is found, rather than querying the entire structure, and the +\texttt{SKIP\_DELETE\_FILTER} disables the framework's automatic delete +filtering, allowing deletes to be manually handled within the \texttt{merge} +function by the developer. These flags exist to allow for optimizations for +certain types of query. For example, point-lookups can take advantage of +\texttt{EARLY\_ABORT} to stop as soon as a match is found, and +\texttt{SKIP\_DELETE\_FILTER} can be used for more efficient tombstone delete +handling in range queries, where tombstones for results will always be in the +\texttt{ResultList}s going into \texttt{merge}. + +The framework itself answers queries by simply calling these routines in +a prescribed order, +\begin{verbatim} +query(QueryArguments qa) BEGIN + FOR i < BufferCount DO + BufferStates[i] = get_buffer_query_state(qa, Buffers[i]) + DONE + + FOR i < ShardCount DO + ShardStates[i] = get_query_state(qa, Shards[i]) + DONE + + process_query_states(qa, ShardStates, BufferStates) + + FOR i < BufferCount DO + temp = buffer_query(BufferStates[i], Buffers[i]) + IF NOT SKIP_DELETE_FILTER THEN + temp = filter_deletes(temp) + END + Results[i] = temp; + + IF EARLY_ABORT AND Results[i].size() > 0 THEN + delete_states(ShardStates, BufferStates) + return merge(Results) + END + DONE + + FOR i < ShardCount DO + temp = query(ShardStates[i], Shards[i]) + IF NOT SKIP_DELETE_FILTER THEN + temp = filter_deletes(temp) + END + Results[i + BufferCount] = temp + IF EARLY_ABORT AD Results[i + BufferCount].size() > 0 THEN + delete_states(ShardStates, BufferStates) + return merge(Results) + END + DONE + + delete_states(ShardStates, BufferStates) + return merge(Results) +END +\end{verbatim} + +\subsubsection{Standardized Queries} + +Provided with the framework are several "standardized" query classes, including +point lookup, range query, and IRS. These queries can be freely applied to any +shard class that implements the necessary optional interfaces. For example, the +provided IRS and range query both require the shard to implement a +\texttt{lower\_bound} and \texttt{upper\_bound} function that returns an index. +They then use this index to access the record array exposed via +\texttt{get\_data}. This is convenient, because it helps to separate the search +problem from the data structure, and moves towards presenting these two objects +as orthogonal. + +In the next section the framework is evaluated by producing a number of indexes +for three different search problems. Specifically, the framework is applied to +a pair of learned indexes, as well as an ISAM-tree. All three of these shards +provide the bound interface described above, meaning that the same range query +class can be used for all of them. It also means that the learned indexes +automatically have support for IRS. And, of course, they also all can be used +with the provided point-lookup query, which simply uses the required +\texttt{point\_lookup} function of the shard. + +At present, the framework only supports associating a single query class with +an index. However, this is simply a limitation of implementation. In the future, +approaches will be considered for associating arbitrary query classes to allow +truly multi-purpose indexes to be constructed. This is not to say that every +data structure will necessarily be efficient at answering every type of query +that could be answered using their interface--but in a database system, being +able to repurpose an existing index to accelerate a wide range of query types +would certainly seem worth considering. + +\section{Framework Evaluation} + +The framework was evaluated using three different types of search problem: +range-count, high-dimensional k-nearest neighbor, and independent range +sampling. In all three cases, an extended static data structure was compared +with dynamic alternatives for the same search problem to demonstrate the +framework's competitiveness. + +\subsection{Methodology} + +All tests were performed using Ubuntu 22.04 +LTS on a dual-socket Intel Xeon Gold 6242R server with 384 GiB of +installed memory and 40 physical cores. Benchmark code was compiled +using \texttt{gcc} version 11.3.0 at the \texttt{-O3} optimization level. + + +\subsection{Range Queries} + +A first test evaluates the performance of the framework in the context of +range queries against learned indexes. In Chapter~\ref{chap:intro}, the +lengthy development cycle of this sort of data structure was discussed, +and so learned indexes were selected as an evaluation candidate to demonstrate +how this framework could allow such lengthy development lifecycles to be largely +bypassed. + +Specifically, the framework is used to produce dynamic learned indexes based on +TrieSpline~\cite{plex} (DE-TS) and the static version of PGM~\cite{pgm} (DE-PGM). These +are both single-pass construction static learned indexes, and thus well suited for use +within this framework compared to more complex structures like RMI~\cite{RMI}, which have +more expensive construction algorithms. The two framework-extended data structures are +compared with dynamic learned indexes, namely ALEX~\cite{ALEX} and the dynamic version of +PGM~\cite{pgm}. PGM provides an interesting comparison, as its native +dynamic version was implemented using a slightly modified version Bentley-Saxe method. + +When performing range queries over large data sets, the +copying of query results can introduce significant overhead. Because the four +tested structures have different data copy behaviors, a range count query was +used for testing, rather than a pure range query. This search problem exposes +the searching performance of the data structures, while controlling for different +data copy behaviors, and so should provide more directly comparable results. + +Range count +queries were executed with a selectivity of $0.01\%$ against three datasets +from the SOSD benchmark~\cite{sosd-datasets}: \texttt{book}, \texttt{fb}, and +\texttt{osm}, which all have 200 million 64-bit keys following a variety of +distributions, which were paired with uniquely generated 64-bit values. There +is a fourth dataset in SOSD, \texttt{wiki}, which was excluded from testing +because it contained duplicate keys, which are not supported by dynamic +PGM.\footnote{The dynamic version of PGM supports deletes using tombstones, +but doesn't wrap records with a header to accomplish this. Instead it reserves +one possible value to represent a tombstone. Records are deleted by inserting a +record having the same key, but this different value. This means that duplicate +keys, even if they have different values, are unsupported as two records with +the same key will be treated as a delete by the index.~\cite{pgm} } + +The shard implementations for DE-PGM and DE-TS required about 300 lines of +C++ code each, and no modification to the data structures themselves. For both +data structures, the framework was configured with a buffer of 12,000 records, a scale +factor of 8, the tombstone delete policy, and tiering. Each shard stored $D_i$ +as a sorted array of records, used an instance of the learned index for +$\mathcal{I}_i$, and has no auxiliary structures. The local query routine used +the learned index to locate the first key in the query range and then iterated +over the sorted array until the end of the range is reached, counting the +number of records and tombstones required. The mutable buffer query performed +the counting over a full scan. No local preprocessing was needed, and the merge +operation simply summed the record and tombstone counts, and returned their +difference. + +\begin{figure*}[t] + \centering + \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-insert} \label{fig:rq-insert}} + \subfloat[Query Latency]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-query} \label{fig:rq-query}} \\ + \subfloat[Index Sizes]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0 ]{img/fig-bs-rq-space} \label{fig:idx-space}} + \caption{Range Count Evaluation} + \label{fig:results1} +\end{figure*} + +Figure~\ref{fig:rq-insert} shows the update throughput of all competitors. ALEX +performs the worst in all cases, and PGM performs the best, with the extended +indexes falling in the middle. It is not unexpected that PGM performs better +than the framework, because the Bentley-Saxe extension in PGM is custom-built, +and thus has a tighter integration than a general framework would allow. +However, even with this advantage, DE-PGM still reaches up to 85\% of PGM's +insertion throughput. Additionally, Figure~\ref{fig:rq-query} shows that PGM +pays a large cost in query latency for its advantage in insertion, with the +framework extended indexes significantly outperforming it. Further, DE-TS even +outperforms ALEX for query latency in some cases. Finally, +Figure~\ref{fig:idx-space} shows the storage cost of the indexes, without +counting the space necessary to store the records themselves. The storage cost +of a learned index is fairly variable, as it is largely a function of the +distribution of the data, but in all cases, the extended learned +indexes, which build compact data arrays without gaps, occupy three orders of +magnitude smaller storage space compared to ALEX, which requires leaving gaps +in the data arrays. + +\subsection{High-Dimensional k-Nearest Neighbor} +The next test evaluates the framework for the extension of high-dimensional +metric indexes for the k-nearest neighbor search problem. An M-tree~\cite{mtree} +was used as the dynamic baseline,\footnote{ + Specifically, the M-tree implementation tested can be found at \url{https://github.com/dbrumbaugh/M-Tree} + and is a fork of a structure written originally by Eduardo D'Avila, modified to compile under C++20. The + tree uses a random selection algorithm for ball splitting. +} and a VPTree~\cite{vptree} as the static structure. The framework was used to +extend VPTree to produce the dynamic version, DE-VPTree. +An M-Tree is a tree that partitions records based on +high-dimensional spheres and supports updates by splitting and merging these +partitions. +A VPTree is a binary tree that is produced by recursively selecting +a point, called the vantage point, and partitioning records based on their +distance from that point. This results in a difficult to modify structure that +can be constructed in $O(n \log n)$ time and can answer KNN queries in $O(k +\log n)$ time. + +DE-VPTree, used a buffer of 12,000 records, a scale factor of 6, tiering, and +delete tagging. The query was implemented without a pre-processing step, using +the standard VPTree algorithm for KNN queries against each shard. All $k$ +records were determined for each shard, and then the merge operation used a +heap to merge the results sets together and return the $k$ nearest neighbors +from the $k\log(n)$ intermediate results. This is a type of query that pays a +non-constant merge cost, even with the framework's expanded query interface, of +$O(k \log k)$. In effect, the kNN query must be answered twice: once for each +shard to get the intermediate result sets, and then a second time within the +merge operation to select the kNN from the result sets. + +\begin{figure} + \centering + \includegraphics[width=.75\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn} + \caption{KNN Index Evaluation} + \label{fig:knn} +\end{figure} +Euclidean distance was used as the metric for both structures, and $k=1000$ was +used for all queries. The reference point for each query was selected randomly +from points within the dataset. Tests were run using the Spanish Billion Words +dataset~\cite{sbw}, of 300-dimensional vectors. The results are shown in +Figure~\ref{fig:knn}. In this case, the static nature of the VPTree allows it +to dominate the M-Tree in query latency, and the simpler reconstruction +procedure shows a significant insertion performance improvement as well. + +\subsection{Independent Range Sampling} +Finally, the +framework was tested using one-dimensional IRS queries. As before, +a static ISAM-tree was used as the data structure to be extended, +however the sampling query was implemented using the query interface from +Section~\ref{ssec:fw-query-int}. The pre-processing step identifies the first +and last query falling into the range to be sampled from, and determines the +total weight based on this range, for each shard. Then, in the local query +generation step, these weights are used to construct and alias structure, which +is used to assign sample sizes to each shard based on weight to avoid +introducing skew into the results. After this, the query routine generates +random numbers between the established bounds to sample records, and the merge +operation appends the individual result sets together. This static procedure +only requires a pair of tree traversals per shard, regardless of how many +samples are taken. + +\begin{figure} + \centering + \subfloat[Query Latency]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-query} \label{fig:irs-query}} + \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-insert} \label{fig:irs-insert}} + \caption{IRS Index Evaluation} + \label{fig:results2} +\end{figure} + +The extended ISAM structure (DE-IRS) was compared to a B$^+$-Tree +with aggregate weight tags on internal nodes (AGG B+Tree) for sampling +and insertion performance, and to a single instance of the static ISAM-tree (ISAM), +which does not support updates. DE-IRS was configured with a buffer size +of 12,000 records, a scale factor of 6, tiering, and delete tagging. The IRS +queries had a selectivity of $0.1\%$ with sample size of $k=1000$. Testing +was performed using the same datasets as were used for range queries. + +Figure~\ref{fig:irs-query} +shows the significant latency advantage that the dynamically extended ISAM tree +enjoys compared to a B+Tree. DE-IRS is up to 23 times faster than the B$^+$-Tree at +answering sampling queries, and only about 3 times slower than the fully static +solution. In this case, the extra query cost caused by needing to query +multiple structures is more than balanced by the query efficiency of each of +those structures, relative to tree sampling. Interestingly, the framework also +results in better update performance compared to the B$^+$-Tree, as shown in +Figure~\ref{fig:irs-insert}. This is likely because the ISAM shards can be +efficiently constructed using a combination of sorted-merge operations and +bulk-loading, and avoid expensive structural modification operations that are +necessary for maintaining a B$^+$-Tree. + +\subsection{Discussion} + + +The results demonstrate not only that the framework's update support is +competitive with custom-built dynamic data structures, but that the framework +is even able to, in many cases, retain some of the query performance advantage +of its extended static data structure. This is particularly evident in the k-nearest +neighbor and independent range sampling tests, where the static version of the +structure was directly tested as well. These tests demonstrate one of the advantages +of static data structures: they are able to maintain much tighter inter-record relationships +than dynamic ones, because update support typically requires relaxing these relationships +to make it easier to update them. While the framework introduces the overhead of querying +multiple structures and merging them together, it is clear from the results that this overhead +is generally less than the overhead incurred by the update support techniques used +in the dynamic structures. The only case where the framework was defeated in query performance +was in competition with ALEX, where the resulting query latencies were comparable. + +It is also evident that the update support provided by the framework is on par with, if not +superior, to that provided by the dynamic baselines, at least in terms of throughput. The +framework will certainly suffer from larger tail latency spikes, which weren't measured in +this round of testing, due to the larger scale of the reconstructions, but the amortization +of these costs over a large number of inserts allows for the maintenance of a respectable +level of throughput. In fact, the only case where the framework loses in insertion throughput +is against the dynamic PGM. However, an examination of the query latency reveals that this +is likely due to the fact that the standard configuration of the Bently-Saxe variant used +by PGM is highly tuned for insertion performance, as the query latencies against this data +structure are far worse than any other learned index tested, so even this result shouldn't +be taken as a ``clear'' defeat of the framework's implementation. + +Overall, it is clear from this evaluation that the dynamic extension framework is a +promising alternative to manual index redesign for accommodating updates. In almost +all cases, the framework-extended static data structures provided superior insertion +throughput in all cases, and query latencies that either matched or exceeded that of +the dynamic baselines. Additionally, though it is hard to quantity, the code complexity +of the framework-extended data structures was much less, with the shard implementations +requiring only a small amount of relatively straightforward code to interface with pre-existing +static data structures, or with the necessary data structure implementations themselves being +simpler. + +\section{Conclusion} + +In this chapter, a generalize version of the framework originally proposed in +Chapter~\ref{chap:sampling} was proposed. This framework is based on two +key properties: extended decomposability and record identity. It is capable +of extending any data structure and search problem supporting these two properties +with support for inserts and deletes. An evaluation of this framework was performed +by extending several static data structures, and comparing the resulting structures' +performance against dynamic baselines capable of answering the same type of search +problem. The extended structures generally performed as well as, if not better, than +their dynamic baselines in query performance, insert performance, or both. This demonstrates +the capability of this framework to produce viable indexes in a variety of contexts. However, +the framework is not yet complete. In the next chapter, the work required to bring this +framework to completion will be described. |