\chapter{Generalizing the Framework} \begin{center} \emph{The following chapter is an adaptation of work completed in collaboration with Dr. Dong Xie and Dr. Zhuoyue Zhao and published in PVLDB Volume 17, Issue 11 (July 2024) under the title "Towards Systematic Index Dynamization". } \hrule \end{center} \label{chap:framework} The previous chapter demonstrated the possible utility of designing indexes based upon the dynamic extension of static data structures. However, the presented strategy falls short of a general framework, as it is specific to sampling problems. In this chapter, the techniques of that work will be discussed in more general terms, to arrive at a more broadly applicable solution. A general framework is proposed, which places only two requirements on supported data structures, \begin{itemize} \item Extended Decomposability \item Record Identity \end{itemize} In this chapter, first these two properties are defined. Then, a general dynamic extension framework is described which can be applied to any data structure supporting these properties. Finally, an experimental evaluation is presented that demonstrates the viability of this framework. \section{Extended Decomposability} Chapter~\ref{chap:sampling} demonstrated how non-DSPs can be efficiently addressed using Bentley-Saxe, so long as the query interface is modified to accommodate their needs. For Independent sampling problems, this involved a two-pass approach, where some pre-processing work was performed against each shard and used to construct a shard alias structure. This structure was then used to determine how many samples to draw from each shard. To generalize this approach, a new class of decomposability is proposed, called \emph{extended decomposability}. At present, its definition is tied tightly to the query interface, rather than a formal mathematical definition. In extended decomposability, rather than treating a search problem as a monolith, the algorithm is decomposed into multiple components. This allows for communication between shards as part of the query process. Additionally, rather than using a binary merge operator, extended decomposability uses a variadic function that merges all of the result sets in one pass, reducing the cost due to merging by a logarithmic factor without introducing any new restrictions. The basic interface that must be supported by a extended-decomposable search problem (eDSP) is, \begin{itemize} \item $\mathbftt{local\_preproc}(\mathcal{I}_i, \mathcal{Q}) \to \mathscr{S}_i$ \\ Pre-processes each partition $\mathcal{D}_i$ using index $\mathcal{I}_i$ to produce preliminary information about the query result on this partition, encoded as an object $\mathscr{S}_i$. \item $\mathbftt{distribute\_query}(\mathscr{S}_1, \ldots, \mathscr{S}_m, \mathcal{Q}) \to \mathcal{Q}_1, \ldots, \mathcal{Q}_m$\\ Processes the list of preliminary information objects $\mathscr{S}_i$ and emits a list of local queries $\mathcal{Q}_i$ to run independently on each partition. \item $\mathbftt{local\_query}(\mathcal{I}_i, \mathcal{Q}_i) \to \mathcal{R}_i$ \\ Executes the local query $\mathcal{Q}_i$ over partition $\mathcal{D}_i$ using index $\mathcal{I}_i$ and returns a partial result $\mathcal{R}_i$. \item $\mathbftt{merge}(\mathcal{R}_1, \ldots \mathcal{R}_m) \to \mathcal{R}$ \\ Merges the partial results to produce the final answer. \end{itemize} The pseudocode for the query algorithm using this interface is, \begin{algorithm} \DontPrintSemicolon \SetKwProg{Proc}{procedure}{ BEGIN}{END} \SetKwProg{For}{for}{ DO}{DONE} \Proc{\mathbftt{QUERY}($D[]$, $\mathscr{Q}$)} { \For{$i \in [0, |D|)$} { $S[i] := \mathbftt{local\_preproc}(D[i], \mathscr{Q})$ } \; $ Q := \mathbftt{distribute\_query}(S, \mathscr{Q}) $ \; \; \For{$i \in [0, |D|)$} { $R[i] := \mathbftt{local\_query}(D[i], Q[i])$ } \; $OUT := \mathbftt{merge}(R)$ \; \Return {$OUT$} \; } \end{algorithm} In this system, each query can report a partial result with \mathbftt{local\_preproc}, which can be used by \mathbftt{distribute\_query} to adjust the per-partition query parameters, allowing for direct communication of state between partitions. Queries which do not need this functionality can simply return empty $\mathscr{S}_i$ objects from \mathbftt{local\_preproc}. \subsection{Query Complexity} Before describing how to use this new interface and definition to support more efficient queries than standard decomposability, more more general expression for the cost of querying such a structure should be derived. Recall that Bentley-Saxe, when applied to a $C(n)$-decomposable problem, has the following query cost, \begin{equation} \label{eq3:Bentley-Saxe} O\left(\log n \cdot \left( Q_s(n) + C(n)\right)\right) \end{equation} where $Q_s(n)$ is the cost of the query against one partition, and $C(n)$ is the cost of the merge operator. Let $Q_s(n)$ represent the cost of \mathbftt{local\_query} and $C(n)$ the cost of \mathbftt{merge} in the extended decomposability case. Additionally, let $P(n)$ be the cost of $\mathbftt{local\_preproc}$ and $\mathcal{D}(n)$ be the cost of \mathbftt{distribute\_query}. Additionally, recall that $|D| = \log n$ for the Bentley-Saxe method. In this case, the cost of a query is \begin{equation} O \left( \log n \cdot P(n) + \mathcal{D}(n) + \log n \cdot Q_s(n) + C(n) \right) \end{equation} Superficially, this looks to be strictly worse than the Bentley-Saxe case in Equation~\ref{eq3:Bentley-Saxe}. However, the important thing to understand is that for $C(n)$-decomposable queries, $P(n) \in O(1)$ and $\mathcal{D}(n) \in O(1)$, as these steps are unneeded. Thus, for normal decomposable queries, the cost actually reduces to, \begin{equation} O \left( \log n \cdot Q_s(n) + C(n) \right) \end{equation} which is actually \emph{better} than Bentley-Saxe. Meanwhile, the ability perform state-sharing between queries can facilitate better solutions than would otherwise be possible. In light of this new approach, consider the two examples of non-decomposable search problems from Section~\ref{ssec:decomp-limits}. \subsection{k-Nearest Neighbor} \label{ssec:knn} The KNN problem is $C(n)$-decomposable, and Section~\ref{sssec-decomp-limits-knn} arrived at a Bentley-Saxe based solution to this problem based on VPTree, with a query cost of \begin{equation} O \left( k \log^2 n + k \log n \log k \right) \end{equation} by running KNN on each partition, and then merging the result sets with a heap. Applying the interface of extended-decomposability to this problem allows for some optimizations. Pre-processing is not necessary here, but the variadic merge function can be leveraged to get an asymptotically better solution. Simply dropping the existing algorithm into this interface will result in a merge algorithm with cost, \begin{equation} C(n) \in O \left( k \log n \left( \log k + \log\log n\right)\right) \end{equation} which results in a total query cost that is slightly \emph{worse} than the original, \begin{equation} O \left( k \log^2 n + k \log n \left(\log k + \log\log n\right) \right) \end{equation} The problem is that the number of records considered in a given merge has grown from $O(k)$ in the binary merge case to $O(\log n \cdot k)$ in the variadic merge. However, because the merge function now has access to all of the data at once, the algorithm can be modified slightly for better efficiency by only pushing $\log n$ elements into the heap at a time. This trick only works if the $R_i$s are in sorted order relative to $f(x, q)$, however this condition is satisfied by the result sets returned by KNN against a VPTree. Thus, for each $R_i$, the first element in sorted order can be inserted into the heap, element in sorted order into the heap, tagged with a reference to which $R_i$ it was taken from. Then, when the heap is popped, the next element from the associated $R_i$ can be inserted. This allows the heap's size to be maintained at no larger than $O(\log n)$, and limits the algorithm to no more than $k$ pop operations and $\log n + k - 1$ pushes. This algorithm reduces the cost of KNN on this structure to, \begin{equation} O(k \log^2 n + \log n) \end{equation} which is strictly better than the original. \subsection{Independent Range Sampling} The eDSP abstraction also provides sufficient features to implement IRS, using the same basic approach as was used in the previous chapter. Unlike KNN, IRS will take advantage of the extended query interface. Recall from the Chapter~\ref{chap:sampling} that the approach used for answering sampling queries (ignoring the buffer, for now) was, \begin{enumerate} \item Query each shard to establish the weight that should be assigned to the shard in sample size assignments. \item Build an alias structure over those weights. \item For each sample, reference the alias structure to determine which shard to sample from, and then draw the sample. \end{enumerate} This approach can be mapped easily onto the eDSP interface as follows, \begin{itemize} \item[\texttt{local\_preproc}] Determine and return the total weight of candidate records for sampling in the shard. \item[\texttt{distribute\_query}] Using the shard weights, construct an alias structure associating each shard with its total weight. Then, query this alias structure $k$ times. For shard $i$, the local query $\mathscr{Q}_i$ will have its sample size assigned based on how many times $i$ is returned during the alias querying. \item[\texttt{local\_query}] Process the local query using the underlying data structure's normal sampling procedure. \item[\texttt{merge}] Union all of the partial results together. \end{itemize} This division of the query maps closely onto the cost function, \begin{equation} O\left(P(n) + kS(n)\right) \end{equation} used in Chapter~\ref{chap:sampling}, where the $W(n) + P(n)$ pre-processing cost is associated with the cost of \texttt{local\_preproc} and the $kS(n)$ sampling cost is associated with $\texttt{local\_query}$. The \texttt{distribute\_query} operation will require $O(\log n)$ time to construct the shard alias structure, and $O(k)$ time to query it. Accounting then for the fact that \texttt{local\_preproc} will be called once per shard ($\log n$ times), and a total of $k$ records will be sampled as the cost of $S(n)$ each, this results in a total query cost of, \begin{equation} O\left(\left[W(n) + P(n)\right]\log n + k S(n)\right) \end{equation} which matches the cost in Equation~\ref{eq:sample-cost}. \section{Record Identity} Another important consideration for the framework is support for deletes, which are important in the contexts of database systems. The sampling extension framework supported two techniques for the deletion of records: tombstone-based deletes and tagging-based deletes. In both cases, the solution required that the shard support point lookups, either for checking tombstones or for finding the record to mark it as deleted. Implicit in this is an important property of the underlying data structure which was taken for granted in that work, but which will be made explicit here: record identity. Delete support requires that each record within the index be uniquely identifiable, and linkable directly to a location in storage. This property is called \emph{record identity}. In the context of database indexes, it isn't a particularly contentious requirement. Indexes already are designed to provide a mapping directly to a record in storage, which (at least in the context of RDBMS) must have a unique identifier attached. However, in more general contexts, this requirement will place some restrictions on the applicability of the framework. For example, approximate data structures or summaries, such as Bloom filters~\cite{bloom70} or count-min sketches~\cite{countmin-sketch} are data structures which don't necessarily store the underlying record. In principle, some summaries \emph{could} be supported by normal Bentley-Saxe as there exist mergeable summaries~\cite{mergeable-summaries}. But because these data structures violate the record identity property, they would not support deletes (either in the framework, or Bentley-Saxe). The framework considers deletes to be a first-class citizen, and this is formalized by requiring record identity as a property that supported data structures must have. \section{The General Framework} Based on these properties, and the work described in Chapter~\ref{chap:sampling}, dynamic extension framework has been devised with broad support for data structures. It is implemented in C++20, using templates and concepts to define the necessary interfaces. A user of this framework needs to provide a definition for their data structure with a prescribed interface (called a \texttt{shard}), and a definition for their query following an interface based on the above definition of an eDSP. These two classes can then be used as template parameters to automatically create a dynamic index, which exposes methods for inserting and deleting records, as well as executing queries. \subsection{Framework Design} \Paragraph{Structure.} The overall design of the general framework itself is not substantially different from the sampling framework discussed in the Chapter~\ref{chap:sampling}. It consists of a mutable buffer and a set of levels containing data structures with geometrically increasing capacities. The \emph{mutable buffer} is a small unsorted record array of fixed capacity that buffers incoming inserts. As the mutable buffer is kept sufficiently small (e.g. fits in L2 CPU cache), the cost of querying it without any auxiliary structures can be minimized, while still allowing better insertion performance than Bentley-Saxe, which requires rebuilding an index structure for each insertion. The use of an unsorted buffer is necessary to ensure that the framework doesn't require an existing dynamic version of the index structure being extended, which would defeat the purpose of the entire exercise. The majority of the data within the structure is stored in a sequence of \emph{levels} with geometrically increasing record capacity, such that the capacity of level $i$ is $s^{i+1}$, where $s$ is a configurable parameter called the \emph{scale factor}. Unlike Bentley-Saxe, these levels are permitted to be partially full, which allows significantly more flexibility in terms of how reconstruction is performed. This also opens up the possibility of allowing each level to allocate its record capacity across multiple data structures (named \emph{shards}) rather than just one. This decision is called the \emph{layout policy}, with the use of a single structure being called \emph{leveling}, and multiple structures being called \emph{tiering}. \begin{figure} \centering \subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/leveling} \label{fig:leveling}} \subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/tiering} \label{fig:tiering}} \caption{\textbf{An overview of the general structure of the dynamic extension framework} using leveling (Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout policies. The pictured extension has a scale factor of 3, with $L_0$ being at capacity, and $L_1$ being at one third capacity. Each shard is shown as a dotted box, wrapping its associated dataset ($D_i$), data structure ($I_i$), and auxiliary structures $(A_i)$. } \label{fig:framework} \end{figure} \Paragraph{Shards.} The basic building block of the dynamic extension is called a shard, defined as $\mathcal{S}_i = (\mathcal{D}_i, \mathcal{I}_i, A_i)$, which consists of a partition of the data $\mathcal{D}_i$, an instance of the static index structure being extended $\mathcal{I}_i$, and an optional auxiliary structure $A_i$. To ensure the viability of level reconstruction, the extended data structure should at least support a construction method $\mathtt{build}(\mathcal{D})$ that can build a new static index from a set of records $\mathcal{D}$ from scratch. This set of records may come from the mutable buffer, or from a union of underlying data of multiple other shards. It is also beneficial for $\mathcal{I}_i$ to support efficient point-lookups, which can search for a record's storage location by its identifier (given by the record identify requirements of the framework). The shard can also be customized to provide any necessary features for supporting the index being extended. For example, auxiliary data structures like Bloom filters or hash tables can be added to improve point-lookup performance, or additional, specialized query functions can be provided for use by the query functions. From an implementation standpoint, the shard object provides a shim between the data structure and the framework itself. At minimum, it must support the following interface, \begin{itemize} \item $\mathbftt{construct}(B) \to S$ \\ Construct a new shard from the contents of the mutable buffer, $B$. \item $\mathbftt{construct}(S_0, \ldots, S_n) \to S$ Construct a new shard from the records contained within a list of already existing shards. \item $\mathbftt{point\_lookup}(r) \to *r$ \\ Search for a record, $r$, by identity and return a reference to its location in storage. \end{itemize} \Paragraph{Insertion \& deletion.} The framework supports inserting new records and deleting records already in the index. These two operations also allow for updates to existing records, by first deleting the old version and then inserting a new one. These operations are added by the framework automatically, and require only a small shim or minor adjustments to the code of the data structure being extended within the implementation of the shard object. Insertions are performed by first wrapping the record to be inserted with a framework header, and then appending it to the end of the mutable buffer. If the mutable buffer is full, it is flushed to create a new shard, which is combined into the first level of the structure. The level reconstruction process is layout policy dependent. In the case of leveling, the underlying data of the source shard and the target shard are combined, resulting a new shard replacing the target shard in the target level. When using tiering, the newly created shard is simply placed into the target level. If the target level is full, the framework first triggers a merge on the target level, which will create another shard at one higher level, and then inserts the former shard at the now empty target level. Note that each time a new shard is created, the framework must invoke $\mathtt{build}$ to construct a new index from scratch for this shard. The framework supports deletes using two approaches: either by inserting a special tombstone record or by performing a lookup for the record to be deleted and setting a bit in the header. This decision is called the \emph{delete policy}, with the former being called \emph{tombstone delete} and the latter \emph{tagged delete}. The framework will automatically filter deleted records from query results before returning them to the user, either by checking for the delete tag, or by performing a lookup of each record for an associated tombstone. The number of deleted records within the framework can be bounded by canceling tombstones and associated records when they meet during reconstruction, or by dropping all tagged records when a shard is reconstructed. The framework also supports aggressive reconstruction (called \emph{compaction}) to precisely bound the number of deleted records within the index, which can be helpful to improve the performance of certain types of query. This is useful for certain search problems, as was seen with sampling queries in Chapter~\ref{chap:sampling}, but is not generally necessary to bound query cost in most cases. \Paragraph{Design space.} The framework described in this section has a large design space. In fact, much of the design space has similar knobs to the well-known LSM Tree~\cite{dayan17}, albeit in a different environment: the framework targets in-memory static index structures for general extended decomposable queries without efficient index merging support, whereas the LSM-tree targets external range indexes that can be efficiently merged. The framework's design trades off among auxiliary memory usage, read performance, and write performance. The two most significant decisions are the choice of layout and delete policy. A tiering layout policy reduces write amplification compared to leveling, requiring each record to only be written once per level, but increases the number of shards within the structure, which can hurt query performance. As for delete policy, the use of tombstones turns deletes into insertions, which are typically faster. However, depending upon the nature of the query being executed, the delocalization of the presence information for a record may result in one extra point lookup for each record in the result set of a query, vastly reducing read performance. In these cases, tagging may make more sense. This results in each delete turning into a slower point-lookup, but always allows for constant-time visibility checks of records. The other two major parameters, scale factor and buffer size, can be used to tune the performance once the policies have been selected. Generally speaking, larger scale factors result in fewer shards, but can increase write amplification under leveling. Large buffer sizes can adversely affect query performance when an unsorted buffer is used, while allowing higher update throughput. Because the overall design of the framework remains largely unchanged, the design space exploration of Section~\ref{ssec:ds-exp} remains relevant here. \subsection{The Shard Interface} The shard object serves as a ``shim'' between a data structure and the extension framework, providing a set of mandatory functions which are used by the framework code to facilitate reconstruction and deleting records. The data structure being extended can be provided by a different library and included as an attribute via composition/aggregation, or can be directly implemented within the shard class. Additionally, shards can contain any necessary auxiliary structures, such as bloom filters or hash tables, as necessary to support the required interface. The require interface for a shard object is as follows, \begin{verbatim} new(MutableBuffer) -> Shard new(Shard[]) -> Shard point_lookup(Record, Boolean) -> Record get_data() -> Record get_record_count() -> Int get_tombstone_count() -> Int get_memory_usage() -> Int get_aux_memory_usage() -> Int \end{verbatim} The first two functions are constructors, necessary to build a new Shard from either an array of other shards (for a reconstruction), or from a mutable buffer (for a buffer flush).\footnote{ This is the interface as it currently stands in the existing implementation, but is subject to change. In particular, we are considering changing the shard reconstruction procedure to allow for only one necessary constructor, with a more general interface. As we look to concurrency, being able to construct shards from arbitrary combinations of shards and buffers will become convenient, for example. } The \texttt{point\_lookup} operation is necessary for delete support, and is used either to locate a record for delete when tagging is used, or to search for a tombstone associated with a record when tombstones are used. The boolean is intended to be used to communicate to the shard whether the lookup is intended to locate a tombstone or a record, and is meant to be used to allow the shard to control whether a point lookup checks a filter before searching, but could also be used for other purposes. The \texttt{get\_data} function exposes a pointer to the beginning of the array of records contained within the shard--it imposes no restriction on the order of these records, but does require that all records can be accessed sequentially from this pointer, and that the order of records does not change. The rest of the functions are accessors for various shard metadata. The record and tombstone count numbers are used by the framework for reconstruction purposes.\footnote{The record count includes tombstones as well, so the true record count on a level is $\text{reccnt} - \text{tscnt}$.} The memory usage statistics are, at present, only exposed directly to the user and have no effect on the framework's behavior. In the future, these may be used for concurrency control and task scheduling purposes. Beyond these, a shard can expose any additional functions that are necessary for its associated query classes. For example, a shard intended to be used for range queries might expose upper and lower bound functions, or a shard used for nearest neighbor search might expose a nearest-neighbor function. \subsection{The Query Interface} \label{ssec:fw-query-int} The required interface for a query in the framework is a bit more complicated than the interface defined for an eDSP, because the framework needs to query the mutable buffer as well as the shards. As a result, there is some slight duplication of functions, with specialized query and pre-processing routines for both shards and buffers. Specifically, a query must define the following functions, \begin{verbatim} get_query_state(QueryParameters, Shard) -> ShardState; get_buffer_query_state(QueryParameters, Buffer) -> BufferState; process_query_states(QueryParameters, ShardStateList, BufferStateList) -> LocalQueryList; query(LocalQuery, Shard) -> ResultList buffer_query(LocalQuery, Buffer) -> ResultList merge(ResultList) -> FinalResult delete_query_state(ShardState) delete_buffer_query_state(BufferState) bool EARLY_ABORT; bool SKIP_DELETE_FILTER; \end{verbatim} The \texttt{get\_query\_state} and \texttt{get\_buffer\_query\_state} functions map to the \texttt{local\_preproc} operation of the eDSP definition for shards and buffers respectively. \texttt{process\_query\_states} serves the function of \texttt{distribute\_query}. Note that this function takes a list of buffer states; although the proposed framework above contains only a single buffer, future support for concurrency will require multiple buffers, and so the interface is set up with support for this. The \texttt{query} and \texttt{buffer\_query} functions execute the local query against the shard or buffer and return the intermediate results, which are merged using \texttt{merge} into a final result set. The \texttt{EARLY\_ABORT} parameter can be set to \texttt{true} to force the framework to immediately return as soon as the first result is found, rather than querying the entire structure, and the \texttt{SKIP\_DELETE\_FILTER} disables the framework's automatic delete filtering, allowing deletes to be manually handled within the \texttt{merge} function by the developer. These flags exist to allow for optimizations for certain types of query. For example, point-lookups can take advantage of \texttt{EARLY\_ABORT} to stop as soon as a match is found, and \texttt{SKIP\_DELETE\_FILTER} can be used for more efficient tombstone delete handling in range queries, where tombstones for results will always be in the \texttt{ResultList}s going into \texttt{merge}. The framework itself answers queries by simply calling these routines in a prescribed order, \begin{verbatim} query(QueryArguments qa) BEGIN FOR i < BufferCount DO BufferStates[i] = get_buffer_query_state(qa, Buffers[i]) DONE FOR i < ShardCount DO ShardStates[i] = get_query_state(qa, Shards[i]) DONE process_query_states(qa, ShardStates, BufferStates) FOR i < BufferCount DO temp = buffer_query(BufferStates[i], Buffers[i]) IF NOT SKIP_DELETE_FILTER THEN temp = filter_deletes(temp) END Results[i] = temp; IF EARLY_ABORT AND Results[i].size() > 0 THEN delete_states(ShardStates, BufferStates) return merge(Results) END DONE FOR i < ShardCount DO temp = query(ShardStates[i], Shards[i]) IF NOT SKIP_DELETE_FILTER THEN temp = filter_deletes(temp) END Results[i + BufferCount] = temp IF EARLY_ABORT AD Results[i + BufferCount].size() > 0 THEN delete_states(ShardStates, BufferStates) return merge(Results) END DONE delete_states(ShardStates, BufferStates) return merge(Results) END \end{verbatim} \subsubsection{Standardized Queries} Provided with the framework are several "standardized" query classes, including point lookup, range query, and IRS. These queries can be freely applied to any shard class that implements the necessary optional interfaces. For example, the provided IRS and range query both require the shard to implement a \texttt{lower\_bound} and \texttt{upper\_bound} function that returns an index. They then use this index to access the record array exposed via \texttt{get\_data}. This is convenient, because it helps to separate the search problem from the data structure, and moves towards presenting these two objects as orthogonal. In the next section the framework is evaluated by producing a number of indexes for three different search problems. Specifically, the framework is applied to a pair of learned indexes, as well as an ISAM-tree. All three of these shards provide the bound interface described above, meaning that the same range query class can be used for all of them. It also means that the learned indexes automatically have support for IRS. And, of course, they also all can be used with the provided point-lookup query, which simply uses the required \texttt{point\_lookup} function of the shard. At present, the framework only supports associating a single query class with an index. However, this is simply a limitation of implementation. In the future, approaches will be considered for associating arbitrary query classes to allow truly multi-purpose indexes to be constructed. This is not to say that every data structure will necessarily be efficient at answering every type of query that could be answered using their interface--but in a database system, being able to repurpose an existing index to accelerate a wide range of query types would certainly seem worth considering. \section{Framework Evaluation} The framework was evaluated using three different types of search problem: range-count, high-dimensional k-nearest neighbor, and independent range sampling. In all three cases, an extended static data structure was compared with dynamic alternatives for the same search problem to demonstrate the framework's competitiveness. \subsection{Methodology} All tests were performed using Ubuntu 22.04 LTS on a dual-socket Intel Xeon Gold 6242R server with 384 GiB of installed memory and 40 physical cores. Benchmark code was compiled using \texttt{gcc} version 11.3.0 at the \texttt{-O3} optimization level. \subsection{Range Queries} A first test evaluates the performance of the framework in the context of range queries against learned indexes. In Chapter~\ref{chap:intro}, the lengthy development cycle of this sort of data structure was discussed, and so learned indexes were selected as an evaluation candidate to demonstrate how this framework could allow such lengthy development lifecycles to be largely bypassed. Specifically, the framework is used to produce dynamic learned indexes based on TrieSpline~\cite{plex} (DE-TS) and the static version of PGM~\cite{pgm} (DE-PGM). These are both single-pass construction static learned indexes, and thus well suited for use within this framework compared to more complex structures like RMI~\cite{RMI}, which have more expensive construction algorithms. The two framework-extended data structures are compared with dynamic learned indexes, namely ALEX~\cite{ALEX} and the dynamic version of PGM~\cite{pgm}. PGM provides an interesting comparison, as its native dynamic version was implemented using a slightly modified version Bentley-Saxe method. When performing range queries over large data sets, the copying of query results can introduce significant overhead. Because the four tested structures have different data copy behaviors, a range count query was used for testing, rather than a pure range query. This search problem exposes the searching performance of the data structures, while controlling for different data copy behaviors, and so should provide more directly comparable results. Range count queries were executed with a selectivity of $0.01\%$ against three datasets from the SOSD benchmark~\cite{sosd-datasets}: \texttt{book}, \texttt{fb}, and \texttt{osm}, which all have 200 million 64-bit keys following a variety of distributions, which were paired with uniquely generated 64-bit values. There is a fourth dataset in SOSD, \texttt{wiki}, which was excluded from testing because it contained duplicate keys, which are not supported by dynamic PGM.\footnote{The dynamic version of PGM supports deletes using tombstones, but doesn't wrap records with a header to accomplish this. Instead it reserves one possible value to represent a tombstone. Records are deleted by inserting a record having the same key, but this different value. This means that duplicate keys, even if they have different values, are unsupported as two records with the same key will be treated as a delete by the index.~\cite{pgm} } The shard implementations for DE-PGM and DE-TS required about 300 lines of C++ code each, and no modification to the data structures themselves. For both data structures, the framework was configured with a buffer of 12,000 records, a scale factor of 8, the tombstone delete policy, and tiering. Each shard stored $D_i$ as a sorted array of records, used an instance of the learned index for $\mathcal{I}_i$, and has no auxiliary structures. The local query routine used the learned index to locate the first key in the query range and then iterated over the sorted array until the end of the range is reached, counting the number of records and tombstones required. The mutable buffer query performed the counting over a full scan. No local preprocessing was needed, and the merge operation simply summed the record and tombstone counts, and returned their difference. \begin{figure*}[t] \centering \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-insert} \label{fig:rq-insert}} \subfloat[Query Latency]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-query} \label{fig:rq-query}} \\ \subfloat[Index Sizes]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0 ]{img/fig-bs-rq-space} \label{fig:idx-space}} \caption{Range Count Evaluation} \label{fig:results1} \end{figure*} Figure~\ref{fig:rq-insert} shows the update throughput of all competitors. ALEX performs the worst in all cases, and PGM performs the best, with the extended indexes falling in the middle. It is not unexpected that PGM performs better than the framework, because the Bentley-Saxe extension in PGM is custom-built, and thus has a tighter integration than a general framework would allow. However, even with this advantage, DE-PGM still reaches up to 85\% of PGM's insertion throughput. Additionally, Figure~\ref{fig:rq-query} shows that PGM pays a large cost in query latency for its advantage in insertion, with the framework extended indexes significantly outperforming it. Further, DE-TS even outperforms ALEX for query latency in some cases. Finally, Figure~\ref{fig:idx-space} shows the storage cost of the indexes, without counting the space necessary to store the records themselves. The storage cost of a learned index is fairly variable, as it is largely a function of the distribution of the data, but in all cases, the extended learned indexes, which build compact data arrays without gaps, occupy three orders of magnitude smaller storage space compared to ALEX, which requires leaving gaps in the data arrays. \subsection{High-Dimensional k-Nearest Neighbor} The next test evaluates the framework for the extension of high-dimensional metric indexes for the k-nearest neighbor search problem. An M-tree~\cite{mtree} was used as the dynamic baseline,\footnote{ Specifically, the M-tree implementation tested can be found at \url{https://github.com/dbrumbaugh/M-Tree} and is a fork of a structure written originally by Eduardo D'Avila, modified to compile under C++20. The tree uses a random selection algorithm for ball splitting. } and a VPTree~\cite{vptree} as the static structure. The framework was used to extend VPTree to produce the dynamic version, DE-VPTree. An M-Tree is a tree that partitions records based on high-dimensional spheres and supports updates by splitting and merging these partitions. A VPTree is a binary tree that is produced by recursively selecting a point, called the vantage point, and partitioning records based on their distance from that point. This results in a difficult to modify structure that can be constructed in $O(n \log n)$ time and can answer KNN queries in $O(k \log n)$ time. DE-VPTree, used a buffer of 12,000 records, a scale factor of 6, tiering, and delete tagging. The query was implemented without a pre-processing step, using the standard VPTree algorithm for KNN queries against each shard. All $k$ records were determined for each shard, and then the merge operation used a heap to merge the results sets together and return the $k$ nearest neighbors from the $k\log(n)$ intermediate results. This is a type of query that pays a non-constant merge cost, even with the framework's expanded query interface, of $O(k \log k)$. In effect, the kNN query must be answered twice: once for each shard to get the intermediate result sets, and then a second time within the merge operation to select the kNN from the result sets. \begin{figure} \centering \includegraphics[width=.75\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn} \caption{KNN Index Evaluation} \label{fig:knn} \end{figure} Euclidean distance was used as the metric for both structures, and $k=1000$ was used for all queries. The reference point for each query was selected randomly from points within the dataset. Tests were run using the Spanish Billion Words dataset~\cite{sbw}, of 300-dimensional vectors. The results are shown in Figure~\ref{fig:knn}. In this case, the static nature of the VPTree allows it to dominate the M-Tree in query latency, and the simpler reconstruction procedure shows a significant insertion performance improvement as well. \subsection{Independent Range Sampling} Finally, the framework was tested using one-dimensional IRS queries. As before, a static ISAM-tree was used as the data structure to be extended, however the sampling query was implemented using the query interface from Section~\ref{ssec:fw-query-int}. The pre-processing step identifies the first and last query falling into the range to be sampled from, and determines the total weight based on this range, for each shard. Then, in the local query generation step, these weights are used to construct and alias structure, which is used to assign sample sizes to each shard based on weight to avoid introducing skew into the results. After this, the query routine generates random numbers between the established bounds to sample records, and the merge operation appends the individual result sets together. This static procedure only requires a pair of tree traversals per shard, regardless of how many samples are taken. \begin{figure} \centering \subfloat[Query Latency]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-query} \label{fig:irs-query}} \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-insert} \label{fig:irs-insert}} \caption{IRS Index Evaluation} \label{fig:results2} \end{figure} The extended ISAM structure (DE-IRS) was compared to a B$^+$-Tree with aggregate weight tags on internal nodes (AGG B+Tree) for sampling and insertion performance, and to a single instance of the static ISAM-tree (ISAM), which does not support updates. DE-IRS was configured with a buffer size of 12,000 records, a scale factor of 6, tiering, and delete tagging. The IRS queries had a selectivity of $0.1\%$ with sample size of $k=1000$. Testing was performed using the same datasets as were used for range queries. Figure~\ref{fig:irs-query} shows the significant latency advantage that the dynamically extended ISAM tree enjoys compared to a B+Tree. DE-IRS is up to 23 times faster than the B$^+$-Tree at answering sampling queries, and only about 3 times slower than the fully static solution. In this case, the extra query cost caused by needing to query multiple structures is more than balanced by the query efficiency of each of those structures, relative to tree sampling. Interestingly, the framework also results in better update performance compared to the B$^+$-Tree, as shown in Figure~\ref{fig:irs-insert}. This is likely because the ISAM shards can be efficiently constructed using a combination of sorted-merge operations and bulk-loading, and avoid expensive structural modification operations that are necessary for maintaining a B$^+$-Tree. \subsection{Discussion} The results demonstrate not only that the framework's update support is competitive with custom-built dynamic data structures, but that the framework is even able to, in many cases, retain some of the query performance advantage of its extended static data structure. This is particularly evident in the k-nearest neighbor and independent range sampling tests, where the static version of the structure was directly tested as well. These tests demonstrate one of the advantages of static data structures: they are able to maintain much tighter inter-record relationships than dynamic ones, because update support typically requires relaxing these relationships to make it easier to update them. While the framework introduces the overhead of querying multiple structures and merging them together, it is clear from the results that this overhead is generally less than the overhead incurred by the update support techniques used in the dynamic structures. The only case where the framework was defeated in query performance was in competition with ALEX, where the resulting query latencies were comparable. It is also evident that the update support provided by the framework is on par with, if not superior, to that provided by the dynamic baselines, at least in terms of throughput. The framework will certainly suffer from larger tail latency spikes, which weren't measured in this round of testing, due to the larger scale of the reconstructions, but the amortization of these costs over a large number of inserts allows for the maintenance of a respectable level of throughput. In fact, the only case where the framework loses in insertion throughput is against the dynamic PGM. However, an examination of the query latency reveals that this is likely due to the fact that the standard configuration of the Bently-Saxe variant used by PGM is highly tuned for insertion performance, as the query latencies against this data structure are far worse than any other learned index tested, so even this result shouldn't be taken as a ``clear'' defeat of the framework's implementation. Overall, it is clear from this evaluation that the dynamic extension framework is a promising alternative to manual index redesign for accommodating updates. In almost all cases, the framework-extended static data structures provided superior insertion throughput in all cases, and query latencies that either matched or exceeded that of the dynamic baselines. Additionally, though it is hard to quantity, the code complexity of the framework-extended data structures was much less, with the shard implementations requiring only a small amount of relatively straightforward code to interface with pre-existing static data structures, or with the necessary data structure implementations themselves being simpler. \section{Conclusion} In this chapter, a generalize version of the framework originally proposed in Chapter~\ref{chap:sampling} was proposed. This framework is based on two key properties: extended decomposability and record identity. It is capable of extending any data structure and search problem supporting these two properties with support for inserts and deletes. An evaluation of this framework was performed by extending several static data structures, and comparing the resulting structures' performance against dynamic baselines capable of answering the same type of search problem. The extended structures generally performed as well as, if not better, than their dynamic baselines in query performance, insert performance, or both. This demonstrates the capability of this framework to produce viable indexes in a variety of contexts. However, the framework is not yet complete. In the next chapter, the work required to bring this framework to completion will be described.