updates

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-14 18:18:05 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-14 18:18:05 -0400
commit: 5d6e1d8bfeba9ab7970948b81ff13d7b963948a1 (patch)
tree: 1d3eb0689be0a7fab9dfd4dafe2f3a5ee6fde821 /chapters/dynamization.tex
parent: 40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb (diff)
download: dissertation-5d6e1d8bfeba9ab7970948b81ff13d7b963948a1.tar.gz
1 files changed, 1027 insertions, 0 deletions
diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex
new file mode 100644
index 0000000..edd3014
--- /dev/null
+++ b/chapters/dynamization.tex
@@ -0,0 +1,1027 @@
+\chapter{Classical Dynamization Techniques}
+\label{chap:background}
+
+This chapter will introduce important background information and
+existing work in the area of data structure dynamization. We will
+first discuss the concept of a search problem, which is central to
+dynamization techniques.  While one might imagine that restrictions on
+dynamization would be functions of the data structure to be dynamized,
+in practice the requirements placed on the data structure are quite mild,
+and it is the necessary properties of the search problem that the data
+structure is used to address that provide the central difficulty to
+applying dynamization techniques in a given area. After this, database
+indices will be discussed briefly. Indices are the primary use of data
+structures within the database context that is of interest to our work.
+Following this, existing theoretical results in the area of data structure
+dynamization will be discussed, which will serve as the building blocks
+for our techniques in subsequent chapters. The chapter will conclude with
+a discussion of some of the limitations of these existing techniques.
+
+\section{Queries and Search Problems}
+\label{sec:dsp}
+
+Data access lies at the core of most database systems. We want to ask
+questions of the data, and ideally get the answer efficiently. We
+will refer to the different types of question that can be asked as
+\emph{search problems}. We will be using this term in a similar way as
+the word \emph{query} \footnote{
+    The term query is often abused and used to
+    refer to several related, but slightly different things. In the
+    vernacular, a query can refer to either a) a general type of search
+    problem (as in "range query"), b) a specific instance of a search
+    problem, or c) a program written in a query language.
+}
+is often used within the database systems literature: to refer to a
+general class of questions. For example, we could consider range scans,
+point-lookups, nearest neighbor searches, predicate filtering, random
+sampling, etc., to each be a general search problem.  Formally, for the
+purposes of this work, a search problem is defined as follows,
+
+\begin{definition}[Search Problem] 
+    Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
+    $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
+    $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
+answer domain.\footnote{
+    It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
+example, a \texttt{COUNT} aggregation might map a set of strings onto
+    an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
+not be a universal constraint.
+}
+\end{definition}
+
+We will use the term \emph{query} to mean a specific instance of a search
+problem,
+
+\begin{definition}[Query]
+    Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
+    a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
+    instance of the search problem, $F(\mathcal{D}, q)$.
+\end{definition}
+
+As an example of using these definitions, a \emph{membership test}
+or \emph{range scan} would be considered search problems, and a range
+scan over the interval $[10, 99]$ would be a query.  We've drawn this
+distinction because, as we'll see as we enter into the discussion of
+our work in later chapters, it is useful to have separate, unambiguous
+terms for these two concepts.
+
+\subsection{Decomposable Search Problems}
+
+Dynamization techniques require the partitioning of one data structure
+into several, smaller ones. As a result, these techniques can only
+be applied in situations where the search problem to be answered can
+be answered from this set of smaller data structures, with the same
+answer as would have been obtained had all of the data been used to
+construct a single, large structure. This requirement is formalized in
+the definition of a class of problems called \emph{decomposable search
+problems (DSP)}.  This class was first defined by Bentley and Saxe in
+their work on dynamization, and we will adopt their definition,
+
+\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
+	\label{def:dsp}
+    A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
+    only if there exists a constant-time computable, associative, and
+    commutative binary operator $\mergeop$ such that,
+    \begin{equation*}
+    F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
+    \end{equation*}
+    for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
+\end{definition}
+
+The requirement for $\mergeop$ to be constant-time was used by Bentley and
+Saxe to prove specific performance bounds for answering queries from a
+decomposed data structure. However, it is not strictly \emph{necessary},
+and later work by Overmars lifted this constraint and considered a more
+general class of search problems called \emph{$C(n)$-decomposable search
+problems},
+
+\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}]
+    A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
+    if and only if there exists an $O(C(n))$-time computable, associative,
+    and commutative binary operator $\mergeop$ such that,
+    \begin{equation*}
+    F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
+    \end{equation*}
+    for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
+\end{definition}
+
+To demonstrate that a search problem is decomposable, it is necessary to
+show the existence of the merge operator, $\mergeop$, with the necessary
+properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B,
+q)$. With these two results, induction demonstrates that the problem is
+decomposable even in cases with more than two partial results.
+
+As an example, consider  range scans,
+\begin{definition}[Range Count]
+    \label{def:range-count}
+    Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
+    $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
+    the cardinality, $|d \cap q|$.
+\end{definition}
+
+\begin{theorem}
+\label{ther:decomp-range-count}
+Range Count is a decomposable search problem.
+\end{theorem}
+
+\begin{proof}
+Let $\mergeop$ be addition ($+$). Applying this to
+Definition~\ref{def:dsp}, gives
+\begin{align*}
+	|(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
+\end{align*}
+which is true by the distributive property of union and
+intersection. Addition is an associative and commutative
+operator that can be calculated in $\Theta(1)$ time. Therefore, range counts
+are DSPs.
+\end{proof}
+
+Because the codomain of a DSP is not restricted, more complex output
+structures can be used to allow for problems that are not directly
+decomposable to be converted to DSPs, possibly with some minor
+post-processing. For example, calculating the arithmetic mean of a set
+of numbers can be formulated as a DSP,
+\begin{theorem}
+The calculation of the arithmetic mean of a set of numbers is a DSP.
+\end{theorem}
+\begin{proof}
+    Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
+    where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
+contains the sum of the values within the input set, and the
+cardinality of the input set. For two disjoint partitions of the data,
+$D_1$ and $D_2$, let  $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
+$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$.
+
+Applying Definition~\ref{def:dsp}, gives
+\begin{align*}
+	A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2)  \\
+	(s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
+\end{align*}
+From this result, the average can be determined in constant time by
+taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
+of numbers is a DSP.
+\end{proof}
+
+\section{Dynamization for Decomposable Search Problems}
+
+Because data in a database is regularly updated, data structures
+intended to be used as an index must support updates (inserts, in-place
+modification, and deletes). Not all potentially useful data structures
+support updates, and so a general strategy for adding update support
+would increase the number of data structures that could be used as
+database indices. We refer to a data structure with update support as
+\emph{dynamic}, and one without update support as \emph{static}.\footnote{
+    The term static is distinct from immutable. Static refers to the
+    layout of records within the data structure, whereas immutable
+    refers to the data stored within those records. This distinction
+    will become relevant when we discuss different techniques for adding
+    delete support to data structures.  The data structures used are
+    always static, but not necessarily immutable, because the records may
+    contain header information (like visibility) that is updated in place.
+}
+
+This section discusses \emph{dynamization}, the construction of a
+dynamic data structure based on an existing static one. When certain
+conditions are satisfied by the data structure and its associated
+search problem, this process can be done automatically, and with
+provable asymptotic bounds on amortized insertion performance, as well
+as worst case query performance. This is in contrast to the manual
+design of dynamic data structures, which involve techniques based on
+partially rebuilding small portions of a single data structure (called
+\emph{local reconstruction})~\cite{overmars83}.  This is a very high cost
+intervention that requires significant effort on the part of the data
+structure designer, whereas conventional dynamization can be performed
+with little-to-no modification of the underlying data structure at all.
+
+It is worth noting that there are a variety of techniques
+discussed in the literature for dynamizing structures with specific
+properties, or under very specific sets of circumstances. Examples
+include frameworks for adding update support succinct data
+structures~\cite{dynamize-succinct} or taking advantage of batching
+of insert and query operations~\cite{batched-decomposable}. This
+section discusses techniques that are more general, and don't require
+workload-specific assumptions.
+
+We will first discuss the necessary data structure requirements, and
+then examine several classical dynamization techniques.  The section
+will conclude with a discussion of delete support within the context
+of these techniques. For more detail than is included in this chapter,
+Overmars wrote a book providing a comprehensive survey of techniques for
+creating dynamic data structures, including not only the dynamization
+techniques discussed here, but also local reconstruction based
+techniques and more~\cite{overmars83}.\footnote{
+    Sadly, this book isn't readily available in
+    digital format as of the time of writing.
+}
+
+
+\subsection{Global Reconstruction}
+
+The most fundamental dynamization technique is that of \emph{global
+reconstruction}. While not particularly useful on its own, global
+reconstruction serves as the basis for the techniques to follow, and so
+we will begin our discussion of dynamization with it.
+
+Consider a class of data structure, $\mathcal{I}$, capable of answering a
+search problem, $\mathcal{Q}$. Insertion via global reconstruction is
+possible if $\mathcal{I}$ supports the following two operations,
+\begin{align*}
+\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
+\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
+\end{align*}
+where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$
+over the data structure over a set of records $d \subseteq \mathcal{D}$
+in $B(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
+\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in
+$\Theta(1)$ time,\footnote{
+    There isn't any practical reason why $\mathtt{unbuild}$ must run
+    in constant time, but this is the assumption made in \cite{saxe79}
+    and in subsequent work based on it, and so we will follow the same
+    definition here.
+} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$.
+
+Given this structure, an insert of record $r \in \mathcal{D}$ into a
+data structure $\mathscr{I} \in \mathcal{I}$ can be defined by,
+\begin{align*}
+\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\})
+\end{align*}
+
+It goes without saying that this operation is sub-optimal, as the
+insertion cost is $\Theta(B(n))$, and $B(n) \in \Omega(n)$ at best for
+most data structures. However, this global reconstruction strategy can
+be used as a primitive for more sophisticated techniques that can provide
+reasonable performance.
+
+\subsection{Amortized Global Reconstruction}
+\label{ssec:agr}
+
+The problem with global reconstruction is that each insert must rebuild
+the entire data structure, involving all of its records. This results
+in a worst-case insert cost of $\Theta(B(n))$. However, opportunities
+for improving this scheme can present themselves when considering the
+\emph{amortized} insertion cost.
+
+Consider the cost accrued by the dynamized structure under global
+reconstruction over the lifetime of the structure. Each insert will result
+in all of the existing records being rewritten, so at worst each record
+will be involved in $\Theta(n)$ reconstructions, each reconstruction
+having $\Theta(B(n))$ cost. We can amortize this cost over the $n$ records
+inserted to get an amortized insertion cost for global reconstruction of,
+
+\begin{equation*}
+I_a(n) = \frac{B(n) \cdot n}{n} = B(n)
+\end{equation*}
+
+This doesn't improve things as is, however it does present two
+opportunities for improvement. If we could either reduce the size of
+the reconstructions, or the number of times a record is reconstructed,
+then we could reduce the amortized insertion cost.
+
+The key insight, first discussed by Bentley and Saxe, is that
+both of these goals can be accomplished by \emph{decomposing} the
+data structure into multiple, smaller structures, each built from
+a disjoint partition of the data. As long as the search problem
+being considered is decomposable, queries can be answered from
+this structure with bounded worst-case overhead, and the amortized
+insertion cost can be improved~\cite{saxe79}. Significant theoretical
+work exists in evaluating different strategies for decomposing the
+data structure~\cite{saxe79, overmars81, overmars83} and for leveraging
+specific efficiencies of the data structures being considered to improve
+these reconstructions~\cite{merge-dsp}.
+
+There are two general decomposition techniques that emerged from
+this work. The earliest of these is the logarithmic method, often
+called the Bentley-Saxe method in modern literature, and is the most
+commonly discussed technique today. The Bentley-Saxe method has been
+directly applied in a few instances in the literature, such as to
+metric indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree},
+and has also been used in a modified form for genetic sequence search
+structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few
+examples.
+
+A later technique, the equal block method, was also developed. It is
+generally not as effective as the Bentley-Saxe method, and as a result we
+have not identified any specific applications of this technique outside
+of the theoretical literature, however we will discuss it as well in
+the interest of completeness, and because it does lend itself well to
+demonstrating certain properties of decomposition-based dynamization
+techniques.
+
+\subsection{Equal Block Method}
+\label{ssec:ebm}
+
+Though chronologically later, the equal block method is theoretically a
+bit simpler, and so we will begin our discussion of decomposition-based
+technique for dynamization of decomposable search problems with it. There
+have been several proposed variations of this concept~\cite{maurer79,
+maurer80}, but we will focus on the most developed form as described by
+Overmars and von Leeuwan~\cite{overmars-art-of-dyn, overmars83}. The core
+concept of the equal block method is to decompose the data structure
+into several smaller data structures, called blocks, over partitions
+of the data. This decomposition is performed such that each block is of
+roughly equal size.
+
+Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves
+some decomposable search problem, $F$ and is built over a set of records
+$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks,
+$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over
+partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value
+makes little sense when the number of records changes, and so it is taken
+to be governed by a smooth, monotonically increasing function $f(n)$ such
+that, at any point, the following two constraints are obeyed.
+\begin{align}
+    f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\
+    \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j |  \leq \frac{2n}{s} \label{ebm-c2}
+\end{align}
+where $|\mathscr{I}_j|$ is the number of records in the block,
+$|\text{unbuild}(\mathscr{I}_j)|$.
+
+A new record is inserted by finding the smallest block and rebuilding it
+using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$,
+then an insert is done by,
+\begin{equation*}
+\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\})
+\end{equation*}
+Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
+    Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
+    violated by deletes. We're omitting deletes from the discussion at
+    this point, but will circle back to them in Section~\ref{sec:deletes}.
+} In this case, the constraints are enforced by "re-configuring" the
+structure. $s$ is updated to be exactly $f(n)$, all of the existing
+blocks are unbuilt, and then the records are redistributed evenly into
+$s$ blocks.
+
+A query with parameters $q$ is answered by this structure by individually
+querying the blocks, and merging the local results together with $\mergeop$,
+\begin{equation*}
+F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q)
+\end{equation*}
+where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to
+answering the query over $d$ using the data structure $\mathscr{I}$.
+
+This technique provides better amortized performance bounds than global
+reconstruction, at the possible cost of worse query performance for
+sub-linear queries. We'll omit the details of the proof of performance
+for brevity and streamline some of the original notation (full details
+can be found in~\cite{overmars83}), but this technique ultimately
+results in a data structure with the following performance characteristics,
+\begin{align*}
+\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{B(n)}{n} + B\left(\frac{n}{f(n)}\right)\right) \\
+\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\
+\end{align*}
+where $B(n)$ is the cost of statically building $\mathcal{I}$, and
+$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$.
+
+%TODO: example?
+
+
+\subsection{The Bentley-Saxe Method} 
+\label{ssec:bsm}
+
+%FIXME: switch this section (and maybe the previous?) over to being
+%       indexed at 0 instead of 1
+
+The original, and most frequently used, dynamization technique is the
+Bentley-Saxe Method (BSM), also called the logarithmic method in older
+literature. Rather than breaking the data structure into equally sized
+blocks, BSM decomposes the structure into logarithmically many blocks
+of exponentially increasing size. More specifically, the data structure
+is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1,
+\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$
+will be either empty, or contain exactly $2^i$ records within it.
+
+The procedure for inserting a record, $r \in \mathcal{D}$, into
+a BSM dynamization is as follows. If the block $\mathscr{I}_0$
+is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not
+empty, then there will exist a maximal sequence of non-empty blocks
+$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq
+0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case,
+$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i
+\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through
+$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the
+end of the structure as needed.
+
+%FIXME: switch the x's to r's for consistency
+\begin{figure}
+\centering
+\includegraphics[width=.8\textwidth]{diag/bsm.pdf}
+\caption{An illustration of inserts into the Bentley-Saxe Method}
+\label{fig:bsm-example}
+\end{figure}
+
+Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The
+dynamization is built over a set of records $x_1, x_2, \ldots,
+x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in
+$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly
+into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the
+first empty block is $\mathscr{I}_2$, and so the insert is performed by
+doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup
+\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$
+and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$.
+
+This technique is called a \emph{binary decomposition} of the data
+structure.  Considering a BSM dynamization of a structure containing $n$
+records, labeling each block with a $0$ if it is empty and a $1$ if it
+is full will result in the binary representation of $n$. For example,
+the final state of the structure in Figure~\ref{fig:bsm-example} contains
+$12$ records, and the labeling procedure will result in $0\text{b}1100$,
+which is $12$ in binary. Inserts affect this representation of the
+structure in the same way that incrementing the binary number by $1$ does.
+
+By applying BSM to a data structure, a dynamized structure can be created
+with the following performance characteristics,
+\begin{align*}
+\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{B(n)}{n}\cdot \log_2 n\right)\right) \\
+\text{Worst Case Insertion Cost:}&\quad \Theta\left(B(n)\right) \\
+\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\
+\end{align*}
+This is a particularly attractive result because, for example, a data
+structure having $B(n) \in \Theta(n)$ will have an amortized insertion
+cost of $\log_2 (n)$, which is quite reasonable. The trade-off for this
+is an extra logarithmic multiple attached to the query complexity. It is
+also worth noting that the worst-case insertion cost remains the same
+as global reconstruction, but this case arises only very rarely. If
+you consider the binary decomposition representation, the worst-case
+behavior is triggered each time the existing number overflows, and a
+new digit must be added.
+
+As a final note about the query performance of this structure, because
+the overhead due to querying the blocks is logarithmic, under certain
+circumstances this cost can be absorbed, resulting in no effect on the
+asymptotic worst-case query performance. As an example, consider a linear
+scan of the data running in $\Theta(n)$ time. In this case, every record
+must be considered, and so there isn't any performance penalty\footnote{
+  From an asymptotic perspective. There will still be measurable performance
+  effects from caching, etc., even in this case.
+} to breaking the records out into multiple chunks and scanning them
+individually. For formally, for any query running in $\mathscr{Q}(n) \in
+\Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case
+cost of answering a decomposable search problem from a BSM dynamization
+is $\Theta\left(\mathscr{Q}(n)\right)$.~\cite{saxe79}
+
+\subsection{Merge Decomposable Search Problems}
+
+\subsection{Delete Support}
+
+Classical dynamization techniques have also been developed with
+support for deleting records. In general, the same technique of global
+reconstruction that was used for inserting records can also be used to
+delete them. Given a record $r \in \mathcal{D}$ and a data structure
+$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be
+deleted from the structure in $C(n)$ time as follows,
+\begin{equation*}
+\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\})
+\end{equation*}
+However, supporting deletes within the dynamization schemes discussed
+above is more complicated. The core problem is that inserts affect the
+dynamized structure in a deterministic way, and as a result certain
+partitioning schemes can be leveraged to reason about the
+performance. But, deletes do not work like this.
+
+\begin{figure}
+\caption{A Bentley-Saxe dynamization for the integers on the
+interval $[1, 100]$.}
+\label{fig:bsm-delete-example}
+\end{figure}
+
+For example, consider a Bentley-Saxe dynamization that contains all
+integers on the interval $[1, 100]$, inserted in that order, shown in
+Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the
+records from this structure, one at a time, using global reconstruction.
+This presents several problems,
+\begin{itemize}
+    \item For each record, we need to identify which block it is in before
+          we can delete it. 
+    \item The cost of performing a delete is a function of which block the
+          record is in, which is a question of distribution and not easily
+          controlled.
+    \item As records are deleted, the structure will potentially violate
+          the invariants of the decomposition scheme used, which will
+          require additional work to fix.
+\end{itemize}
+
+To resolve these difficulties, two very different approaches have been
+proposed for supporting deletes, each of which rely on certain properties
+of the search problem and data structure. These are the use of a ghost
+structure and weak deletes.
+
+\subsubsection{Ghost Structure for Invertible Search Problems}
+
+The first proposed mechanism for supporting deletes was discussed
+alongside the Bentley-Saxe method in Bentley and Saxe's original
+paper. This technique applies to a class of search problems called
+\emph{invertible} (also called \emph{decomposable counting problems}
+in later literature~\cite{overmars83}). Invertible search problems
+are decomposable, and also support an ``inverse'' merge operator, $\Delta$,
+that is able to remove records from the result set. More formally,
+\begin{definition}[Invertible Search Problem~\cite{saxe79}]
+\label{def:invert}
+A decomposable search problem, $F$ is invertible if and only if there
+exists a constant time computable operator, $\Delta$, such that
+\begin{equation*}
+F(A / B, q) = F(A, q)~\Delta~F(B, q)
+\end{equation*}
+for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
+\end{definition}
+
+Given a search problem with this property, it is possible to perform
+deletes by creating a secondary ``ghost'' structure. When a record
+is to be deleted, it is inserted into this structure. Then, when the
+dynamization is queried, this ghost structure is queried as well as the
+main one. The results from the ghost structure can be removed from the
+result set using the inverse merge operator. This simulates the result
+that would have been obtained had the records been physically removed
+from the main structure.
+
+Two examples of invertible search problems are set membership
+and range count. Range count was formally defined in
+Definition~\ref{def:range-count}.
+
+\begin{theorem}
+Range count is an invertible search problem.
+\end{theorem}
+
+\begin{proof}
+To prove that range count is an invertible search problem, it must be
+decomposable and have a $\Delta$ operator. That it is a DSP has already
+been proven in Theorem~\ref{ther:decomp-range-count}.
+
+Let $\Delta$ be subtraction $(-)$. Applying this to Definition~\ref{def:invert}
+gives,
+\begin{equation*}
+|(A / B) \cap q | = |(A \cap q) / (B \cap q)| = |(A \cap q)| - |(B \cap q)|
+\end{equation*}
+which is true by the distributive property of set difference and
+intersection. Subtraction is computable in constant time, therefore
+range count is an invertible search problem using subtraction as $\Delta$.
+\end{proof}
+
+The set membership search problem is defined as follows,
+\begin{definition}[Set Membership]
+\label{def:set-membership}
+Consider a set of elements $d \subseteq \mathcal{D}$ from some domain,
+and a single element $r \in \mathcal{D}$. A test of set membership is a
+search problem of the form $F: (\mathcal{PS}(\mathcal{D}), \mathcal{D})
+\to \mathbb{B}$ such that $F(d, r) = r \in d$, which maps to $0$ if $r
+\not\in d$ and $1$ if $r \in d$.
+\end{definition}
+
+\begin{theorem}
+Set membership is an invertible search problem.
+\end{theorem}
+
+\begin{proof}
+To prove that set membership is invertible, it is necessary to establish
+that it is a decomposable search problem, and that a $\Delta$ operator
+exists. We'll begin with the former.
+\begin{lemma}
+    \label{lem:set-memb-dsp}
+    Set membership is a decomposable search problem.
+\end{lemma}
+\begin{proof}
+Let $\mergeop$ be the logical disjunction ($\lor$). This yields,
+\begin{align*}
+F(A \cup B, r) &= F(A, r) \lor F(B, r) \\
+r \in (A \cup B) &= (r \in A) \lor (r \in B)
+\end{align*}
+which is true, following directly from the definition of union. The
+logical disjunction is an associative, commutative operator that can
+be calculated in $\Theta(1)$ time. Therefore, set membership is a
+decomposable search problem.
+\end{proof}
+
+For the inverse merge operator, $\Delta$, it is necessary that $F(A,
+r) ~\Delta~F(B, r)$ be true \emph{only} if $r \in A$ and $r \not\in
+B$. Thus, it could be directly implemented as $F(A, r)~\Delta~F(B, r) =
+F(A, r) \land \neg F(B, r)$, which is constant time if
+the operands are already known.
+
+Thus, we have shown that set membership is a decomposable search problem,
+and that a constant time $\Delta$ operator exists. Therefore, it is an
+invertible search problem.
+\end{proof}
+
+For search problems such as these, this technique allows for deletes to be
+supported with the same cost as an insert. Unfortunately, it suffers from
+write amplification because each deleted record is recorded twice--one in
+the main structure, and once in the ghost structure. This means that $n$
+is, in effect, the total number of records and deletes. This can lead
+to some serious problems, for example if every record in a structure
+of $n$ records is deleted, the net result will be an "empty" dynamized
+data structure containing $2n$ physical records within it. To circumvent
+this problem, Bentley and Saxe proposed a mechanism of setting a maximum
+threshold for the size of the ghost structure relative to the main one,
+and performing a complete re-partitioning of the data once this threshold
+is reached, removing all deleted records from the main structure,
+emptying the ghost structure, and rebuilding blocks with the records
+that remain according to the invariants of the technique.
+
+\subsubsection{Weak Deletes for Deletion Decomposable Search Problems}
+
+Another approach for supporting deletes was proposed later, by Overmars
+and van Leeuwen, for a class of search problem called \emph{deletion
+decomposable}. These are decomposable search problems for which the
+underlying data structure supports a delete operation. More formally,
+
+\begin{definition}[Deletion Decomposable Search Problem~\cite{merge-dsp}]
+    A decomposable search problem, $F$, and its data structure,
+    $\mathcal{I}$, is deletion decomposable if and only if, for some
+    instance $\mathscr{I} \in \mathcal{I}$, containing $n$ records,
+    there exists a deletion routine $\mathtt{delete}(\mathscr{I},
+    r)$ that removes some $r \in \mathcal{D}$ in time $D(n)$ without
+    increasing the query time, deletion time, or storage requirement,
+    for $\mathscr{I}$.
+\end{definition}
+
+Superficially, this doesn't appear very useful. If the underlying data
+structure already supports deletes, there isn't much reason to use a
+dynamization technique to add deletes to it. However, one point worth
+mentioning is that it is possible, in many cases, to easily \emph{add}
+delete support to a static structure. If it is possible to locate a
+record and somehow mark it as deleted, without removing it from the
+structure, and then efficiently ignore these records while querying,
+then the given structure and its search problem can be said to be
+deletion decomposable. This technique for deleting records is called
+\emph{weak deletes}.
+
+\begin{definition}[Weak Deletes~\cite{overmars81}]
+\label{def:weak-delete}
+A data structure is said to support weak deletes if it provides a
+routine, \texttt{delete}, that guarantees that after $\alpha \cdot n$
+deletions, where $\alpha < 1$, the query cost is bounded by $k_\alpha
+\mathscr{Q}(n)$ for some constant $k_\alpha$ dependent only upon $\alpha$,
+where $\mathscr{Q}(n)$ is the cost of answering the query against a
+structure upon which no weak deletes were performed.\footnote{
+    This paper also provides a similar definition for weak updates,
+    but these aren't of interest to us in this work, and so the above
+    definition was adapted from the original with the weak update
+    constraints removed.
+} The results of the query of a block containing weakly deleted records
+should be the same as the results would be against a block with those
+records removed.
+\end{definition}
+
+As an example of a deletion decomposable search problem, consider the set
+membership problem considered above (Definition~\ref{def:set-membership})
+where $\mathcal{I}$, the data structure used to answer queries of the
+search problem, is a hash map.\footnote{
+  While most hash maps are already dynamic, and so wouldn't need
+  dynamization to be applied, there do exist static ones too. For example,
+  the hash map being considered could be implemented using perfect
+  hashing~\cite{perfect-hashing}, which has many static implementations.
+}
+
+\begin{theorem}
+ The set membership problem, answered using a static hash map, is
+ deletion decomposable.
+\end{theorem}
+
+\begin{proof}
+We've already shown in Lemma~\ref{lem:set-memb-dsp} that set membership
+is a decomposable search problem. For it to be deletion decomposable,
+we must demonstrate that the hash map, $\mathcal{I}$, supports deleting
+records without hurting its query performance, delete performance, or
+storage requirements. Assume that an instance $\mathscr{I} \in
+\mathcal{I}$ having $|\mathscr{I}| = n$ can answer queries in
+$\mathscr{Q}(n) \in \Theta(1)$ time and requires $\Omega(n)$ storage.
+
+Such a structure can support weak deletes. Each record within the
+structure has a single bit attached to it, indicating whether it has
+been deleted or not. These bits will require $\Theta(n)$ storage and
+be initialized to 0 when the structure is constructed. A delete can
+be performed by querying the structure for the record to be deleted in
+$\Theta(1)$ time, and setting the bit to 1 if the record is found. This
+operation has $D(n) \in \Theta(1)$ cost.
+
+\begin{lemma}
+\label{lem:weak-deletes}
+The delete procedure as described above satisfies the requirements of
+Definition~\ref{def:weak-delete} for weak deletes.
+\end{lemma}
+\begin{proof}
+Per Definition~\ref{def:weak-delete}, there must exist some constant
+dependent only on $\alpha$, $k_\alpha$, such that after $\alpha \cdot
+n$ deletes against $\mathscr{I}$ with $\alpha < 1$, the query cost is
+bounded by $\Theta(\alpha \mathscr{Q}(n))$.
+
+In this case, $\mathscr{Q}(n) \in \Theta(1)$, and therefore our final
+query cost must be bounded by $\Theta(k_\alpha)$. When a query is
+executed against $\mathscr{I}$, there are three possible cases,
+\begin{enumerate}
+\item The record being searched for does not exist in $\mathscr{I}$. In
+this case, the query result is 0.
+\item The record being searched for does exist in $\mathscr{I}$  and has
+a delete bit value of 0. In this case, the query result is 1.
+\item The record being searched for does exist in $\mathscr{I}$ and has
+a delete bit value of 1 (i.e., it has been deleted). In this case, the
+query result is 0.
+\end{enumerate}
+In all three cases, the addition of deletes requires only $\Theta(1)$
+extra work at most. Therefore, set membership over a static hash map
+using our proposed deletion mechanism satisfies the requirements for
+weak deletes, with $k_\alpha = 1$.
+\end{proof}
+
+Finally, we note that the cost of one of these weak deletes is $D(n)
+= \mathscr{Q}(n)$. By Lemma~\ref{lem:weak-deletes}, the delete cost is
+not asymptotically harmed by deleting records.
+
+Thus, we've shown that set membership using a static hash map is a
+decomposable search problem, the storage cost remains $\Omega(n)$ and the
+query and delete costs are unaffected by the presence of deletes using the
+proposed mechanism. All of the requirements of deletion decomposability
+are satisfied, therefore set membership using a static hash map is a
+deletion decomposable search problem.
+\end{proof}
+
+For such problems, deletes can be supported by first identifying the
+block in the dynamization containing the record to be deleted, and
+then calling $\mathtt{delete}$ on it. In order to allow this block to
+be easily located, it is possible to maintain a hash table over all
+of the records, alongside the dynamization, which maps each record
+onto the block containing it. This table must be kept up to date as
+reconstructions occur, but this can be done at no extra asymptotic costs
+for any data structures having $B(n) \in \Omega(n)$, as it requires only
+linear time. This allows for deletes to be performed in $\mathscr{D}(n)
+\in \Theta(D(n))$ time.
+
+The presence of deleted records within the structure does introduce a
+new problem, however. Over time, the number of records in each block will
+drift away from the requirements imposed by the dynamization technique. It
+will eventually become necessary to re-partition the records to restore
+these invariants, which are necessary for bounding the number of blocks,
+and thereby the query performance. The particular invariant maintenance
+rules depend upon the decomposition scheme used.
+
+\Paragraph{Bentley-Saxe Method.} When creating a BSM dynamization for
+a deletion decomposable search problem, the $i$th block where $i \geq 2$\footnote{
+ Block $i=0$ will only ever have one record, so no special maintenance must be
+ done for it. A delete will simply empty it completely.
+},
+in the absence of deletes, will contain $2^{i-1} + 1$ records. When a
+delete occurs in block $i$, no special action is taken until the number
+of records in that block falls below $2^{i-2}$. Once this threshold is
+reached, a reconstruction can be performed to restore the appropriate
+record counts in each block.~\cite{merge-dsp}
+
+\Paragraph{Equal Block Method.} For the equal block method, there are
+two cases in which a delete may cause a block to fail to obey the method's
+size invariants,
+\begin{enumerate}
+    \item If enough records are deleted, it is possible for the number
+    of blocks to exceed $f(2n)$, violating Invariant~\ref{ebm-c1}.
+    \item The deletion of records may cause the maximum size of each
+    block to shrink, causing some blocks to exceed the maximum capacity
+    of $\nicefrac{2n}{s}$. This is a violation of Invariant~\ref{ebm-c2}.
+\end{enumerate}
+In both cases, it should be noted that $n$ is decreased as records are
+deleted. Should either of these cases emerge as a result of a delete,
+the entire structure must be reconfigured to ensure that its invariants
+are maintained. This reconfiguration follows the same procedure as when
+an insert results in a violation: $s$ is updated to be exactly $f(n)$, all
+existing blocks are unbuilt, and then the records are evenly redistributed
+into the $s$ blocks.~\cite{overmars-art-of-dyn}
+
+
+\subsection{Worst-Case Optimal Techniques}
+
+
+\section{Limitations of Classical Dynamization Techniques}
+\label{sec:bsm-limits}
+
+While fairly general, these dynamization techniques have a number of
+limitations that prevent them from being directly usable as a general
+solution to the problem of creating database indices. Because of the
+requirement that the query being answered be decomposable, many search
+problems cannot be addressed--or at least efficiently addressed, by
+decomposition-based dynamization. The techniques also do nothing to reduce
+the worst-case insertion cost, resulting in extremely poor tail latency
+performance relative to hand-built dynamic structures. Finally, these
+approaches do not do a good job of exposing the underlying configuration
+space to the user, meaning that the user can exert limited control on the
+performance of the dynamized data structure. This section will discuss
+these limitations, and the rest of the document will be dedicated to
+proposing solutions to them.
+
+\subsection{Limits of Decomposability}
+\label{ssec:decomp-limits}
+Unfortunately, the DSP abstraction used as the basis of classical
+dynamization techniques has a few significant limitations that restrict
+their applicability,
+
+\begin{itemize}
+    \item The query must be broadcast identically to each block and cannot
+    be adjusted based on the state of the other blocks.
+
+    \item The query process is done in one pass--it cannot be repeated.
+
+    \item The result merge operation must be $O(1)$ to maintain good query
+          performance.
+
+    \item The result merge operation must be commutative and associative, 
+          and is called repeatedly to merge pairs of results.
+\end{itemize}
+
+These requirements restrict the types of queries that can be supported by
+the method efficiently. For example, k-nearest neighbor and independent
+range sampling are not decomposable. 
+
+\subsubsection{k-Nearest Neighbor} 
+\label{sssec-decomp-limits-knn}
+The k-nearest neighbor (k-NN) problem is a generalization of the nearest
+neighbor problem, which seeks to return the closest point within the
+dataset to a given query point. More formally, this can be defined as,
+\begin{definition}[Nearest Neighbor]
+    
+    Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
+    be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
+    between two points within $D$. The nearest neighbor problem, $NN(D,
+    q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
+    for some query point, $q \in \mathbb{R}^d$.
+
+\end{definition}
+
+In practice, it is common to require $f(x, y)$ be a metric,\footnote
+{
+    Contrary to its vernacular usage as a synonym for ``distance'', a
+    metric is more formally defined as a valid distance function over
+    a metric space. Metric spaces require their distance functions to
+    have the following properties,
+    \begin{itemize}
+        \item The distance between a point and itself is always 0.
+        \item All distances between non-equal points must be positive.
+        \item For all points, $x, y \in D$, it is true that 
+              $f(x, y) = f(y, x)$.
+        \item For any three points $x, y, z \in D$ it is true that 
+              $f(x, z) \leq f(x, y) + f(y, z)$.
+    \end{itemize}
+
+    These distances also must have the interpretation that $f(x, y) <
+    f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
+    is the opposite of the definition of similarity, and so some minor
+    manipulations are usually required to make similarity measures work
+    in metric-based indexes. \cite{intro-analysis}
+}
+and this will be done in the examples of indices for addressing
+this problem in this work, but it is not a fundamental aspect of the problem
+formulation. The nearest neighbor problem itself is decomposable,
+with a simple merge function that accepts the result with the smallest
+value of $f(x, q)$ for any two inputs\cite{saxe79}.
+
+The k-nearest neighbor problem generalizes nearest-neighbor to return
+the $k$ nearest elements,
+\begin{definition}[k-Nearest Neighbor]
+
+    Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
+    be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
+    between two points within $D$. The k-nearest neighbor problem,
+    $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
+    such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.
+
+\end{definition}
+
+This can be thought of as solving the nearest-neighbor problem $k$ times,
+each time removing the returned result from $D$ prior to solving the
+problem again.  Unlike the single nearest-neighbor case (which can be
+thought of as k-NN with $k=1$), this problem is \emph{not} decomposable.
+
+\begin{theorem}
+    k-NN is not a decomposable search problem.
+\end{theorem}
+
+\begin{proof}
+To prove this, consider the query $KNN(D, q, k)$ against some partitioned
+dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable,
+then there must exist some constant-time, commutative, and associative
+binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l}
+R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
+k)$. Consider the evaluation of the merge operator against two arbitrary
+result sets, $R = R_i \mergeop R_j$.  It is clear that $|R| = |R_i| =
+|R_j| = k$, and that the contents of $R$ must be the $k$ records from
+$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the
+problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$
+time. Therefore, k-NN is not a decomposable search problem.
+\end{proof}
+
+With that said, it is clear that there isn't any fundamental restriction
+preventing the merging of the result sets; it is only the case that an
+arbitrary performance requirement wouldn't be satisfied. It is possible
+to merge the result sets in non-constant time, and so it is the case
+that k-NN is $C(n)$-decomposable. Unfortunately, this classification
+brings with it a reduction in query performance as a result of the way
+result merges are performed.
+
+As a concrete example of these costs, consider using the Bentley-Saxe
+method to extend the VPTree~\cite{vptree}. The VPTree is a static,
+metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k
+\log n)$.  One possible merge algorithm for k-NN would be to push all
+of the elements in the two arguments onto a min-heap, and then pop off
+the first $k$. In this case, the cost of the merge operation would be
+$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation
+could be considered to be constant-time.  But given that $k$ is only
+bounded in size above by $n$, this isn't a safe assumption to make in
+general. Evaluating the total query cost for the extended structure,
+this would yield,
+
+\begin{equation} 
+    k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
+\end{equation}
+
+The reason for this large increase in cost is the repeated application
+of the merge operator. The Bentley-Saxe method requires applying the
+merge operator in a binary fashion to each partial result, multiplying
+its cost by a factor of $\log n$. Thus, the constant-time requirement
+of standard decomposability is necessary to keep the cost of the merge
+operator from appearing within the complexity bound of the entire
+operation in the general case.\footnote {
+    There is a special case, noted by Overmars, where the total cost is
+    $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
+    \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
+    case where the cost of the query and merge operation are sufficiently
+    large to consume the logarithmic factor, and so it doesn't represent
+    a special case with better performance.
+} 
+If we could revise the result merging operation to remove this duplicated
+cost, we could greatly reduce the cost of supporting $C(n)$-decomposable
+queries.
+
+\subsubsection{Independent Range Sampling}
+\label{ssec:background-irs}
+
+Another problem that is not decomposable is independent sampling. There
+are a variety of problems falling under this umbrella, including weighted
+set sampling, simple random sampling, and weighted independent range
+sampling, but we will focus on independent range sampling here.
+
+\begin{definition}[Independent Range Sampling~\cite{tao22}]
+    Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
+    interval $q = [x, y]$ and an integer $k$, an independent range
+    sampling query returns $k$ independent samples from $D \cap q$
+    with each point having equal probability of being sampled.
+\end{definition}
+
+This problem immediately encounters a category error when considering
+whether it is decomposable: the result set is randomized, whereas
+the conditions for decomposability are defined in terms of an exact
+matching of records in result sets. To work around this, a slight abuse
+of definition is in order: assume that the equality conditions within
+the DSP definition can be interpreted to mean ``the contents in the two
+sets are drawn from the same distribution''. This enables the category
+of DSP to apply to this type of problem.
+
+Even with this abuse, however, IRS cannot generally be considered
+decomposable; it is at best $C(n)$-decomposable.  The reason for this is
+that matching the distribution requires drawing the appropriate number
+of samples from each each partition of the data. Even in the special
+case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples
+from each partition that must appear in the result set cannot be known
+in advance due to differences in the selectivity of the predicate across
+the partitions.
+
+\begin{example}[IRS Sampling Difficulties]
+
+    Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 =
+    \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and
+    an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
+    partitions have the same size, it seems sensible to evenly distribute
+    the samples across them ($4$ samples from each partition). Applying
+    the query predicate to the partitions results in the following,
+    $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$.
+
+    In expectation, then, the first result set will contain $R_0 = \{3,
+    3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
+    probability of a $4$. The second and third result sets can only
+    be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
+    together, we'd find that the probability distribution of the sample
+    would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
+    the same sampling operation over the full dataset (not partitioned),
+    the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
+
+\end{example}
+
+The problem is that the number of samples drawn from each partition needs to be
+weighted based on the number of elements satisfying the query predicate in that
+partition. In the above example, by drawing $4$ samples from $D_1$, more weight
+is given to $3$ than exists within the base dataset. This can be worked around
+by sampling a full $k$ records from each partition, returning both the sample
+and the number of records satisfying the predicate as that partition's query
+result, and then performing another pass of IRS as the merge operator, but this
+is the same approach as was used for k-NN above. This leaves IRS firmly in the
+$C(n)$-decomposable camp. If it were possible to pre-calculate the number of
+samples to draw from each partition, then a constant-time merge operation could
+be used.
+
+\subsection{Configurability}
+
+\subsection{Insertion Tail Latency}
+
+
+\section{Conclusion}
+This chapter discussed the necessary background information pertaining to
+queries and search problems, indexes, and techniques for dynamic extension. It
+described the potential for using custom indexes for accelerating particular
+kinds of queries, as well as the challenges associated with constructing these
+indexes. The remainder of this document will seek to address these challenges
+through modification and extension of the Bentley-Saxe method, describing work
+that has already been completed, as well as the additional work that must be
+done to realize this vision.
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-14 18:18:05 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-14 18:18:05 -0400
commit	5d6e1d8bfeba9ab7970948b81ff13d7b963948a1 (patch)
tree	1d3eb0689be0a7fab9dfd4dafe2f3a5ee6fde821 /chapters/dynamization.tex
parent	40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb (diff)
download	dissertation-5d6e1d8bfeba9ab7970948b81ff13d7b963948a1.tar.gz