summaryrefslogtreecommitdiffstats
path: root/chapters
diff options
context:
space:
mode:
authorDouglas Rumbaugh <dbr4@psu.edu>2025-05-14 18:18:05 -0400
committerDouglas Rumbaugh <dbr4@psu.edu>2025-05-14 18:18:05 -0400
commit5d6e1d8bfeba9ab7970948b81ff13d7b963948a1 (patch)
tree1d3eb0689be0a7fab9dfd4dafe2f3a5ee6fde821 /chapters
parent40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb (diff)
downloaddissertation-5d6e1d8bfeba9ab7970948b81ff13d7b963948a1.tar.gz
updates
Diffstat (limited to 'chapters')
-rw-r--r--chapters/background.tex1027
-rw-r--r--chapters/dynamization.tex1027
2 files changed, 1027 insertions, 1027 deletions
diff --git a/chapters/background.tex b/chapters/background.tex
index 8ad92a8..ef30685 100644
--- a/chapters/background.tex
+++ b/chapters/background.tex
@@ -1,167 +1,4 @@
\chapter{Background}
-\label{chap:background}
-
-This chapter will introduce important background information and
-existing work in the area of data structure dynamization. We will
-first discuss the concept of a search problem, which is central to
-dynamization techniques. While one might imagine that restrictions on
-dynamization would be functions of the data structure to be dynamized,
-in practice the requirements placed on the data structure are quite mild,
-and it is the necessary properties of the search problem that the data
-structure is used to address that provide the central difficulty to
-applying dynamization techniques in a given area. After this, database
-indices will be discussed briefly. Indices are the primary use of data
-structures within the database context that is of interest to our work.
-Following this, existing theoretical results in the area of data structure
-dynamization will be discussed, which will serve as the building blocks
-for our techniques in subsequent chapters. The chapter will conclude with
-a discussion of some of the limitations of these existing techniques.
-
-\section{Queries and Search Problems}
-\label{sec:dsp}
-
-Data access lies at the core of most database systems. We want to ask
-questions of the data, and ideally get the answer efficiently. We
-will refer to the different types of question that can be asked as
-\emph{search problems}. We will be using this term in a similar way as
-the word \emph{query} \footnote{
- The term query is often abused and used to
- refer to several related, but slightly different things. In the
- vernacular, a query can refer to either a) a general type of search
- problem (as in "range query"), b) a specific instance of a search
- problem, or c) a program written in a query language.
-}
-is often used within the database systems literature: to refer to a
-general class of questions. For example, we could consider range scans,
-point-lookups, nearest neighbor searches, predicate filtering, random
-sampling, etc., to each be a general search problem. Formally, for the
-purposes of this work, a search problem is defined as follows,
-
-\begin{definition}[Search Problem]
- Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
- $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
- $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
-answer domain.\footnote{
- It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
-example, a \texttt{COUNT} aggregation might map a set of strings onto
- an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
-not be a universal constraint.
-}
-\end{definition}
-
-We will use the term \emph{query} to mean a specific instance of a search
-problem,
-
-\begin{definition}[Query]
- Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
- a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
- instance of the search problem, $F(\mathcal{D}, q)$.
-\end{definition}
-
-As an example of using these definitions, a \emph{membership test}
-or \emph{range scan} would be considered search problems, and a range
-scan over the interval $[10, 99]$ would be a query. We've drawn this
-distinction because, as we'll see as we enter into the discussion of
-our work in later chapters, it is useful to have separate, unambiguous
-terms for these two concepts.
-
-\subsection{Decomposable Search Problems}
-
-Dynamization techniques require the partitioning of one data structure
-into several, smaller ones. As a result, these techniques can only
-be applied in situations where the search problem to be answered can
-be answered from this set of smaller data structures, with the same
-answer as would have been obtained had all of the data been used to
-construct a single, large structure. This requirement is formalized in
-the definition of a class of problems called \emph{decomposable search
-problems (DSP)}. This class was first defined by Bentley and Saxe in
-their work on dynamization, and we will adopt their definition,
-
-\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
- \label{def:dsp}
- A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
- only if there exists a constant-time computable, associative, and
- commutative binary operator $\mergeop$ such that,
- \begin{equation*}
- F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
- \end{equation*}
- for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
-\end{definition}
-
-The requirement for $\mergeop$ to be constant-time was used by Bentley and
-Saxe to prove specific performance bounds for answering queries from a
-decomposed data structure. However, it is not strictly \emph{necessary},
-and later work by Overmars lifted this constraint and considered a more
-general class of search problems called \emph{$C(n)$-decomposable search
-problems},
-
-\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}]
- A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
- if and only if there exists an $O(C(n))$-time computable, associative,
- and commutative binary operator $\mergeop$ such that,
- \begin{equation*}
- F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
- \end{equation*}
- for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
-\end{definition}
-
-To demonstrate that a search problem is decomposable, it is necessary to
-show the existence of the merge operator, $\mergeop$, with the necessary
-properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B,
-q)$. With these two results, induction demonstrates that the problem is
-decomposable even in cases with more than two partial results.
-
-As an example, consider range scans,
-\begin{definition}[Range Count]
- \label{def:range-count}
- Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
- $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
- the cardinality, $|d \cap q|$.
-\end{definition}
-
-\begin{theorem}
-\label{ther:decomp-range-count}
-Range Count is a decomposable search problem.
-\end{theorem}
-
-\begin{proof}
-Let $\mergeop$ be addition ($+$). Applying this to
-Definition~\ref{def:dsp}, gives
-\begin{align*}
- |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
-\end{align*}
-which is true by the distributive property of union and
-intersection. Addition is an associative and commutative
-operator that can be calculated in $\Theta(1)$ time. Therefore, range counts
-are DSPs.
-\end{proof}
-
-Because the codomain of a DSP is not restricted, more complex output
-structures can be used to allow for problems that are not directly
-decomposable to be converted to DSPs, possibly with some minor
-post-processing. For example, calculating the arithmetic mean of a set
-of numbers can be formulated as a DSP,
-\begin{theorem}
-The calculation of the arithmetic mean of a set of numbers is a DSP.
-\end{theorem}
-\begin{proof}
- Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
- where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
-contains the sum of the values within the input set, and the
-cardinality of the input set. For two disjoint partitions of the data,
-$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
-$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$.
-
-Applying Definition~\ref{def:dsp}, gives
-\begin{align*}
- A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\
- (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
-\end{align*}
-From this result, the average can be determined in constant time by
-taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
-of numbers is a DSP.
-\end{proof}
-
\section{Database Indexes}
@@ -241,7 +78,6 @@ and the log-structured merge (LSM) tree~\cite{oneil96} is also often
used within the context of key-value stores~\cite{rocksdb}. Some databases
implement unordered indices using hash tables~\cite{mysql-btree-hash}.
-
\subsection{The Generalized Index}
The previous section discussed the classical definition of index
@@ -359,866 +195,3 @@ that has been designed specifically for answering particular types of queries
%done surrounding the use of arbitrary indexes in queries in the past,
%such as~\cite{byods-datalog}. This problem is considered out-of-scope
%for the proposed work, but will be considered in the future.
-
-\section{Classical Dynamization Techniques}
-
-Because data in a database is regularly updated, data structures
-intended to be used as an index must support updates (inserts, in-place
-modification, and deletes). Not all potentially useful data structures
-support updates, and so a general strategy for adding update support
-would increase the number of data structures that could be used as
-database indices. We refer to a data structure with update support as
-\emph{dynamic}, and one without update support as \emph{static}.\footnote{
- The term static is distinct from immutable. Static refers to the
- layout of records within the data structure, whereas immutable
- refers to the data stored within those records. This distinction
- will become relevant when we discuss different techniques for adding
- delete support to data structures. The data structures used are
- always static, but not necessarily immutable, because the records may
- contain header information (like visibility) that is updated in place.
-}
-
-This section discusses \emph{dynamization}, the construction of a
-dynamic data structure based on an existing static one. When certain
-conditions are satisfied by the data structure and its associated
-search problem, this process can be done automatically, and with
-provable asymptotic bounds on amortized insertion performance, as well
-as worst case query performance. This is in contrast to the manual
-design of dynamic data structures, which involve techniques based on
-partially rebuilding small portions of a single data structure (called
-\emph{local reconstruction})~\cite{overmars83}. This is a very high cost
-intervention that requires significant effort on the part of the data
-structure designer, whereas conventional dynamization can be performed
-with little-to-no modification of the underlying data structure at all.
-
-It is worth noting that there are a variety of techniques
-discussed in the literature for dynamizing structures with specific
-properties, or under very specific sets of circumstances. Examples
-include frameworks for adding update support succinct data
-structures~\cite{dynamize-succinct} or taking advantage of batching
-of insert and query operations~\cite{batched-decomposable}. This
-section discusses techniques that are more general, and don't require
-workload-specific assumptions.
-
-We will first discuss the necessary data structure requirements, and
-then examine several classical dynamization techniques. The section
-will conclude with a discussion of delete support within the context
-of these techniques. For more detail than is included in this chapter,
-Overmars wrote a book providing a comprehensive survey of techniques for
-creating dynamic data structures, including not only the dynamization
-techniques discussed here, but also local reconstruction based
-techniques and more~\cite{overmars83}.\footnote{
- Sadly, this book isn't readily available in
- digital format as of the time of writing.
-}
-
-
-\subsection{Global Reconstruction}
-
-The most fundamental dynamization technique is that of \emph{global
-reconstruction}. While not particularly useful on its own, global
-reconstruction serves as the basis for the techniques to follow, and so
-we will begin our discussion of dynamization with it.
-
-Consider a class of data structure, $\mathcal{I}$, capable of answering a
-search problem, $\mathcal{Q}$. Insertion via global reconstruction is
-possible if $\mathcal{I}$ supports the following two operations,
-\begin{align*}
-\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
-\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
-\end{align*}
-where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$
-over the data structure over a set of records $d \subseteq \mathcal{D}$
-in $B(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
-\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in
-$\Theta(1)$ time,\footnote{
- There isn't any practical reason why $\mathtt{unbuild}$ must run
- in constant time, but this is the assumption made in \cite{saxe79}
- and in subsequent work based on it, and so we will follow the same
- definition here.
-} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$.
-
-Given this structure, an insert of record $r \in \mathcal{D}$ into a
-data structure $\mathscr{I} \in \mathcal{I}$ can be defined by,
-\begin{align*}
-\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\})
-\end{align*}
-
-It goes without saying that this operation is sub-optimal, as the
-insertion cost is $\Theta(B(n))$, and $B(n) \in \Omega(n)$ at best for
-most data structures. However, this global reconstruction strategy can
-be used as a primitive for more sophisticated techniques that can provide
-reasonable performance.
-
-\subsection{Amortized Global Reconstruction}
-\label{ssec:agr}
-
-The problem with global reconstruction is that each insert must rebuild
-the entire data structure, involving all of its records. This results
-in a worst-case insert cost of $\Theta(B(n))$. However, opportunities
-for improving this scheme can present themselves when considering the
-\emph{amortized} insertion cost.
-
-Consider the cost accrued by the dynamized structure under global
-reconstruction over the lifetime of the structure. Each insert will result
-in all of the existing records being rewritten, so at worst each record
-will be involved in $\Theta(n)$ reconstructions, each reconstruction
-having $\Theta(B(n))$ cost. We can amortize this cost over the $n$ records
-inserted to get an amortized insertion cost for global reconstruction of,
-
-\begin{equation*}
-I_a(n) = \frac{B(n) \cdot n}{n} = B(n)
-\end{equation*}
-
-This doesn't improve things as is, however it does present two
-opportunities for improvement. If we could either reduce the size of
-the reconstructions, or the number of times a record is reconstructed,
-then we could reduce the amortized insertion cost.
-
-The key insight, first discussed by Bentley and Saxe, is that
-both of these goals can be accomplished by \emph{decomposing} the
-data structure into multiple, smaller structures, each built from
-a disjoint partition of the data. As long as the search problem
-being considered is decomposable, queries can be answered from
-this structure with bounded worst-case overhead, and the amortized
-insertion cost can be improved~\cite{saxe79}. Significant theoretical
-work exists in evaluating different strategies for decomposing the
-data structure~\cite{saxe79, overmars81, overmars83} and for leveraging
-specific efficiencies of the data structures being considered to improve
-these reconstructions~\cite{merge-dsp}.
-
-There are two general decomposition techniques that emerged from
-this work. The earliest of these is the logarithmic method, often
-called the Bentley-Saxe method in modern literature, and is the most
-commonly discussed technique today. The Bentley-Saxe method has been
-directly applied in a few instances in the literature, such as to
-metric indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree},
-and has also been used in a modified form for genetic sequence search
-structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few
-examples.
-
-A later technique, the equal block method, was also developed. It is
-generally not as effective as the Bentley-Saxe method, and as a result we
-have not identified any specific applications of this technique outside
-of the theoretical literature, however we will discuss it as well in
-the interest of completeness, and because it does lend itself well to
-demonstrating certain properties of decomposition-based dynamization
-techniques.
-
-\subsection{Equal Block Method}
-\label{ssec:ebm}
-
-Though chronologically later, the equal block method is theoretically a
-bit simpler, and so we will begin our discussion of decomposition-based
-technique for dynamization of decomposable search problems with it. There
-have been several proposed variations of this concept~\cite{maurer79,
-maurer80}, but we will focus on the most developed form as described by
-Overmars and von Leeuwan~\cite{overmars-art-of-dyn, overmars83}. The core
-concept of the equal block method is to decompose the data structure
-into several smaller data structures, called blocks, over partitions
-of the data. This decomposition is performed such that each block is of
-roughly equal size.
-
-Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves
-some decomposable search problem, $F$ and is built over a set of records
-$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks,
-$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over
-partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value
-makes little sense when the number of records changes, and so it is taken
-to be governed by a smooth, monotonically increasing function $f(n)$ such
-that, at any point, the following two constraints are obeyed.
-\begin{align}
- f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\
- \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{s} \label{ebm-c2}
-\end{align}
-where $|\mathscr{I}_j|$ is the number of records in the block,
-$|\text{unbuild}(\mathscr{I}_j)|$.
-
-A new record is inserted by finding the smallest block and rebuilding it
-using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$,
-then an insert is done by,
-\begin{equation*}
-\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\})
-\end{equation*}
-Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
- Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
- violated by deletes. We're omitting deletes from the discussion at
- this point, but will circle back to them in Section~\ref{sec:deletes}.
-} In this case, the constraints are enforced by "re-configuring" the
-structure. $s$ is updated to be exactly $f(n)$, all of the existing
-blocks are unbuilt, and then the records are redistributed evenly into
-$s$ blocks.
-
-A query with parameters $q$ is answered by this structure by individually
-querying the blocks, and merging the local results together with $\mergeop$,
-\begin{equation*}
-F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q)
-\end{equation*}
-where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to
-answering the query over $d$ using the data structure $\mathscr{I}$.
-
-This technique provides better amortized performance bounds than global
-reconstruction, at the possible cost of worse query performance for
-sub-linear queries. We'll omit the details of the proof of performance
-for brevity and streamline some of the original notation (full details
-can be found in~\cite{overmars83}), but this technique ultimately
-results in a data structure with the following performance characteristics,
-\begin{align*}
-\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{B(n)}{n} + B\left(\frac{n}{f(n)}\right)\right) \\
-\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\
-\end{align*}
-where $B(n)$ is the cost of statically building $\mathcal{I}$, and
-$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$.
-
-%TODO: example?
-
-
-\subsection{The Bentley-Saxe Method~\cite{saxe79}}
-\label{ssec:bsm}
-
-%FIXME: switch this section (and maybe the previous?) over to being
-% indexed at 0 instead of 1
-
-The original, and most frequently used, dynamization technique is the
-Bentley-Saxe Method (BSM), also called the logarithmic method in older
-literature. Rather than breaking the data structure into equally sized
-blocks, BSM decomposes the structure into logarithmically many blocks
-of exponentially increasing size. More specifically, the data structure
-is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1,
-\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$
-will be either empty, or contain exactly $2^i$ records within it.
-
-The procedure for inserting a record, $r \in \mathcal{D}$, into
-a BSM dynamization is as follows. If the block $\mathscr{I}_0$
-is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not
-empty, then there will exist a maximal sequence of non-empty blocks
-$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq
-0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case,
-$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i
-\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through
-$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the
-end of the structure as needed.
-
-%FIXME: switch the x's to r's for consistency
-\begin{figure}
-\centering
-\includegraphics[width=.8\textwidth]{diag/bsm.pdf}
-\caption{An illustration of inserts into the Bentley-Saxe Method}
-\label{fig:bsm-example}
-\end{figure}
-
-Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The
-dynamization is built over a set of records $x_1, x_2, \ldots,
-x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in
-$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly
-into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the
-first empty block is $\mathscr{I}_2$, and so the insert is performed by
-doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup
-\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$
-and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$.
-
-This technique is called a \emph{binary decomposition} of the data
-structure. Considering a BSM dynamization of a structure containing $n$
-records, labeling each block with a $0$ if it is empty and a $1$ if it
-is full will result in the binary representation of $n$. For example,
-the final state of the structure in Figure~\ref{fig:bsm-example} contains
-$12$ records, and the labeling procedure will result in $0\text{b}1100$,
-which is $12$ in binary. Inserts affect this representation of the
-structure in the same way that incrementing the binary number by $1$ does.
-
-By applying BSM to a data structure, a dynamized structure can be created
-with the following performance characteristics,
-\begin{align*}
-\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{B(n)}{n}\cdot \log_2 n\right)\right) \\
-\text{Worst Case Insertion Cost:}&\quad \Theta\left(B(n)\right) \\
-\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\
-\end{align*}
-This is a particularly attractive result because, for example, a data
-structure having $B(n) \in \Theta(n)$ will have an amortized insertion
-cost of $\log_2 (n)$, which is quite reasonable. The trade-off for this
-is an extra logarithmic multiple attached to the query complexity. It is
-also worth noting that the worst-case insertion cost remains the same
-as global reconstruction, but this case arises only very rarely. If
-you consider the binary decomposition representation, the worst-case
-behavior is triggered each time the existing number overflows, and a
-new digit must be added.
-
-As a final note about the query performance of this structure, because
-the overhead due to querying the blocks is logarithmic, under certain
-circumstances this cost can be absorbed, resulting in no effect on the
-asymptotic worst-case query performance. As an example, consider a linear
-scan of the data running in $\Theta(n)$ time. In this case, every record
-must be considered, and so there isn't any performance penalty\footnote{
- From an asymptotic perspective. There will still be measurable performance
- effects from caching, etc., even in this case.
-} to breaking the records out into multiple chunks and scanning them
-individually. For formally, for any query running in $\mathscr{Q}(n) \in
-\Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case
-cost of answering a decomposable search problem from a BSM dynamization
-is $\Theta\left(\mathscr{Q}(n)\right)$.~\cite{saxe79}
-
-\subsection{Merge Decomposable Search Problems}
-
-\subsection{Delete Support}
-
-Classical dynamization techniques have also been developed with
-support for deleting records. In general, the same technique of global
-reconstruction that was used for inserting records can also be used to
-delete them. Given a record $r \in \mathcal{D}$ and a data structure
-$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be
-deleted from the structure in $C(n)$ time as follows,
-\begin{equation*}
-\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\})
-\end{equation*}
-However, supporting deletes within the dynamization schemes discussed
-above is more complicated. The core problem is that inserts affect the
-dynamized structure in a deterministic way, and as a result certain
-partitioning schemes can be leveraged to reason about the
-performance. But, deletes do not work like this.
-
-\begin{figure}
-\caption{A Bentley-Saxe dynamization for the integers on the
-interval $[1, 100]$.}
-\label{fig:bsm-delete-example}
-\end{figure}
-
-For example, consider a Bentley-Saxe dynamization that contains all
-integers on the interval $[1, 100]$, inserted in that order, shown in
-Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the
-records from this structure, one at a time, using global reconstruction.
-This presents several problems,
-\begin{itemize}
- \item For each record, we need to identify which block it is in before
- we can delete it.
- \item The cost of performing a delete is a function of which block the
- record is in, which is a question of distribution and not easily
- controlled.
- \item As records are deleted, the structure will potentially violate
- the invariants of the decomposition scheme used, which will
- require additional work to fix.
-\end{itemize}
-
-To resolve these difficulties, two very different approaches have been
-proposed for supporting deletes, each of which rely on certain properties
-of the search problem and data structure. These are the use of a ghost
-structure and weak deletes.
-
-\subsubsection{Ghost Structure for Invertible Search Problems}
-
-The first proposed mechanism for supporting deletes was discussed
-alongside the Bentley-Saxe method in Bentley and Saxe's original
-paper. This technique applies to a class of search problems called
-\emph{invertible} (also called \emph{decomposable counting problems}
-in later literature~\cite{overmars83}). Invertible search problems
-are decomposable, and also support an ``inverse'' merge operator, $\Delta$,
-that is able to remove records from the result set. More formally,
-\begin{definition}[Invertible Search Problem~\cite{saxe79}]
-\label{def:invert}
-A decomposable search problem, $F$ is invertible if and only if there
-exists a constant time computable operator, $\Delta$, such that
-\begin{equation*}
-F(A / B, q) = F(A, q)~\Delta~F(B, q)
-\end{equation*}
-for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
-\end{definition}
-
-Given a search problem with this property, it is possible to perform
-deletes by creating a secondary ``ghost'' structure. When a record
-is to be deleted, it is inserted into this structure. Then, when the
-dynamization is queried, this ghost structure is queried as well as the
-main one. The results from the ghost structure can be removed from the
-result set using the inverse merge operator. This simulates the result
-that would have been obtained had the records been physically removed
-from the main structure.
-
-Two examples of invertible search problems are set membership
-and range count. Range count was formally defined in
-Definition~\ref{def:range-count}.
-
-\begin{theorem}
-Range count is an invertible search problem.
-\end{theorem}
-
-\begin{proof}
-To prove that range count is an invertible search problem, it must be
-decomposable and have a $\Delta$ operator. That it is a DSP has already
-been proven in Theorem~\ref{ther:decomp-range-count}.
-
-Let $\Delta$ be subtraction $(-)$. Applying this to Definition~\ref{def:invert}
-gives,
-\begin{equation*}
-|(A / B) \cap q | = |(A \cap q) / (B \cap q)| = |(A \cap q)| - |(B \cap q)|
-\end{equation*}
-which is true by the distributive property of set difference and
-intersection. Subtraction is computable in constant time, therefore
-range count is an invertible search problem using subtraction as $\Delta$.
-\end{proof}
-
-The set membership search problem is defined as follows,
-\begin{definition}[Set Membership]
-\label{def:set-membership}
-Consider a set of elements $d \subseteq \mathcal{D}$ from some domain,
-and a single element $r \in \mathcal{D}$. A test of set membership is a
-search problem of the form $F: (\mathcal{PS}(\mathcal{D}), \mathcal{D})
-\to \mathbb{B}$ such that $F(d, r) = r \in d$, which maps to $0$ if $r
-\not\in d$ and $1$ if $r \in d$.
-\end{definition}
-
-\begin{theorem}
-Set membership is an invertible search problem.
-\end{theorem}
-
-\begin{proof}
-To prove that set membership is invertible, it is necessary to establish
-that it is a decomposable search problem, and that a $\Delta$ operator
-exists. We'll begin with the former.
-\begin{lemma}
- \label{lem:set-memb-dsp}
- Set membership is a decomposable search problem.
-\end{lemma}
-\begin{proof}
-Let $\mergeop$ be the logical disjunction ($\lor$). This yields,
-\begin{align*}
-F(A \cup B, r) &= F(A, r) \lor F(B, r) \\
-r \in (A \cup B) &= (r \in A) \lor (r \in B)
-\end{align*}
-which is true, following directly from the definition of union. The
-logical disjunction is an associative, commutative operator that can
-be calculated in $\Theta(1)$ time. Therefore, set membership is a
-decomposable search problem.
-\end{proof}
-
-For the inverse merge operator, $\Delta$, it is necessary that $F(A,
-r) ~\Delta~F(B, r)$ be true \emph{only} if $r \in A$ and $r \not\in
-B$. Thus, it could be directly implemented as $F(A, r)~\Delta~F(B, r) =
-F(A, r) \land \neg F(B, r)$, which is constant time if
-the operands are already known.
-
-Thus, we have shown that set membership is a decomposable search problem,
-and that a constant time $\Delta$ operator exists. Therefore, it is an
-invertible search problem.
-\end{proof}
-
-For search problems such as these, this technique allows for deletes to be
-supported with the same cost as an insert. Unfortunately, it suffers from
-write amplification because each deleted record is recorded twice--one in
-the main structure, and once in the ghost structure. This means that $n$
-is, in effect, the total number of records and deletes. This can lead
-to some serious problems, for example if every record in a structure
-of $n$ records is deleted, the net result will be an "empty" dynamized
-data structure containing $2n$ physical records within it. To circumvent
-this problem, Bentley and Saxe proposed a mechanism of setting a maximum
-threshold for the size of the ghost structure relative to the main one,
-and performing a complete re-partitioning of the data once this threshold
-is reached, removing all deleted records from the main structure,
-emptying the ghost structure, and rebuilding blocks with the records
-that remain according to the invariants of the technique.
-
-\subsubsection{Weak Deletes for Deletion Decomposable Search Problems}
-
-Another approach for supporting deletes was proposed later, by Overmars
-and van Leeuwen, for a class of search problem called \emph{deletion
-decomposable}. These are decomposable search problems for which the
-underlying data structure supports a delete operation. More formally,
-
-\begin{definition}[Deletion Decomposable Search Problem~\cite{merge-dsp}]
- A decomposable search problem, $F$, and its data structure,
- $\mathcal{I}$, is deletion decomposable if and only if, for some
- instance $\mathscr{I} \in \mathcal{I}$, containing $n$ records,
- there exists a deletion routine $\mathtt{delete}(\mathscr{I},
- r)$ that removes some $r \in \mathcal{D}$ in time $D(n)$ without
- increasing the query time, deletion time, or storage requirement,
- for $\mathscr{I}$.
-\end{definition}
-
-Superficially, this doesn't appear very useful. If the underlying data
-structure already supports deletes, there isn't much reason to use a
-dynamization technique to add deletes to it. However, one point worth
-mentioning is that it is possible, in many cases, to easily \emph{add}
-delete support to a static structure. If it is possible to locate a
-record and somehow mark it as deleted, without removing it from the
-structure, and then efficiently ignore these records while querying,
-then the given structure and its search problem can be said to be
-deletion decomposable. This technique for deleting records is called
-\emph{weak deletes}.
-
-\begin{definition}[Weak Deletes~\cite{overmars81}]
-\label{def:weak-delete}
-A data structure is said to support weak deletes if it provides a
-routine, \texttt{delete}, that guarantees that after $\alpha \cdot n$
-deletions, where $\alpha < 1$, the query cost is bounded by $k_\alpha
-\mathscr{Q}(n)$ for some constant $k_\alpha$ dependent only upon $\alpha$,
-where $\mathscr{Q}(n)$ is the cost of answering the query against a
-structure upon which no weak deletes were performed.\footnote{
- This paper also provides a similar definition for weak updates,
- but these aren't of interest to us in this work, and so the above
- definition was adapted from the original with the weak update
- constraints removed.
-} The results of the query of a block containing weakly deleted records
-should be the same as the results would be against a block with those
-records removed.
-\end{definition}
-
-As an example of a deletion decomposable search problem, consider the set
-membership problem considered above (Definition~\ref{def:set-membership})
-where $\mathcal{I}$, the data structure used to answer queries of the
-search problem, is a hash map.\footnote{
- While most hash maps are already dynamic, and so wouldn't need
- dynamization to be applied, there do exist static ones too. For example,
- the hash map being considered could be implemented using perfect
- hashing~\cite{perfect-hashing}, which has many static implementations.
-}
-
-\begin{theorem}
- The set membership problem, answered using a static hash map, is
- deletion decomposable.
-\end{theorem}
-
-\begin{proof}
-We've already shown in Lemma~\ref{lem:set-memb-dsp} that set membership
-is a decomposable search problem. For it to be deletion decomposable,
-we must demonstrate that the hash map, $\mathcal{I}$, supports deleting
-records without hurting its query performance, delete performance, or
-storage requirements. Assume that an instance $\mathscr{I} \in
-\mathcal{I}$ having $|\mathscr{I}| = n$ can answer queries in
-$\mathscr{Q}(n) \in \Theta(1)$ time and requires $\Omega(n)$ storage.
-
-Such a structure can support weak deletes. Each record within the
-structure has a single bit attached to it, indicating whether it has
-been deleted or not. These bits will require $\Theta(n)$ storage and
-be initialized to 0 when the structure is constructed. A delete can
-be performed by querying the structure for the record to be deleted in
-$\Theta(1)$ time, and setting the bit to 1 if the record is found. This
-operation has $D(n) \in \Theta(1)$ cost.
-
-\begin{lemma}
-\label{lem:weak-deletes}
-The delete procedure as described above satisfies the requirements of
-Definition~\ref{def:weak-delete} for weak deletes.
-\end{lemma}
-\begin{proof}
-Per Definition~\ref{def:weak-delete}, there must exist some constant
-dependent only on $\alpha$, $k_\alpha$, such that after $\alpha \cdot
-n$ deletes against $\mathscr{I}$ with $\alpha < 1$, the query cost is
-bounded by $\Theta(\alpha \mathscr{Q}(n))$.
-
-In this case, $\mathscr{Q}(n) \in \Theta(1)$, and therefore our final
-query cost must be bounded by $\Theta(k_\alpha)$. When a query is
-executed against $\mathscr{I}$, there are three possible cases,
-\begin{enumerate}
-\item The record being searched for does not exist in $\mathscr{I}$. In
-this case, the query result is 0.
-\item The record being searched for does exist in $\mathscr{I}$ and has
-a delete bit value of 0. In this case, the query result is 1.
-\item The record being searched for does exist in $\mathscr{I}$ and has
-a delete bit value of 1 (i.e., it has been deleted). In this case, the
-query result is 0.
-\end{enumerate}
-In all three cases, the addition of deletes requires only $\Theta(1)$
-extra work at most. Therefore, set membership over a static hash map
-using our proposed deletion mechanism satisfies the requirements for
-weak deletes, with $k_\alpha = 1$.
-\end{proof}
-
-Finally, we note that the cost of one of these weak deletes is $D(n)
-= \mathscr{Q}(n)$. By Lemma~\ref{lem:weak-deletes}, the delete cost is
-not asymptotically harmed by deleting records.
-
-Thus, we've shown that set membership using a static hash map is a
-decomposable search problem, the storage cost remains $\Omega(n)$ and the
-query and delete costs are unaffected by the presence of deletes using the
-proposed mechanism. All of the requirements of deletion decomposability
-are satisfied, therefore set membership using a static hash map is a
-deletion decomposable search problem.
-\end{proof}
-
-For such problems, deletes can be supported by first identifying the
-block in the dynamization containing the record to be deleted, and
-then calling $\mathtt{delete}$ on it. In order to allow this block to
-be easily located, it is possible to maintain a hash table over all
-of the records, alongside the dynamization, which maps each record
-onto the block containing it. This table must be kept up to date as
-reconstructions occur, but this can be done at no extra asymptotic costs
-for any data structures having $B(n) \in \Omega(n)$, as it requires only
-linear time. This allows for deletes to be performed in $\mathscr{D}(n)
-\in \Theta(D(n))$ time.
-
-The presence of deleted records within the structure does introduce a
-new problem, however. Over time, the number of records in each block will
-drift away from the requirements imposed by the dynamization technique. It
-will eventually become necessary to re-partition the records to restore
-these invariants, which are necessary for bounding the number of blocks,
-and thereby the query performance. The particular invariant maintenance
-rules depend upon the decomposition scheme used.
-
-\Paragraph{Bentley-Saxe Method.} When creating a BSM dynamization for
-a deletion decomposable search problem, the $i$th block where $i \geq 2$\footnote{
- Block $i=0$ will only ever have one record, so no special maintenance must be
- done for it. A delete will simply empty it completely.
-},
-in the absence of deletes, will contain $2^{i-1} + 1$ records. When a
-delete occurs in block $i$, no special action is taken until the number
-of records in that block falls below $2^{i-2}$. Once this threshold is
-reached, a reconstruction can be performed to restore the appropriate
-record counts in each block.~\cite{merge-dsp}
-
-\Paragraph{Equal Block Method.} For the equal block method, there are
-two cases in which a delete may cause a block to fail to obey the method's
-size invariants,
-\begin{enumerate}
- \item If enough records are deleted, it is possible for the number
- of blocks to exceed $f(2n)$, violating Invariant~\ref{ebm-c1}.
- \item The deletion of records may cause the maximum size of each
- block to shrink, causing some blocks to exceed the maximum capacity
- of $\nicefrac{2n}{s}$. This is a violation of Invariant~\ref{ebm-c2}.
-\end{enumerate}
-In both cases, it should be noted that $n$ is decreased as records are
-deleted. Should either of these cases emerge as a result of a delete,
-the entire structure must be reconfigured to ensure that its invariants
-are maintained. This reconfiguration follows the same procedure as when
-an insert results in a violation: $s$ is updated to be exactly $f(n)$, all
-existing blocks are unbuilt, and then the records are evenly redistributed
-into the $s$ blocks.~\cite{overmars-art-of-dyn}
-
-
-\subsection{Worst-Case Optimal Techniques}
-
-
-\section{Limitations of Classical Dynamization Techniques}
-\label{sec:bsm-limits}
-
-While fairly general, these dynamization techniques have a number of
-limitations that prevent them from being directly usable as a general
-solution to the problem of creating database indices. Because of the
-requirement that the query being answered be decomposable, many search
-problems cannot be addressed--or at least efficiently addressed, by
-decomposition-based dynamization. The techniques also do nothing to reduce
-the worst-case insertion cost, resulting in extremely poor tail latency
-performance relative to hand-built dynamic structures. Finally, these
-approaches do not do a good job of exposing the underlying configuration
-space to the user, meaning that the user can exert limited control on the
-performance of the dynamized data structure. This section will discuss
-these limitations, and the rest of the document will be dedicated to
-proposing solutions to them.
-
-\subsection{Limits of Decomposability}
-\label{ssec:decomp-limits}
-Unfortunately, the DSP abstraction used as the basis of classical
-dynamization techniques has a few significant limitations that restrict
-their applicability,
-
-\begin{itemize}
- \item The query must be broadcast identically to each block and cannot
- be adjusted based on the state of the other blocks.
-
- \item The query process is done in one pass--it cannot be repeated.
-
- \item The result merge operation must be $O(1)$ to maintain good query
- performance.
-
- \item The result merge operation must be commutative and associative,
- and is called repeatedly to merge pairs of results.
-\end{itemize}
-
-These requirements restrict the types of queries that can be supported by
-the method efficiently. For example, k-nearest neighbor and independent
-range sampling are not decomposable.
-
-\subsubsection{k-Nearest Neighbor}
-\label{sssec-decomp-limits-knn}
-The k-nearest neighbor (k-NN) problem is a generalization of the nearest
-neighbor problem, which seeks to return the closest point within the
-dataset to a given query point. More formally, this can be defined as,
-\begin{definition}[Nearest Neighbor]
-
- Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
- be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
- between two points within $D$. The nearest neighbor problem, $NN(D,
- q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
- for some query point, $q \in \mathbb{R}^d$.
-
-\end{definition}
-
-In practice, it is common to require $f(x, y)$ be a metric,\footnote
-{
- Contrary to its vernacular usage as a synonym for ``distance'', a
- metric is more formally defined as a valid distance function over
- a metric space. Metric spaces require their distance functions to
- have the following properties,
- \begin{itemize}
- \item The distance between a point and itself is always 0.
- \item All distances between non-equal points must be positive.
- \item For all points, $x, y \in D$, it is true that
- $f(x, y) = f(y, x)$.
- \item For any three points $x, y, z \in D$ it is true that
- $f(x, z) \leq f(x, y) + f(y, z)$.
- \end{itemize}
-
- These distances also must have the interpretation that $f(x, y) <
- f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
- is the opposite of the definition of similarity, and so some minor
- manipulations are usually required to make similarity measures work
- in metric-based indexes. \cite{intro-analysis}
-}
-and this will be done in the examples of indices for addressing
-this problem in this work, but it is not a fundamental aspect of the problem
-formulation. The nearest neighbor problem itself is decomposable,
-with a simple merge function that accepts the result with the smallest
-value of $f(x, q)$ for any two inputs\cite{saxe79}.
-
-The k-nearest neighbor problem generalizes nearest-neighbor to return
-the $k$ nearest elements,
-\begin{definition}[k-Nearest Neighbor]
-
- Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
- be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
- between two points within $D$. The k-nearest neighbor problem,
- $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
- such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.
-
-\end{definition}
-
-This can be thought of as solving the nearest-neighbor problem $k$ times,
-each time removing the returned result from $D$ prior to solving the
-problem again. Unlike the single nearest-neighbor case (which can be
-thought of as k-NN with $k=1$), this problem is \emph{not} decomposable.
-
-\begin{theorem}
- k-NN is not a decomposable search problem.
-\end{theorem}
-
-\begin{proof}
-To prove this, consider the query $KNN(D, q, k)$ against some partitioned
-dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable,
-then there must exist some constant-time, commutative, and associative
-binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l}
-R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
-k)$. Consider the evaluation of the merge operator against two arbitrary
-result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| =
-|R_j| = k$, and that the contents of $R$ must be the $k$ records from
-$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the
-problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$
-time. Therefore, k-NN is not a decomposable search problem.
-\end{proof}
-
-With that said, it is clear that there isn't any fundamental restriction
-preventing the merging of the result sets; it is only the case that an
-arbitrary performance requirement wouldn't be satisfied. It is possible
-to merge the result sets in non-constant time, and so it is the case
-that k-NN is $C(n)$-decomposable. Unfortunately, this classification
-brings with it a reduction in query performance as a result of the way
-result merges are performed.
-
-As a concrete example of these costs, consider using the Bentley-Saxe
-method to extend the VPTree~\cite{vptree}. The VPTree is a static,
-metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k
-\log n)$. One possible merge algorithm for k-NN would be to push all
-of the elements in the two arguments onto a min-heap, and then pop off
-the first $k$. In this case, the cost of the merge operation would be
-$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation
-could be considered to be constant-time. But given that $k$ is only
-bounded in size above by $n$, this isn't a safe assumption to make in
-general. Evaluating the total query cost for the extended structure,
-this would yield,
-
-\begin{equation}
- k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
-\end{equation}
-
-The reason for this large increase in cost is the repeated application
-of the merge operator. The Bentley-Saxe method requires applying the
-merge operator in a binary fashion to each partial result, multiplying
-its cost by a factor of $\log n$. Thus, the constant-time requirement
-of standard decomposability is necessary to keep the cost of the merge
-operator from appearing within the complexity bound of the entire
-operation in the general case.\footnote {
- There is a special case, noted by Overmars, where the total cost is
- $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
- \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
- case where the cost of the query and merge operation are sufficiently
- large to consume the logarithmic factor, and so it doesn't represent
- a special case with better performance.
-}
-If we could revise the result merging operation to remove this duplicated
-cost, we could greatly reduce the cost of supporting $C(n)$-decomposable
-queries.
-
-\subsubsection{Independent Range Sampling}
-\label{ssec:background-irs}
-
-Another problem that is not decomposable is independent sampling. There
-are a variety of problems falling under this umbrella, including weighted
-set sampling, simple random sampling, and weighted independent range
-sampling, but we will focus on independent range sampling here.
-
-\begin{definition}[Independent Range Sampling~\cite{tao22}]
- Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
- interval $q = [x, y]$ and an integer $k$, an independent range
- sampling query returns $k$ independent samples from $D \cap q$
- with each point having equal probability of being sampled.
-\end{definition}
-
-This problem immediately encounters a category error when considering
-whether it is decomposable: the result set is randomized, whereas
-the conditions for decomposability are defined in terms of an exact
-matching of records in result sets. To work around this, a slight abuse
-of definition is in order: assume that the equality conditions within
-the DSP definition can be interpreted to mean ``the contents in the two
-sets are drawn from the same distribution''. This enables the category
-of DSP to apply to this type of problem.
-
-Even with this abuse, however, IRS cannot generally be considered
-decomposable; it is at best $C(n)$-decomposable. The reason for this is
-that matching the distribution requires drawing the appropriate number
-of samples from each each partition of the data. Even in the special
-case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples
-from each partition that must appear in the result set cannot be known
-in advance due to differences in the selectivity of the predicate across
-the partitions.
-
-\begin{example}[IRS Sampling Difficulties]
-
- Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 =
- \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and
- an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
- partitions have the same size, it seems sensible to evenly distribute
- the samples across them ($4$ samples from each partition). Applying
- the query predicate to the partitions results in the following,
- $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$.
-
- In expectation, then, the first result set will contain $R_0 = \{3,
- 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
- probability of a $4$. The second and third result sets can only
- be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
- together, we'd find that the probability distribution of the sample
- would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
- the same sampling operation over the full dataset (not partitioned),
- the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
-
-\end{example}
-
-The problem is that the number of samples drawn from each partition needs to be
-weighted based on the number of elements satisfying the query predicate in that
-partition. In the above example, by drawing $4$ samples from $D_1$, more weight
-is given to $3$ than exists within the base dataset. This can be worked around
-by sampling a full $k$ records from each partition, returning both the sample
-and the number of records satisfying the predicate as that partition's query
-result, and then performing another pass of IRS as the merge operator, but this
-is the same approach as was used for k-NN above. This leaves IRS firmly in the
-$C(n)$-decomposable camp. If it were possible to pre-calculate the number of
-samples to draw from each partition, then a constant-time merge operation could
-be used.
-
-\subsection{Insertion Tail Latency}
-
-\subsection{Configurability}
-
-\section{Conclusion}
-This chapter discussed the necessary background information pertaining to
-queries and search problems, indexes, and techniques for dynamic extension. It
-described the potential for using custom indexes for accelerating particular
-kinds of queries, as well as the challenges associated with constructing these
-indexes. The remainder of this document will seek to address these challenges
-through modification and extension of the Bentley-Saxe method, describing work
-that has already been completed, as well as the additional work that must be
-done to realize this vision.
diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex
new file mode 100644
index 0000000..edd3014
--- /dev/null
+++ b/chapters/dynamization.tex
@@ -0,0 +1,1027 @@
+\chapter{Classical Dynamization Techniques}
+\label{chap:background}
+
+This chapter will introduce important background information and
+existing work in the area of data structure dynamization. We will
+first discuss the concept of a search problem, which is central to
+dynamization techniques. While one might imagine that restrictions on
+dynamization would be functions of the data structure to be dynamized,
+in practice the requirements placed on the data structure are quite mild,
+and it is the necessary properties of the search problem that the data
+structure is used to address that provide the central difficulty to
+applying dynamization techniques in a given area. After this, database
+indices will be discussed briefly. Indices are the primary use of data
+structures within the database context that is of interest to our work.
+Following this, existing theoretical results in the area of data structure
+dynamization will be discussed, which will serve as the building blocks
+for our techniques in subsequent chapters. The chapter will conclude with
+a discussion of some of the limitations of these existing techniques.
+
+\section{Queries and Search Problems}
+\label{sec:dsp}
+
+Data access lies at the core of most database systems. We want to ask
+questions of the data, and ideally get the answer efficiently. We
+will refer to the different types of question that can be asked as
+\emph{search problems}. We will be using this term in a similar way as
+the word \emph{query} \footnote{
+ The term query is often abused and used to
+ refer to several related, but slightly different things. In the
+ vernacular, a query can refer to either a) a general type of search
+ problem (as in "range query"), b) a specific instance of a search
+ problem, or c) a program written in a query language.
+}
+is often used within the database systems literature: to refer to a
+general class of questions. For example, we could consider range scans,
+point-lookups, nearest neighbor searches, predicate filtering, random
+sampling, etc., to each be a general search problem. Formally, for the
+purposes of this work, a search problem is defined as follows,
+
+\begin{definition}[Search Problem]
+ Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
+ $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
+ $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
+answer domain.\footnote{
+ It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
+example, a \texttt{COUNT} aggregation might map a set of strings onto
+ an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
+not be a universal constraint.
+}
+\end{definition}
+
+We will use the term \emph{query} to mean a specific instance of a search
+problem,
+
+\begin{definition}[Query]
+ Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
+ a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
+ instance of the search problem, $F(\mathcal{D}, q)$.
+\end{definition}
+
+As an example of using these definitions, a \emph{membership test}
+or \emph{range scan} would be considered search problems, and a range
+scan over the interval $[10, 99]$ would be a query. We've drawn this
+distinction because, as we'll see as we enter into the discussion of
+our work in later chapters, it is useful to have separate, unambiguous
+terms for these two concepts.
+
+\subsection{Decomposable Search Problems}
+
+Dynamization techniques require the partitioning of one data structure
+into several, smaller ones. As a result, these techniques can only
+be applied in situations where the search problem to be answered can
+be answered from this set of smaller data structures, with the same
+answer as would have been obtained had all of the data been used to
+construct a single, large structure. This requirement is formalized in
+the definition of a class of problems called \emph{decomposable search
+problems (DSP)}. This class was first defined by Bentley and Saxe in
+their work on dynamization, and we will adopt their definition,
+
+\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
+ \label{def:dsp}
+ A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
+ only if there exists a constant-time computable, associative, and
+ commutative binary operator $\mergeop$ such that,
+ \begin{equation*}
+ F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
+ \end{equation*}
+ for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
+\end{definition}
+
+The requirement for $\mergeop$ to be constant-time was used by Bentley and
+Saxe to prove specific performance bounds for answering queries from a
+decomposed data structure. However, it is not strictly \emph{necessary},
+and later work by Overmars lifted this constraint and considered a more
+general class of search problems called \emph{$C(n)$-decomposable search
+problems},
+
+\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}]
+ A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
+ if and only if there exists an $O(C(n))$-time computable, associative,
+ and commutative binary operator $\mergeop$ such that,
+ \begin{equation*}
+ F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
+ \end{equation*}
+ for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
+\end{definition}
+
+To demonstrate that a search problem is decomposable, it is necessary to
+show the existence of the merge operator, $\mergeop$, with the necessary
+properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B,
+q)$. With these two results, induction demonstrates that the problem is
+decomposable even in cases with more than two partial results.
+
+As an example, consider range scans,
+\begin{definition}[Range Count]
+ \label{def:range-count}
+ Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
+ $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
+ the cardinality, $|d \cap q|$.
+\end{definition}
+
+\begin{theorem}
+\label{ther:decomp-range-count}
+Range Count is a decomposable search problem.
+\end{theorem}
+
+\begin{proof}
+Let $\mergeop$ be addition ($+$). Applying this to
+Definition~\ref{def:dsp}, gives
+\begin{align*}
+ |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
+\end{align*}
+which is true by the distributive property of union and
+intersection. Addition is an associative and commutative
+operator that can be calculated in $\Theta(1)$ time. Therefore, range counts
+are DSPs.
+\end{proof}
+
+Because the codomain of a DSP is not restricted, more complex output
+structures can be used to allow for problems that are not directly
+decomposable to be converted to DSPs, possibly with some minor
+post-processing. For example, calculating the arithmetic mean of a set
+of numbers can be formulated as a DSP,
+\begin{theorem}
+The calculation of the arithmetic mean of a set of numbers is a DSP.
+\end{theorem}
+\begin{proof}
+ Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
+ where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
+contains the sum of the values within the input set, and the
+cardinality of the input set. For two disjoint partitions of the data,
+$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
+$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$.
+
+Applying Definition~\ref{def:dsp}, gives
+\begin{align*}
+ A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\
+ (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
+\end{align*}
+From this result, the average can be determined in constant time by
+taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
+of numbers is a DSP.
+\end{proof}
+
+\section{Dynamization for Decomposable Search Problems}
+
+Because data in a database is regularly updated, data structures
+intended to be used as an index must support updates (inserts, in-place
+modification, and deletes). Not all potentially useful data structures
+support updates, and so a general strategy for adding update support
+would increase the number of data structures that could be used as
+database indices. We refer to a data structure with update support as
+\emph{dynamic}, and one without update support as \emph{static}.\footnote{
+ The term static is distinct from immutable. Static refers to the
+ layout of records within the data structure, whereas immutable
+ refers to the data stored within those records. This distinction
+ will become relevant when we discuss different techniques for adding
+ delete support to data structures. The data structures used are
+ always static, but not necessarily immutable, because the records may
+ contain header information (like visibility) that is updated in place.
+}
+
+This section discusses \emph{dynamization}, the construction of a
+dynamic data structure based on an existing static one. When certain
+conditions are satisfied by the data structure and its associated
+search problem, this process can be done automatically, and with
+provable asymptotic bounds on amortized insertion performance, as well
+as worst case query performance. This is in contrast to the manual
+design of dynamic data structures, which involve techniques based on
+partially rebuilding small portions of a single data structure (called
+\emph{local reconstruction})~\cite{overmars83}. This is a very high cost
+intervention that requires significant effort on the part of the data
+structure designer, whereas conventional dynamization can be performed
+with little-to-no modification of the underlying data structure at all.
+
+It is worth noting that there are a variety of techniques
+discussed in the literature for dynamizing structures with specific
+properties, or under very specific sets of circumstances. Examples
+include frameworks for adding update support succinct data
+structures~\cite{dynamize-succinct} or taking advantage of batching
+of insert and query operations~\cite{batched-decomposable}. This
+section discusses techniques that are more general, and don't require
+workload-specific assumptions.
+
+We will first discuss the necessary data structure requirements, and
+then examine several classical dynamization techniques. The section
+will conclude with a discussion of delete support within the context
+of these techniques. For more detail than is included in this chapter,
+Overmars wrote a book providing a comprehensive survey of techniques for
+creating dynamic data structures, including not only the dynamization
+techniques discussed here, but also local reconstruction based
+techniques and more~\cite{overmars83}.\footnote{
+ Sadly, this book isn't readily available in
+ digital format as of the time of writing.
+}
+
+
+\subsection{Global Reconstruction}
+
+The most fundamental dynamization technique is that of \emph{global
+reconstruction}. While not particularly useful on its own, global
+reconstruction serves as the basis for the techniques to follow, and so
+we will begin our discussion of dynamization with it.
+
+Consider a class of data structure, $\mathcal{I}$, capable of answering a
+search problem, $\mathcal{Q}$. Insertion via global reconstruction is
+possible if $\mathcal{I}$ supports the following two operations,
+\begin{align*}
+\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
+\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
+\end{align*}
+where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$
+over the data structure over a set of records $d \subseteq \mathcal{D}$
+in $B(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
+\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in
+$\Theta(1)$ time,\footnote{
+ There isn't any practical reason why $\mathtt{unbuild}$ must run
+ in constant time, but this is the assumption made in \cite{saxe79}
+ and in subsequent work based on it, and so we will follow the same
+ definition here.
+} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$.
+
+Given this structure, an insert of record $r \in \mathcal{D}$ into a
+data structure $\mathscr{I} \in \mathcal{I}$ can be defined by,
+\begin{align*}
+\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\})
+\end{align*}
+
+It goes without saying that this operation is sub-optimal, as the
+insertion cost is $\Theta(B(n))$, and $B(n) \in \Omega(n)$ at best for
+most data structures. However, this global reconstruction strategy can
+be used as a primitive for more sophisticated techniques that can provide
+reasonable performance.
+
+\subsection{Amortized Global Reconstruction}
+\label{ssec:agr}
+
+The problem with global reconstruction is that each insert must rebuild
+the entire data structure, involving all of its records. This results
+in a worst-case insert cost of $\Theta(B(n))$. However, opportunities
+for improving this scheme can present themselves when considering the
+\emph{amortized} insertion cost.
+
+Consider the cost accrued by the dynamized structure under global
+reconstruction over the lifetime of the structure. Each insert will result
+in all of the existing records being rewritten, so at worst each record
+will be involved in $\Theta(n)$ reconstructions, each reconstruction
+having $\Theta(B(n))$ cost. We can amortize this cost over the $n$ records
+inserted to get an amortized insertion cost for global reconstruction of,
+
+\begin{equation*}
+I_a(n) = \frac{B(n) \cdot n}{n} = B(n)
+\end{equation*}
+
+This doesn't improve things as is, however it does present two
+opportunities for improvement. If we could either reduce the size of
+the reconstructions, or the number of times a record is reconstructed,
+then we could reduce the amortized insertion cost.
+
+The key insight, first discussed by Bentley and Saxe, is that
+both of these goals can be accomplished by \emph{decomposing} the
+data structure into multiple, smaller structures, each built from
+a disjoint partition of the data. As long as the search problem
+being considered is decomposable, queries can be answered from
+this structure with bounded worst-case overhead, and the amortized
+insertion cost can be improved~\cite{saxe79}. Significant theoretical
+work exists in evaluating different strategies for decomposing the
+data structure~\cite{saxe79, overmars81, overmars83} and for leveraging
+specific efficiencies of the data structures being considered to improve
+these reconstructions~\cite{merge-dsp}.
+
+There are two general decomposition techniques that emerged from
+this work. The earliest of these is the logarithmic method, often
+called the Bentley-Saxe method in modern literature, and is the most
+commonly discussed technique today. The Bentley-Saxe method has been
+directly applied in a few instances in the literature, such as to
+metric indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree},
+and has also been used in a modified form for genetic sequence search
+structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few
+examples.
+
+A later technique, the equal block method, was also developed. It is
+generally not as effective as the Bentley-Saxe method, and as a result we
+have not identified any specific applications of this technique outside
+of the theoretical literature, however we will discuss it as well in
+the interest of completeness, and because it does lend itself well to
+demonstrating certain properties of decomposition-based dynamization
+techniques.
+
+\subsection{Equal Block Method}
+\label{ssec:ebm}
+
+Though chronologically later, the equal block method is theoretically a
+bit simpler, and so we will begin our discussion of decomposition-based
+technique for dynamization of decomposable search problems with it. There
+have been several proposed variations of this concept~\cite{maurer79,
+maurer80}, but we will focus on the most developed form as described by
+Overmars and von Leeuwan~\cite{overmars-art-of-dyn, overmars83}. The core
+concept of the equal block method is to decompose the data structure
+into several smaller data structures, called blocks, over partitions
+of the data. This decomposition is performed such that each block is of
+roughly equal size.
+
+Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves
+some decomposable search problem, $F$ and is built over a set of records
+$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks,
+$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over
+partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value
+makes little sense when the number of records changes, and so it is taken
+to be governed by a smooth, monotonically increasing function $f(n)$ such
+that, at any point, the following two constraints are obeyed.
+\begin{align}
+ f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\
+ \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{s} \label{ebm-c2}
+\end{align}
+where $|\mathscr{I}_j|$ is the number of records in the block,
+$|\text{unbuild}(\mathscr{I}_j)|$.
+
+A new record is inserted by finding the smallest block and rebuilding it
+using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$,
+then an insert is done by,
+\begin{equation*}
+\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\})
+\end{equation*}
+Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
+ Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
+ violated by deletes. We're omitting deletes from the discussion at
+ this point, but will circle back to them in Section~\ref{sec:deletes}.
+} In this case, the constraints are enforced by "re-configuring" the
+structure. $s$ is updated to be exactly $f(n)$, all of the existing
+blocks are unbuilt, and then the records are redistributed evenly into
+$s$ blocks.
+
+A query with parameters $q$ is answered by this structure by individually
+querying the blocks, and merging the local results together with $\mergeop$,
+\begin{equation*}
+F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q)
+\end{equation*}
+where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to
+answering the query over $d$ using the data structure $\mathscr{I}$.
+
+This technique provides better amortized performance bounds than global
+reconstruction, at the possible cost of worse query performance for
+sub-linear queries. We'll omit the details of the proof of performance
+for brevity and streamline some of the original notation (full details
+can be found in~\cite{overmars83}), but this technique ultimately
+results in a data structure with the following performance characteristics,
+\begin{align*}
+\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{B(n)}{n} + B\left(\frac{n}{f(n)}\right)\right) \\
+\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\
+\end{align*}
+where $B(n)$ is the cost of statically building $\mathcal{I}$, and
+$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$.
+
+%TODO: example?
+
+
+\subsection{The Bentley-Saxe Method}
+\label{ssec:bsm}
+
+%FIXME: switch this section (and maybe the previous?) over to being
+% indexed at 0 instead of 1
+
+The original, and most frequently used, dynamization technique is the
+Bentley-Saxe Method (BSM), also called the logarithmic method in older
+literature. Rather than breaking the data structure into equally sized
+blocks, BSM decomposes the structure into logarithmically many blocks
+of exponentially increasing size. More specifically, the data structure
+is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1,
+\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$
+will be either empty, or contain exactly $2^i$ records within it.
+
+The procedure for inserting a record, $r \in \mathcal{D}$, into
+a BSM dynamization is as follows. If the block $\mathscr{I}_0$
+is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not
+empty, then there will exist a maximal sequence of non-empty blocks
+$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq
+0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case,
+$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i
+\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through
+$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the
+end of the structure as needed.
+
+%FIXME: switch the x's to r's for consistency
+\begin{figure}
+\centering
+\includegraphics[width=.8\textwidth]{diag/bsm.pdf}
+\caption{An illustration of inserts into the Bentley-Saxe Method}
+\label{fig:bsm-example}
+\end{figure}
+
+Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The
+dynamization is built over a set of records $x_1, x_2, \ldots,
+x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in
+$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly
+into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the
+first empty block is $\mathscr{I}_2$, and so the insert is performed by
+doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup
+\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$
+and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$.
+
+This technique is called a \emph{binary decomposition} of the data
+structure. Considering a BSM dynamization of a structure containing $n$
+records, labeling each block with a $0$ if it is empty and a $1$ if it
+is full will result in the binary representation of $n$. For example,
+the final state of the structure in Figure~\ref{fig:bsm-example} contains
+$12$ records, and the labeling procedure will result in $0\text{b}1100$,
+which is $12$ in binary. Inserts affect this representation of the
+structure in the same way that incrementing the binary number by $1$ does.
+
+By applying BSM to a data structure, a dynamized structure can be created
+with the following performance characteristics,
+\begin{align*}
+\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{B(n)}{n}\cdot \log_2 n\right)\right) \\
+\text{Worst Case Insertion Cost:}&\quad \Theta\left(B(n)\right) \\
+\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\
+\end{align*}
+This is a particularly attractive result because, for example, a data
+structure having $B(n) \in \Theta(n)$ will have an amortized insertion
+cost of $\log_2 (n)$, which is quite reasonable. The trade-off for this
+is an extra logarithmic multiple attached to the query complexity. It is
+also worth noting that the worst-case insertion cost remains the same
+as global reconstruction, but this case arises only very rarely. If
+you consider the binary decomposition representation, the worst-case
+behavior is triggered each time the existing number overflows, and a
+new digit must be added.
+
+As a final note about the query performance of this structure, because
+the overhead due to querying the blocks is logarithmic, under certain
+circumstances this cost can be absorbed, resulting in no effect on the
+asymptotic worst-case query performance. As an example, consider a linear
+scan of the data running in $\Theta(n)$ time. In this case, every record
+must be considered, and so there isn't any performance penalty\footnote{
+ From an asymptotic perspective. There will still be measurable performance
+ effects from caching, etc., even in this case.
+} to breaking the records out into multiple chunks and scanning them
+individually. For formally, for any query running in $\mathscr{Q}(n) \in
+\Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case
+cost of answering a decomposable search problem from a BSM dynamization
+is $\Theta\left(\mathscr{Q}(n)\right)$.~\cite{saxe79}
+
+\subsection{Merge Decomposable Search Problems}
+
+\subsection{Delete Support}
+
+Classical dynamization techniques have also been developed with
+support for deleting records. In general, the same technique of global
+reconstruction that was used for inserting records can also be used to
+delete them. Given a record $r \in \mathcal{D}$ and a data structure
+$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be
+deleted from the structure in $C(n)$ time as follows,
+\begin{equation*}
+\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\})
+\end{equation*}
+However, supporting deletes within the dynamization schemes discussed
+above is more complicated. The core problem is that inserts affect the
+dynamized structure in a deterministic way, and as a result certain
+partitioning schemes can be leveraged to reason about the
+performance. But, deletes do not work like this.
+
+\begin{figure}
+\caption{A Bentley-Saxe dynamization for the integers on the
+interval $[1, 100]$.}
+\label{fig:bsm-delete-example}
+\end{figure}
+
+For example, consider a Bentley-Saxe dynamization that contains all
+integers on the interval $[1, 100]$, inserted in that order, shown in
+Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the
+records from this structure, one at a time, using global reconstruction.
+This presents several problems,
+\begin{itemize}
+ \item For each record, we need to identify which block it is in before
+ we can delete it.
+ \item The cost of performing a delete is a function of which block the
+ record is in, which is a question of distribution and not easily
+ controlled.
+ \item As records are deleted, the structure will potentially violate
+ the invariants of the decomposition scheme used, which will
+ require additional work to fix.
+\end{itemize}
+
+To resolve these difficulties, two very different approaches have been
+proposed for supporting deletes, each of which rely on certain properties
+of the search problem and data structure. These are the use of a ghost
+structure and weak deletes.
+
+\subsubsection{Ghost Structure for Invertible Search Problems}
+
+The first proposed mechanism for supporting deletes was discussed
+alongside the Bentley-Saxe method in Bentley and Saxe's original
+paper. This technique applies to a class of search problems called
+\emph{invertible} (also called \emph{decomposable counting problems}
+in later literature~\cite{overmars83}). Invertible search problems
+are decomposable, and also support an ``inverse'' merge operator, $\Delta$,
+that is able to remove records from the result set. More formally,
+\begin{definition}[Invertible Search Problem~\cite{saxe79}]
+\label{def:invert}
+A decomposable search problem, $F$ is invertible if and only if there
+exists a constant time computable operator, $\Delta$, such that
+\begin{equation*}
+F(A / B, q) = F(A, q)~\Delta~F(B, q)
+\end{equation*}
+for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
+\end{definition}
+
+Given a search problem with this property, it is possible to perform
+deletes by creating a secondary ``ghost'' structure. When a record
+is to be deleted, it is inserted into this structure. Then, when the
+dynamization is queried, this ghost structure is queried as well as the
+main one. The results from the ghost structure can be removed from the
+result set using the inverse merge operator. This simulates the result
+that would have been obtained had the records been physically removed
+from the main structure.
+
+Two examples of invertible search problems are set membership
+and range count. Range count was formally defined in
+Definition~\ref{def:range-count}.
+
+\begin{theorem}
+Range count is an invertible search problem.
+\end{theorem}
+
+\begin{proof}
+To prove that range count is an invertible search problem, it must be
+decomposable and have a $\Delta$ operator. That it is a DSP has already
+been proven in Theorem~\ref{ther:decomp-range-count}.
+
+Let $\Delta$ be subtraction $(-)$. Applying this to Definition~\ref{def:invert}
+gives,
+\begin{equation*}
+|(A / B) \cap q | = |(A \cap q) / (B \cap q)| = |(A \cap q)| - |(B \cap q)|
+\end{equation*}
+which is true by the distributive property of set difference and
+intersection. Subtraction is computable in constant time, therefore
+range count is an invertible search problem using subtraction as $\Delta$.
+\end{proof}
+
+The set membership search problem is defined as follows,
+\begin{definition}[Set Membership]
+\label{def:set-membership}
+Consider a set of elements $d \subseteq \mathcal{D}$ from some domain,
+and a single element $r \in \mathcal{D}$. A test of set membership is a
+search problem of the form $F: (\mathcal{PS}(\mathcal{D}), \mathcal{D})
+\to \mathbb{B}$ such that $F(d, r) = r \in d$, which maps to $0$ if $r
+\not\in d$ and $1$ if $r \in d$.
+\end{definition}
+
+\begin{theorem}
+Set membership is an invertible search problem.
+\end{theorem}
+
+\begin{proof}
+To prove that set membership is invertible, it is necessary to establish
+that it is a decomposable search problem, and that a $\Delta$ operator
+exists. We'll begin with the former.
+\begin{lemma}
+ \label{lem:set-memb-dsp}
+ Set membership is a decomposable search problem.
+\end{lemma}
+\begin{proof}
+Let $\mergeop$ be the logical disjunction ($\lor$). This yields,
+\begin{align*}
+F(A \cup B, r) &= F(A, r) \lor F(B, r) \\
+r \in (A \cup B) &= (r \in A) \lor (r \in B)
+\end{align*}
+which is true, following directly from the definition of union. The
+logical disjunction is an associative, commutative operator that can
+be calculated in $\Theta(1)$ time. Therefore, set membership is a
+decomposable search problem.
+\end{proof}
+
+For the inverse merge operator, $\Delta$, it is necessary that $F(A,
+r) ~\Delta~F(B, r)$ be true \emph{only} if $r \in A$ and $r \not\in
+B$. Thus, it could be directly implemented as $F(A, r)~\Delta~F(B, r) =
+F(A, r) \land \neg F(B, r)$, which is constant time if
+the operands are already known.
+
+Thus, we have shown that set membership is a decomposable search problem,
+and that a constant time $\Delta$ operator exists. Therefore, it is an
+invertible search problem.
+\end{proof}
+
+For search problems such as these, this technique allows for deletes to be
+supported with the same cost as an insert. Unfortunately, it suffers from
+write amplification because each deleted record is recorded twice--one in
+the main structure, and once in the ghost structure. This means that $n$
+is, in effect, the total number of records and deletes. This can lead
+to some serious problems, for example if every record in a structure
+of $n$ records is deleted, the net result will be an "empty" dynamized
+data structure containing $2n$ physical records within it. To circumvent
+this problem, Bentley and Saxe proposed a mechanism of setting a maximum
+threshold for the size of the ghost structure relative to the main one,
+and performing a complete re-partitioning of the data once this threshold
+is reached, removing all deleted records from the main structure,
+emptying the ghost structure, and rebuilding blocks with the records
+that remain according to the invariants of the technique.
+
+\subsubsection{Weak Deletes for Deletion Decomposable Search Problems}
+
+Another approach for supporting deletes was proposed later, by Overmars
+and van Leeuwen, for a class of search problem called \emph{deletion
+decomposable}. These are decomposable search problems for which the
+underlying data structure supports a delete operation. More formally,
+
+\begin{definition}[Deletion Decomposable Search Problem~\cite{merge-dsp}]
+ A decomposable search problem, $F$, and its data structure,
+ $\mathcal{I}$, is deletion decomposable if and only if, for some
+ instance $\mathscr{I} \in \mathcal{I}$, containing $n$ records,
+ there exists a deletion routine $\mathtt{delete}(\mathscr{I},
+ r)$ that removes some $r \in \mathcal{D}$ in time $D(n)$ without
+ increasing the query time, deletion time, or storage requirement,
+ for $\mathscr{I}$.
+\end{definition}
+
+Superficially, this doesn't appear very useful. If the underlying data
+structure already supports deletes, there isn't much reason to use a
+dynamization technique to add deletes to it. However, one point worth
+mentioning is that it is possible, in many cases, to easily \emph{add}
+delete support to a static structure. If it is possible to locate a
+record and somehow mark it as deleted, without removing it from the
+structure, and then efficiently ignore these records while querying,
+then the given structure and its search problem can be said to be
+deletion decomposable. This technique for deleting records is called
+\emph{weak deletes}.
+
+\begin{definition}[Weak Deletes~\cite{overmars81}]
+\label{def:weak-delete}
+A data structure is said to support weak deletes if it provides a
+routine, \texttt{delete}, that guarantees that after $\alpha \cdot n$
+deletions, where $\alpha < 1$, the query cost is bounded by $k_\alpha
+\mathscr{Q}(n)$ for some constant $k_\alpha$ dependent only upon $\alpha$,
+where $\mathscr{Q}(n)$ is the cost of answering the query against a
+structure upon which no weak deletes were performed.\footnote{
+ This paper also provides a similar definition for weak updates,
+ but these aren't of interest to us in this work, and so the above
+ definition was adapted from the original with the weak update
+ constraints removed.
+} The results of the query of a block containing weakly deleted records
+should be the same as the results would be against a block with those
+records removed.
+\end{definition}
+
+As an example of a deletion decomposable search problem, consider the set
+membership problem considered above (Definition~\ref{def:set-membership})
+where $\mathcal{I}$, the data structure used to answer queries of the
+search problem, is a hash map.\footnote{
+ While most hash maps are already dynamic, and so wouldn't need
+ dynamization to be applied, there do exist static ones too. For example,
+ the hash map being considered could be implemented using perfect
+ hashing~\cite{perfect-hashing}, which has many static implementations.
+}
+
+\begin{theorem}
+ The set membership problem, answered using a static hash map, is
+ deletion decomposable.
+\end{theorem}
+
+\begin{proof}
+We've already shown in Lemma~\ref{lem:set-memb-dsp} that set membership
+is a decomposable search problem. For it to be deletion decomposable,
+we must demonstrate that the hash map, $\mathcal{I}$, supports deleting
+records without hurting its query performance, delete performance, or
+storage requirements. Assume that an instance $\mathscr{I} \in
+\mathcal{I}$ having $|\mathscr{I}| = n$ can answer queries in
+$\mathscr{Q}(n) \in \Theta(1)$ time and requires $\Omega(n)$ storage.
+
+Such a structure can support weak deletes. Each record within the
+structure has a single bit attached to it, indicating whether it has
+been deleted or not. These bits will require $\Theta(n)$ storage and
+be initialized to 0 when the structure is constructed. A delete can
+be performed by querying the structure for the record to be deleted in
+$\Theta(1)$ time, and setting the bit to 1 if the record is found. This
+operation has $D(n) \in \Theta(1)$ cost.
+
+\begin{lemma}
+\label{lem:weak-deletes}
+The delete procedure as described above satisfies the requirements of
+Definition~\ref{def:weak-delete} for weak deletes.
+\end{lemma}
+\begin{proof}
+Per Definition~\ref{def:weak-delete}, there must exist some constant
+dependent only on $\alpha$, $k_\alpha$, such that after $\alpha \cdot
+n$ deletes against $\mathscr{I}$ with $\alpha < 1$, the query cost is
+bounded by $\Theta(\alpha \mathscr{Q}(n))$.
+
+In this case, $\mathscr{Q}(n) \in \Theta(1)$, and therefore our final
+query cost must be bounded by $\Theta(k_\alpha)$. When a query is
+executed against $\mathscr{I}$, there are three possible cases,
+\begin{enumerate}
+\item The record being searched for does not exist in $\mathscr{I}$. In
+this case, the query result is 0.
+\item The record being searched for does exist in $\mathscr{I}$ and has
+a delete bit value of 0. In this case, the query result is 1.
+\item The record being searched for does exist in $\mathscr{I}$ and has
+a delete bit value of 1 (i.e., it has been deleted). In this case, the
+query result is 0.
+\end{enumerate}
+In all three cases, the addition of deletes requires only $\Theta(1)$
+extra work at most. Therefore, set membership over a static hash map
+using our proposed deletion mechanism satisfies the requirements for
+weak deletes, with $k_\alpha = 1$.
+\end{proof}
+
+Finally, we note that the cost of one of these weak deletes is $D(n)
+= \mathscr{Q}(n)$. By Lemma~\ref{lem:weak-deletes}, the delete cost is
+not asymptotically harmed by deleting records.
+
+Thus, we've shown that set membership using a static hash map is a
+decomposable search problem, the storage cost remains $\Omega(n)$ and the
+query and delete costs are unaffected by the presence of deletes using the
+proposed mechanism. All of the requirements of deletion decomposability
+are satisfied, therefore set membership using a static hash map is a
+deletion decomposable search problem.
+\end{proof}
+
+For such problems, deletes can be supported by first identifying the
+block in the dynamization containing the record to be deleted, and
+then calling $\mathtt{delete}$ on it. In order to allow this block to
+be easily located, it is possible to maintain a hash table over all
+of the records, alongside the dynamization, which maps each record
+onto the block containing it. This table must be kept up to date as
+reconstructions occur, but this can be done at no extra asymptotic costs
+for any data structures having $B(n) \in \Omega(n)$, as it requires only
+linear time. This allows for deletes to be performed in $\mathscr{D}(n)
+\in \Theta(D(n))$ time.
+
+The presence of deleted records within the structure does introduce a
+new problem, however. Over time, the number of records in each block will
+drift away from the requirements imposed by the dynamization technique. It
+will eventually become necessary to re-partition the records to restore
+these invariants, which are necessary for bounding the number of blocks,
+and thereby the query performance. The particular invariant maintenance
+rules depend upon the decomposition scheme used.
+
+\Paragraph{Bentley-Saxe Method.} When creating a BSM dynamization for
+a deletion decomposable search problem, the $i$th block where $i \geq 2$\footnote{
+ Block $i=0$ will only ever have one record, so no special maintenance must be
+ done for it. A delete will simply empty it completely.
+},
+in the absence of deletes, will contain $2^{i-1} + 1$ records. When a
+delete occurs in block $i$, no special action is taken until the number
+of records in that block falls below $2^{i-2}$. Once this threshold is
+reached, a reconstruction can be performed to restore the appropriate
+record counts in each block.~\cite{merge-dsp}
+
+\Paragraph{Equal Block Method.} For the equal block method, there are
+two cases in which a delete may cause a block to fail to obey the method's
+size invariants,
+\begin{enumerate}
+ \item If enough records are deleted, it is possible for the number
+ of blocks to exceed $f(2n)$, violating Invariant~\ref{ebm-c1}.
+ \item The deletion of records may cause the maximum size of each
+ block to shrink, causing some blocks to exceed the maximum capacity
+ of $\nicefrac{2n}{s}$. This is a violation of Invariant~\ref{ebm-c2}.
+\end{enumerate}
+In both cases, it should be noted that $n$ is decreased as records are
+deleted. Should either of these cases emerge as a result of a delete,
+the entire structure must be reconfigured to ensure that its invariants
+are maintained. This reconfiguration follows the same procedure as when
+an insert results in a violation: $s$ is updated to be exactly $f(n)$, all
+existing blocks are unbuilt, and then the records are evenly redistributed
+into the $s$ blocks.~\cite{overmars-art-of-dyn}
+
+
+\subsection{Worst-Case Optimal Techniques}
+
+
+\section{Limitations of Classical Dynamization Techniques}
+\label{sec:bsm-limits}
+
+While fairly general, these dynamization techniques have a number of
+limitations that prevent them from being directly usable as a general
+solution to the problem of creating database indices. Because of the
+requirement that the query being answered be decomposable, many search
+problems cannot be addressed--or at least efficiently addressed, by
+decomposition-based dynamization. The techniques also do nothing to reduce
+the worst-case insertion cost, resulting in extremely poor tail latency
+performance relative to hand-built dynamic structures. Finally, these
+approaches do not do a good job of exposing the underlying configuration
+space to the user, meaning that the user can exert limited control on the
+performance of the dynamized data structure. This section will discuss
+these limitations, and the rest of the document will be dedicated to
+proposing solutions to them.
+
+\subsection{Limits of Decomposability}
+\label{ssec:decomp-limits}
+Unfortunately, the DSP abstraction used as the basis of classical
+dynamization techniques has a few significant limitations that restrict
+their applicability,
+
+\begin{itemize}
+ \item The query must be broadcast identically to each block and cannot
+ be adjusted based on the state of the other blocks.
+
+ \item The query process is done in one pass--it cannot be repeated.
+
+ \item The result merge operation must be $O(1)$ to maintain good query
+ performance.
+
+ \item The result merge operation must be commutative and associative,
+ and is called repeatedly to merge pairs of results.
+\end{itemize}
+
+These requirements restrict the types of queries that can be supported by
+the method efficiently. For example, k-nearest neighbor and independent
+range sampling are not decomposable.
+
+\subsubsection{k-Nearest Neighbor}
+\label{sssec-decomp-limits-knn}
+The k-nearest neighbor (k-NN) problem is a generalization of the nearest
+neighbor problem, which seeks to return the closest point within the
+dataset to a given query point. More formally, this can be defined as,
+\begin{definition}[Nearest Neighbor]
+
+ Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
+ be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
+ between two points within $D$. The nearest neighbor problem, $NN(D,
+ q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
+ for some query point, $q \in \mathbb{R}^d$.
+
+\end{definition}
+
+In practice, it is common to require $f(x, y)$ be a metric,\footnote
+{
+ Contrary to its vernacular usage as a synonym for ``distance'', a
+ metric is more formally defined as a valid distance function over
+ a metric space. Metric spaces require their distance functions to
+ have the following properties,
+ \begin{itemize}
+ \item The distance between a point and itself is always 0.
+ \item All distances between non-equal points must be positive.
+ \item For all points, $x, y \in D$, it is true that
+ $f(x, y) = f(y, x)$.
+ \item For any three points $x, y, z \in D$ it is true that
+ $f(x, z) \leq f(x, y) + f(y, z)$.
+ \end{itemize}
+
+ These distances also must have the interpretation that $f(x, y) <
+ f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
+ is the opposite of the definition of similarity, and so some minor
+ manipulations are usually required to make similarity measures work
+ in metric-based indexes. \cite{intro-analysis}
+}
+and this will be done in the examples of indices for addressing
+this problem in this work, but it is not a fundamental aspect of the problem
+formulation. The nearest neighbor problem itself is decomposable,
+with a simple merge function that accepts the result with the smallest
+value of $f(x, q)$ for any two inputs\cite{saxe79}.
+
+The k-nearest neighbor problem generalizes nearest-neighbor to return
+the $k$ nearest elements,
+\begin{definition}[k-Nearest Neighbor]
+
+ Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
+ be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
+ between two points within $D$. The k-nearest neighbor problem,
+ $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
+ such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.
+
+\end{definition}
+
+This can be thought of as solving the nearest-neighbor problem $k$ times,
+each time removing the returned result from $D$ prior to solving the
+problem again. Unlike the single nearest-neighbor case (which can be
+thought of as k-NN with $k=1$), this problem is \emph{not} decomposable.
+
+\begin{theorem}
+ k-NN is not a decomposable search problem.
+\end{theorem}
+
+\begin{proof}
+To prove this, consider the query $KNN(D, q, k)$ against some partitioned
+dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable,
+then there must exist some constant-time, commutative, and associative
+binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l}
+R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
+k)$. Consider the evaluation of the merge operator against two arbitrary
+result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| =
+|R_j| = k$, and that the contents of $R$ must be the $k$ records from
+$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the
+problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$
+time. Therefore, k-NN is not a decomposable search problem.
+\end{proof}
+
+With that said, it is clear that there isn't any fundamental restriction
+preventing the merging of the result sets; it is only the case that an
+arbitrary performance requirement wouldn't be satisfied. It is possible
+to merge the result sets in non-constant time, and so it is the case
+that k-NN is $C(n)$-decomposable. Unfortunately, this classification
+brings with it a reduction in query performance as a result of the way
+result merges are performed.
+
+As a concrete example of these costs, consider using the Bentley-Saxe
+method to extend the VPTree~\cite{vptree}. The VPTree is a static,
+metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k
+\log n)$. One possible merge algorithm for k-NN would be to push all
+of the elements in the two arguments onto a min-heap, and then pop off
+the first $k$. In this case, the cost of the merge operation would be
+$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation
+could be considered to be constant-time. But given that $k$ is only
+bounded in size above by $n$, this isn't a safe assumption to make in
+general. Evaluating the total query cost for the extended structure,
+this would yield,
+
+\begin{equation}
+ k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
+\end{equation}
+
+The reason for this large increase in cost is the repeated application
+of the merge operator. The Bentley-Saxe method requires applying the
+merge operator in a binary fashion to each partial result, multiplying
+its cost by a factor of $\log n$. Thus, the constant-time requirement
+of standard decomposability is necessary to keep the cost of the merge
+operator from appearing within the complexity bound of the entire
+operation in the general case.\footnote {
+ There is a special case, noted by Overmars, where the total cost is
+ $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
+ \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
+ case where the cost of the query and merge operation are sufficiently
+ large to consume the logarithmic factor, and so it doesn't represent
+ a special case with better performance.
+}
+If we could revise the result merging operation to remove this duplicated
+cost, we could greatly reduce the cost of supporting $C(n)$-decomposable
+queries.
+
+\subsubsection{Independent Range Sampling}
+\label{ssec:background-irs}
+
+Another problem that is not decomposable is independent sampling. There
+are a variety of problems falling under this umbrella, including weighted
+set sampling, simple random sampling, and weighted independent range
+sampling, but we will focus on independent range sampling here.
+
+\begin{definition}[Independent Range Sampling~\cite{tao22}]
+ Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
+ interval $q = [x, y]$ and an integer $k$, an independent range
+ sampling query returns $k$ independent samples from $D \cap q$
+ with each point having equal probability of being sampled.
+\end{definition}
+
+This problem immediately encounters a category error when considering
+whether it is decomposable: the result set is randomized, whereas
+the conditions for decomposability are defined in terms of an exact
+matching of records in result sets. To work around this, a slight abuse
+of definition is in order: assume that the equality conditions within
+the DSP definition can be interpreted to mean ``the contents in the two
+sets are drawn from the same distribution''. This enables the category
+of DSP to apply to this type of problem.
+
+Even with this abuse, however, IRS cannot generally be considered
+decomposable; it is at best $C(n)$-decomposable. The reason for this is
+that matching the distribution requires drawing the appropriate number
+of samples from each each partition of the data. Even in the special
+case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples
+from each partition that must appear in the result set cannot be known
+in advance due to differences in the selectivity of the predicate across
+the partitions.
+
+\begin{example}[IRS Sampling Difficulties]
+
+ Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 =
+ \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and
+ an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
+ partitions have the same size, it seems sensible to evenly distribute
+ the samples across them ($4$ samples from each partition). Applying
+ the query predicate to the partitions results in the following,
+ $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$.
+
+ In expectation, then, the first result set will contain $R_0 = \{3,
+ 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
+ probability of a $4$. The second and third result sets can only
+ be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
+ together, we'd find that the probability distribution of the sample
+ would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
+ the same sampling operation over the full dataset (not partitioned),
+ the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
+
+\end{example}
+
+The problem is that the number of samples drawn from each partition needs to be
+weighted based on the number of elements satisfying the query predicate in that
+partition. In the above example, by drawing $4$ samples from $D_1$, more weight
+is given to $3$ than exists within the base dataset. This can be worked around
+by sampling a full $k$ records from each partition, returning both the sample
+and the number of records satisfying the predicate as that partition's query
+result, and then performing another pass of IRS as the merge operator, but this
+is the same approach as was used for k-NN above. This leaves IRS firmly in the
+$C(n)$-decomposable camp. If it were possible to pre-calculate the number of
+samples to draw from each partition, then a constant-time merge operation could
+be used.
+
+\subsection{Configurability}
+
+\subsection{Insertion Tail Latency}
+
+
+\section{Conclusion}
+This chapter discussed the necessary background information pertaining to
+queries and search problems, indexes, and techniques for dynamic extension. It
+described the potential for using custom indexes for accelerating particular
+kinds of queries, as well as the challenges associated with constructing these
+indexes. The remainder of this document will seek to address these challenges
+through modification and extension of the Bentley-Saxe method, describing work
+that has already been completed, as well as the additional work that must be
+done to realize this vision.