summaryrefslogtreecommitdiffstats
path: root/chapters/background.tex
diff options
context:
space:
mode:
authorDouglas Rumbaugh <dbr4@psu.edu>2025-05-14 18:18:05 -0400
committerDouglas Rumbaugh <dbr4@psu.edu>2025-05-14 18:18:05 -0400
commit5d6e1d8bfeba9ab7970948b81ff13d7b963948a1 (patch)
tree1d3eb0689be0a7fab9dfd4dafe2f3a5ee6fde821 /chapters/background.tex
parent40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb (diff)
downloaddissertation-5d6e1d8bfeba9ab7970948b81ff13d7b963948a1.tar.gz
updates
Diffstat (limited to 'chapters/background.tex')
-rw-r--r--chapters/background.tex1027
1 files changed, 0 insertions, 1027 deletions
diff --git a/chapters/background.tex b/chapters/background.tex
index 8ad92a8..ef30685 100644
--- a/chapters/background.tex
+++ b/chapters/background.tex
@@ -1,167 +1,4 @@
\chapter{Background}
-\label{chap:background}
-
-This chapter will introduce important background information and
-existing work in the area of data structure dynamization. We will
-first discuss the concept of a search problem, which is central to
-dynamization techniques. While one might imagine that restrictions on
-dynamization would be functions of the data structure to be dynamized,
-in practice the requirements placed on the data structure are quite mild,
-and it is the necessary properties of the search problem that the data
-structure is used to address that provide the central difficulty to
-applying dynamization techniques in a given area. After this, database
-indices will be discussed briefly. Indices are the primary use of data
-structures within the database context that is of interest to our work.
-Following this, existing theoretical results in the area of data structure
-dynamization will be discussed, which will serve as the building blocks
-for our techniques in subsequent chapters. The chapter will conclude with
-a discussion of some of the limitations of these existing techniques.
-
-\section{Queries and Search Problems}
-\label{sec:dsp}
-
-Data access lies at the core of most database systems. We want to ask
-questions of the data, and ideally get the answer efficiently. We
-will refer to the different types of question that can be asked as
-\emph{search problems}. We will be using this term in a similar way as
-the word \emph{query} \footnote{
- The term query is often abused and used to
- refer to several related, but slightly different things. In the
- vernacular, a query can refer to either a) a general type of search
- problem (as in "range query"), b) a specific instance of a search
- problem, or c) a program written in a query language.
-}
-is often used within the database systems literature: to refer to a
-general class of questions. For example, we could consider range scans,
-point-lookups, nearest neighbor searches, predicate filtering, random
-sampling, etc., to each be a general search problem. Formally, for the
-purposes of this work, a search problem is defined as follows,
-
-\begin{definition}[Search Problem]
- Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
- $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
- $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
-answer domain.\footnote{
- It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
-example, a \texttt{COUNT} aggregation might map a set of strings onto
- an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
-not be a universal constraint.
-}
-\end{definition}
-
-We will use the term \emph{query} to mean a specific instance of a search
-problem,
-
-\begin{definition}[Query]
- Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
- a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
- instance of the search problem, $F(\mathcal{D}, q)$.
-\end{definition}
-
-As an example of using these definitions, a \emph{membership test}
-or \emph{range scan} would be considered search problems, and a range
-scan over the interval $[10, 99]$ would be a query. We've drawn this
-distinction because, as we'll see as we enter into the discussion of
-our work in later chapters, it is useful to have separate, unambiguous
-terms for these two concepts.
-
-\subsection{Decomposable Search Problems}
-
-Dynamization techniques require the partitioning of one data structure
-into several, smaller ones. As a result, these techniques can only
-be applied in situations where the search problem to be answered can
-be answered from this set of smaller data structures, with the same
-answer as would have been obtained had all of the data been used to
-construct a single, large structure. This requirement is formalized in
-the definition of a class of problems called \emph{decomposable search
-problems (DSP)}. This class was first defined by Bentley and Saxe in
-their work on dynamization, and we will adopt their definition,
-
-\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
- \label{def:dsp}
- A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
- only if there exists a constant-time computable, associative, and
- commutative binary operator $\mergeop$ such that,
- \begin{equation*}
- F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
- \end{equation*}
- for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
-\end{definition}
-
-The requirement for $\mergeop$ to be constant-time was used by Bentley and
-Saxe to prove specific performance bounds for answering queries from a
-decomposed data structure. However, it is not strictly \emph{necessary},
-and later work by Overmars lifted this constraint and considered a more
-general class of search problems called \emph{$C(n)$-decomposable search
-problems},
-
-\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}]
- A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
- if and only if there exists an $O(C(n))$-time computable, associative,
- and commutative binary operator $\mergeop$ such that,
- \begin{equation*}
- F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, q)
- \end{equation*}
- for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
-\end{definition}
-
-To demonstrate that a search problem is decomposable, it is necessary to
-show the existence of the merge operator, $\mergeop$, with the necessary
-properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B,
-q)$. With these two results, induction demonstrates that the problem is
-decomposable even in cases with more than two partial results.
-
-As an example, consider range scans,
-\begin{definition}[Range Count]
- \label{def:range-count}
- Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
- $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
- the cardinality, $|d \cap q|$.
-\end{definition}
-
-\begin{theorem}
-\label{ther:decomp-range-count}
-Range Count is a decomposable search problem.
-\end{theorem}
-
-\begin{proof}
-Let $\mergeop$ be addition ($+$). Applying this to
-Definition~\ref{def:dsp}, gives
-\begin{align*}
- |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
-\end{align*}
-which is true by the distributive property of union and
-intersection. Addition is an associative and commutative
-operator that can be calculated in $\Theta(1)$ time. Therefore, range counts
-are DSPs.
-\end{proof}
-
-Because the codomain of a DSP is not restricted, more complex output
-structures can be used to allow for problems that are not directly
-decomposable to be converted to DSPs, possibly with some minor
-post-processing. For example, calculating the arithmetic mean of a set
-of numbers can be formulated as a DSP,
-\begin{theorem}
-The calculation of the arithmetic mean of a set of numbers is a DSP.
-\end{theorem}
-\begin{proof}
- Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
- where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
-contains the sum of the values within the input set, and the
-cardinality of the input set. For two disjoint partitions of the data,
-$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
-$A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$.
-
-Applying Definition~\ref{def:dsp}, gives
-\begin{align*}
- A(D_1 \cup D_2) &= A(D_1)\mergeop A(D_2) \\
- (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
-\end{align*}
-From this result, the average can be determined in constant time by
-taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
-of numbers is a DSP.
-\end{proof}
-
\section{Database Indexes}
@@ -241,7 +78,6 @@ and the log-structured merge (LSM) tree~\cite{oneil96} is also often
used within the context of key-value stores~\cite{rocksdb}. Some databases
implement unordered indices using hash tables~\cite{mysql-btree-hash}.
-
\subsection{The Generalized Index}
The previous section discussed the classical definition of index
@@ -359,866 +195,3 @@ that has been designed specifically for answering particular types of queries
%done surrounding the use of arbitrary indexes in queries in the past,
%such as~\cite{byods-datalog}. This problem is considered out-of-scope
%for the proposed work, but will be considered in the future.
-
-\section{Classical Dynamization Techniques}
-
-Because data in a database is regularly updated, data structures
-intended to be used as an index must support updates (inserts, in-place
-modification, and deletes). Not all potentially useful data structures
-support updates, and so a general strategy for adding update support
-would increase the number of data structures that could be used as
-database indices. We refer to a data structure with update support as
-\emph{dynamic}, and one without update support as \emph{static}.\footnote{
- The term static is distinct from immutable. Static refers to the
- layout of records within the data structure, whereas immutable
- refers to the data stored within those records. This distinction
- will become relevant when we discuss different techniques for adding
- delete support to data structures. The data structures used are
- always static, but not necessarily immutable, because the records may
- contain header information (like visibility) that is updated in place.
-}
-
-This section discusses \emph{dynamization}, the construction of a
-dynamic data structure based on an existing static one. When certain
-conditions are satisfied by the data structure and its associated
-search problem, this process can be done automatically, and with
-provable asymptotic bounds on amortized insertion performance, as well
-as worst case query performance. This is in contrast to the manual
-design of dynamic data structures, which involve techniques based on
-partially rebuilding small portions of a single data structure (called
-\emph{local reconstruction})~\cite{overmars83}. This is a very high cost
-intervention that requires significant effort on the part of the data
-structure designer, whereas conventional dynamization can be performed
-with little-to-no modification of the underlying data structure at all.
-
-It is worth noting that there are a variety of techniques
-discussed in the literature for dynamizing structures with specific
-properties, or under very specific sets of circumstances. Examples
-include frameworks for adding update support succinct data
-structures~\cite{dynamize-succinct} or taking advantage of batching
-of insert and query operations~\cite{batched-decomposable}. This
-section discusses techniques that are more general, and don't require
-workload-specific assumptions.
-
-We will first discuss the necessary data structure requirements, and
-then examine several classical dynamization techniques. The section
-will conclude with a discussion of delete support within the context
-of these techniques. For more detail than is included in this chapter,
-Overmars wrote a book providing a comprehensive survey of techniques for
-creating dynamic data structures, including not only the dynamization
-techniques discussed here, but also local reconstruction based
-techniques and more~\cite{overmars83}.\footnote{
- Sadly, this book isn't readily available in
- digital format as of the time of writing.
-}
-
-
-\subsection{Global Reconstruction}
-
-The most fundamental dynamization technique is that of \emph{global
-reconstruction}. While not particularly useful on its own, global
-reconstruction serves as the basis for the techniques to follow, and so
-we will begin our discussion of dynamization with it.
-
-Consider a class of data structure, $\mathcal{I}$, capable of answering a
-search problem, $\mathcal{Q}$. Insertion via global reconstruction is
-possible if $\mathcal{I}$ supports the following two operations,
-\begin{align*}
-\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
-\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
-\end{align*}
-where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$
-over the data structure over a set of records $d \subseteq \mathcal{D}$
-in $B(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
-\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in
-$\Theta(1)$ time,\footnote{
- There isn't any practical reason why $\mathtt{unbuild}$ must run
- in constant time, but this is the assumption made in \cite{saxe79}
- and in subsequent work based on it, and so we will follow the same
- definition here.
-} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$.
-
-Given this structure, an insert of record $r \in \mathcal{D}$ into a
-data structure $\mathscr{I} \in \mathcal{I}$ can be defined by,
-\begin{align*}
-\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\})
-\end{align*}
-
-It goes without saying that this operation is sub-optimal, as the
-insertion cost is $\Theta(B(n))$, and $B(n) \in \Omega(n)$ at best for
-most data structures. However, this global reconstruction strategy can
-be used as a primitive for more sophisticated techniques that can provide
-reasonable performance.
-
-\subsection{Amortized Global Reconstruction}
-\label{ssec:agr}
-
-The problem with global reconstruction is that each insert must rebuild
-the entire data structure, involving all of its records. This results
-in a worst-case insert cost of $\Theta(B(n))$. However, opportunities
-for improving this scheme can present themselves when considering the
-\emph{amortized} insertion cost.
-
-Consider the cost accrued by the dynamized structure under global
-reconstruction over the lifetime of the structure. Each insert will result
-in all of the existing records being rewritten, so at worst each record
-will be involved in $\Theta(n)$ reconstructions, each reconstruction
-having $\Theta(B(n))$ cost. We can amortize this cost over the $n$ records
-inserted to get an amortized insertion cost for global reconstruction of,
-
-\begin{equation*}
-I_a(n) = \frac{B(n) \cdot n}{n} = B(n)
-\end{equation*}
-
-This doesn't improve things as is, however it does present two
-opportunities for improvement. If we could either reduce the size of
-the reconstructions, or the number of times a record is reconstructed,
-then we could reduce the amortized insertion cost.
-
-The key insight, first discussed by Bentley and Saxe, is that
-both of these goals can be accomplished by \emph{decomposing} the
-data structure into multiple, smaller structures, each built from
-a disjoint partition of the data. As long as the search problem
-being considered is decomposable, queries can be answered from
-this structure with bounded worst-case overhead, and the amortized
-insertion cost can be improved~\cite{saxe79}. Significant theoretical
-work exists in evaluating different strategies for decomposing the
-data structure~\cite{saxe79, overmars81, overmars83} and for leveraging
-specific efficiencies of the data structures being considered to improve
-these reconstructions~\cite{merge-dsp}.
-
-There are two general decomposition techniques that emerged from
-this work. The earliest of these is the logarithmic method, often
-called the Bentley-Saxe method in modern literature, and is the most
-commonly discussed technique today. The Bentley-Saxe method has been
-directly applied in a few instances in the literature, such as to
-metric indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree},
-and has also been used in a modified form for genetic sequence search
-structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few
-examples.
-
-A later technique, the equal block method, was also developed. It is
-generally not as effective as the Bentley-Saxe method, and as a result we
-have not identified any specific applications of this technique outside
-of the theoretical literature, however we will discuss it as well in
-the interest of completeness, and because it does lend itself well to
-demonstrating certain properties of decomposition-based dynamization
-techniques.
-
-\subsection{Equal Block Method}
-\label{ssec:ebm}
-
-Though chronologically later, the equal block method is theoretically a
-bit simpler, and so we will begin our discussion of decomposition-based
-technique for dynamization of decomposable search problems with it. There
-have been several proposed variations of this concept~\cite{maurer79,
-maurer80}, but we will focus on the most developed form as described by
-Overmars and von Leeuwan~\cite{overmars-art-of-dyn, overmars83}. The core
-concept of the equal block method is to decompose the data structure
-into several smaller data structures, called blocks, over partitions
-of the data. This decomposition is performed such that each block is of
-roughly equal size.
-
-Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves
-some decomposable search problem, $F$ and is built over a set of records
-$d \in \mathcal{D}$. This structure can be decomposed into $s$ blocks,
-$\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_s$ each built over
-partitions of $d$, $d_1, d_2, \ldots, d_s$. Fixing $s$ to a specific value
-makes little sense when the number of records changes, and so it is taken
-to be governed by a smooth, monotonically increasing function $f(n)$ such
-that, at any point, the following two constraints are obeyed.
-\begin{align}
- f\left(\frac{n}{2}\right) \leq s \leq f(2n) \label{ebm-c1}\\
- \forall_{1 \leq j \leq s} \quad | \mathscr{I}_j | \leq \frac{2n}{s} \label{ebm-c2}
-\end{align}
-where $|\mathscr{I}_j|$ is the number of records in the block,
-$|\text{unbuild}(\mathscr{I}_j)|$.
-
-A new record is inserted by finding the smallest block and rebuilding it
-using the new record. If $k = \argmin_{1 \leq j \leq s}(|\mathscr{I}_j|)$,
-then an insert is done by,
-\begin{equation*}
-\mathscr{I}_k^\prime = \text{build}(\text{unbuild}(\mathscr{I}_k) \cup \{r\})
-\end{equation*}
-Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\footnote{
- Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be
- violated by deletes. We're omitting deletes from the discussion at
- this point, but will circle back to them in Section~\ref{sec:deletes}.
-} In this case, the constraints are enforced by "re-configuring" the
-structure. $s$ is updated to be exactly $f(n)$, all of the existing
-blocks are unbuilt, and then the records are redistributed evenly into
-$s$ blocks.
-
-A query with parameters $q$ is answered by this structure by individually
-querying the blocks, and merging the local results together with $\mergeop$,
-\begin{equation*}
-F(\mathscr{I}, q) = \bigmergeop_{j=1}^{s}F(\mathscr{I}_j, q)
-\end{equation*}
-where $F(\mathscr{I}, q)$ is a slight abuse of notation, referring to
-answering the query over $d$ using the data structure $\mathscr{I}$.
-
-This technique provides better amortized performance bounds than global
-reconstruction, at the possible cost of worse query performance for
-sub-linear queries. We'll omit the details of the proof of performance
-for brevity and streamline some of the original notation (full details
-can be found in~\cite{overmars83}), but this technique ultimately
-results in a data structure with the following performance characteristics,
-\begin{align*}
-\text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{B(n)}{n} + B\left(\frac{n}{f(n)}\right)\right) \\
-\text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\
-\end{align*}
-where $B(n)$ is the cost of statically building $\mathcal{I}$, and
-$\mathscr{Q}(n)$ is the cost of answering $F$ using $\mathcal{I}$.
-
-%TODO: example?
-
-
-\subsection{The Bentley-Saxe Method~\cite{saxe79}}
-\label{ssec:bsm}
-
-%FIXME: switch this section (and maybe the previous?) over to being
-% indexed at 0 instead of 1
-
-The original, and most frequently used, dynamization technique is the
-Bentley-Saxe Method (BSM), also called the logarithmic method in older
-literature. Rather than breaking the data structure into equally sized
-blocks, BSM decomposes the structure into logarithmically many blocks
-of exponentially increasing size. More specifically, the data structure
-is decomposed into $h = \lceil \log_2 n \rceil$ blocks, $\mathscr{I}_1,
-\mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$
-will be either empty, or contain exactly $2^i$ records within it.
-
-The procedure for inserting a record, $r \in \mathcal{D}$, into
-a BSM dynamization is as follows. If the block $\mathscr{I}_0$
-is empty, then $\mathscr{I}_0 = \text{build}{\{r\}}$. If it is not
-empty, then there will exist a maximal sequence of non-empty blocks
-$\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq
-0$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case,
-$\mathscr{I}_{i+1}$ is set to $\text{build}(\{r\} \cup \bigcup_{l=0}^i
-\text{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_0$ through
-$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the
-end of the structure as needed.
-
-%FIXME: switch the x's to r's for consistency
-\begin{figure}
-\centering
-\includegraphics[width=.8\textwidth]{diag/bsm.pdf}
-\caption{An illustration of inserts into the Bentley-Saxe Method}
-\label{fig:bsm-example}
-\end{figure}
-
-Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The
-dynamization is built over a set of records $x_1, x_2, \ldots,
-x_{10}$ initially, with eight records in $\mathscr{I}_3$ and two in
-$\mathscr{I}_1$. The first new record, $x_{11}$, is inserted directly
-into $\mathscr{I}_0$. For the next insert following this, $x_{12}$, the
-first empty block is $\mathscr{I}_2$, and so the insert is performed by
-doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup
-\text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$
-and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$.
-
-This technique is called a \emph{binary decomposition} of the data
-structure. Considering a BSM dynamization of a structure containing $n$
-records, labeling each block with a $0$ if it is empty and a $1$ if it
-is full will result in the binary representation of $n$. For example,
-the final state of the structure in Figure~\ref{fig:bsm-example} contains
-$12$ records, and the labeling procedure will result in $0\text{b}1100$,
-which is $12$ in binary. Inserts affect this representation of the
-structure in the same way that incrementing the binary number by $1$ does.
-
-By applying BSM to a data structure, a dynamized structure can be created
-with the following performance characteristics,
-\begin{align*}
-\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{B(n)}{n}\cdot \log_2 n\right)\right) \\
-\text{Worst Case Insertion Cost:}&\quad \Theta\left(B(n)\right) \\
-\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\
-\end{align*}
-This is a particularly attractive result because, for example, a data
-structure having $B(n) \in \Theta(n)$ will have an amortized insertion
-cost of $\log_2 (n)$, which is quite reasonable. The trade-off for this
-is an extra logarithmic multiple attached to the query complexity. It is
-also worth noting that the worst-case insertion cost remains the same
-as global reconstruction, but this case arises only very rarely. If
-you consider the binary decomposition representation, the worst-case
-behavior is triggered each time the existing number overflows, and a
-new digit must be added.
-
-As a final note about the query performance of this structure, because
-the overhead due to querying the blocks is logarithmic, under certain
-circumstances this cost can be absorbed, resulting in no effect on the
-asymptotic worst-case query performance. As an example, consider a linear
-scan of the data running in $\Theta(n)$ time. In this case, every record
-must be considered, and so there isn't any performance penalty\footnote{
- From an asymptotic perspective. There will still be measurable performance
- effects from caching, etc., even in this case.
-} to breaking the records out into multiple chunks and scanning them
-individually. For formally, for any query running in $\mathscr{Q}(n) \in
-\Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case
-cost of answering a decomposable search problem from a BSM dynamization
-is $\Theta\left(\mathscr{Q}(n)\right)$.~\cite{saxe79}
-
-\subsection{Merge Decomposable Search Problems}
-
-\subsection{Delete Support}
-
-Classical dynamization techniques have also been developed with
-support for deleting records. In general, the same technique of global
-reconstruction that was used for inserting records can also be used to
-delete them. Given a record $r \in \mathcal{D}$ and a data structure
-$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be
-deleted from the structure in $C(n)$ time as follows,
-\begin{equation*}
-\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\})
-\end{equation*}
-However, supporting deletes within the dynamization schemes discussed
-above is more complicated. The core problem is that inserts affect the
-dynamized structure in a deterministic way, and as a result certain
-partitioning schemes can be leveraged to reason about the
-performance. But, deletes do not work like this.
-
-\begin{figure}
-\caption{A Bentley-Saxe dynamization for the integers on the
-interval $[1, 100]$.}
-\label{fig:bsm-delete-example}
-\end{figure}
-
-For example, consider a Bentley-Saxe dynamization that contains all
-integers on the interval $[1, 100]$, inserted in that order, shown in
-Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the
-records from this structure, one at a time, using global reconstruction.
-This presents several problems,
-\begin{itemize}
- \item For each record, we need to identify which block it is in before
- we can delete it.
- \item The cost of performing a delete is a function of which block the
- record is in, which is a question of distribution and not easily
- controlled.
- \item As records are deleted, the structure will potentially violate
- the invariants of the decomposition scheme used, which will
- require additional work to fix.
-\end{itemize}
-
-To resolve these difficulties, two very different approaches have been
-proposed for supporting deletes, each of which rely on certain properties
-of the search problem and data structure. These are the use of a ghost
-structure and weak deletes.
-
-\subsubsection{Ghost Structure for Invertible Search Problems}
-
-The first proposed mechanism for supporting deletes was discussed
-alongside the Bentley-Saxe method in Bentley and Saxe's original
-paper. This technique applies to a class of search problems called
-\emph{invertible} (also called \emph{decomposable counting problems}
-in later literature~\cite{overmars83}). Invertible search problems
-are decomposable, and also support an ``inverse'' merge operator, $\Delta$,
-that is able to remove records from the result set. More formally,
-\begin{definition}[Invertible Search Problem~\cite{saxe79}]
-\label{def:invert}
-A decomposable search problem, $F$ is invertible if and only if there
-exists a constant time computable operator, $\Delta$, such that
-\begin{equation*}
-F(A / B, q) = F(A, q)~\Delta~F(B, q)
-\end{equation*}
-for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
-\end{definition}
-
-Given a search problem with this property, it is possible to perform
-deletes by creating a secondary ``ghost'' structure. When a record
-is to be deleted, it is inserted into this structure. Then, when the
-dynamization is queried, this ghost structure is queried as well as the
-main one. The results from the ghost structure can be removed from the
-result set using the inverse merge operator. This simulates the result
-that would have been obtained had the records been physically removed
-from the main structure.
-
-Two examples of invertible search problems are set membership
-and range count. Range count was formally defined in
-Definition~\ref{def:range-count}.
-
-\begin{theorem}
-Range count is an invertible search problem.
-\end{theorem}
-
-\begin{proof}
-To prove that range count is an invertible search problem, it must be
-decomposable and have a $\Delta$ operator. That it is a DSP has already
-been proven in Theorem~\ref{ther:decomp-range-count}.
-
-Let $\Delta$ be subtraction $(-)$. Applying this to Definition~\ref{def:invert}
-gives,
-\begin{equation*}
-|(A / B) \cap q | = |(A \cap q) / (B \cap q)| = |(A \cap q)| - |(B \cap q)|
-\end{equation*}
-which is true by the distributive property of set difference and
-intersection. Subtraction is computable in constant time, therefore
-range count is an invertible search problem using subtraction as $\Delta$.
-\end{proof}
-
-The set membership search problem is defined as follows,
-\begin{definition}[Set Membership]
-\label{def:set-membership}
-Consider a set of elements $d \subseteq \mathcal{D}$ from some domain,
-and a single element $r \in \mathcal{D}$. A test of set membership is a
-search problem of the form $F: (\mathcal{PS}(\mathcal{D}), \mathcal{D})
-\to \mathbb{B}$ such that $F(d, r) = r \in d$, which maps to $0$ if $r
-\not\in d$ and $1$ if $r \in d$.
-\end{definition}
-
-\begin{theorem}
-Set membership is an invertible search problem.
-\end{theorem}
-
-\begin{proof}
-To prove that set membership is invertible, it is necessary to establish
-that it is a decomposable search problem, and that a $\Delta$ operator
-exists. We'll begin with the former.
-\begin{lemma}
- \label{lem:set-memb-dsp}
- Set membership is a decomposable search problem.
-\end{lemma}
-\begin{proof}
-Let $\mergeop$ be the logical disjunction ($\lor$). This yields,
-\begin{align*}
-F(A \cup B, r) &= F(A, r) \lor F(B, r) \\
-r \in (A \cup B) &= (r \in A) \lor (r \in B)
-\end{align*}
-which is true, following directly from the definition of union. The
-logical disjunction is an associative, commutative operator that can
-be calculated in $\Theta(1)$ time. Therefore, set membership is a
-decomposable search problem.
-\end{proof}
-
-For the inverse merge operator, $\Delta$, it is necessary that $F(A,
-r) ~\Delta~F(B, r)$ be true \emph{only} if $r \in A$ and $r \not\in
-B$. Thus, it could be directly implemented as $F(A, r)~\Delta~F(B, r) =
-F(A, r) \land \neg F(B, r)$, which is constant time if
-the operands are already known.
-
-Thus, we have shown that set membership is a decomposable search problem,
-and that a constant time $\Delta$ operator exists. Therefore, it is an
-invertible search problem.
-\end{proof}
-
-For search problems such as these, this technique allows for deletes to be
-supported with the same cost as an insert. Unfortunately, it suffers from
-write amplification because each deleted record is recorded twice--one in
-the main structure, and once in the ghost structure. This means that $n$
-is, in effect, the total number of records and deletes. This can lead
-to some serious problems, for example if every record in a structure
-of $n$ records is deleted, the net result will be an "empty" dynamized
-data structure containing $2n$ physical records within it. To circumvent
-this problem, Bentley and Saxe proposed a mechanism of setting a maximum
-threshold for the size of the ghost structure relative to the main one,
-and performing a complete re-partitioning of the data once this threshold
-is reached, removing all deleted records from the main structure,
-emptying the ghost structure, and rebuilding blocks with the records
-that remain according to the invariants of the technique.
-
-\subsubsection{Weak Deletes for Deletion Decomposable Search Problems}
-
-Another approach for supporting deletes was proposed later, by Overmars
-and van Leeuwen, for a class of search problem called \emph{deletion
-decomposable}. These are decomposable search problems for which the
-underlying data structure supports a delete operation. More formally,
-
-\begin{definition}[Deletion Decomposable Search Problem~\cite{merge-dsp}]
- A decomposable search problem, $F$, and its data structure,
- $\mathcal{I}$, is deletion decomposable if and only if, for some
- instance $\mathscr{I} \in \mathcal{I}$, containing $n$ records,
- there exists a deletion routine $\mathtt{delete}(\mathscr{I},
- r)$ that removes some $r \in \mathcal{D}$ in time $D(n)$ without
- increasing the query time, deletion time, or storage requirement,
- for $\mathscr{I}$.
-\end{definition}
-
-Superficially, this doesn't appear very useful. If the underlying data
-structure already supports deletes, there isn't much reason to use a
-dynamization technique to add deletes to it. However, one point worth
-mentioning is that it is possible, in many cases, to easily \emph{add}
-delete support to a static structure. If it is possible to locate a
-record and somehow mark it as deleted, without removing it from the
-structure, and then efficiently ignore these records while querying,
-then the given structure and its search problem can be said to be
-deletion decomposable. This technique for deleting records is called
-\emph{weak deletes}.
-
-\begin{definition}[Weak Deletes~\cite{overmars81}]
-\label{def:weak-delete}
-A data structure is said to support weak deletes if it provides a
-routine, \texttt{delete}, that guarantees that after $\alpha \cdot n$
-deletions, where $\alpha < 1$, the query cost is bounded by $k_\alpha
-\mathscr{Q}(n)$ for some constant $k_\alpha$ dependent only upon $\alpha$,
-where $\mathscr{Q}(n)$ is the cost of answering the query against a
-structure upon which no weak deletes were performed.\footnote{
- This paper also provides a similar definition for weak updates,
- but these aren't of interest to us in this work, and so the above
- definition was adapted from the original with the weak update
- constraints removed.
-} The results of the query of a block containing weakly deleted records
-should be the same as the results would be against a block with those
-records removed.
-\end{definition}
-
-As an example of a deletion decomposable search problem, consider the set
-membership problem considered above (Definition~\ref{def:set-membership})
-where $\mathcal{I}$, the data structure used to answer queries of the
-search problem, is a hash map.\footnote{
- While most hash maps are already dynamic, and so wouldn't need
- dynamization to be applied, there do exist static ones too. For example,
- the hash map being considered could be implemented using perfect
- hashing~\cite{perfect-hashing}, which has many static implementations.
-}
-
-\begin{theorem}
- The set membership problem, answered using a static hash map, is
- deletion decomposable.
-\end{theorem}
-
-\begin{proof}
-We've already shown in Lemma~\ref{lem:set-memb-dsp} that set membership
-is a decomposable search problem. For it to be deletion decomposable,
-we must demonstrate that the hash map, $\mathcal{I}$, supports deleting
-records without hurting its query performance, delete performance, or
-storage requirements. Assume that an instance $\mathscr{I} \in
-\mathcal{I}$ having $|\mathscr{I}| = n$ can answer queries in
-$\mathscr{Q}(n) \in \Theta(1)$ time and requires $\Omega(n)$ storage.
-
-Such a structure can support weak deletes. Each record within the
-structure has a single bit attached to it, indicating whether it has
-been deleted or not. These bits will require $\Theta(n)$ storage and
-be initialized to 0 when the structure is constructed. A delete can
-be performed by querying the structure for the record to be deleted in
-$\Theta(1)$ time, and setting the bit to 1 if the record is found. This
-operation has $D(n) \in \Theta(1)$ cost.
-
-\begin{lemma}
-\label{lem:weak-deletes}
-The delete procedure as described above satisfies the requirements of
-Definition~\ref{def:weak-delete} for weak deletes.
-\end{lemma}
-\begin{proof}
-Per Definition~\ref{def:weak-delete}, there must exist some constant
-dependent only on $\alpha$, $k_\alpha$, such that after $\alpha \cdot
-n$ deletes against $\mathscr{I}$ with $\alpha < 1$, the query cost is
-bounded by $\Theta(\alpha \mathscr{Q}(n))$.
-
-In this case, $\mathscr{Q}(n) \in \Theta(1)$, and therefore our final
-query cost must be bounded by $\Theta(k_\alpha)$. When a query is
-executed against $\mathscr{I}$, there are three possible cases,
-\begin{enumerate}
-\item The record being searched for does not exist in $\mathscr{I}$. In
-this case, the query result is 0.
-\item The record being searched for does exist in $\mathscr{I}$ and has
-a delete bit value of 0. In this case, the query result is 1.
-\item The record being searched for does exist in $\mathscr{I}$ and has
-a delete bit value of 1 (i.e., it has been deleted). In this case, the
-query result is 0.
-\end{enumerate}
-In all three cases, the addition of deletes requires only $\Theta(1)$
-extra work at most. Therefore, set membership over a static hash map
-using our proposed deletion mechanism satisfies the requirements for
-weak deletes, with $k_\alpha = 1$.
-\end{proof}
-
-Finally, we note that the cost of one of these weak deletes is $D(n)
-= \mathscr{Q}(n)$. By Lemma~\ref{lem:weak-deletes}, the delete cost is
-not asymptotically harmed by deleting records.
-
-Thus, we've shown that set membership using a static hash map is a
-decomposable search problem, the storage cost remains $\Omega(n)$ and the
-query and delete costs are unaffected by the presence of deletes using the
-proposed mechanism. All of the requirements of deletion decomposability
-are satisfied, therefore set membership using a static hash map is a
-deletion decomposable search problem.
-\end{proof}
-
-For such problems, deletes can be supported by first identifying the
-block in the dynamization containing the record to be deleted, and
-then calling $\mathtt{delete}$ on it. In order to allow this block to
-be easily located, it is possible to maintain a hash table over all
-of the records, alongside the dynamization, which maps each record
-onto the block containing it. This table must be kept up to date as
-reconstructions occur, but this can be done at no extra asymptotic costs
-for any data structures having $B(n) \in \Omega(n)$, as it requires only
-linear time. This allows for deletes to be performed in $\mathscr{D}(n)
-\in \Theta(D(n))$ time.
-
-The presence of deleted records within the structure does introduce a
-new problem, however. Over time, the number of records in each block will
-drift away from the requirements imposed by the dynamization technique. It
-will eventually become necessary to re-partition the records to restore
-these invariants, which are necessary for bounding the number of blocks,
-and thereby the query performance. The particular invariant maintenance
-rules depend upon the decomposition scheme used.
-
-\Paragraph{Bentley-Saxe Method.} When creating a BSM dynamization for
-a deletion decomposable search problem, the $i$th block where $i \geq 2$\footnote{
- Block $i=0$ will only ever have one record, so no special maintenance must be
- done for it. A delete will simply empty it completely.
-},
-in the absence of deletes, will contain $2^{i-1} + 1$ records. When a
-delete occurs in block $i$, no special action is taken until the number
-of records in that block falls below $2^{i-2}$. Once this threshold is
-reached, a reconstruction can be performed to restore the appropriate
-record counts in each block.~\cite{merge-dsp}
-
-\Paragraph{Equal Block Method.} For the equal block method, there are
-two cases in which a delete may cause a block to fail to obey the method's
-size invariants,
-\begin{enumerate}
- \item If enough records are deleted, it is possible for the number
- of blocks to exceed $f(2n)$, violating Invariant~\ref{ebm-c1}.
- \item The deletion of records may cause the maximum size of each
- block to shrink, causing some blocks to exceed the maximum capacity
- of $\nicefrac{2n}{s}$. This is a violation of Invariant~\ref{ebm-c2}.
-\end{enumerate}
-In both cases, it should be noted that $n$ is decreased as records are
-deleted. Should either of these cases emerge as a result of a delete,
-the entire structure must be reconfigured to ensure that its invariants
-are maintained. This reconfiguration follows the same procedure as when
-an insert results in a violation: $s$ is updated to be exactly $f(n)$, all
-existing blocks are unbuilt, and then the records are evenly redistributed
-into the $s$ blocks.~\cite{overmars-art-of-dyn}
-
-
-\subsection{Worst-Case Optimal Techniques}
-
-
-\section{Limitations of Classical Dynamization Techniques}
-\label{sec:bsm-limits}
-
-While fairly general, these dynamization techniques have a number of
-limitations that prevent them from being directly usable as a general
-solution to the problem of creating database indices. Because of the
-requirement that the query being answered be decomposable, many search
-problems cannot be addressed--or at least efficiently addressed, by
-decomposition-based dynamization. The techniques also do nothing to reduce
-the worst-case insertion cost, resulting in extremely poor tail latency
-performance relative to hand-built dynamic structures. Finally, these
-approaches do not do a good job of exposing the underlying configuration
-space to the user, meaning that the user can exert limited control on the
-performance of the dynamized data structure. This section will discuss
-these limitations, and the rest of the document will be dedicated to
-proposing solutions to them.
-
-\subsection{Limits of Decomposability}
-\label{ssec:decomp-limits}
-Unfortunately, the DSP abstraction used as the basis of classical
-dynamization techniques has a few significant limitations that restrict
-their applicability,
-
-\begin{itemize}
- \item The query must be broadcast identically to each block and cannot
- be adjusted based on the state of the other blocks.
-
- \item The query process is done in one pass--it cannot be repeated.
-
- \item The result merge operation must be $O(1)$ to maintain good query
- performance.
-
- \item The result merge operation must be commutative and associative,
- and is called repeatedly to merge pairs of results.
-\end{itemize}
-
-These requirements restrict the types of queries that can be supported by
-the method efficiently. For example, k-nearest neighbor and independent
-range sampling are not decomposable.
-
-\subsubsection{k-Nearest Neighbor}
-\label{sssec-decomp-limits-knn}
-The k-nearest neighbor (k-NN) problem is a generalization of the nearest
-neighbor problem, which seeks to return the closest point within the
-dataset to a given query point. More formally, this can be defined as,
-\begin{definition}[Nearest Neighbor]
-
- Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
- be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
- between two points within $D$. The nearest neighbor problem, $NN(D,
- q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
- for some query point, $q \in \mathbb{R}^d$.
-
-\end{definition}
-
-In practice, it is common to require $f(x, y)$ be a metric,\footnote
-{
- Contrary to its vernacular usage as a synonym for ``distance'', a
- metric is more formally defined as a valid distance function over
- a metric space. Metric spaces require their distance functions to
- have the following properties,
- \begin{itemize}
- \item The distance between a point and itself is always 0.
- \item All distances between non-equal points must be positive.
- \item For all points, $x, y \in D$, it is true that
- $f(x, y) = f(y, x)$.
- \item For any three points $x, y, z \in D$ it is true that
- $f(x, z) \leq f(x, y) + f(y, z)$.
- \end{itemize}
-
- These distances also must have the interpretation that $f(x, y) <
- f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
- is the opposite of the definition of similarity, and so some minor
- manipulations are usually required to make similarity measures work
- in metric-based indexes. \cite{intro-analysis}
-}
-and this will be done in the examples of indices for addressing
-this problem in this work, but it is not a fundamental aspect of the problem
-formulation. The nearest neighbor problem itself is decomposable,
-with a simple merge function that accepts the result with the smallest
-value of $f(x, q)$ for any two inputs\cite{saxe79}.
-
-The k-nearest neighbor problem generalizes nearest-neighbor to return
-the $k$ nearest elements,
-\begin{definition}[k-Nearest Neighbor]
-
- Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
- be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
- between two points within $D$. The k-nearest neighbor problem,
- $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
- such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.
-
-\end{definition}
-
-This can be thought of as solving the nearest-neighbor problem $k$ times,
-each time removing the returned result from $D$ prior to solving the
-problem again. Unlike the single nearest-neighbor case (which can be
-thought of as k-NN with $k=1$), this problem is \emph{not} decomposable.
-
-\begin{theorem}
- k-NN is not a decomposable search problem.
-\end{theorem}
-
-\begin{proof}
-To prove this, consider the query $KNN(D, q, k)$ against some partitioned
-dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable,
-then there must exist some constant-time, commutative, and associative
-binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l}
-R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
-k)$. Consider the evaluation of the merge operator against two arbitrary
-result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| =
-|R_j| = k$, and that the contents of $R$ must be the $k$ records from
-$R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the
-problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$
-time. Therefore, k-NN is not a decomposable search problem.
-\end{proof}
-
-With that said, it is clear that there isn't any fundamental restriction
-preventing the merging of the result sets; it is only the case that an
-arbitrary performance requirement wouldn't be satisfied. It is possible
-to merge the result sets in non-constant time, and so it is the case
-that k-NN is $C(n)$-decomposable. Unfortunately, this classification
-brings with it a reduction in query performance as a result of the way
-result merges are performed.
-
-As a concrete example of these costs, consider using the Bentley-Saxe
-method to extend the VPTree~\cite{vptree}. The VPTree is a static,
-metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k
-\log n)$. One possible merge algorithm for k-NN would be to push all
-of the elements in the two arguments onto a min-heap, and then pop off
-the first $k$. In this case, the cost of the merge operation would be
-$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation
-could be considered to be constant-time. But given that $k$ is only
-bounded in size above by $n$, this isn't a safe assumption to make in
-general. Evaluating the total query cost for the extended structure,
-this would yield,
-
-\begin{equation}
- k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
-\end{equation}
-
-The reason for this large increase in cost is the repeated application
-of the merge operator. The Bentley-Saxe method requires applying the
-merge operator in a binary fashion to each partial result, multiplying
-its cost by a factor of $\log n$. Thus, the constant-time requirement
-of standard decomposability is necessary to keep the cost of the merge
-operator from appearing within the complexity bound of the entire
-operation in the general case.\footnote {
- There is a special case, noted by Overmars, where the total cost is
- $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
- \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
- case where the cost of the query and merge operation are sufficiently
- large to consume the logarithmic factor, and so it doesn't represent
- a special case with better performance.
-}
-If we could revise the result merging operation to remove this duplicated
-cost, we could greatly reduce the cost of supporting $C(n)$-decomposable
-queries.
-
-\subsubsection{Independent Range Sampling}
-\label{ssec:background-irs}
-
-Another problem that is not decomposable is independent sampling. There
-are a variety of problems falling under this umbrella, including weighted
-set sampling, simple random sampling, and weighted independent range
-sampling, but we will focus on independent range sampling here.
-
-\begin{definition}[Independent Range Sampling~\cite{tao22}]
- Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
- interval $q = [x, y]$ and an integer $k$, an independent range
- sampling query returns $k$ independent samples from $D \cap q$
- with each point having equal probability of being sampled.
-\end{definition}
-
-This problem immediately encounters a category error when considering
-whether it is decomposable: the result set is randomized, whereas
-the conditions for decomposability are defined in terms of an exact
-matching of records in result sets. To work around this, a slight abuse
-of definition is in order: assume that the equality conditions within
-the DSP definition can be interpreted to mean ``the contents in the two
-sets are drawn from the same distribution''. This enables the category
-of DSP to apply to this type of problem.
-
-Even with this abuse, however, IRS cannot generally be considered
-decomposable; it is at best $C(n)$-decomposable. The reason for this is
-that matching the distribution requires drawing the appropriate number
-of samples from each each partition of the data. Even in the special
-case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples
-from each partition that must appear in the result set cannot be known
-in advance due to differences in the selectivity of the predicate across
-the partitions.
-
-\begin{example}[IRS Sampling Difficulties]
-
- Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 =
- \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and
- an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
- partitions have the same size, it seems sensible to evenly distribute
- the samples across them ($4$ samples from each partition). Applying
- the query predicate to the partitions results in the following,
- $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$.
-
- In expectation, then, the first result set will contain $R_0 = \{3,
- 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
- probability of a $4$. The second and third result sets can only
- be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
- together, we'd find that the probability distribution of the sample
- would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
- the same sampling operation over the full dataset (not partitioned),
- the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
-
-\end{example}
-
-The problem is that the number of samples drawn from each partition needs to be
-weighted based on the number of elements satisfying the query predicate in that
-partition. In the above example, by drawing $4$ samples from $D_1$, more weight
-is given to $3$ than exists within the base dataset. This can be worked around
-by sampling a full $k$ records from each partition, returning both the sample
-and the number of records satisfying the predicate as that partition's query
-result, and then performing another pass of IRS as the merge operator, but this
-is the same approach as was used for k-NN above. This leaves IRS firmly in the
-$C(n)$-decomposable camp. If it were possible to pre-calculate the number of
-samples to draw from each partition, then a constant-time merge operation could
-be used.
-
-\subsection{Insertion Tail Latency}
-
-\subsection{Configurability}
-
-\section{Conclusion}
-This chapter discussed the necessary background information pertaining to
-queries and search problems, indexes, and techniques for dynamic extension. It
-described the potential for using custom indexes for accelerating particular
-kinds of queries, as well as the challenges associated with constructing these
-indexes. The remainder of this document will seek to address these challenges
-through modification and extension of the Bentley-Saxe method, describing work
-that has already been completed, as well as the additional work that must be
-done to realize this vision.