\chapter{Background}
\label{chap:background}

This chapter will introduce important background information and
existing work in the area of data structure dynamization. We will
first discuss the concept of a search problem, which is central to
dynamization techniques.  While one might imagine that restrictions on
dynamization would be functions of the data structure to be dynamized,
in practice the requirements placed on the data structure are quite mild,
and it is the necessary properties of the search problem that the data
structure is used to address that provide the central difficulty to
applying dynamization techniques in a given area. After this, database
indices will be discussed briefly. Indices are the primary use of data
structures within the database context that is of interest to our work.
Following this, existing theoretical results in the area of data structure
dynamization will be discussed, which will serve as the building blocks
for our techniques in subsquent chapters. The chapter will conclude with
a discussion of some of the limitations of these existing techniques.

\section{Queries and Search Problems}
\label{sec:dsp}

Data access lies at the core of most database systems. We want to ask
questions of the data, and ideally get the answer efficiently. We
will refer to the different types of question that can be asked as
\emph{search problems}. We will be using this term in a similar way as
the word \emph{query} \footnote{
    The term query is often abused and used to
    refer to several related, but slightly different things. In the
    vernacular, a query can refer to either a) a general type of search
    problem (as in "range query"), b) a specific instance of a search
    problem, or c) a program written in a query language.
}
is often used within the database systems literature: to refer to a
general class of questions. For example, we could consider range scans,
point-lookups, nearest neighbor searches, predicate filtering, random
sampling, etc., to each be a general search problem.  Formally, for the
purposes of this work, a search problem is defined as follows,

\begin{definition}[Search Problem] 
    Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
    $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
    $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
answer domain.\footnote{
    It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
example, a \texttt{COUNT} aggregation might map a set of strings onto
    an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
not be a universal constraint.
}
\end{definition}

We will use the term \emph{query} to mean a specific instance of a search
problem,

\begin{definition}[Query]
    Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
    a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
    instance of the search problem, $F(\mathcal{D}, q)$.
\end{definition}

As an example of using these definitions, a \emph{membership test}
or \emph{range scan} would be considered search problems, and a range
scan over the interval $[10, 99]$ would be a query.  We've drawn this
distinction because, as we'll see as we enter into the discussion of
our work in later chapters, it is useful to have seperate, unambiguous
terms for these two concepts.

\subsection{Decomposable Search Problems}

Dynamization techniques require the partitioning of one data structure
into several, smaller ones. As a result, these techniques can only
be applied in situations where the search problem to be answered can
be answered from this set of smaller data structures, with the same
answer as would have been obtained had all of the data been used to
construct a single, large structure. This requirement is formalized in
the definition of a class of problems called \emph{decomposable search
problems (DSP)}.  This class was first defined by Bentley and Saxe in
their work on dynamization, and we will adopt their definition,

\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
	\label{def:dsp}
    A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
    only if there exists a constant-time computable, associative, and
    commutative binary operator $\square$ such that,
    \begin{equation*}
    F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
    \end{equation*}
\end{definition}

The requirement for $\square$ to be constant-time was used by Bentley and
Saxe to prove specific performance bounds for answering queries from a
decomposed data structure. However, it is not strictly \emph{necessary},
and later work by Overmars lifted this constraint and considered a more
general class of search problems called \emph{$C(n)$-decomposable search
problems},

\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}]
    A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
    if and only if there exists an $O(C(n))$-time computable, associative,
    and commutative binary operator $\square$ such that,
    \begin{equation*}
    F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
    \end{equation*}
\end{definition}

To demonstrate that a search problem is decomposable, it is necessary to
show the existence of the merge operator, $\square$, with the necessary
properties, and to show that $F(A \cup B, q) = F(A, q)~ \square ~F(B,
q)$. With these two results, induction demonstrates that the problem is
decomposable even in cases with more than two partial results.

As an example, consider  range scans,
\begin{definition}[Range Count]
    Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
    $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
    the cardinality, $|d \cap q|$.
\end{definition}

\begin{theorem}
Range Count is a decomposable search problem.
\end{theorem}

\begin{proof}
Let $\square$ be addition ($+$). Applying this to
Definition~\ref{def:dsp}, gives
\begin{align*}
	|(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
\end{align*}
which is true by the distributive property of union and
intersection. Addition is an associative and commutative
operator that can be calculated in $O(1)$ time. Therefore, range counts
are DSPs.
\end{proof}

Because the codomain of a DSP is not restricted, more complex output
structures can be used to allow for problems that are not directly
decomposable to be converted to DSPs, possibly with some minor
post-processing. For example, calculating the arithmetic mean of a set
of numbers can be formulated as a DSP,
\begin{theorem}
The calculation of the arithmetic mean of a set of numbers is a DSP.
\end{theorem}
\begin{proof}
    Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
    where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
contains the sum of the values within the input set, and the
cardinality of the input set. For two disjoint paritions of the data,
$D_1$ and $D_2$, let  $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
$A(D_1) \square A(D_2) = (s_1 + s_2, c_1 + c_2)$.

Applying Definition~\ref{def:dsp}, gives
\begin{align*}
	A(D_1 \cup D_2) &= A(D_1)\square A(D_2)  \\
	(s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
\end{align*}
From this result, the average can be determined in constant time by
taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
of numbers is a DSP.
\end{proof}


\section{Database Indexes}
\label{sec:indexes}

Within a database system, search problems are expressed using
some high level language (or mapped directly to commands, for
simpler systems like key-value stores), which is processed by
the database system to produce a result. Within many database
systems, the most basic access primitive is a table scan, which
sequentially examines each record within the data set. There are many
situations in which the same query could be answered in less time using
a more sophisticated data access scheme, however, and databases support
a limited number of such schemes through the use of specialized data
structures called \emph{indices} (or indexes). Indices can be built over
a set of attributes in a table and provide faster access for particular
search problems.

The term \emph{index} is often abused within the database community
to refer to a range of closely related, but distinct, conceptual
categories.\footnote{
The word index can be used to refer to a structure mapping record
information to the set of records matching that information, as a
general synonym for ``data structure'', to data structures used
specifically in query processing, etc.
} 
This ambiguity is rarely problematic, as the subtle differences between
these categories are not often significant, and context clarifies the
intended meaning in situations where they are.  However, this work
explicitly operates at the interface of two of these categories, and so
it is important to disambiguate between them.

\subsection{The Classical Index}

A database index is a specialized data structure that provides a means
to efficiently locate records that satisfy specific criteria. This
enables more efficient query processing for supported search problems. A
classical index can be modeled as a function, mapping a set of attribute
values, called a key, $\mathcal{K}$, to a set of record identifiers,
$\mathcal{R}$. The codomain of an index can be either the set of
record identifiers, a set containing sets of record identifiers, or
the set of physical records, depending upon the configuration of the
index.~\cite{cowbook} For our purposes here, we'll focus on the first of
these, but the use of other codmains wouldn't have any material effect
on our discussion.

We will use the following definition of a "classical" database index,

\begin{definition}[Classical Index~\cite{cowbook}]
Consider a set of database records, $\mathcal{D}$. An index over
these records, $\mathcal{I}_\mathcal{D}$ is a map of the form
    $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where
$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$,
called a \emph{key}.
\end{definition}

In order to facilitate this mapping, indexes are built using data
structures.  The selection of data structure has implications on the
performance of the index, and the types of search problem it can be
used to accelerate. Broadly speaking, classical indices can be divided
into two categories: ordered and unordered. Ordered indices allow for
the iteration over a set of record identifiers in a particular sorted
order of keys, and the efficient location of a specific key value in
that order. These indices can be used to accelerate range scans and
point-lookups. Unordered indices are specialized for point-lookups on a
particular key value, and do not support iterating over records in some
order.~\cite{cowbook, mysql-btree-hash}

There is a very small set of data structures that are usually used for
creating classical indexes. For ordered indices, the most commonly used
data structure is the B-tree~\cite{ubiq-btree},\footnote{
    By \emph{B-tree} here, we are referring not to the B-tree data
    structure, but to a wide range of related structures derived from
    the B-tree.  Examples include the B$^+$-tree, B$^\epsilon$-tree, etc.
}
and the log-structured merge (LSM) tree~\cite{oneil96} is also often
used within the context of key-value stores~\cite{rocksdb}. Some databases
implement unordered indices using hash tables~\cite{mysql-btree-hash}.


\subsection{The Generalized Index}

The previous section discussed the classical definition of index
as might be found in a database systems textbook. However, this
definition is limited by its association specifically with mapping
key fields to records. For the purposes of this work, a broader
definition of index will be considered,

\begin{definition}[Generalized Index]
Consider a set of database records, $\mathcal{D}$, and search
problem, $\mathcal{Q}$.
A generalized index, $\mathcal{I}_\mathcal{D}$
is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to
\mathcal{R})$.
\end{definition}

A classical index is a special case of a generalized index, with $\mathcal{Q}$
being a point-lookup or range scan based on a set of record attributes.

There are a number of generalized indexes that appear in some database systems.
For example, some specialized databases or database extensions have support for
indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a
signifier for a general class of related data structures.} for spatial
databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world
graphs for similarity search~\cite{pinecone-db}, among others. These systems
are typically either an add-on module, or a specialized standalone database
that has been designed specifically for answering particular types of queries
(such as spatial queries, similarity search, string matching, etc.).

%\subsection{Indexes in Query Processing} 

%A database management system utilizes indexes to accelerate certain
%types of query. Queries are expressed to the system in some high
%level language, such as SQL or Datalog. These are generalized
%languages capable of expressing a wide range of possible queries.
%The DBMS is then responsible for converting these queries into a
%set of primitive data access procedures that are supported by the
%underlying storage engine. There are a variety of techniques for
%this, including mapping directly to a tree of relational algebra
%operators and interpreting that tree, query compilation, etc. But,
%ultimately, this internal query representation is limited by the routines 
%supported by the storage engine.~\cite{cowbook}

%As an example, consider the following SQL query (representing a
%2-dimensional k-nearest neighbor problem)\footnote{There are more efficient
%ways of answering this query, but I'm aiming for simplicity here
%to demonstrate my point},
%
%\begin{verbatim}
%SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A
%       WHERE A.property = filtering_criterion
%       ORDER BY d
%       LIMIT 5;
%\end{verbatim}
%
%This query will be translated into a logical query plan (a sequence
%of relational algebra operators) by the query planner, which could
%result in a plan like this,
%
%\begin{verbatim}
%query plan here
%\end{verbatim}
%
%With this logical query plan, the DBMS will next need to determine
%which supported operations it can use to most efficiently answer
%this query. For example, the selection operation (A) could be
%physically manifested as a table scan, or could be answered using
%an index scan if there is an ordered index over \texttt{A.property}.
%The query optimizer will make this decision based on its estimate
%of the selectivity of the predicate. This may result in one of the
%following physical query plans
%
%\begin{verbatim}
%physical query plan
%\end{verbatim}
%
%In either case, however, the space of possible physical plans is
%limited by the available access methods: either a sorted scan on
%an attribute (index) or an unsorted scan (table scan). The database
%must filter for all elements matching the filtering criterion,
%calculate the distances between all of these points and the query,
%and then sort the results to get the final answer. Additionally,
%note that the sort operation in the plan is a pipeline-breaker. If
%this plan were to appear as a sub-tree in a larger query plan, the
%overall plan would need to wait for the full evaluation of this
%sub-query before it could proceed, as sorting requires the full
%result set.
%
%Imagine a world where a new index was available to the DBMS: a
%nearest neighbor index. This index would allow the iteration over
%records in sorted order, relative to some predefined metric and a
%query point. If such an index existed over \texttt{(A.x, A.y)} using
%\texttt{dist}, then a third physical plan would be available to the DBMS,
%
%\begin{verbatim}
%\end{verbatim}
%
%This plan pulls records in order of their distance to \texttt{Q}
%directly, using an index, and then filters them, avoiding the
%pipeline breaking sort operation. While it's not obvious in this
%case that this new plan is superior (this would depend upon the
%selectivity of the predicate), it is a third option. It becomes
%increasingly superior as the selectivity of the predicate grows,
%and is clearly superior in the case where the predicate has unit
%selectivity (requiring only the consideration of $5$ records total).
%
%This use of query-specific indexing schemes presents a query
%optimization challenge: how does the database know when a particular
%specialized index can be used for a given query, and how can
%specialized indexes broadcast their capabilities to the query optimizer
%in a general fashion? This work is focused on the problem of enabling
%the existence of such indexes, rather than facilitating their use;
%however these are important questions that must be considered in
%future work for this solution to be viable. There has been work
%done surrounding the use of arbitrary indexes in queries in the past,
%such as~\cite{byods-datalog}. This problem is considered out-of-scope
%for the proposed work, but will be considered in the future.

\section{Classical Dynamization Techniques}

Because data in a database is regularly updated, data structures
intended to be used as an index must support updates (inserts, in-place
modification, and deletes). Not all potentially useful data structures
support updates, and so a general strategy for adding update support
would increase the number of data structures that could be used as
database indices. We refer to a data structure with update support as
\emph{dynamic}, and one without update support as \emph{static}.\footnote{
    
    The term static is distinct from immutable. Static refers to the
    layout of records within the data structure, whereas immutable
    refers to the data stored within those records. This distinction
    will become relevant when we discuss different techniques for adding
    delete support to data structures.  The data structures used are
    always static, but not necessarily immutable, because the records may
    contain header information (like visibility) that is updated in place.
}

This section discusses \emph{dynamization}, the construction of a dynamic
data structure based on an existing static one. When certain conditions
are satisfied by the data structure and its associated search problem,
this process can be done automatically, and with provable asymptotic
bounds on amortized insertion performance, as well as worst case query
performance. We will first discuss the necessary data structure
requirements, and then examine several classical dynamization techniques.
The section will conclude with a discussion of delete support within the
context of these techniques.

\subsection{Global Reconstruction}

The most fundamental dynamization technique is that of \emph{global
reconstruction}. While not particularly useful on its own, global
reconstruction serves as the basis for the techniques to follow, and so
we will begin our discussion of dynamization with it.

Consider a class of data structure, $\mathcal{I}$, capable of answering a
search problem, $\mathcal{Q}$. Insertion via global reconstruction is
possible if $\mathcal{I}$ supports the following two operations,
\begin{align*}
\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
\end{align*}
where $\mathtt{build}$ constructs an instance $\mathscr{i}\in\mathcal{I}$
over the data structure over a set of records $d \subseteq \mathcal{D}$
in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
\subseteq \mathcal{D}$ used to construct $\mathscr{i} \in \mathcal{I}$ in
$\Theta(1)$ time,\footnote{
    There isn't any practical reason why $\mathtt{unbuild}$ must run
    in constant time, but this is the assumption made in \cite{saxe79}
    and in subsequent work based on it, and so we will follow the same
    defininition here.
} such that $\mathscr{i} = \mathtt{build}(\mathtt{unbuild}(\mathscr{i}))$.


\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} 
\label{ssec:bsm}

Another approach to support updates is to amortize the cost of
global reconstruction over multiple updates. This approach can take
take three forms, 
\begin{enumerate}

        \item Pairing a dynamic data structure (called a buffer or
        memtable) with an instance of the structure being extended.
        Updates are written to the buffer, and when the buffer is
        full its records are merged with those in the static
        structure, and the structure is rebuilt.  This approach is
        used by one version of the originally proposed
        LSM-tree~\cite{oneil96}. Technically this technique proposed
        in that work for the purposes of converting random writes
        into sequential ones (all structures involved are dynamic),
        but it can be used for dynamization as well.

        \item Creating multiple, smaller data structures each
        containing a partition of the records from the dataset, and
        reconstructing individual structures to accommodate new
        inserts in a systematic manner. This technique is the basis
        of the Bentley-Saxe method~\cite{saxe79}.

        \item Using both of the above techniques at once. This is
        the approach used by modern incarnations of the
        LSM-tree~\cite{rocksdb}.

\end{enumerate}

In all three cases, it is necessary for the search problem associated
with the index to be a DSP, as answering it will require querying
multiple structures (the buffer and/or one or more instances of the
data structure) and merging the results together to get a final
result. This section will focus exclusively on the Bentley-Saxe
method, as it is the basis for the proposed methodology.

When dividing records across multiple structures, there is a clear
trade-off between read performance and write performance. Keeping
the individual structures small reduces the cost of reconstructing,
and thereby increases update performance. However, this also means
that more structures will be required to accommodate the same number
of records, when compared to a scheme that allows the structures
to be larger. As each structure must be queried independently, this
will lead to worse query performance. The reverse is also true,
fewer, larger structures will have better query performance and
worse update performance, with the extreme limit of this being a
single structure that is fully rebuilt on each insert.

The key insight of the Bentley-Saxe method~\cite{saxe79} is that a
good balance can be struck by using a geometrically increasing
structure size. In Bentley-Saxe, the sub-structures are ``stacked'',
with the base level having a capacity of a single record, and
each subsequent level doubling in capacity. When an update is
performed, the first empty level is located and a reconstruction
is triggered, merging the structures of all levels below this empty
one, along with the new record. The merits of this approach are
that it ensures that ``most'' reconstructions involve the smaller
data structures towards the bottom of the sequence, while most of
the records reside in large, infrequently updated, structures towards
the top. This balances between the read and write implications of
structure size, while also allowing the number of structures required
to represent $n$ records to be worst-case bounded by $O(\log n)$.

Given a structure and DSP with $P(n)$ construction cost and $Q_S(n)$
query cost, the Bentley-Saxe Method will produce a dynamic data
structure with,

\begin{align}
    \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\
    \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right)
\end{align}

In the case of a $C(n)$-decomposable problem, the query cost grows to
\begin{equation}
    O\left((Q_s(n) + C(n)) \cdot \log n\right)
\end{equation}


While the Bentley-Saxe method manages to maintain good performance in
terms of \emph{amortized} insertion cost, it has has poor worst-case performance. If the
entire structure is full, it must grow by another level, requiring
a full reconstruction involving every record within the structure.
A slight adjustment to the technique, due to Overmars and van 
Leeuwen~\cite{overmars81}, allows for the worst-case insertion cost to be bounded by
$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing
each reconstruction into small pieces, one of which is executed
each time a new update occurs. This has the effect of bounding the
worst-case performance, but does so by sacrificing the expected
case performance, and adds a lot of complexity to the method. This
technique is not used much in practice.\footnote{
    We've yet to find any example of it used in a journal article
    or conference paper.
}

\section{Limitations of the Bentley-Saxe Method}
\label{sec:bsm-limits}

While fairly general, the Bentley-Saxe method has a number of limitations. Because
of the way in which it merges query results together, the number of search problems
to which it can be efficiently applied is limited. Additionally, the method does not
expose any trade-off space to configure the structure: it is one-size fits all. 

\subsection{Limits of Decomposability}
\label{ssec:decomp-limits}
Unfortunately, the DSP abstraction used as the basis of the Bentley-Saxe
method has a few significant limitations that must first be overcome,
before it can be used for the purposes of this work. At a high level, these limitations
are as follows,

\begin{itemize}
    \item Each local query must be oblivious to the state of every partition,
          aside from the one it is directly running against. Further,
          Bentley-Saxe provides no facility for accessing cross-block state 
          or performing multiple query passes against each partition.

    \item The result merge operation must be $O(1)$ to maintain good query
          performance.

    \item The result merge operation must be commutative and associative, 
          and is called repeatedly to merge pairs of results.
\end{itemize}

These requirements restrict the types of queries that can be supported by
the method efficiently. For example, k-nearest neighbor and independent
range sampling are not decomposable. 

\subsubsection{k-Nearest Neighbor} 
\label{sssec-decomp-limits-knn}
The k-nearest neighbor (KNN) problem is a generalization of the nearest
neighbor problem, which seeks to return the closest point within the
dataset to a given query point. More formally, this can be defined as,
\begin{definition}[Nearest Neighbor]
    
    Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
    be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
    between two points within $D$. The nearest neighbor problem, $NN(D,
    q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
    for some query point, $q \in \mathbb{R}^d$.

\end{definition}

In practice, it is common to require $f(x, y)$ be a metric,\footnote
{
    Contrary to its vernacular usage as a synonym for ``distance'', a
    metric is more formally defined as a valid distance function over
    a metric space. Metric spaces require their distance functions to
    have the following properties,
    \begin{itemize}
        \item The distance between a point and itself is always 0.
        \item All distances between non-equal points must be positive.
        \item For all points, $x, y \in D$, it is true that 
              $f(x, y) = f(y, x)$.
        \item For any three points $x, y, z \in D$ it is true that 
              $f(x, z) \leq f(x, y) + f(y, z)$.
    \end{itemize}

    These distances also must have the interpretation that $f(x, y) <
    f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
    is the opposite of the definition of similarity, and so some minor
    manipulations are usually required to make similarity measures work
    in metric-based indexes. \cite{intro-analysis}
}
and this will be done in the examples of indexes for addressing
this problem in this work, but it is not a fundamental aspect of the problem
formulation. The nearest neighbor problem itself is decomposable, with
a simple merge function that accepts the result with the smallest value
of $f(x, q)$ for any two inputs\cite{saxe79}.

The k-nearest neighbor problem generalizes nearest-neighbor to return
the $k$ nearest elements,
\begin{definition}[k-Nearest Neighbor]

    Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
    be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
    between two points within $D$. The k-nearest neighbor problem,
    $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
    such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.

\end{definition}

This can be thought of as solving the nearest-neighbor problem $k$ times,
each time removing the returned result from $D$ prior to solving the
problem again.  Unlike the single nearest-neighbor case (which can be
thought of as KNN with $k=1$), this problem is \emph{not} decomposable.

\begin{theorem}
    KNN is not a decomposable search problem.
\end{theorem}

\begin{proof}
To prove this, consider the query $KNN(D, q, k)$ against some partitioned
dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If KNN is decomposable,
then there must exist some constant-time, commutative, and associative
binary operator $\square$, such that $R = \square_{0 \leq i \leq l}
R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
k)$. Consider the evaluation of the merge operator against two arbitrary
result sets, $R = R_i \square R_j$.  It is clear that $|R| = |R_i| =
|R_j| = k$, and that the contents of $R$ must be the $k$ records from
$R_i \cup R_j$ that are nearest to $q$. Thus, $\square$ must solve the
problem $KNN(R_i \cup R_j, q, k)$. However, KNN cannot be solved in $O(1)$
time. Therefore, KNN is not a decomposable search problem.
\end{proof}

With that said, it is clear that there isn't any fundamental restriction
preventing the merging of the result sets;
it is only the case that an
arbitrary performance requirement wouldn't be satisfied. It is possible
to merge the result sets in non-constant time, and so it is the case that
KNN is $C(n)$-decomposable. Unfortunately, this classification brings with
it a reduction in query performance as a result of the way result merges are
performed in Bentley-Saxe.

As a concrete example of these costs, consider using Bentley-Saxe to
extend the VPTree~\cite{vptree}. The VPTree is a static, metric index capable of
answering KNN queries in $KNN(D, q, k) \in O(k \log n)$.  One possible
merge algorithm for KNN would be to push all of the elements in the two
arguments onto a min-heap, and then pop off the first $k$. In this case,
the cost of the merge operation would be $C(k) = k \log k$. Were $k$ assumed
to be constant, then the operation could be considered to be constant-time.
But given that $k$ is only bounded in size above
by $n$, this isn't a safe assumption to make in general. Evaluating the
total query cost for the extended structure, this would yield, 

\begin{equation} 
    KNN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
\end{equation}

The reason for this large increase in cost is the repeated application
of the merge operator. The Bentley-Saxe method requires applying the
merge operator in a binary fashion to each partial result, multiplying
its cost by a factor of $\log n$. Thus, the constant-time requirement
of standard decomposability is necessary to keep the cost of the merge
operator from appearing within the complexity bound of the entire
operation in the general case.\footnote {
    There is a special case, noted by Overmars, where the total cost is
    $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
    \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
    case where the cost of the query and merge operation are sufficiently
    large to consume the logarithmic factor, and so it doesn't represent
    a special case with better performance.
} 
If the result merging operation could be revised to remove this
duplicated cost, the cost of supporting $C(n)$-decomposable queries
could be greatly reduced.

\subsubsection{Independent Range Sampling}

Another problem that is not decomposable is independent sampling. There
are a variety of problems falling under this umbrella, including weighted
set sampling, simple random sampling, and weighted independent range
sampling, but this section will focus on independent range sampling.

\begin{definition}[Independent Range Sampling~\cite{tao22}]
    Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
    interval $q = [x, y]$ and an integer $k$, an independent range
    sampling query returns $k$ independent samples from $D \cap q$
    with each point having equal probability of being sampled.
\end{definition}

This problem immediately encounters a category error when considering
whether it is decomposable: the result set is randomized, whereas the
conditions for decomposability are defined in terms of an exact matching
of records in result sets. To work around this, a slight abuse of definition
is in order: 
assume that the equality conditions within the DSP definition can
be interpreted to mean ``the contents in the two sets are drawn from the
same distribution''. This enables the category of DSP to apply to this type
of problem. More formally,
\begin{definition}[Decomposable Sampling Problem]
    A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and
    only if there exists a constant-time computable, associative, and
    commutative binary operator $\square$ such that,
    \begin{equation*}
    F(A \cup B, q) \sim F(A, q)~ \square ~F(B, q)
    \end{equation*}
\end{definition}

Even with this abuse, however, IRS cannot generally be considered decomposable;
it is at best $C(n)$-decomposable.  The reason for this is that matching the
distribution requires drawing the appropriate number of samples from each each
partition of the data. Even in the special case that $|D_0| = |D_1| = \ldots =
|D_\ell|$, the number of samples from each partition that must appear in the
result set cannot be known in advance due to differences in the selectivity
of the predicate across the partitions.

\begin{example}[IRS Sampling Difficulties]

    Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 =
    \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and
    an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
    partitions have the same size, it seems sensible to evenly distribute
    the samples across them ($4$ samples from each partition). Applying
    the query predicate to the partitions results in the following,
    $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$.

    In expectation, then, the first result set will contain $R_0 = \{3,
    3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
    probability of a $4$. The second and third result sets can only
    be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
    together, we'd find that the probability distribution of the sample
    would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
    the same sampling operation over the full dataset (not partitioned),
    the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.

\end{example}

The problem is that the number of samples drawn from each partition needs to be
weighted based on the number of elements satisfying the query predicate in that
partition. In the above example, by drawing $4$ samples from $D_1$, more weight
is given to $3$ than exists within the base dataset. This can be worked around
by sampling a full $k$ records from each partition, returning both the sample
and the number of records satisfying the predicate as that partition's query
result, and then performing another pass of IRS as the merge operator, but this
is the same approach as was used for KNN above. This leaves IRS firmly in the
$C(n)$-decomposable camp. If it were possible to pre-calculate the number of
samples to draw from each partition, then a constant-time merge operation could
be used.

\section{Conclusion}
This chapter discussed the necessary background information pertaining to
queries and search problems, indexes, and techniques for dynamic extension. It
described the potential for using custom indexes for accelerating particular
kinds of queries, as well as the challenges associated with constructing these
indexes. The remainder of this document will seek to address these challenges
through modification and extension of the Bentley-Saxe method, describing work
that has already been completed, as well as the additional work that must be
done to realize this vision.