Initial commit

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
commit: 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree: 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/background.tex
download: dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz
1 files changed, 746 insertions, 0 deletions
diff --git a/chapters/background.tex b/chapters/background.tex
new file mode 100644
index 0000000..75e2b59
--- /dev/null
+++ b/chapters/background.tex
@@ -0,0 +1,746 @@
+\chapter{Background}
+\label{chap:background}
+
+This chapter will introduce important background information and
+existing work in the area of data structure dynamization. We will
+first discuss the concept of a search problem, which is central to
+dynamization techniques.  While one might imagine that restrictions on
+dynamization would be functions of the data structure to be dynamized,
+in practice the requirements placed on the data structure are quite mild,
+and it is the necessary properties of the search problem that the data
+structure is used to address that provide the central difficulty to
+applying dynamization techniques in a given area. After this, database
+indices will be discussed briefly. Indices are the primary use of data
+structures within the database context that is of interest to our work.
+Following this, existing theoretical results in the area of data structure
+dynamization will be discussed, which will serve as the building blocks
+for our techniques in subsquent chapters. The chapter will conclude with
+a discussion of some of the limitations of these existing techniques.
+
+\section{Queries and Search Problems}
+\label{sec:dsp}
+
+Data access lies at the core of most database systems. We want to ask
+questions of the data, and ideally get the answer efficiently. We
+will refer to the different types of question that can be asked as
+\emph{search problems}. We will be using this term in a similar way as
+the word \emph{query} \footnote{
+    The term query is often abused and used to
+    refer to several related, but slightly different things. In the
+    vernacular, a query can refer to either a) a general type of search
+    problem (as in "range query"), b) a specific instance of a search
+    problem, or c) a program written in a query language.
+}
+is often used within the database systems literature: to refer to a
+general class of questions. For example, we could consider range scans,
+point-lookups, nearest neighbor searches, predicate filtering, random
+sampling, etc., to each be a general search problem.  Formally, for the
+purposes of this work, a search problem is defined as follows,
+
+\begin{definition}[Search Problem] 
+    Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
+    $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
+    $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
+answer domain.\footnote{
+    It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
+example, a \texttt{COUNT} aggregation might map a set of strings onto
+    an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
+not be a universal constraint.
+}
+\end{definition}
+
+We will use the term \emph{query} to mean a specific instance of a search
+problem,
+
+\begin{definition}[Query]
+    Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
+    a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific
+    instance of the search problem, $F(\mathcal{D}, q)$.
+\end{definition}
+
+As an example of using these definitions, a \emph{membership test}
+or \emph{range scan} would be considered search problems, and a range
+scan over the interval $[10, 99]$ would be a query.  We've drawn this
+distinction because, as we'll see as we enter into the discussion of
+our work in later chapters, it is useful to have seperate, unambiguous
+terms for these two concepts.
+
+\subsection{Decomposable Search Problems}
+
+Dynamization techniques require the partitioning of one data structure
+into several, smaller ones. As a result, these techniques can only
+be applied in situations where the search problem to be answered can
+be answered from this set of smaller data structures, with the same
+answer as would have been obtained had all of the data been used to
+construct a single, large structure. This requirement is formalized in
+the definition of a class of problems called \emph{decomposable search
+problems (DSP)}.  This class was first defined by Bentley and Saxe in
+their work on dynamization, and we will adopt their definition,
+
+\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
+	\label{def:dsp}
+    A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and
+    only if there exists a constant-time computable, associative, and
+    commutative binary operator $\square$ such that,
+    \begin{equation*}
+    F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
+    \end{equation*}
+\end{definition}
+
+The requirement for $\square$ to be constant-time was used by Bentley and
+Saxe to prove specific performance bounds for answering queries from a
+decomposed data structure. However, it is not strictly \emph{necessary},
+and later work by Overmars lifted this constraint and considered a more
+general class of search problems called \emph{$C(n)$-decomposable search
+problems},
+
+\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}]
+    A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
+    if and only if there exists an $O(C(n))$-time computable, associative,
+    and commutative binary operator $\square$ such that,
+    \begin{equation*}
+    F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
+    \end{equation*}
+\end{definition}
+
+To demonstrate that a search problem is decomposable, it is necessary to
+show the existence of the merge operator, $\square$, with the necessary
+properties, and to show that $F(A \cup B, q) = F(A, q)~ \square ~F(B,
+q)$. With these two results, induction demonstrates that the problem is
+decomposable even in cases with more than two partial results.
+
+As an example, consider  range scans,
+\begin{definition}[Range Count]
+    Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
+    $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns
+    the cardinality, $|d \cap q|$.
+\end{definition}
+
+\begin{theorem}
+Range Count is a decomposable search problem.
+\end{theorem}
+
+\begin{proof}
+Let $\square$ be addition ($+$). Applying this to
+Definition~\ref{def:dsp}, gives
+\begin{align*}
+	|(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)|
+\end{align*}
+which is true by the distributive property of union and
+intersection. Addition is an associative and commutative
+operator that can be calculated in $O(1)$ time. Therefore, range counts
+are DSPs.
+\end{proof}
+
+Because the codomain of a DSP is not restricted, more complex output
+structures can be used to allow for problems that are not directly
+decomposable to be converted to DSPs, possibly with some minor
+post-processing. For example, calculating the arithmetic mean of a set
+of numbers can be formulated as a DSP,
+\begin{theorem}
+The calculation of the arithmetic mean of a set of numbers is a DSP.
+\end{theorem}
+\begin{proof}
+    Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$,
+    where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple
+contains the sum of the values within the input set, and the
+cardinality of the input set. For two disjoint paritions of the data,
+$D_1$ and $D_2$, let  $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let
+$A(D_1) \square A(D_2) = (s_1 + s_2, c_1 + c_2)$.
+
+Applying Definition~\ref{def:dsp}, gives
+\begin{align*}
+	A(D_1 \cup D_2) &= A(D_1)\square A(D_2)  \\
+	(s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
+\end{align*}
+From this result, the average can be determined in constant time by
+taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
+of numbers is a DSP.
+\end{proof}
+
+
+
+\section{Database Indexes}
+\label{sec:indexes}
+
+Within a database system, search problems are expressed using
+some high level language (or mapped directly to commands, for
+simpler systems like key-value stores), which is processed by
+the database system to produce a result. Within many database
+systems, the most basic access primitive is a table scan, which
+sequentially examines each record within the data set. There are many
+situations in which the same query could be answered in less time using
+a more sophisticated data access scheme, however, and databases support
+a limited number of such schemes through the use of specialized data
+structures called \emph{indices} (or indexes). Indices can be built over
+a set of attributes in a table and provide faster access for particular
+search problems.
+
+The term \emph{index} is often abused within the database community
+to refer to a range of closely related, but distinct, conceptual
+categories.\footnote{
+The word index can be used to refer to a structure mapping record
+information to the set of records matching that information, as a
+general synonym for ``data structure'', to data structures used
+specifically in query processing, etc.
+} 
+This ambiguity is rarely problematic, as the subtle differences between
+these categories are not often significant, and context clarifies the
+intended meaning in situations where they are.  However, this work
+explicitly operates at the interface of two of these categories, and so
+it is important to disambiguate between them.
+
+\subsection{The Classical Index}
+
+A database index is a specialized data structure that provides a means
+to efficiently locate records that satisfy specific criteria. This
+enables more efficient query processing for supported search problems. A
+classical index can be modeled as a function, mapping a set of attribute
+values, called a key, $\mathcal{K}$, to a set of record identifiers,
+$\mathcal{R}$. The codomain of an index can be either the set of
+record identifiers, a set containing sets of record identifiers, or
+the set of physical records, depending upon the configuration of the
+index.~\cite{cowbook} For our purposes here, we'll focus on the first of
+these, but the use of other codmains wouldn't have any material effect
+on our discussion.
+
+We will use the following definition of a "classical" database index,
+
+\begin{definition}[Classical Index~\cite{cowbook}]
+Consider a set of database records, $\mathcal{D}$. An index over
+these records, $\mathcal{I}_\mathcal{D}$ is a map of the form
+    $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where
+$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$,
+called a \emph{key}.
+\end{definition}
+
+In order to facilitate this mapping, indexes are built using data
+structures.  The selection of data structure has implications on the
+performance of the index, and the types of search problem it can be
+used to accelerate. Broadly speaking, classical indices can be divided
+into two categories: ordered and unordered. Ordered indices allow for
+the iteration over a set of record identifiers in a particular sorted
+order of keys, and the efficient location of a specific key value in
+that order. These indices can be used to accelerate range scans and
+point-lookups. Unordered indices are specialized for point-lookups on a
+particular key value, and do not support iterating over records in some
+order.~\cite{cowbook, mysql-btree-hash}
+
+There is a very small set of data structures that are usually used for
+creating classical indexes. For ordered indices, the most commonly used
+data structure is the B-tree~\cite{ubiq-btree},\footnote{
+    By \emph{B-tree} here, we are referring not to the B-tree data
+    structure, but to a wide range of related structures derived from
+    the B-tree.  Examples include the B$^+$-tree, B$^\epsilon$-tree, etc.
+}
+and the log-structured merge (LSM) tree~\cite{oneil96} is also often
+used within the context of key-value stores~\cite{rocksdb}. Some databases
+implement unordered indices using hash tables~\cite{mysql-btree-hash}.
+
+
+\subsection{The Generalized Index}
+
+The previous section discussed the classical definition of index
+as might be found in a database systems textbook. However, this
+definition is limited by its association specifically with mapping
+key fields to records. For the purposes of this work, a broader
+definition of index will be considered,
+
+\begin{definition}[Generalized Index]
+Consider a set of database records, $\mathcal{D}$, and search
+problem, $\mathcal{Q}$.
+A generalized index, $\mathcal{I}_\mathcal{D}$
+is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to
+\mathcal{R})$.
+\end{definition}
+
+A classical index is a special case of a generalized index, with $\mathcal{Q}$
+being a point-lookup or range scan based on a set of record attributes.
+
+There are a number of generalized indexes that appear in some database systems.
+For example, some specialized databases or database extensions have support for
+indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a
+signifier for a general class of related data structures.} for spatial
+databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world
+graphs for similarity search~\cite{pinecone-db}, among others. These systems
+are typically either an add-on module, or a specialized standalone database
+that has been designed specifically for answering particular types of queries
+(such as spatial queries, similarity search, string matching, etc.).
+
+%\subsection{Indexes in Query Processing} 
+
+%A database management system utilizes indexes to accelerate certain
+%types of query. Queries are expressed to the system in some high
+%level language, such as SQL or Datalog. These are generalized
+%languages capable of expressing a wide range of possible queries.
+%The DBMS is then responsible for converting these queries into a
+%set of primitive data access procedures that are supported by the
+%underlying storage engine. There are a variety of techniques for
+%this, including mapping directly to a tree of relational algebra
+%operators and interpreting that tree, query compilation, etc. But,
+%ultimately, this internal query representation is limited by the routines 
+%supported by the storage engine.~\cite{cowbook}
+
+%As an example, consider the following SQL query (representing a
+%2-dimensional k-nearest neighbor problem)\footnote{There are more efficient
+%ways of answering this query, but I'm aiming for simplicity here
+%to demonstrate my point},
+%
+%\begin{verbatim}
+%SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A
+%       WHERE A.property = filtering_criterion
+%       ORDER BY d
+%       LIMIT 5;
+%\end{verbatim}
+%
+%This query will be translated into a logical query plan (a sequence
+%of relational algebra operators) by the query planner, which could
+%result in a plan like this,
+%
+%\begin{verbatim}
+%query plan here
+%\end{verbatim}
+%
+%With this logical query plan, the DBMS will next need to determine
+%which supported operations it can use to most efficiently answer
+%this query. For example, the selection operation (A) could be
+%physically manifested as a table scan, or could be answered using
+%an index scan if there is an ordered index over \texttt{A.property}.
+%The query optimizer will make this decision based on its estimate
+%of the selectivity of the predicate. This may result in one of the
+%following physical query plans
+%
+%\begin{verbatim}
+%physical query plan
+%\end{verbatim}
+%
+%In either case, however, the space of possible physical plans is
+%limited by the available access methods: either a sorted scan on
+%an attribute (index) or an unsorted scan (table scan). The database
+%must filter for all elements matching the filtering criterion,
+%calculate the distances between all of these points and the query,
+%and then sort the results to get the final answer. Additionally,
+%note that the sort operation in the plan is a pipeline-breaker. If
+%this plan were to appear as a sub-tree in a larger query plan, the
+%overall plan would need to wait for the full evaluation of this
+%sub-query before it could proceed, as sorting requires the full
+%result set.
+%
+%Imagine a world where a new index was available to the DBMS: a
+%nearest neighbor index. This index would allow the iteration over
+%records in sorted order, relative to some predefined metric and a
+%query point. If such an index existed over \texttt{(A.x, A.y)} using
+%\texttt{dist}, then a third physical plan would be available to the DBMS,
+%
+%\begin{verbatim}
+%\end{verbatim}
+%
+%This plan pulls records in order of their distance to \texttt{Q}
+%directly, using an index, and then filters them, avoiding the
+%pipeline breaking sort operation. While it's not obvious in this
+%case that this new plan is superior (this would depend upon the
+%selectivity of the predicate), it is a third option. It becomes
+%increasingly superior as the selectivity of the predicate grows,
+%and is clearly superior in the case where the predicate has unit
+%selectivity (requiring only the consideration of $5$ records total).
+%
+%This use of query-specific indexing schemes presents a query
+%optimization challenge: how does the database know when a particular
+%specialized index can be used for a given query, and how can
+%specialized indexes broadcast their capabilities to the query optimizer
+%in a general fashion? This work is focused on the problem of enabling
+%the existence of such indexes, rather than facilitating their use;
+%however these are important questions that must be considered in
+%future work for this solution to be viable. There has been work
+%done surrounding the use of arbitrary indexes in queries in the past,
+%such as~\cite{byods-datalog}. This problem is considered out-of-scope
+%for the proposed work, but will be considered in the future.
+
+\section{Classical Dynamization Techniques}
+
+Because data in a database is regularly updated, data structures
+intended to be used as an index must support updates (inserts, in-place
+modification, and deletes). Not all potentially useful data structures
+support updates, and so a general strategy for adding update support
+would increase the number of data structures that could be used as
+database indices. We refer to a data structure with update support as
+\emph{dynamic}, and one without update support as \emph{static}.\footnote{
+    
+    The term static is distinct from immutable. Static refers to the
+    layout of records within the data structure, whereas immutable
+    refers to the data stored within those records. This distinction
+    will become relevant when we discuss different techniques for adding
+    delete support to data structures.  The data structures used are
+    always static, but not necessarily immutable, because the records may
+    contain header information (like visibility) that is updated in place.
+}
+
+This section discusses \emph{dynamization}, the construction of a dynamic
+data structure based on an existing static one. When certain conditions
+are satisfied by the data structure and its associated search problem,
+this process can be done automatically, and with provable asymptotic
+bounds on amortized insertion performance, as well as worst case query
+performance. We will first discuss the necessary data structure
+requirements, and then examine several classical dynamization techniques.
+The section will conclude with a discussion of delete support within the
+context of these techniques.
+
+\subsection{Global Reconstruction}
+
+The most fundamental dynamization technique is that of \emph{global
+reconstruction}. While not particularly useful on its own, global
+reconstruction serves as the basis for the techniques to follow, and so
+we will begin our discussion of dynamization with it.
+
+Consider a class of data structure, $\mathcal{I}$, capable of answering a
+search problem, $\mathcal{Q}$. Insertion via global reconstruction is
+possible if $\mathcal{I}$ supports the following two operations,
+\begin{align*}
+\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
+\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
+\end{align*}
+where $\mathtt{build}$ constructs an instance $\mathscr{i}\in\mathcal{I}$
+over the data structure over a set of records $d \subseteq \mathcal{D}$
+in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
+\subseteq \mathcal{D}$ used to construct $\mathscr{i} \in \mathcal{I}$ in
+$\Theta(1)$ time,\footnote{
+    There isn't any practical reason why $\mathtt{unbuild}$ must run
+    in constant time, but this is the assumption made in \cite{saxe79}
+    and in subsequent work based on it, and so we will follow the same
+    defininition here.
+} such that $\mathscr{i} = \mathtt{build}(\mathtt{unbuild}(\mathscr{i}))$.
+
+
+
+
+
+
+\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} 
+\label{ssec:bsm}
+
+Another approach to support updates is to amortize the cost of
+global reconstruction over multiple updates. This approach can take
+take three forms, 
+\begin{enumerate}
+
+        \item Pairing a dynamic data structure (called a buffer or
+        memtable) with an instance of the structure being extended.
+        Updates are written to the buffer, and when the buffer is
+        full its records are merged with those in the static
+        structure, and the structure is rebuilt.  This approach is
+        used by one version of the originally proposed
+        LSM-tree~\cite{oneil96}. Technically this technique proposed
+        in that work for the purposes of converting random writes
+        into sequential ones (all structures involved are dynamic),
+        but it can be used for dynamization as well.
+
+        \item Creating multiple, smaller data structures each
+        containing a partition of the records from the dataset, and
+        reconstructing individual structures to accommodate new
+        inserts in a systematic manner. This technique is the basis
+        of the Bentley-Saxe method~\cite{saxe79}.
+
+        \item Using both of the above techniques at once. This is
+        the approach used by modern incarnations of the
+        LSM-tree~\cite{rocksdb}.
+
+\end{enumerate}
+
+In all three cases, it is necessary for the search problem associated
+with the index to be a DSP, as answering it will require querying
+multiple structures (the buffer and/or one or more instances of the
+data structure) and merging the results together to get a final
+result. This section will focus exclusively on the Bentley-Saxe
+method, as it is the basis for the proposed methodology.
+
+When dividing records across multiple structures, there is a clear
+trade-off between read performance and write performance. Keeping
+the individual structures small reduces the cost of reconstructing,
+and thereby increases update performance. However, this also means
+that more structures will be required to accommodate the same number
+of records, when compared to a scheme that allows the structures
+to be larger. As each structure must be queried independently, this
+will lead to worse query performance. The reverse is also true,
+fewer, larger structures will have better query performance and
+worse update performance, with the extreme limit of this being a
+single structure that is fully rebuilt on each insert.
+
+The key insight of the Bentley-Saxe method~\cite{saxe79} is that a
+good balance can be struck by using a geometrically increasing
+structure size. In Bentley-Saxe, the sub-structures are ``stacked'',
+with the base level having a capacity of a single record, and
+each subsequent level doubling in capacity. When an update is
+performed, the first empty level is located and a reconstruction
+is triggered, merging the structures of all levels below this empty
+one, along with the new record. The merits of this approach are
+that it ensures that ``most'' reconstructions involve the smaller
+data structures towards the bottom of the sequence, while most of
+the records reside in large, infrequently updated, structures towards
+the top. This balances between the read and write implications of
+structure size, while also allowing the number of structures required
+to represent $n$ records to be worst-case bounded by $O(\log n)$.
+
+Given a structure and DSP with $P(n)$ construction cost and $Q_S(n)$
+query cost, the Bentley-Saxe Method will produce a dynamic data
+structure with,
+
+\begin{align}
+    \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\
+    \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right)
+\end{align}
+
+In the case of a $C(n)$-decomposable problem, the query cost grows to
+\begin{equation}
+    O\left((Q_s(n) + C(n)) \cdot \log n\right)
+\end{equation}
+
+
+While the Bentley-Saxe method manages to maintain good performance in
+terms of \emph{amortized} insertion cost, it has has poor worst-case performance. If the
+entire structure is full, it must grow by another level, requiring
+a full reconstruction involving every record within the structure.
+A slight adjustment to the technique, due to Overmars and van 
+Leeuwen~\cite{overmars81}, allows for the worst-case insertion cost to be bounded by
+$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing
+each reconstruction into small pieces, one of which is executed
+each time a new update occurs. This has the effect of bounding the
+worst-case performance, but does so by sacrificing the expected
+case performance, and adds a lot of complexity to the method. This
+technique is not used much in practice.\footnote{
+    We've yet to find any example of it used in a journal article
+    or conference paper.
+}
+
+\section{Limitations of the Bentley-Saxe Method}
+\label{sec:bsm-limits}
+
+While fairly general, the Bentley-Saxe method has a number of limitations. Because
+of the way in which it merges query results together, the number of search problems
+to which it can be efficiently applied is limited. Additionally, the method does not
+expose any trade-off space to configure the structure: it is one-size fits all. 
+
+\subsection{Limits of Decomposability}
+\label{ssec:decomp-limits}
+Unfortunately, the DSP abstraction used as the basis of the Bentley-Saxe
+method has a few significant limitations that must first be overcome,
+before it can be used for the purposes of this work. At a high level, these limitations
+are as follows,
+
+\begin{itemize}
+    \item Each local query must be oblivious to the state of every partition,
+          aside from the one it is directly running against. Further,
+          Bentley-Saxe provides no facility for accessing cross-block state 
+          or performing multiple query passes against each partition.
+
+    \item The result merge operation must be $O(1)$ to maintain good query
+          performance.
+
+    \item The result merge operation must be commutative and associative, 
+          and is called repeatedly to merge pairs of results.
+\end{itemize}
+
+These requirements restrict the types of queries that can be supported by
+the method efficiently. For example, k-nearest neighbor and independent
+range sampling are not decomposable. 
+
+\subsubsection{k-Nearest Neighbor} 
+\label{sssec-decomp-limits-knn}
+The k-nearest neighbor (KNN) problem is a generalization of the nearest
+neighbor problem, which seeks to return the closest point within the
+dataset to a given query point. More formally, this can be defined as,
+\begin{definition}[Nearest Neighbor]
+    
+    Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$
+    be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
+    between two points within $D$. The nearest neighbor problem, $NN(D,
+    q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$
+    for some query point, $q \in \mathbb{R}^d$.
+
+\end{definition}
+
+In practice, it is common to require $f(x, y)$ be a metric,\footnote
+{
+    Contrary to its vernacular usage as a synonym for ``distance'', a
+    metric is more formally defined as a valid distance function over
+    a metric space. Metric spaces require their distance functions to
+    have the following properties,
+    \begin{itemize}
+        \item The distance between a point and itself is always 0.
+        \item All distances between non-equal points must be positive.
+        \item For all points, $x, y \in D$, it is true that 
+              $f(x, y) = f(y, x)$.
+        \item For any three points $x, y, z \in D$ it is true that 
+              $f(x, z) \leq f(x, y) + f(y, z)$.
+    \end{itemize}
+
+    These distances also must have the interpretation that $f(x, y) <
+    f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This
+    is the opposite of the definition of similarity, and so some minor
+    manipulations are usually required to make similarity measures work
+    in metric-based indexes. \cite{intro-analysis}
+}
+and this will be done in the examples of indexes for addressing
+this problem in this work, but it is not a fundamental aspect of the problem
+formulation. The nearest neighbor problem itself is decomposable, with
+a simple merge function that accepts the result with the smallest value
+of $f(x, q)$ for any two inputs\cite{saxe79}.
+
+The k-nearest neighbor problem generalizes nearest-neighbor to return
+the $k$ nearest elements,
+\begin{definition}[k-Nearest Neighbor]
+
+    Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$
+    be some function $f: D^2 \to \mathbb{R}^+$ representing the distance
+    between two points within $D$. The k-nearest neighbor problem,
+    $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$
+    such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$.
+
+\end{definition}
+
+This can be thought of as solving the nearest-neighbor problem $k$ times,
+each time removing the returned result from $D$ prior to solving the
+problem again.  Unlike the single nearest-neighbor case (which can be
+thought of as KNN with $k=1$), this problem is \emph{not} decomposable.
+
+\begin{theorem}
+    KNN is not a decomposable search problem.
+\end{theorem}
+
+\begin{proof}
+To prove this, consider the query $KNN(D, q, k)$ against some partitioned
+dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If KNN is decomposable,
+then there must exist some constant-time, commutative, and associative
+binary operator $\square$, such that $R = \square_{0 \leq i \leq l}
+R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q,
+k)$. Consider the evaluation of the merge operator against two arbitrary
+result sets, $R = R_i \square R_j$.  It is clear that $|R| = |R_i| =
+|R_j| = k$, and that the contents of $R$ must be the $k$ records from
+$R_i \cup R_j$ that are nearest to $q$. Thus, $\square$ must solve the
+problem $KNN(R_i \cup R_j, q, k)$. However, KNN cannot be solved in $O(1)$
+time. Therefore, KNN is not a decomposable search problem.
+\end{proof}
+
+With that said, it is clear that there isn't any fundamental restriction
+preventing the merging of the result sets;
+it is only the case that an
+arbitrary performance requirement wouldn't be satisfied. It is possible
+to merge the result sets in non-constant time, and so it is the case that
+KNN is $C(n)$-decomposable. Unfortunately, this classification brings with
+it a reduction in query performance as a result of the way result merges are
+performed in Bentley-Saxe.
+
+As a concrete example of these costs, consider using Bentley-Saxe to
+extend the VPTree~\cite{vptree}. The VPTree is a static, metric index capable of
+answering KNN queries in $KNN(D, q, k) \in O(k \log n)$.  One possible
+merge algorithm for KNN would be to push all of the elements in the two
+arguments onto a min-heap, and then pop off the first $k$. In this case,
+the cost of the merge operation would be $C(k) = k \log k$. Were $k$ assumed
+to be constant, then the operation could be considered to be constant-time.
+But given that $k$ is only bounded in size above
+by $n$, this isn't a safe assumption to make in general. Evaluating the
+total query cost for the extended structure, this would yield, 
+
+\begin{equation} 
+    KNN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right)
+\end{equation}
+
+The reason for this large increase in cost is the repeated application
+of the merge operator. The Bentley-Saxe method requires applying the
+merge operator in a binary fashion to each partial result, multiplying
+its cost by a factor of $\log n$. Thus, the constant-time requirement
+of standard decomposability is necessary to keep the cost of the merge
+operator from appearing within the complexity bound of the entire
+operation in the general case.\footnote {
+    There is a special case, noted by Overmars, where the total cost is
+    $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n))
+    \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the
+    case where the cost of the query and merge operation are sufficiently
+    large to consume the logarithmic factor, and so it doesn't represent
+    a special case with better performance.
+} 
+If the result merging operation could be revised to remove this
+duplicated cost, the cost of supporting $C(n)$-decomposable queries
+could be greatly reduced.
+
+\subsubsection{Independent Range Sampling}
+
+Another problem that is not decomposable is independent sampling. There
+are a variety of problems falling under this umbrella, including weighted
+set sampling, simple random sampling, and weighted independent range
+sampling, but this section will focus on independent range sampling.
+
+\begin{definition}[Independent Range Sampling~\cite{tao22}]
+    Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query
+    interval $q = [x, y]$ and an integer $k$, an independent range
+    sampling query returns $k$ independent samples from $D \cap q$
+    with each point having equal probability of being sampled.
+\end{definition}
+
+This problem immediately encounters a category error when considering
+whether it is decomposable: the result set is randomized, whereas the
+conditions for decomposability are defined in terms of an exact matching
+of records in result sets. To work around this, a slight abuse of definition
+is in order: 
+assume that the equality conditions within the DSP definition can
+be interpreted to mean ``the contents in the two sets are drawn from the
+same distribution''. This enables the category of DSP to apply to this type
+of problem. More formally,
+\begin{definition}[Decomposable Sampling Problem]
+    A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and
+    only if there exists a constant-time computable, associative, and
+    commutative binary operator $\square$ such that,
+    \begin{equation*}
+    F(A \cup B, q) \sim F(A, q)~ \square ~F(B, q)
+    \end{equation*}
+\end{definition}
+
+Even with this abuse, however, IRS cannot generally be considered decomposable;
+it is at best $C(n)$-decomposable.  The reason for this is that matching the
+distribution requires drawing the appropriate number of samples from each each
+partition of the data. Even in the special case that $|D_0| = |D_1| = \ldots =
+|D_\ell|$, the number of samples from each partition that must appear in the
+result set cannot be known in advance due to differences in the selectivity
+of the predicate across the partitions.
+
+\begin{example}[IRS Sampling Difficulties]
+
+    Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 =
+    \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and
+    an IRS query over the interval $[3, 4]$ with $k=12$. Because all three
+    partitions have the same size, it seems sensible to evenly distribute
+    the samples across them ($4$ samples from each partition). Applying
+    the query predicate to the partitions results in the following,
+    $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$.
+
+    In expectation, then, the first result set will contain $R_0 = \{3,
+    3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same
+    probability of a $4$. The second and third result sets can only
+    be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these
+    together, we'd find that the probability distribution of the sample
+    would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform
+    the same sampling operation over the full dataset (not partitioned),
+    the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$.
+
+\end{example}
+
+The problem is that the number of samples drawn from each partition needs to be
+weighted based on the number of elements satisfying the query predicate in that
+partition. In the above example, by drawing $4$ samples from $D_1$, more weight
+is given to $3$ than exists within the base dataset. This can be worked around
+by sampling a full $k$ records from each partition, returning both the sample
+and the number of records satisfying the predicate as that partition's query
+result, and then performing another pass of IRS as the merge operator, but this
+is the same approach as was used for KNN above. This leaves IRS firmly in the
+$C(n)$-decomposable camp. If it were possible to pre-calculate the number of
+samples to draw from each partition, then a constant-time merge operation could
+be used.
+
+\section{Conclusion}
+This chapter discussed the necessary background information pertaining to
+queries and search problems, indexes, and techniques for dynamic extension. It
+described the potential for using custom indexes for accelerating particular
+kinds of queries, as well as the challenges associated with constructing these
+indexes. The remainder of this document will seek to address these challenges
+through modification and extension of the Bentley-Saxe method, describing work
+that has already been completed, as well as the additional work that must be
+done to realize this vision.
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
commit	5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree	276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/background.tex
download	dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz