diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
| commit | 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch) | |
| tree | 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/background.tex | |
| download | dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz | |
Initial commit
Diffstat (limited to 'chapters/background.tex')
| -rw-r--r-- | chapters/background.tex | 746 |
1 files changed, 746 insertions, 0 deletions
diff --git a/chapters/background.tex b/chapters/background.tex new file mode 100644 index 0000000..75e2b59 --- /dev/null +++ b/chapters/background.tex @@ -0,0 +1,746 @@ +\chapter{Background} +\label{chap:background} + +This chapter will introduce important background information and +existing work in the area of data structure dynamization. We will +first discuss the concept of a search problem, which is central to +dynamization techniques. While one might imagine that restrictions on +dynamization would be functions of the data structure to be dynamized, +in practice the requirements placed on the data structure are quite mild, +and it is the necessary properties of the search problem that the data +structure is used to address that provide the central difficulty to +applying dynamization techniques in a given area. After this, database +indices will be discussed briefly. Indices are the primary use of data +structures within the database context that is of interest to our work. +Following this, existing theoretical results in the area of data structure +dynamization will be discussed, which will serve as the building blocks +for our techniques in subsquent chapters. The chapter will conclude with +a discussion of some of the limitations of these existing techniques. + +\section{Queries and Search Problems} +\label{sec:dsp} + +Data access lies at the core of most database systems. We want to ask +questions of the data, and ideally get the answer efficiently. We +will refer to the different types of question that can be asked as +\emph{search problems}. We will be using this term in a similar way as +the word \emph{query} \footnote{ + The term query is often abused and used to + refer to several related, but slightly different things. In the + vernacular, a query can refer to either a) a general type of search + problem (as in "range query"), b) a specific instance of a search + problem, or c) a program written in a query language. +} +is often used within the database systems literature: to refer to a +general class of questions. For example, we could consider range scans, +point-lookups, nearest neighbor searches, predicate filtering, random +sampling, etc., to each be a general search problem. Formally, for the +purposes of this work, a search problem is defined as follows, + +\begin{definition}[Search Problem] + Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function + $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched, + $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the +answer domain.\footnote{ + It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an +example, a \texttt{COUNT} aggregation might map a set of strings onto + an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need +not be a universal constraint. +} +\end{definition} + +We will use the term \emph{query} to mean a specific instance of a search +problem, + +\begin{definition}[Query] + Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and + a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific + instance of the search problem, $F(\mathcal{D}, q)$. +\end{definition} + +As an example of using these definitions, a \emph{membership test} +or \emph{range scan} would be considered search problems, and a range +scan over the interval $[10, 99]$ would be a query. We've drawn this +distinction because, as we'll see as we enter into the discussion of +our work in later chapters, it is useful to have seperate, unambiguous +terms for these two concepts. + +\subsection{Decomposable Search Problems} + +Dynamization techniques require the partitioning of one data structure +into several, smaller ones. As a result, these techniques can only +be applied in situations where the search problem to be answered can +be answered from this set of smaller data structures, with the same +answer as would have been obtained had all of the data been used to +construct a single, large structure. This requirement is formalized in +the definition of a class of problems called \emph{decomposable search +problems (DSP)}. This class was first defined by Bentley and Saxe in +their work on dynamization, and we will adopt their definition, + +\begin{definition}[Decomposable Search Problem~\cite{saxe79}] + \label{def:dsp} + A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and + only if there exists a constant-time computable, associative, and + commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +The requirement for $\square$ to be constant-time was used by Bentley and +Saxe to prove specific performance bounds for answering queries from a +decomposed data structure. However, it is not strictly \emph{necessary}, +and later work by Overmars lifted this constraint and considered a more +general class of search problems called \emph{$C(n)$-decomposable search +problems}, + +\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}] + A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable + if and only if there exists an $O(C(n))$-time computable, associative, + and commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +To demonstrate that a search problem is decomposable, it is necessary to +show the existence of the merge operator, $\square$, with the necessary +properties, and to show that $F(A \cup B, q) = F(A, q)~ \square ~F(B, +q)$. With these two results, induction demonstrates that the problem is +decomposable even in cases with more than two partial results. + +As an example, consider range scans, +\begin{definition}[Range Count] + Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval, + $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns + the cardinality, $|d \cap q|$. +\end{definition} + +\begin{theorem} +Range Count is a decomposable search problem. +\end{theorem} + +\begin{proof} +Let $\square$ be addition ($+$). Applying this to +Definition~\ref{def:dsp}, gives +\begin{align*} + |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)| +\end{align*} +which is true by the distributive property of union and +intersection. Addition is an associative and commutative +operator that can be calculated in $O(1)$ time. Therefore, range counts +are DSPs. +\end{proof} + +Because the codomain of a DSP is not restricted, more complex output +structures can be used to allow for problems that are not directly +decomposable to be converted to DSPs, possibly with some minor +post-processing. For example, calculating the arithmetic mean of a set +of numbers can be formulated as a DSP, +\begin{theorem} +The calculation of the arithmetic mean of a set of numbers is a DSP. +\end{theorem} +\begin{proof} + Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$, + where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple +contains the sum of the values within the input set, and the +cardinality of the input set. For two disjoint paritions of the data, +$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let +$A(D_1) \square A(D_2) = (s_1 + s_2, c_1 + c_2)$. + +Applying Definition~\ref{def:dsp}, gives +\begin{align*} + A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\ + (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) +\end{align*} +From this result, the average can be determined in constant time by +taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set +of numbers is a DSP. +\end{proof} + + + +\section{Database Indexes} +\label{sec:indexes} + +Within a database system, search problems are expressed using +some high level language (or mapped directly to commands, for +simpler systems like key-value stores), which is processed by +the database system to produce a result. Within many database +systems, the most basic access primitive is a table scan, which +sequentially examines each record within the data set. There are many +situations in which the same query could be answered in less time using +a more sophisticated data access scheme, however, and databases support +a limited number of such schemes through the use of specialized data +structures called \emph{indices} (or indexes). Indices can be built over +a set of attributes in a table and provide faster access for particular +search problems. + +The term \emph{index} is often abused within the database community +to refer to a range of closely related, but distinct, conceptual +categories.\footnote{ +The word index can be used to refer to a structure mapping record +information to the set of records matching that information, as a +general synonym for ``data structure'', to data structures used +specifically in query processing, etc. +} +This ambiguity is rarely problematic, as the subtle differences between +these categories are not often significant, and context clarifies the +intended meaning in situations where they are. However, this work +explicitly operates at the interface of two of these categories, and so +it is important to disambiguate between them. + +\subsection{The Classical Index} + +A database index is a specialized data structure that provides a means +to efficiently locate records that satisfy specific criteria. This +enables more efficient query processing for supported search problems. A +classical index can be modeled as a function, mapping a set of attribute +values, called a key, $\mathcal{K}$, to a set of record identifiers, +$\mathcal{R}$. The codomain of an index can be either the set of +record identifiers, a set containing sets of record identifiers, or +the set of physical records, depending upon the configuration of the +index.~\cite{cowbook} For our purposes here, we'll focus on the first of +these, but the use of other codmains wouldn't have any material effect +on our discussion. + +We will use the following definition of a "classical" database index, + +\begin{definition}[Classical Index~\cite{cowbook}] +Consider a set of database records, $\mathcal{D}$. An index over +these records, $\mathcal{I}_\mathcal{D}$ is a map of the form + $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where +$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, +called a \emph{key}. +\end{definition} + +In order to facilitate this mapping, indexes are built using data +structures. The selection of data structure has implications on the +performance of the index, and the types of search problem it can be +used to accelerate. Broadly speaking, classical indices can be divided +into two categories: ordered and unordered. Ordered indices allow for +the iteration over a set of record identifiers in a particular sorted +order of keys, and the efficient location of a specific key value in +that order. These indices can be used to accelerate range scans and +point-lookups. Unordered indices are specialized for point-lookups on a +particular key value, and do not support iterating over records in some +order.~\cite{cowbook, mysql-btree-hash} + +There is a very small set of data structures that are usually used for +creating classical indexes. For ordered indices, the most commonly used +data structure is the B-tree~\cite{ubiq-btree},\footnote{ + By \emph{B-tree} here, we are referring not to the B-tree data + structure, but to a wide range of related structures derived from + the B-tree. Examples include the B$^+$-tree, B$^\epsilon$-tree, etc. +} +and the log-structured merge (LSM) tree~\cite{oneil96} is also often +used within the context of key-value stores~\cite{rocksdb}. Some databases +implement unordered indices using hash tables~\cite{mysql-btree-hash}. + + +\subsection{The Generalized Index} + +The previous section discussed the classical definition of index +as might be found in a database systems textbook. However, this +definition is limited by its association specifically with mapping +key fields to records. For the purposes of this work, a broader +definition of index will be considered, + +\begin{definition}[Generalized Index] +Consider a set of database records, $\mathcal{D}$, and search +problem, $\mathcal{Q}$. +A generalized index, $\mathcal{I}_\mathcal{D}$ +is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to +\mathcal{R})$. +\end{definition} + +A classical index is a special case of a generalized index, with $\mathcal{Q}$ +being a point-lookup or range scan based on a set of record attributes. + +There are a number of generalized indexes that appear in some database systems. +For example, some specialized databases or database extensions have support for +indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a +signifier for a general class of related data structures.} for spatial +databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world +graphs for similarity search~\cite{pinecone-db}, among others. These systems +are typically either an add-on module, or a specialized standalone database +that has been designed specifically for answering particular types of queries +(such as spatial queries, similarity search, string matching, etc.). + +%\subsection{Indexes in Query Processing} + +%A database management system utilizes indexes to accelerate certain +%types of query. Queries are expressed to the system in some high +%level language, such as SQL or Datalog. These are generalized +%languages capable of expressing a wide range of possible queries. +%The DBMS is then responsible for converting these queries into a +%set of primitive data access procedures that are supported by the +%underlying storage engine. There are a variety of techniques for +%this, including mapping directly to a tree of relational algebra +%operators and interpreting that tree, query compilation, etc. But, +%ultimately, this internal query representation is limited by the routines +%supported by the storage engine.~\cite{cowbook} + +%As an example, consider the following SQL query (representing a +%2-dimensional k-nearest neighbor problem)\footnote{There are more efficient +%ways of answering this query, but I'm aiming for simplicity here +%to demonstrate my point}, +% +%\begin{verbatim} +%SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A +% WHERE A.property = filtering_criterion +% ORDER BY d +% LIMIT 5; +%\end{verbatim} +% +%This query will be translated into a logical query plan (a sequence +%of relational algebra operators) by the query planner, which could +%result in a plan like this, +% +%\begin{verbatim} +%query plan here +%\end{verbatim} +% +%With this logical query plan, the DBMS will next need to determine +%which supported operations it can use to most efficiently answer +%this query. For example, the selection operation (A) could be +%physically manifested as a table scan, or could be answered using +%an index scan if there is an ordered index over \texttt{A.property}. +%The query optimizer will make this decision based on its estimate +%of the selectivity of the predicate. This may result in one of the +%following physical query plans +% +%\begin{verbatim} +%physical query plan +%\end{verbatim} +% +%In either case, however, the space of possible physical plans is +%limited by the available access methods: either a sorted scan on +%an attribute (index) or an unsorted scan (table scan). The database +%must filter for all elements matching the filtering criterion, +%calculate the distances between all of these points and the query, +%and then sort the results to get the final answer. Additionally, +%note that the sort operation in the plan is a pipeline-breaker. If +%this plan were to appear as a sub-tree in a larger query plan, the +%overall plan would need to wait for the full evaluation of this +%sub-query before it could proceed, as sorting requires the full +%result set. +% +%Imagine a world where a new index was available to the DBMS: a +%nearest neighbor index. This index would allow the iteration over +%records in sorted order, relative to some predefined metric and a +%query point. If such an index existed over \texttt{(A.x, A.y)} using +%\texttt{dist}, then a third physical plan would be available to the DBMS, +% +%\begin{verbatim} +%\end{verbatim} +% +%This plan pulls records in order of their distance to \texttt{Q} +%directly, using an index, and then filters them, avoiding the +%pipeline breaking sort operation. While it's not obvious in this +%case that this new plan is superior (this would depend upon the +%selectivity of the predicate), it is a third option. It becomes +%increasingly superior as the selectivity of the predicate grows, +%and is clearly superior in the case where the predicate has unit +%selectivity (requiring only the consideration of $5$ records total). +% +%This use of query-specific indexing schemes presents a query +%optimization challenge: how does the database know when a particular +%specialized index can be used for a given query, and how can +%specialized indexes broadcast their capabilities to the query optimizer +%in a general fashion? This work is focused on the problem of enabling +%the existence of such indexes, rather than facilitating their use; +%however these are important questions that must be considered in +%future work for this solution to be viable. There has been work +%done surrounding the use of arbitrary indexes in queries in the past, +%such as~\cite{byods-datalog}. This problem is considered out-of-scope +%for the proposed work, but will be considered in the future. + +\section{Classical Dynamization Techniques} + +Because data in a database is regularly updated, data structures +intended to be used as an index must support updates (inserts, in-place +modification, and deletes). Not all potentially useful data structures +support updates, and so a general strategy for adding update support +would increase the number of data structures that could be used as +database indices. We refer to a data structure with update support as +\emph{dynamic}, and one without update support as \emph{static}.\footnote{ + + The term static is distinct from immutable. Static refers to the + layout of records within the data structure, whereas immutable + refers to the data stored within those records. This distinction + will become relevant when we discuss different techniques for adding + delete support to data structures. The data structures used are + always static, but not necessarily immutable, because the records may + contain header information (like visibility) that is updated in place. +} + +This section discusses \emph{dynamization}, the construction of a dynamic +data structure based on an existing static one. When certain conditions +are satisfied by the data structure and its associated search problem, +this process can be done automatically, and with provable asymptotic +bounds on amortized insertion performance, as well as worst case query +performance. We will first discuss the necessary data structure +requirements, and then examine several classical dynamization techniques. +The section will conclude with a discussion of delete support within the +context of these techniques. + +\subsection{Global Reconstruction} + +The most fundamental dynamization technique is that of \emph{global +reconstruction}. While not particularly useful on its own, global +reconstruction serves as the basis for the techniques to follow, and so +we will begin our discussion of dynamization with it. + +Consider a class of data structure, $\mathcal{I}$, capable of answering a +search problem, $\mathcal{Q}$. Insertion via global reconstruction is +possible if $\mathcal{I}$ supports the following two operations, +\begin{align*} +\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\ +\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D}) +\end{align*} +where $\mathtt{build}$ constructs an instance $\mathscr{i}\in\mathcal{I}$ +over the data structure over a set of records $d \subseteq \mathcal{D}$ +in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d +\subseteq \mathcal{D}$ used to construct $\mathscr{i} \in \mathcal{I}$ in +$\Theta(1)$ time,\footnote{ + There isn't any practical reason why $\mathtt{unbuild}$ must run + in constant time, but this is the assumption made in \cite{saxe79} + and in subsequent work based on it, and so we will follow the same + defininition here. +} such that $\mathscr{i} = \mathtt{build}(\mathtt{unbuild}(\mathscr{i}))$. + + + + + + +\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} +\label{ssec:bsm} + +Another approach to support updates is to amortize the cost of +global reconstruction over multiple updates. This approach can take +take three forms, +\begin{enumerate} + + \item Pairing a dynamic data structure (called a buffer or + memtable) with an instance of the structure being extended. + Updates are written to the buffer, and when the buffer is + full its records are merged with those in the static + structure, and the structure is rebuilt. This approach is + used by one version of the originally proposed + LSM-tree~\cite{oneil96}. Technically this technique proposed + in that work for the purposes of converting random writes + into sequential ones (all structures involved are dynamic), + but it can be used for dynamization as well. + + \item Creating multiple, smaller data structures each + containing a partition of the records from the dataset, and + reconstructing individual structures to accommodate new + inserts in a systematic manner. This technique is the basis + of the Bentley-Saxe method~\cite{saxe79}. + + \item Using both of the above techniques at once. This is + the approach used by modern incarnations of the + LSM-tree~\cite{rocksdb}. + +\end{enumerate} + +In all three cases, it is necessary for the search problem associated +with the index to be a DSP, as answering it will require querying +multiple structures (the buffer and/or one or more instances of the +data structure) and merging the results together to get a final +result. This section will focus exclusively on the Bentley-Saxe +method, as it is the basis for the proposed methodology. + +When dividing records across multiple structures, there is a clear +trade-off between read performance and write performance. Keeping +the individual structures small reduces the cost of reconstructing, +and thereby increases update performance. However, this also means +that more structures will be required to accommodate the same number +of records, when compared to a scheme that allows the structures +to be larger. As each structure must be queried independently, this +will lead to worse query performance. The reverse is also true, +fewer, larger structures will have better query performance and +worse update performance, with the extreme limit of this being a +single structure that is fully rebuilt on each insert. + +The key insight of the Bentley-Saxe method~\cite{saxe79} is that a +good balance can be struck by using a geometrically increasing +structure size. In Bentley-Saxe, the sub-structures are ``stacked'', +with the base level having a capacity of a single record, and +each subsequent level doubling in capacity. When an update is +performed, the first empty level is located and a reconstruction +is triggered, merging the structures of all levels below this empty +one, along with the new record. The merits of this approach are +that it ensures that ``most'' reconstructions involve the smaller +data structures towards the bottom of the sequence, while most of +the records reside in large, infrequently updated, structures towards +the top. This balances between the read and write implications of +structure size, while also allowing the number of structures required +to represent $n$ records to be worst-case bounded by $O(\log n)$. + +Given a structure and DSP with $P(n)$ construction cost and $Q_S(n)$ +query cost, the Bentley-Saxe Method will produce a dynamic data +structure with, + +\begin{align} + \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\ + \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right) +\end{align} + +In the case of a $C(n)$-decomposable problem, the query cost grows to +\begin{equation} + O\left((Q_s(n) + C(n)) \cdot \log n\right) +\end{equation} + + +While the Bentley-Saxe method manages to maintain good performance in +terms of \emph{amortized} insertion cost, it has has poor worst-case performance. If the +entire structure is full, it must grow by another level, requiring +a full reconstruction involving every record within the structure. +A slight adjustment to the technique, due to Overmars and van +Leeuwen~\cite{overmars81}, allows for the worst-case insertion cost to be bounded by +$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing +each reconstruction into small pieces, one of which is executed +each time a new update occurs. This has the effect of bounding the +worst-case performance, but does so by sacrificing the expected +case performance, and adds a lot of complexity to the method. This +technique is not used much in practice.\footnote{ + We've yet to find any example of it used in a journal article + or conference paper. +} + +\section{Limitations of the Bentley-Saxe Method} +\label{sec:bsm-limits} + +While fairly general, the Bentley-Saxe method has a number of limitations. Because +of the way in which it merges query results together, the number of search problems +to which it can be efficiently applied is limited. Additionally, the method does not +expose any trade-off space to configure the structure: it is one-size fits all. + +\subsection{Limits of Decomposability} +\label{ssec:decomp-limits} +Unfortunately, the DSP abstraction used as the basis of the Bentley-Saxe +method has a few significant limitations that must first be overcome, +before it can be used for the purposes of this work. At a high level, these limitations +are as follows, + +\begin{itemize} + \item Each local query must be oblivious to the state of every partition, + aside from the one it is directly running against. Further, + Bentley-Saxe provides no facility for accessing cross-block state + or performing multiple query passes against each partition. + + \item The result merge operation must be $O(1)$ to maintain good query + performance. + + \item The result merge operation must be commutative and associative, + and is called repeatedly to merge pairs of results. +\end{itemize} + +These requirements restrict the types of queries that can be supported by +the method efficiently. For example, k-nearest neighbor and independent +range sampling are not decomposable. + +\subsubsection{k-Nearest Neighbor} +\label{sssec-decomp-limits-knn} +The k-nearest neighbor (KNN) problem is a generalization of the nearest +neighbor problem, which seeks to return the closest point within the +dataset to a given query point. More formally, this can be defined as, +\begin{definition}[Nearest Neighbor] + + Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$ + be some function $f: D^2 \to \mathbb{R}^+$ representing the distance + between two points within $D$. The nearest neighbor problem, $NN(D, + q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$ + for some query point, $q \in \mathbb{R}^d$. + +\end{definition} + +In practice, it is common to require $f(x, y)$ be a metric,\footnote +{ + Contrary to its vernacular usage as a synonym for ``distance'', a + metric is more formally defined as a valid distance function over + a metric space. Metric spaces require their distance functions to + have the following properties, + \begin{itemize} + \item The distance between a point and itself is always 0. + \item All distances between non-equal points must be positive. + \item For all points, $x, y \in D$, it is true that + $f(x, y) = f(y, x)$. + \item For any three points $x, y, z \in D$ it is true that + $f(x, z) \leq f(x, y) + f(y, z)$. + \end{itemize} + + These distances also must have the interpretation that $f(x, y) < + f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This + is the opposite of the definition of similarity, and so some minor + manipulations are usually required to make similarity measures work + in metric-based indexes. \cite{intro-analysis} +} +and this will be done in the examples of indexes for addressing +this problem in this work, but it is not a fundamental aspect of the problem +formulation. The nearest neighbor problem itself is decomposable, with +a simple merge function that accepts the result with the smallest value +of $f(x, q)$ for any two inputs\cite{saxe79}. + +The k-nearest neighbor problem generalizes nearest-neighbor to return +the $k$ nearest elements, +\begin{definition}[k-Nearest Neighbor] + + Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$ + be some function $f: D^2 \to \mathbb{R}^+$ representing the distance + between two points within $D$. The k-nearest neighbor problem, + $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$ + such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$. + +\end{definition} + +This can be thought of as solving the nearest-neighbor problem $k$ times, +each time removing the returned result from $D$ prior to solving the +problem again. Unlike the single nearest-neighbor case (which can be +thought of as KNN with $k=1$), this problem is \emph{not} decomposable. + +\begin{theorem} + KNN is not a decomposable search problem. +\end{theorem} + +\begin{proof} +To prove this, consider the query $KNN(D, q, k)$ against some partitioned +dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If KNN is decomposable, +then there must exist some constant-time, commutative, and associative +binary operator $\square$, such that $R = \square_{0 \leq i \leq l} +R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, +k)$. Consider the evaluation of the merge operator against two arbitrary +result sets, $R = R_i \square R_j$. It is clear that $|R| = |R_i| = +|R_j| = k$, and that the contents of $R$ must be the $k$ records from +$R_i \cup R_j$ that are nearest to $q$. Thus, $\square$ must solve the +problem $KNN(R_i \cup R_j, q, k)$. However, KNN cannot be solved in $O(1)$ +time. Therefore, KNN is not a decomposable search problem. +\end{proof} + +With that said, it is clear that there isn't any fundamental restriction +preventing the merging of the result sets; +it is only the case that an +arbitrary performance requirement wouldn't be satisfied. It is possible +to merge the result sets in non-constant time, and so it is the case that +KNN is $C(n)$-decomposable. Unfortunately, this classification brings with +it a reduction in query performance as a result of the way result merges are +performed in Bentley-Saxe. + +As a concrete example of these costs, consider using Bentley-Saxe to +extend the VPTree~\cite{vptree}. The VPTree is a static, metric index capable of +answering KNN queries in $KNN(D, q, k) \in O(k \log n)$. One possible +merge algorithm for KNN would be to push all of the elements in the two +arguments onto a min-heap, and then pop off the first $k$. In this case, +the cost of the merge operation would be $C(k) = k \log k$. Were $k$ assumed +to be constant, then the operation could be considered to be constant-time. +But given that $k$ is only bounded in size above +by $n$, this isn't a safe assumption to make in general. Evaluating the +total query cost for the extended structure, this would yield, + +\begin{equation} + KNN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) +\end{equation} + +The reason for this large increase in cost is the repeated application +of the merge operator. The Bentley-Saxe method requires applying the +merge operator in a binary fashion to each partial result, multiplying +its cost by a factor of $\log n$. Thus, the constant-time requirement +of standard decomposability is necessary to keep the cost of the merge +operator from appearing within the complexity bound of the entire +operation in the general case.\footnote { + There is a special case, noted by Overmars, where the total cost is + $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n)) + \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the + case where the cost of the query and merge operation are sufficiently + large to consume the logarithmic factor, and so it doesn't represent + a special case with better performance. +} +If the result merging operation could be revised to remove this +duplicated cost, the cost of supporting $C(n)$-decomposable queries +could be greatly reduced. + +\subsubsection{Independent Range Sampling} + +Another problem that is not decomposable is independent sampling. There +are a variety of problems falling under this umbrella, including weighted +set sampling, simple random sampling, and weighted independent range +sampling, but this section will focus on independent range sampling. + +\begin{definition}[Independent Range Sampling~\cite{tao22}] + Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range + sampling query returns $k$ independent samples from $D \cap q$ + with each point having equal probability of being sampled. +\end{definition} + +This problem immediately encounters a category error when considering +whether it is decomposable: the result set is randomized, whereas the +conditions for decomposability are defined in terms of an exact matching +of records in result sets. To work around this, a slight abuse of definition +is in order: +assume that the equality conditions within the DSP definition can +be interpreted to mean ``the contents in the two sets are drawn from the +same distribution''. This enables the category of DSP to apply to this type +of problem. More formally, +\begin{definition}[Decomposable Sampling Problem] + A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and + only if there exists a constant-time computable, associative, and + commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) \sim F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +Even with this abuse, however, IRS cannot generally be considered decomposable; +it is at best $C(n)$-decomposable. The reason for this is that matching the +distribution requires drawing the appropriate number of samples from each each +partition of the data. Even in the special case that $|D_0| = |D_1| = \ldots = +|D_\ell|$, the number of samples from each partition that must appear in the +result set cannot be known in advance due to differences in the selectivity +of the predicate across the partitions. + +\begin{example}[IRS Sampling Difficulties] + + Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 = + \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and + an IRS query over the interval $[3, 4]$ with $k=12$. Because all three + partitions have the same size, it seems sensible to evenly distribute + the samples across them ($4$ samples from each partition). Applying + the query predicate to the partitions results in the following, + $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$. + + In expectation, then, the first result set will contain $R_0 = \{3, + 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same + probability of a $4$. The second and third result sets can only + be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these + together, we'd find that the probability distribution of the sample + would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform + the same sampling operation over the full dataset (not partitioned), + the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$. + +\end{example} + +The problem is that the number of samples drawn from each partition needs to be +weighted based on the number of elements satisfying the query predicate in that +partition. In the above example, by drawing $4$ samples from $D_1$, more weight +is given to $3$ than exists within the base dataset. This can be worked around +by sampling a full $k$ records from each partition, returning both the sample +and the number of records satisfying the predicate as that partition's query +result, and then performing another pass of IRS as the merge operator, but this +is the same approach as was used for KNN above. This leaves IRS firmly in the +$C(n)$-decomposable camp. If it were possible to pre-calculate the number of +samples to draw from each partition, then a constant-time merge operation could +be used. + +\section{Conclusion} +This chapter discussed the necessary background information pertaining to +queries and search problems, indexes, and techniques for dynamic extension. It +described the potential for using custom indexes for accelerating particular +kinds of queries, as well as the challenges associated with constructing these +indexes. The remainder of this document will seek to address these challenges +through modification and extension of the Bentley-Saxe method, describing work +that has already been completed, as well as the additional work that must be +done to realize this vision. |