diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-02 16:03:40 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-02 16:03:40 -0400 |
| commit | 873fd659e45e80fe9e229d3d85b3c4c99fb2c121 (patch) | |
| tree | 139575dc844870b5470be7eeec9b97a46f20d70d /chapters/background.tex | |
| parent | b30145b6a54480d3f051be3ff3f8f222f5116f87 (diff) | |
| download | dissertation-873fd659e45e80fe9e229d3d85b3c4c99fb2c121.tar.gz | |
Updates
Diffstat (limited to 'chapters/background.tex')
| -rw-r--r-- | chapters/background.tex | 218 |
1 files changed, 141 insertions, 77 deletions
diff --git a/chapters/background.tex b/chapters/background.tex index 332dbb6..9950b39 100644 --- a/chapters/background.tex +++ b/chapters/background.tex @@ -14,7 +14,7 @@ indices will be discussed briefly. Indices are the primary use of data structures within the database context that is of interest to our work. Following this, existing theoretical results in the area of data structure dynamization will be discussed, which will serve as the building blocks -for our techniques in subsquent chapters. The chapter will conclude with +for our techniques in subsequent chapters. The chapter will conclude with a discussion of some of the limitations of these existing techniques. \section{Queries and Search Problems} @@ -62,7 +62,7 @@ As an example of using these definitions, a \emph{membership test} or \emph{range scan} would be considered search problems, and a range scan over the interval $[10, 99]$ would be a query. We've drawn this distinction because, as we'll see as we enter into the discussion of -our work in later chapters, it is useful to have seperate, unambiguous +our work in later chapters, it is useful to have separate, unambiguous terms for these two concepts. \subsection{Decomposable Search Problems} @@ -144,7 +144,7 @@ The calculation of the arithmetic mean of a set of numbers is a DSP. Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$, where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple contains the sum of the values within the input set, and the -cardinality of the input set. For two disjoint paritions of the data, +cardinality of the input set. For two disjoint partitions of the data, $D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let $A(D_1) \mergeop A(D_2) = (s_1 + s_2, c_1 + c_2)$. @@ -416,7 +416,7 @@ $\Theta(1)$ time,\footnote{ There isn't any practical reason why $\mathtt{unbuild}$ must run in constant time, but this is the assumption made in \cite{saxe79} and in subsequent work based on it, and so we will follow the same - defininition here. + definition here. } such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$. Given this structure, an insert of record $r \in \mathcal{D}$ into a @@ -440,7 +440,7 @@ in a worst-case insert cost of $\Theta(C(n))$. However, opportunities for improving this scheme can present themselves when considering the \emph{amortized} insertion cost. -Consider the cost acrrued by the dynamized structure under global +Consider the cost accrued by the dynamized structure under global reconstruction over the lifetime of the structure. Each insert will result in all of the existing records being rewritten, so at worst each record will be involved in $\Theta(n)$ reconstructions, each reconstruction @@ -473,7 +473,7 @@ work. The earliest of these is the logarithmic method, often called the Bentley-Saxe method in modern literature, and is the most commonly discussed technique today. A later technique, the equal block method, was also examined. It is generally not as effective as the Bentley-Saxe -method, but it has some useful properties for explainatory purposes and +method, but it has some useful properties for explanatory purposes and so will be discussed here as well. \subsection{Equal Block Method~\cite[pp.~96-100]{overmars83}} @@ -512,7 +512,7 @@ Following an insert, it is possible that Constraint~\ref{ebm-c1} is violated.\fo Constraint~\ref{ebm-c2} cannot be violated by inserts, but may be violated by deletes. We're omitting deletes from the discussion at this point, but will circle back to them in Section~\ref{sec:deletes}. -} In this case, the constraints are enforced by "reconfiguring" the +} In this case, the constraints are enforced by "re-configuring" the structure. $s$ is updated to be exactly $f(n)$, all of the existing blocks are unbuilt, and then the records are redistributed evenly into $s$ blocks. @@ -530,7 +530,7 @@ reconstruction, at the possible cost of increased query performance for sub-linear queries. We'll omit the details of the proof of performance for brevity and streamline some of the original notation (full details can be found in~\cite{overmars83}), but this technique ultimately -results in a data structure with the following performance characterstics, +results in a data structure with the following performance characteristics, \begin{align*} \text{Amortized Insertion Cost:}&\quad \Theta\left(\frac{C(n)}{n} + C\left(\frac{n}{f(n)}\right)\right) \\ \text{Worst-case Query Cost:}& \quad \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right) \\ @@ -585,40 +585,100 @@ doing $\mathscr{I}_2 = \text{build}\left(\{x_{12}\} \cup \text{unbuild}(\mathscr{I}_1) \cup \text{unbuild}(\mathscr{I}_2)\right)$ and then emptying $\mathscr{I}_1$ and $\mathscr{I}_2$. +This technique is called a \emph{binary decomposition} of the data +structure. Considering a BSM dynamization of a structure containing $n$ +records, labeling each block with a $0$ if it is empty and a $1$ if it +is full will result in the binary representation of $n$. For example, +the final state of the structure in Figure~\ref{fig:bsm-example} contains +$12$ records, and the labeling procedure will result in $0\text{b}1100$, +which is $12$ in binary. Inserts affect this representation of the +structure in the same way that incrementing the binary number by $1$ does. + +By applying BSM to a data structure, a dynamized structure can be created +with the following performance characteristics, +\begin{align*} +\text{Amortized Insertion Cost:}&\quad \Theta\left(\left(\frac{C(n)}{n}\cdot \log_2 n\right)\right) \\ +\text{Worst Case Insertion Cost:}&\quad \Theta\left(C(n)\right) \\ +\text{Worst-case Query Cost:}& \quad \Theta\left(\log_2 n\cdot \mathscr{Q}\left(n\right)\right) \\ +\end{align*} +This is a particularly attractive result because, for example, a data +structure having $C(n) \in \Theta(n)$ will have an amortized insertion +cost of $\log_2 (n)$, which is quite reasonable. The cost is an extra +logarithmic multiple attached to the query complexity. It is also worth +noting that the worst-case insertion cost remains the same as global +reconstruction, but this case arises only very rarely. If you consider the +binary decomposition representation, the worst-case behavior is triggered +each time the existing number overflows, and a new digit must be added. + +\subsection{Delete Support} + +Classical dynamization techniques have also been developed with +support for deleting records. In general, the same technique of global +reconstruction that was used for inserting records can also be used to +delete them. Given a record $r \in \mathcal{D}$ and a data structure +$\mathscr{I} \in \mathcal{I}$ such that $r \in \mathscr{I}$, $r$ can be +deleted from the structure in $C(n)$ time as follows, +\begin{equation*} +\mathscr{I}^\prime = \text{build}(\text{unbuild}(\mathscr{I}) - \{r\}) +\end{equation*} +However, supporting deletes within the dynamization schemes discussed +above is more complicated. The core problem is that inserts affect the +dynamized structure in a deterministic way, and as a result certain +partitioning schemes can be leveraged to reason about the +performance. But, deletes do not work like this. +\begin{figure} +\caption{A Bentley-Saxe dynamization for the integers on the +interval $[1, 100]$.} +\label{fig:bsm-delete-example} +\end{figure} - - - - - - - - +For example, consider a Bentley-Saxe dynamization that contains all +integers on the interval $[1, 100]$, inserted in that order, shown in +Figure~\ref{fig:bsm-delete-example}. We would like to delete all of the +records from this structure, one at a time, using global reconstruction. +This presents several problems, +\begin{itemize} + \item For each record, we need to identify which block it is in before + we can delete it. + \item The cost of performing a delete is a function of which block the + record is in, which is a question of distribution and not easily + controlled. + \item As records are deleted, the structure will potentially violate + the invariants of the decomposition scheme used, which will + require additional work to fix. +\end{itemize} \section{Limitations of Classical Dynamization Techniques} \label{sec:bsm-limits} -While fairly general, the Bentley-Saxe method has a number of -limitations. Because of the way in which it merges query results together, -the number of search problems to which it can be efficiently applied is -limited. Additionally, the method does not expose any trade-off space -to configure the structure: it is one-size fits all. +While fairly general, these dynamization techniques have a number of +limitations that prevent them from being directly usable as a general +solution to the problem of creating database indices. Because of the +requirement that the query being answered be decomposable, many search +problems cannot be addressed--or at least efficiently addressed, by +decomposition-based dynamization. The techniques also do nothing to reduce +the worst-case insertion cost, resulting in extremely poor tail latency +performance relative to hand-built dynamic structures. Finally, these +approaches do not do a good job of exposing the underlying configuration +space to the user, meaning that the user can exert limited control on the +performance of the dynamized data structure. This section will discuss +these limitations, and the rest of the document will be dedicated to +proposing solutions to them. \subsection{Limits of Decomposability} \label{ssec:decomp-limits} -Unfortunately, the DSP abstraction used as the basis of the Bentley-Saxe -method has a few significant limitations that must first be overcome, -before it can be used for the purposes of this work. At a high level, these limitations -are as follows, +Unfortunately, the DSP abstraction used as the basis of classical +dynamization techniques has a few significant limitations that restrict +their applicability, \begin{itemize} - \item Each local query must be oblivious to the state of every partition, - aside from the one it is directly running against. Further, - Bentley-Saxe provides no facility for accessing cross-block state - or performing multiple query passes against each partition. + \item The query must be broadcast identically to each block and cannot + be adjusted based on the state of the other blocks. + + \item The query process is done in one pass--it cannot be repeated. \item The result merge operation must be $O(1)$ to maintain good query performance. @@ -633,7 +693,7 @@ range sampling are not decomposable. \subsubsection{k-Nearest Neighbor} \label{sssec-decomp-limits-knn} -The k-nearest neighbor (KNN) problem is a generalization of the nearest +The k-nearest neighbor (k-NN) problem is a generalization of the nearest neighbor problem, which seeks to return the closest point within the dataset to a given query point. More formally, this can be defined as, \begin{definition}[Nearest Neighbor] @@ -667,11 +727,11 @@ In practice, it is common to require $f(x, y)$ be a metric,\footnote manipulations are usually required to make similarity measures work in metric-based indexes. \cite{intro-analysis} } -and this will be done in the examples of indexes for addressing +and this will be done in the examples of indices for addressing this problem in this work, but it is not a fundamental aspect of the problem -formulation. The nearest neighbor problem itself is decomposable, with -a simple merge function that accepts the result with the smallest value -of $f(x, q)$ for any two inputs\cite{saxe79}. +formulation. The nearest neighbor problem itself is decomposable, +with a simple merge function that accepts the result with the smallest +value of $f(x, q)$ for any two inputs\cite{saxe79}. The k-nearest neighbor problem generalizes nearest-neighbor to return the $k$ nearest elements, @@ -688,15 +748,15 @@ the $k$ nearest elements, This can be thought of as solving the nearest-neighbor problem $k$ times, each time removing the returned result from $D$ prior to solving the problem again. Unlike the single nearest-neighbor case (which can be -thought of as KNN with $k=1$), this problem is \emph{not} decomposable. +thought of as k-NN with $k=1$), this problem is \emph{not} decomposable. \begin{theorem} - KNN is not a decomposable search problem. + k-NN is not a decomposable search problem. \end{theorem} \begin{proof} To prove this, consider the query $KNN(D, q, k)$ against some partitioned -dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If KNN is decomposable, +dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If k-NN is decomposable, then there must exist some constant-time, commutative, and associative binary operator $\mergeop$, such that $R = \mergeop_{0 \leq i \leq l} R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, @@ -704,32 +764,32 @@ k)$. Consider the evaluation of the merge operator against two arbitrary result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| = |R_j| = k$, and that the contents of $R$ must be the $k$ records from $R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the -problem $KNN(R_i \cup R_j, q, k)$. However, KNN cannot be solved in $O(1)$ -time. Therefore, KNN is not a decomposable search problem. +problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$ +time. Therefore, k-NN is not a decomposable search problem. \end{proof} With that said, it is clear that there isn't any fundamental restriction -preventing the merging of the result sets; -it is only the case that an +preventing the merging of the result sets; it is only the case that an arbitrary performance requirement wouldn't be satisfied. It is possible -to merge the result sets in non-constant time, and so it is the case that -KNN is $C(n)$-decomposable. Unfortunately, this classification brings with -it a reduction in query performance as a result of the way result merges are -performed in Bentley-Saxe. - -As a concrete example of these costs, consider using Bentley-Saxe to -extend the VPTree~\cite{vptree}. The VPTree is a static, metric index capable of -answering KNN queries in $KNN(D, q, k) \in O(k \log n)$. One possible -merge algorithm for KNN would be to push all of the elements in the two -arguments onto a min-heap, and then pop off the first $k$. In this case, -the cost of the merge operation would be $C(k) = k \log k$. Were $k$ assumed -to be constant, then the operation could be considered to be constant-time. -But given that $k$ is only bounded in size above -by $n$, this isn't a safe assumption to make in general. Evaluating the -total query cost for the extended structure, this would yield, +to merge the result sets in non-constant time, and so it is the case +that k-NN is $C(n)$-decomposable. Unfortunately, this classification +brings with it a reduction in query performance as a result of the way +result merges are performed. + +As a concrete example of these costs, consider using the Bentley-Saxe +method to extend the VPTree~\cite{vptree}. The VPTree is a static, +metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k +\log n)$. One possible merge algorithm for k-NN would be to push all +of the elements in the two arguments onto a min-heap, and then pop off +the first $k$. In this case, the cost of the merge operation would be +$C(k) = k \log k$. Were $k$ assumed to be constant, then the operation +could be considered to be constant-time. But given that $k$ is only +bounded in size above by $n$, this isn't a safe assumption to make in +general. Evaluating the total query cost for the extended structure, +this would yield, \begin{equation} - KNN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) + k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) \end{equation} The reason for this large increase in cost is the repeated application @@ -746,16 +806,16 @@ operation in the general case.\footnote { large to consume the logarithmic factor, and so it doesn't represent a special case with better performance. } -If the result merging operation could be revised to remove this -duplicated cost, the cost of supporting $C(n)$-decomposable queries -could be greatly reduced. +If we could revise the result merging operation to remove this duplicated +cost, we could greatly reduce the cost of supporting $C(n)$-decomposable +queries. \subsubsection{Independent Range Sampling} Another problem that is not decomposable is independent sampling. There are a variety of problems falling under this umbrella, including weighted set sampling, simple random sampling, and weighted independent range -sampling, but this section will focus on independent range sampling. +sampling, but we will focus on independent range sampling here. \begin{definition}[Independent Range Sampling~\cite{tao22}] Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query @@ -765,14 +825,13 @@ sampling, but this section will focus on independent range sampling. \end{definition} This problem immediately encounters a category error when considering -whether it is decomposable: the result set is randomized, whereas the -conditions for decomposability are defined in terms of an exact matching -of records in result sets. To work around this, a slight abuse of definition -is in order: -assume that the equality conditions within the DSP definition can -be interpreted to mean ``the contents in the two sets are drawn from the -same distribution''. This enables the category of DSP to apply to this type -of problem. More formally, +whether it is decomposable: the result set is randomized, whereas +the conditions for decomposability are defined in terms of an exact +matching of records in result sets. To work around this, a slight abuse +of definition is in order: assume that the equality conditions within +the DSP definition can be interpreted to mean ``the contents in the two +sets are drawn from the same distribution''. This enables the category +of DSP to apply to this type of problem. More formally, \begin{definition}[Decomposable Sampling Problem] A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and only if there exists a constant-time computable, associative, and @@ -782,13 +841,14 @@ of problem. More formally, \end{equation*} \end{definition} -Even with this abuse, however, IRS cannot generally be considered decomposable; -it is at best $C(n)$-decomposable. The reason for this is that matching the -distribution requires drawing the appropriate number of samples from each each -partition of the data. Even in the special case that $|D_0| = |D_1| = \ldots = -|D_\ell|$, the number of samples from each partition that must appear in the -result set cannot be known in advance due to differences in the selectivity -of the predicate across the partitions. +Even with this abuse, however, IRS cannot generally be considered +decomposable; it is at best $C(n)$-decomposable. The reason for this is +that matching the distribution requires drawing the appropriate number +of samples from each each partition of the data. Even in the special +case that $|D_0| = |D_1| = \ldots = |D_\ell|$, the number of samples +from each partition that must appear in the result set cannot be known +in advance due to differences in the selectivity of the predicate across +the partitions. \begin{example}[IRS Sampling Difficulties] @@ -818,11 +878,15 @@ is given to $3$ than exists within the base dataset. This can be worked around by sampling a full $k$ records from each partition, returning both the sample and the number of records satisfying the predicate as that partition's query result, and then performing another pass of IRS as the merge operator, but this -is the same approach as was used for KNN above. This leaves IRS firmly in the +is the same approach as was used for k-NN above. This leaves IRS firmly in the $C(n)$-decomposable camp. If it were possible to pre-calculate the number of samples to draw from each partition, then a constant-time merge operation could be used. +\subsection{Insertion Tail Latency} + +\subsection{Configurability} + \section{Conclusion} This chapter discussed the necessary background information pertaining to queries and search problems, indexes, and techniques for dynamic extension. It |