diff options
| -rw-r--r-- | chapters/beyond-dsp.tex | 2 | ||||
| -rw-r--r-- | chapters/dynamization.tex | 297 | ||||
| -rw-r--r-- | chapters/introduction.tex | 79 |
3 files changed, 194 insertions, 184 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex index 5655b8c..9e6adf0 100644 --- a/chapters/beyond-dsp.tex +++ b/chapters/beyond-dsp.tex @@ -1764,7 +1764,7 @@ in the dataset. \subfloat[Query Latency]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn-query} \label{fig:knn-query}} \subfloat[Index Overhead]{\includegraphics[width=.32\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn-space} \label{fig:knn-space}} %\vspace{-3mm} - \caption{k-NN Index Evaluation} + \caption{$k$-NN Index Evaluation} %\vspace{-3mm} \label{fig:knn-eval} \end{figure*} diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex index 085ce65..738a436 100644 --- a/chapters/dynamization.tex +++ b/chapters/dynamization.tex @@ -115,16 +115,21 @@ data structures must support the following three operations, \item $\mathbftt{query}: \left(\mathcal{I}, \mathcal{Q}\right) \to \mathcal{R}$ \\ $\mathbftt{query}(\mathscr{I}, q)$ answers the query $F(\mathscr{I}, q)$ and returns the result. This operation runs - in $\mathscr{Q}_S(n)$ time in the worst-case and \emph{cannot alter - the state of $\mathscr{I}$}. + in $\mathscr{Q}_S(n)$ time in the worst-case and cannot alter + the state of $\mathscr{I}$. -\item $\mathbftt{build}:\left(\mathcal{PS}(\mathcal{D})\right) \to \mathcal{I}$ \\ +\item $\mathbftt{build}:\mathcal{PS}(\mathcal{D}) \to \mathcal{I}$ \\ $\mathbftt{build}(d)$ constructs a new instance of $\mathcal{I}$ using the records in set $d$. This operation runs in $B(n)$ time in - the worst case. - -\item $\mathbftt{unbuild}\left(\mathcal{I}\right) \to \mathcal{PS}(\mathcal{D})$ \\ - $\mathbftt{unbuild}(\mathscr{I})$ recovers the set of records, $d$ + the worst case.\footnote{ + We use the notation $\mathcal{PS}(\mathcal{D})$ to indicate the + power set of $\mathcal{D}$, i.e. the set containing all possible + subsets of $\mathcal{D}$. Thus, $d \in \mathcal{PS}(\mathcal{D}) + \iff d \subseteq \mathcal{D}$. + } + +\item $\mathbftt{unbuild}: \mathcal{I} \to \mathcal{PS}(\mathcal{D})$ \\ + $\mathbftt{unbuild}(\mathscr{I})$ recovers the set of records, $d$, used to construct $\mathscr{I}$. The literature on dynamization generally assumes that this operation runs in $\Theta(1)$ time~\cite{saxe79}, and we will adopt the same assumption in our @@ -133,38 +138,41 @@ data structures must support the following three operations, \end{definition} -Note that the term static is distinct from immutable. Static refers -to the layout of records within the data structure, whereas immutable -refers to the data stored within those records. This distinction will -become relevant when we discuss different techniques for adding delete -support to data structures. The data structures used are always static, -but not necessarily immutable, because the records may contain header -information (like visibility) that is updated in place. +Note that the property of being static is distinct from that of being +immutable. Static refers to the layout of records within the data +structure, whereas immutable refers to the data stored within those +records. This distinction will become relevant when we discuss different +techniques for adding delete support to data structures. The data +structures used are always static, but not necessarily immutable, +because the records may contain header information (like visibility) +that is updated in place. \begin{definition}[Half-dynamic Data Structure~\cite{overmars-art-of-dyn}] \label{def:half-dynamic-ds} A half-dynamic data structure requires the three operations of a static -data structure, as well as the ability to efficiently insert new data into -a structure built over an existing data set, $d$. +data structure, as well as the ability to efficiently insert new data +into a structure built over an existing data set. \begin{itemize} \item $\mathbftt{insert}: \left(\mathcal{I}, \mathcal{D}\right) \to \mathcal{I}$ \\ $\mathbftt{insert}(\mathscr{I}, r)$ returns a data structure, $\mathscr{I}^\prime$, such that $\mathbftt{query}(\mathscr{I}^\prime, - q) = F(d \cup r, q)$, for some $r \in \mathcal{D}$. This operation - runs in $I(n)$ time in the worst-case. + q) = F(\mathbftt{unbuild}(\mathscr{I}) \cup \{r\}, q)$, for some + $r \in \mathcal{D}$. This operation runs in $I(n)$ time in the + worst-case. \end{itemize} \end{definition} The important aspect of insertion in this model is that the effect of -the new record on the query result is observed, not necessarily that -the result is a structure exactly identical to the one that would be -obtained by building a new structure over $d \cup r$. Also, though the -formalism used implies a functional operation where the original data -structure is unmodified, this is not actually a requirement. $\mathscr{I}$ -could be sightly modified in place, and returned as $\mathscr{I}^\prime$, -as is conventionally done with native dynamic data structures. +the new record on the query result is observed, not necessarily that the +result is a structure exactly identical to the one that would be obtained +by building a new structure over $\mathbftt{unbuild}(\mathscr{I}) \cup +\{r\}$. Also, though the formalism used implies a functional operation +where the original data structure is unmodified, this is not actually +a requirement. $\mathscr{I}$ could be sightly modified in place, and +returned as $\mathscr{I}^\prime$, as is conventionally done with native +dynamic data structures. \begin{definition}[Full-dynamic Data Structure~\cite{overmars-art-of-dyn}] \label{def:full-dynamic-ds} @@ -175,7 +183,7 @@ has support for deleting records from the dataset. \item $\mathbftt{delete}: \left(\mathcal{I}, \mathcal{D}\right) \to \mathcal{I}$ \\ $\mathbftt{delete}(\mathscr{I}, r)$ returns a data structure, $\mathscr{I}^\prime$, such that $\mathbftt{query}(\mathscr{I}^\prime, - q) = F(d - r, q)$, for some $r \in \mathcal{D}$. This operation + q) = F(\mathbftt{unbuild}(\mathscr{I}) - \{r\}, q)$, for some $r \in \mathcal{D}$. This operation runs in $D(n)$ time in the worst-case. \end{itemize} @@ -199,7 +207,7 @@ data structures cannot be statically queried--the act of querying them mutates their state. This is the case for structures like heaps, stacks, and queues, for example. -\section{Decomposition-based Dynamization} +\section{Dynamization Basics} \emph{Dynamization} is the process of transforming a static data structure into a dynamic one. When certain conditions are satisfied by the data @@ -226,10 +234,10 @@ section discusses techniques that are more general, and don't require workload-specific assumptions. For more detail than is included in this section, Overmars wrote a book providing a comprehensive survey of techniques for creating dynamic data structures, including not only the -dynamization techniques discussed here, but also local reconstruction -based techniques and more~\cite{overmars83}.\footnote{ - Sadly, this book isn't readily available in - digital format as of the time of writing. +techniques discussed here, but also local reconstruction based techniques +and more~\cite{overmars83}.\footnote{ + Sadly, this book isn't readily available in digital format as of + the time of writing. } @@ -271,30 +279,34 @@ requiring $B(n)$ time, it will require $B(\sqrt{n})$ time.} \end{figure} The problem with global reconstruction is that each insert or delete -must rebuild the entire data structure, involving all of its records. The -key insight, first discussed by Bentley and Saxe~\cite{saxe79}, is that -the cost associated with global reconstruction can be reduced by be -accomplished by \emph{decomposing} the data structure into multiple, -smaller structures, each built from a disjoint partition of the data. -These smaller structures are called \emph{blocks}. It is possible to -devise decomposition schemes that result in asymptotic improvements -of insertion performance when compared to global reconstruction alone. +must rebuild the entire data structure. The key insight that enables +dynamization based on global reconstruction, first discussed by +Bentley and Saxe~\cite{saxe79}, is that the cost associated with global +reconstruction can be reduced by \emph{decomposing} the data structure +into multiple, smaller structures, called \emph{blocks}, each built from +a disjoint partition of the data. The process by which the structure is +broken into blocks is called a decomposition method, and various methods +have been proposed that result in asymptotic improvements of insertion +performance when compared to global reconstruction alone. \begin{example}[Data Structure Decomposition] -Consider a data structure that can be constructed in $B(n) \in \Theta -(n \log n)$ time with $|\mathscr{I}| = n$. Inserting a new record into -this structure using global reconstruction will require $I(n) \in \Theta -(n \log n)$ time. However, if the data structure is decomposed into -blocks, such that each block contains $\Theta(\sqrt{n)})$ records, as shown -in Figure~\ref{fig:bg-decomp}, then only a single block must be reconstructed -to accommodate the insert, requiring $I(n) \in \Theta(\sqrt{n} \log \sqrt{n})$ time. +Consider a data structure that can be constructed in $B(n) \in \Theta (n +\log n)$ time with $|\mathscr{I}| = n$. Inserting a new record into this +structure using global reconstruction will require $I(n) \in \Theta (n +\log n)$ time. However, if the data structure is decomposed into blocks, +such that each block contains $\Theta(\sqrt{n)})$ records, as shown in +Figure~\ref{fig:bg-decomp}, then only a single block must be reconstructed +to accommodate the insert, requiring $I(n) \in \Theta(\sqrt{n} \log +\sqrt{n})$ time. If this structure contains $m = \frac{n}{\sqrt{n}}$ +blocks, we represent it with the notation $\mathscr{I} = \{\mathscr{I}_1, +\ldots, \mathscr{I}_m\}$, where $\mathscr{I}_i$ is the $i$th block. \end{example} Much of the existing work on dynamization has considered different -approaches to decomposing data structures, and the effects that these -approaches have on insertion and query performance. However, before we can -discuss these approaches, we must first address the problem of answering -search problems over these decomposed structures. +decomposition methods for static data structures, and the effects that +these methods have on insertion and query performance. However, before +we can discuss these approaches, we must first address the problem of +answering search problems over these decomposed structures. \subsection{Decomposable Search Problems} @@ -335,12 +347,12 @@ search problems}, for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. \end{definition} -\Paragraph{Examples.} To demonstrate that a search problem is -decomposable, it is necessary to show the existence of the merge operator, -$\mergeop$, with the necessary properties, and to show that $F(A \cup -B, q) = F(A, q)~ \mergeop ~F(B, q)$. With these two results, induction -demonstrates that the problem is decomposable even in cases with more -than two partial results. +\subsubsection{Examples} +To demonstrate that a search problem is decomposable, it is necessary to +prove the existence of the merge operator, $\mergeop$, with the necessary +properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B, +q)$. With these two results, induction demonstrates that the problem is +decomposable even in cases with more than two partial results. As an example, consider the range counting problem, which seeks to identify the number of elements in a set of 1-dimensional points that @@ -395,13 +407,14 @@ taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set of numbers is a DSP. \end{proof} -\Paragraph{Answering Queries for DSPs.} Queries for a decomposable -search problem can be answered over a decomposed structure by -individually querying each block, and then merging the results together -using $\mergeop$. In many cases, this process will introduce some -overhead in the query cost. Given a decomposed data structure $\mathscr{I} -= \{\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_m\}$, -a query for a $C(n)$-decomposable search problem can be answered using, +\subsubsection{Answering Queries for DSPs.} +Queries for a decomposable search problem can be answered over a +decomposed structure by individually querying each block, and then merging +the results together using $\mergeop$. In many cases, this process +will introduce some overhead in the query cost. Given a decomposed +data structure $\mathscr{I} = \{\mathscr{I}_1, \mathscr{I}_2, \ldots, +\mathscr{I}_m\}$, a query for a $C(n)$-decomposable search problem can +be answered using, \begin{equation*} \mathbftt{query}\left(\mathscr{I}, q\right) \triangleq \bigmergeop_{i=1}^{m} F(\mathscr{I}_i, q) \end{equation*} @@ -420,11 +433,11 @@ better. Under certain circumstances, the costs of querying multiple blocks can be absorbed, resulting in no worst-case overhead, at least asymptotically. As an example, consider a linear scan of the data running in $\Theta(n)$ time. In this case, every record must be considered, -and so there isn't any performance penalty\footnote{ +and so there isn't any performance penalty to breaking the records into +multiple blocks and scanning them individually.\footnote{ From an asymptotic perspective. There will still be measurable performance effects from caching, etc., even in this case. -} to breaking the records out into multiple chunks and scanning them -individually. More formally, for any query running in $\mathscr{Q}_S(n) \in +} More formally, for any query running in $\mathscr{Q}_S(n) \in \Omega\left(n^\epsilon\right)$ time where $\epsilon > 0$, the worst-case cost of answering a decomposable search problem from a decomposed structure is $\Theta\left(\mathscr{Q}_S(n)\right)$.~\cite{saxe79} @@ -445,37 +458,38 @@ half-dynamic data structures, and the next section will discuss similar considerations for full-dynamic structures. Of the decomposition techniques, we will focus on the three most important -from a practical standpoint.\footnote{ - There are, in effect, two main methods for decomposition: +methods.\footnote{ + There are two main classes of method for decomposition: decomposing based on some counting scheme (logarithmic and $k$-binomial)~\cite{saxe79} or decomposing into equally sized blocks (equal block method)~\cite{overmars-art-of-dyn}. Other, more complex, methods do exist, but they are largely compositions of these two - simpler ones. These composed decompositions (heh) are of largely - theoretical interest, as they are sufficiently complex to be of - questionable practical utility.~\cite{overmars83} -} The earliest of these is the logarithmic method, often called the -Bentley-Saxe method in modern literature, and is the most commonly -discussed technique today. The logarithmic method has been directly -applied in a few instances in the literature, such as to metric indexing -structures~\cite{naidan14} and spatial structures~\cite{bkdtree}, + simpler ones. These decompositions are of largely theoretical + interest, as they are sufficiently complex to be of questionable + practical utility.~\cite{overmars83} +} The earliest of these is the logarithmic method~\cite{saxe79}, often +called the Bentley-Saxe method in modern literature, and is the most +commonly discussed technique today. The logarithmic method has been +directly applied in a few instances in the literature, such as to metric +indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree}, and has also been used in a modified form for genetic sequence search structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few examples. Bentley and Saxe also proposed a second approach, the $k$-binomial method, that slightly alters the exact decomposition approach -used by the logarithmic method to allow for flexibility in whether the -performance of inserts or queries should be favored. A later technique, -the equal block method, was also developed, which also seeks to introduce -a mechanism for performance tuning. Of the three, the logarithmic method -is the most generally effective, and we have not identified any specific -applications of either $k$-binomial decomposition or the equal block method -outside of the theoretical literature. +used by the logarithmic method to allow for flexibility in whether +the performance of inserts or queries should be favored~\cite{saxe79}. +A later technique, the equal block method~\cite{overmars-art-of-dyn}, +was also developed, which also seeks to introduce a mechanism for +performance tuning. Of the three, the logarithmic method is the most +generally effective, and we have not identified any specific applications +of either $k$-binomial decomposition or the equal block method outside +of the theoretical literature. \subsection{The Logarithmic Method} \label{ssec:bsm} The original, and most frequently used, decomposition technique is the -logarithmic method, also called Bentley-Saxe method (BSM) in more recent +logarithmic method, also called Bentley-Saxe method in more recent literature. This technique decomposes the structure into logarithmically many blocks of exponentially increasing size. More specifically, the data structure is decomposed into $h = \lceil \log_2 n \rceil$ blocks, @@ -483,16 +497,16 @@ $\mathscr{I}_1, \mathscr{I}_2, \ldots, \mathscr{I}_h$. A given block $\mathscr{I}_i$ will be either empty, or contain exactly $2^i$ records within it. -The procedure for inserting a record, $r \in \mathcal{D}$, into -a logarithmic decomposition is as follows. If the block $\mathscr{I}_1$ -is empty, then $\mathscr{I}_1 = \mathbftt{build}{\{r\}}$. If it is not -empty, then there will exist a maximal sequence of non-empty blocks -$\mathscr{I}_1, \mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq +The procedure for inserting a record, $r \in \mathcal{D}$, into a +logarithmic decomposition is as follows. If the block $\mathscr{I}_1$ +is empty, then $\mathscr{I}_1 = \mathbftt{build}(\{r\})$. If it is +not empty, then there will exist a maximal sequence of non-empty +blocks $\mathscr{I}_1, \ldots, \mathscr{I}_i$ for some $i \geq 1$, terminated by an empty block $\mathscr{I}_{i+1}$. In this case, $\mathscr{I}_{i+1}$ is set to $\mathbftt{build}(\{r\} \cup \bigcup_{l=1}^i \mathbftt{unbuild}(\mathscr{I}_l))$ and blocks $\mathscr{I}_1$ through -$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to the -end of the structure as needed. +$\mathscr{I}_i$ are emptied. New empty blocks can be freely added to +the end of the structure as needed. %FIXME: switch the x's to r's for consistency \begin{figure} @@ -509,18 +523,20 @@ and $2$ to be merged, along with the $r_{12}$, to create the new block. \label{fig:bsm-example} \end{figure} +\begin{example}[Insertion into a Logarithmic Decomposition] Figure~\ref{fig:bsm-example} demonstrates this insertion procedure. The -dynamization is built over a set of records $x_1, x_2, \ldots, -x_{10}$ initially, with eight records in $\mathscr{I}_4$ and two in -$\mathscr{I}_2$. The first new record, $x_{11}$, is inserted directly -into $\mathscr{I}_1$. For the next insert following this, $x_{12}$, the +dynamization is built over a set of records $r_1, r_2, \ldots, +r_{10}$ initially, with eight records in $\mathscr{I}_4$ and two in +$\mathscr{I}_2$. The first new record, $r_{11}$, is inserted directly +into $\mathscr{I}_1$. For the next insert following this, $r_{12}$, the first empty block is $\mathscr{I}_3$, and so the insert is performed by -doing $\mathscr{I}_3 = \text{build}\left(\{x_{12}\} \cup +doing $\mathscr{I}_3 = \text{build}\left(\{r_{12}\} \cup \text{unbuild}(\mathscr{I}_2) \cup \text{unbuild}(\mathscr{I}_3)\right)$ and then emptying $\mathscr{I}_2$ and $\mathscr{I}_3$. +\end{example} -This technique is called a \emph{binary decomposition} of the data -structure. Considering a logarithmic decomposition of a structure +This technique is also called a \emph{binary decomposition} of the +data structure. Considering a logarithmic decomposition of a structure containing $n$ records, labeling each block with a $0$ if it is empty and a $1$ if it is full will result in the binary representation of $n$. For example, the final state of the structure in Figure~\ref{fig:bsm-example} @@ -529,8 +545,8 @@ in $0\text{b}1100$, which is $12$ in binary. Inserts affect this representation of the structure in the same way that incrementing the binary number by $1$ does. -By applying this method to a data structure, a dynamized structure can -be created with the following performance characteristics, +By applying this method to a static data structure, a half-dynamic +structure can be created with the following performance characteristics, \begin{align*} \text{Amortized Insertion Cost:}&\quad I_A(n) \in \Theta\left(\frac{B(n)}{n}\cdot \log_2 n\right) \\ \text{Worst Case Insertion Cost:}&\quad I(n) \in \Theta\left(B(n)\right) \\ @@ -568,12 +584,11 @@ entire structure is compacted into a single block. One of the significant limitations of the logarithmic method is that it is incredibly rigid. In our earlier discussion of decomposition we noted that there exists a clear trade-off between insert and query performance -for half-dynamic structures mediate by the number of blocks into which -the structure is decomposed. However, the logarithmic method does not -allow any navigation of this trade-off. In their original paper on the -topic, Bentley and Saxe proposed a different decomposition scheme that -does expose this trade-off, however, which they called the $k$-binomial -transform.~\cite{saxe79} +for half-dynamic structures mediate by the number of blocks into which the +structure is decomposed. However, the logarithmic method does not allow +any navigation of this trade-off. In their original paper on the topic, +Bentley and Saxe proposed a different decomposition scheme that does +expose this trade-off, called the $k$-binomial transform.~\cite{saxe79} In this transform, rather than decomposing the data structure based on powers of two, the structure is decomposed based on a sum of $k$ binomial @@ -772,17 +787,17 @@ which such a data structure exists is called a \emph{merge decomposable search problem} (MDSP)~\cite{merge-dsp}. Note that in~\cite{merge-dsp}, Overmars considers a \emph{very} specific -definition where the data structure is built in two stages. An initial -sorting phase, requiring $O(n \log n)$ time, and then a construction -phase requiring $O(n)$ time. Overmars's proposed mechanism for leveraging -this property is to include with each block a linked list storing the -records in sorted order (presumably to account for structures where the -records must be sorted, but aren't necessarily kept that way). During -reconstructions, these sorted lists can first be merged, and then the -data structure built from the resulting merged list. Using this approach, -even accounting for the merging of the list, he is able to prove that -the amortized insertion cost is less than would have been the case paying -the $O( n \log n)$ cost for each reconstruction.~\cite{merge-dsp} +definition where the data structure is built in two stages: An initial +sorting phase, requiring $O(n \log n)$ time, and then a construction phase +requiring $O(n)$ time. Overmars's proposed mechanism for leveraging this +property attaches a linked list to each block, which stores the records +in sorted order (to account for structures where the records must be +sorted, but aren't necessarily kept that way). During reconstructions, +these sorted lists can first be merged, and then the data structure built +from the resulting merged list. Using this approach, even accounting +for the merging of the list, he is able to prove that the amortized +insertion cost is less than would have been the case paying the $O( +n \log n)$ cost for each reconstruction.~\cite{merge-dsp} While Overmars's definition for MDSP does capture a large number of mergeable data structures (including all of the mergeable structures @@ -793,12 +808,12 @@ built from an unsorted set of records. More formally, \begin{definition}[Merge Decomposable Search Problem~\cite{merge-dsp}] \label{def:mdsp} A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ - is decomposable if and only if there exists a solution to the + is merge decomposable if and only if there exists a solution to the search problem (i.e., a data structure) that is static, and also supports the operation, \begin{itemize} \item $\mathbftt{merge}: \mathcal{I}^k \to \mathcal{I}$ \\ - $\mathbftt{merge}(\mathscr{I}_1, \ldots \mathscr{I}_k)$ returns a + $\mathbftt{merge}(\mathscr{I}_1, \ldots, \mathscr{I}_k)$ returns a static data structure, $\mathcal{I}^\prime$, constructed from the input data structures, with cost $B_M(n, k) \leq B(n)$, such that for any set of search parameters $q$, @@ -812,8 +827,8 @@ The value of $k$ can be upper-bounded by the decomposition technique used. For example, in the logarithmic method there will be $\log n$ structures to merge in the worst case, and so to gain benefit from the merge routine, the merging of $\log n$ structures must be less expensive -than building the new structure using the standard $\mathtt{unbuild}$ -and $\mathtt{build}$ mechanism. Note that the availability of an efficient merge +than building the new structure using the standard $\mathbftt{unbuild}$ +and $\mathbftt{build}$ mechanism. The availability of an efficient merge operation isn't helpful in the equal block method, which doesn't perform data structure merges.\footnote{ In the equal block method, all reconstructions are due to either @@ -860,8 +875,8 @@ additionally appear in a new structure as well. When inserting into this structure, the algorithm first examines every level, $i$. If both $Older_{i-1}$ and $Oldest_{i-1}$ are full, then the algorithm will execute $\frac{B(2^i)}{2^i}$ steps of the algorithm -to construct $New_i$ from $\text{unbuild}(Older_{i-1}) \cup -\text{unbuild}(Oldest_{i-1})$. Once enough inserts have been performed +to construct $New_i$ from $\mathbftt{unbuild}(Older_{i-1}) \cup +\mathbftt{unbuild}(Oldest_{i-1})$. Once enough inserts have been performed to completely build some block, $New_i$, the source blocks for the reconstruction, $Oldest_{i-1}$ and $Older_{i-1}$ are deleted, $Old_{i-1}$ becomes $Oldest_{i-1}$, and $New_i$ is assigned to the oldest empty block @@ -880,18 +895,16 @@ worst-case bound drops to $I(n) \in \Theta\left(\frac{B(n)}{n}\right)$. \label{ssec:dyn-deletes} Full-dynamic structures are those with support for deleting records, -as well as inserting. As it turns out, supporting deletes efficiently -is significantly more challenging than inserts, but there are some -results in the theoretical literature for efficient delete support in -restricted cases. - -While, as discussed earlier, it is in principle possible to support -deletes using global reconstruction, with the operation defined as +as well as inserting. As it turns out, supporting deletes efficiently is +significantly more challenging than inserts, but there are some results +in the theoretical literature for efficient delete support in restricted +cases. In principle it is possible possible to support deletes using +global reconstruction, with the operation defined as, \begin{equation*} \mathbftt{delete}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) - \{r\}) \end{equation*} -the extension of this procedure to a decomposed data structure is less -than trivial. Unlike inserts, where the record can (in principle) be +However, the extension of this procedure to a decomposed data structure is +less than trivial. Unlike inserts, where the record can (in principle) be placed into whatever block we like, deletes must be applied specifically to the block containing the record. As a result, there must be a means to locate the block containing a specified record before it can be deleted. @@ -940,7 +953,7 @@ exists a constant time computable operator, $\Delta$, such that \begin{equation*} F(A - B, q) = F(A, q)~\Delta~F(B, q) \end{equation*} -for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$. +for all $A, B \in \mathcal{PS}(\mathcal{D})$. \end{definition} Given a search problem with this property, it is possible to emulate @@ -1305,15 +1318,15 @@ the $k$ nearest elements, This can be thought of as solving the nearest-neighbor problem $k$ times, each time removing the returned result from $D$ prior to solving the problem again. Unlike the single nearest-neighbor case (which can be -thought of as k-NN with $k=1$), this problem is \emph{not} decomposable. +thought of as $k$-NN with $k=1$), this problem is \emph{not} decomposable. \begin{theorem} - k-NN is not a decomposable search problem. + $k$-NN is not a decomposable search problem. \end{theorem} \begin{proof} To prove this, consider the query $KNN(D, q, k)$ against some partitioned -dataset $D = D_1 \cup D_2 \ldots \cup D_\ell$. If k-NN is decomposable, +dataset $D = D_1 \cup D_2 \ldots \cup D_\ell$. If $k$-NN is decomposable, then there must exist some constant-time, commutative, and associative binary operator $\mergeop$, such that $R = \mergeop_{1 \leq i \leq l} R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, @@ -1321,22 +1334,22 @@ k)$. Consider the evaluation of the merge operator against two arbitrary result sets, $R = R_i \mergeop R_j$. It is clear that $|R| = |R_i| = |R_j| = k$, and that the contents of $R$ must be the $k$ records from $R_i \cup R_j$ that are nearest to $q$. Thus, $\mergeop$ must solve the -problem $KNN(R_i \cup R_j, q, k)$. However, k-NN cannot be solved in $O(1)$ -time. Therefore, k-NN is not a decomposable search problem. +problem $KNN(R_i \cup R_j, q, k)$. However, $k$-NN cannot be solved in $O(1)$ +time. Therefore, $k$-NN is not a decomposable search problem. \end{proof} With that said, it is clear that there isn't any fundamental restriction preventing the merging of the result sets; it is only the case that an arbitrary performance requirement wouldn't be satisfied. It is possible to merge the result sets in non-constant time, and so it is the case -that k-NN is $C(n)$-decomposable. Unfortunately, this classification +that $k$-NN is $C(n)$-decomposable. Unfortunately, this classification brings with it a reduction in query performance as a result of the way result merges are performed. As a concrete example of these costs, consider using the logarithmic method to extend the VPTree~\cite{vptree}. The VPTree is a static, -metric index capable of answering k-NN queries in $KNN(D, q, k) \in O(k -\log n)$. One possible merge algorithm for k-NN would be to push all +metric index capable of answering $k$-NN queries in $KNN(D, q, k) \in O(k +\log n)$. One possible merge algorithm for $k$-NN would be to push all of the elements in the two arguments onto a min-heap, and then pop off the first $k$. In this case, the cost of the merge operation would be $C(k) = k \log k$. Were $k$ assumed to be constant, then the operation @@ -1346,7 +1359,7 @@ general. Evaluating the total query cost for the extended structure, this would yield, \begin{equation} - k-NN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) + KNN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) \end{equation} The reason for this large increase in cost is the repeated application diff --git a/chapters/introduction.tex b/chapters/introduction.tex index 8a45bd0..101c36f 100644 --- a/chapters/introduction.tex +++ b/chapters/introduction.tex @@ -1,8 +1,6 @@ \chapter{Introduction} \label{chap:intro} -\section{Motivation} - Modern relational database management systems (RDBMS) are founded upon a set-based representation of data~\cite{codd70}. This model is very flexible and can be used to represent data of a wide variety of @@ -16,11 +14,10 @@ structures called indices, which can be used to accelerate particular types of query. To take full advantage of these structures, databases feature sophisticated query planning and optimization systems that can identify opportunities to utilize these indices~\cite{cowbook}. This -approach works well for particular types of queries for which an index -has been designed and integrated into the database. Unfortunately, many -RDBMS only support a very limited set of indices for accelerating single -dimensional range queries and point-lookups~\cite{mysql-btree-hash, -cowbook}. +approach works well for particular types of queries for which an index has +been designed and integrated into the database. Many RDBMS only support +a very limited set of indices for accelerating single dimensional range +queries and point-lookups~\cite{mysql-btree-hash, cowbook}. This situation is unfortunate, because one of the major challenges currently facing data systems is the processing of complex analytical @@ -54,28 +51,27 @@ of extending an existing or novel data structure with support for all of these functions is a major barrier to their use. As a current example that demonstrates this problem, consider the recent -development of learned indices. These are a broad class of data structure -that use various techniques to approximate a function mapping a key onto -its location in storage. Theoretically, this model allows for better -space efficiency of the index, as well as improved lookup performance. -This concept was first proposed by Kraska et al. in 2017, when they -published a paper on the first learned index, RMI~\cite{RMI}. This index -succeeding in showing that a learned model can be both faster and smaller -than a conventional range index, but the proposed solution did not support -updates. The first (non-concurrently) updatable learned index, ALEX, took -a year and a half to appear~\cite{alex}. Over the course of the subsequent +development of learned indices. Learned indices are data structures +that that use various techniques to approximate a function mapping +a key onto its location in storage. The concept was first proposed +by Kraska et al. in 2017, when they published a paper on the first +learned index, RMI~\cite{RMI}. This index succeeding in showing that +a learned model can be both faster and smaller than a conventional +range index, but the proposed solution did not support updates. The +first (non-concurrently) updatable learned index, ALEX, took a year +and a half to appear~\cite{alex}. Over the course of the subsequent three years, several learned indexes were proposed with concurrency -support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a -recent performance study~\cite{10.14778/3551793.3551848} showed that these -were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352}, -a traditional index. This same study did however demonstrate that a new -design, ALEX+, was able to outperform ART-OLC under certain circumstances, -but even with this result learned indexes are not generally considered -production ready, because they suffer from significant performance -regressions under certain workloads, and are highly sensitive to the -distribution of keys~\cite{10.14778/3551793.3551848,alex-aca}. Despite the -demonstrable advantages of the technique and over half a decade of -development, learned indexes still have not reached a generally usable +support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} +but a recent performance study~\cite{10.14778/3551793.3551848} +showed that these were still generally inferior to traditional +indexing techniques~\cite{10.1145/2933349.2933352}. While this +study demonstrated that a new design, ALEX+, was able to outperform +traditional indices under certain circumstances, it also showed that +learned indices suffer from significant performance regressions +under certain workloads, and are highly sensitive to the key +distribution~\cite{10.14778/3551793.3551848,alex-aca}. Despite the +demonstrable advantages of the technique and nearly a decade of +development, learned indices still have not reached a generally usable state. It would not be an exaggeration to say that there are dozens of novel data @@ -91,7 +87,7 @@ to database practitioners, and the capabilities of database systems could be greatly enhanced. It is our goal with this work to make a significant step in this direction. -\section{Existing Attempts} +\section{Existing Work} At present, there are several lines of work targeted at reducing the development burden associated with creating specialized indices. We @@ -100,7 +96,7 @@ classify them into three broad categories, \begin{itemize} \item \textbf{Automatic Index Composition.} This line of work seeks to automatically compose an instance-optimized data structure for indexing -static data by examining the workload and combining a collection of basic +data by examining the workload and combining a collection of basic primitive structures to optimize performance. \item \textbf{Generalized Index Templates.} This line of work seeks @@ -125,14 +121,14 @@ will be extensively discussed in Chapter~\ref{chap:background}. Automatic index composition has been considered in a variety of papers~\cite{periodic-table,ds-alchemy,fluid-ds,gene,cosine}, each considering differing sets of data structure primitives and different -techniques for composing the structure. The general principle across all -incarnations of the technique is to consider a (usually static) set of -data, and a workload consisting of single-dimensional range queries and -point lookups. The system then analyzes the workload, either statically -or in real time, selects specific primitive structures optimized for -certain operations (e.g., hash table-like structures for point lookups, -sorted runs for range scans), and applies them to different regions -of the data, in an attempt to maximize the overall performance of the +techniques for composing the structure. The general principle across +all incarnations of the technique is to consider a (usually static) +set of data, and a workload consisting of single-dimensional range +queries and point lookups. The system then analyzes the workload, +either statically or in real time, selects specific primitive structures +optimized for certain operations (e.g., hash table-like structures for +point lookups, sorted runs for range scans), and applies them to different +regions of the data in order to maximize the overall performance of the workload. Although some work in this area suggests generalization to more complex data types, such as multi-dimensional data~\cite{fluid-ds}, this line is broadly focused on creating instance-optimal indices for @@ -171,8 +167,8 @@ we will consider dynamization,\footnote{ all refer to the same process. } the automatic extension of an existing static data structure with support for inserts and deletes. The most general of these techniques -are based on amortized global reconstruction~\cite{overmars83}, -an approach that divides a single data structure up into smaller +are based on data structure \emph{decomposition}~\cite{overmars83}. This +is an approach that divides a single data structure up into smaller structures, called blocks, built over disjoint partitions of the data. Inserts and deletes can then be supported by selectively rebuilding these blocks. The most commonly used version of this @@ -214,7 +210,8 @@ Specifically, the proposed work will address the following points, \begin{enumerate} \item The proposal of a theoretical framework for analyzing queries and data structures that extends existing theoretical - approaches and allows for more data structures to be dynamized. + approaches and allows for more data structures to be + systematically dynamized. \item The design of a system based upon this theoretical framework for automatically dynamizing static data structures in a performant and configurable manner. |