diff options
Diffstat (limited to 'chapters/background.tex.bak')
| -rw-r--r-- | chapters/background.tex.bak | 574 |
1 files changed, 574 insertions, 0 deletions
diff --git a/chapters/background.tex.bak b/chapters/background.tex.bak new file mode 100644 index 0000000..d57b370 --- /dev/null +++ b/chapters/background.tex.bak @@ -0,0 +1,574 @@ +\chapter{Background} + +This chapter will introduce important background information that +will be used throughput the remainder of the document. We'll first +define precisely what is meant by a query, and consider some special +classes of query that will become relevant in our discussion of dynamic +extension. We'll then consider the difference between a static and a +dynamic structure, and techniques for converting static structures into +dynamic ones in a variety of circumstances. + +\section{Database Indexes} + +The term \emph{index} is often abused within the database community +to refer to a range of closely related, but distinct, conceptual +categories\footnote{ +The word index can be used to refer to a structure mapping record +information to the set of records matching that information, as a +general synonym for ``data structure'', to data structures used +specifically in query processing, etc. +}. +This ambiguity is rarely problematic, as the subtle differences +between these categories are not often significant, and context +clarifies the intended meaning in situtations where they are. +However, this work explicitly operates at the interface of two of +these categories, and so it is important to disambiguiate between +them. As a result, we will be using the word index to +refer to a very specific structure + +\subsection{The Traditional Index} +A database index is a specialized structure which provides a means +to efficiently locate records that satisfy specific criteria. This +enables more efficient query processing for support queries. A +traditional database index can be modeled as a function, mapping a +set of attribute values, called a key, $\mathcal{K}$, to a set of +record identifiers, $\mathcal{R}$. Technically, the codomain of an +index can be either a record identifier, a set of record identifiers, +or the physical record itself, depending upon the configuration of +the index. For the purposes of this work, the focus will be on the +first of these, but in principle any of the three index types could +be used with little material difference to the discussion. + +Formally speaking, we will use the following definition of a traditional +database index, +\begin{definition}[Traditional Index] +Consider a set of database records, $\mathcal{D}$. An index over +these records, $\mathcal{I}_\mathcal{D}$ is a map of the form +$F:(\mathcal{I}_\mathcal{D}, \mathcal{K}) \to \mathcal{R}$, where +$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, +called a \emph{key}. +\end{definition} + +In order to facilitate this mapping, indexes are built using data +structures. The specific data structure used has particular +implications about the performance of the index, and the situations +in which the index is effectively. Broadly speaking, traditional +database indexes can be categorized in two ways: ordered indexes +and unordered indexes. The former of these allows for iteration +over the set of record identifiers in some sorted order, starting +at the returned record. The latter allows for point-lookups only. + +There is a very small set of data structures that are usually used +for creating database indexes. The most common range index in RDBMSs +is the B-tree\footnote{ By \emph{B-tree} here, I am referring not +to the B-tree datastructure, but to a wide range of related structures +derived from the B-tree. Examples include the B$^+$-tree, +B$^\epsilon$-tree, etc. } based index, and key-value stores commonly +use indices built on the LSM-tree. Some databases support unordered +indexes using hashtables. Beyond these, some specialized databases or +database extensions have support for indexes based on other structures, +such as the R-tree\footnote{ +Like the B-tree, R-tree here is used as a signifier for a general class +of related data structures} for spatial databases or approximate small +world graph models for similarity search. + +\subsection{The Generalized Index} + +The previous section discussed the traditional definition of index +as might be found in a database systems textbook. However, this +definition is limited by its association specifically with mapping +key fields to records. For the purposes of this work, I will be +considering a slightly broader definition of index, + +\begin{definition}[Generalized Index] +Consider a set of database records, $\mathcal{D}$ and a search +problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$ +is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{Q}) \to +\mathcal{R})$. +\end{definition} + +\emph{Search problems} are the topic of the next section, but in +brief a search problem represents a general class of query, such +as range scan, point lookup, k-nearest neightbor, etc. A traditional +index is a special case of a generalized index, having $\mathcal{Q}$ +being a point-lookup or range query based on a set of record +attributes. + +\subsection{Indices in Query Processing} + +A database management system utilizes indices to accelerate certain +types of query. Queries are expressed to the system in some high +level language, such as SQL or Datalog. These are generalized +languages capable of expressing a wide range of possible queries. +The DBMS is then responsible for converting these queries into a +set of primitive data access procedures that are supported by the +underlying storage engine. There are a variety of techniques for +this, including mapping directly to a tree of relational algebra +operators and interpretting that tree, query compilation, etc. But, +ultimately, the expressiveness of this internal query representation +is limited by the routines supported by the storage engine. + +As an example, consider the following SQL query (representing a +2-dimensional k-nearest neighbor)\footnote{There are more efficient +ways of answering this query, but I'm aiming for simplicity here +to demonstrate my point}, + +\begin{verbatim} +SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A + WHERE A.property = filtering_criterion + ORDER BY d + LIMIT 5; +\end{verbatim} + +This query will be translated into a logical query plan (a sequence +of relational algebra operators) by the query planner, which could +result in a plan like this, + +\begin{verbatim} +query plan here +\end{verbatim} + +With this logical query plan, the DBMS will next need to determine +which supported operations it can use to most efficiently answer +this query. For example, the selection operation (A) could be +physically manifested as a table scan, or could be answered using +an index scan if there is an ordered index over \texttt{A.property}. +The query optimizer will make this decision based on its estimate +of the selectivity of the predicate. This may result in one of the +following physical query plans + +\begin{verbatim} +physical query plan +\end{verbatim} + +In either case, however, the space of possible physical plans is +limited by the available access methods: either a sorted scan on +an attribute (index) or an unsorted scan (table scan). The database +must filter for all elements matching the filtering criterion, +calculate the distances between all of these points and the query, +and then sort the results to get the final answer. Additionally, +note that the sort operation in the plan is a pipeline-breaker. If +this plan were to appear as a subtree in a larger query plan, the +overall plan would need to wait for the full evaluation of this +sub-query before it could proceed, as sorting requires the full +result set. + +Imagine a world where a new index was available to our DBMS: a +nearest neighbor index. This index would allow the iteration over +records in sorted order, relative to some predefined metric and a +query point. If such an index existed over \texttt{(A.x, A.y)} using +\texttt{dist}, then a third physical plan would be available to the DBMS, + +\begin{verbatim} +\end{verbatim} + +This plan pulls records in order of their distance to \texttt{Q} +directly, using an index, and then filters them, avoiding the +pipeline breaking sort operation. While it's not obvious in this +case that this new plan is superior (this would depend a lot on the +selectivity of the predicate), it is a third option. It becomes +increasingly superior as the selectivity of the predicate grows, +and is clearly superior in the case where the predicate has unit +selectivity (requiring only the consideration of $5$ records total). +The construction of this special index will be considered in +Section~\ref{ssec:knn}. + +This use of query-specific indexing schemes also presents a query +planning challenge: how does the database know when a particular +specialized index can be used for a given query, and how can +specialized indexes broadcast their capabilities to the query planner +in a general fashion? This work is focused on the problem of enabling +the existence of such indexes, rather than facilitating their use, +however these are important questions that must be considered in +future work for this solution to be viable. There has been work +done surrounding the use of arbtrary indexes in queries in the past, +such as~\cite{byods-datalog}. This problem is considered out-of-scope +for the proposed work, but will be considered in the future. + +\section{Queries and Search Problems} + +In our discussion of generalized indexes, we encountered \emph{search +problems}. A search problem is a term used within the literature +on data structures in a manner similar to how the database community +sometimes uses the term query\footnote{ +Like with the term index, the term query is often abused and used to +refer to several related, but slightly different things. In the vernacular, +a query can refer to either a) a general type of search problem (as in "range query"), +b) a specific instance of a search problem, or c) a program written in a query language. +}, to refer to a general +class of questions asked of data. Examples include range queries, +point-lookups, nearest neighbor queries, predicate filtering, random +sampling, etc. Formally, for the purposes of this work, we will define +a search problem as follows, +\begin{definition}[Search Problem] +Given three multisets, $D$, $R$, and $Q$, a search problem is a function +$F: (D, Q) \to R$, where $D$ represents the domain of data to be searched, +$Q$ represents the domain of query parameters, and $R$ represents the +answer domain. +\footnote{ +It is important to note that it is not required for $R \subseteq D$. As an +example, a \texttt{COUNT} aggregation might map a set of strings onto +an integer. Most common queries do satisfy $R \subseteq D$, but this need +not be a universal constraint. +} +\end{definition} + +And we will use the word \emph{query} to refer to a specific instance +of a search problem, except when used as part of the generally +accepted name of a search problem (i.e., range query). + +\begin{definition}[Query] +Given three multisets, $D$, $R$, and $Q$, a search problem $F$ and +a specific set of query parameters $q \in Q$, a query is a specific +instance of the search problem, $F(D, q)$. +\end{definition} + +As an example of using these definitions, a \emph{membership test} +or \emph{range query} would be considered search problems, and a +range query over the interval $[10, 99]$ would be a query. + +\subsection{Decomposable Search Problems} + +An important subset of search problems is that of decomposable +search problems (DSPs). This class was first defined by Saxe and +Bentley as follows, + +\begin{definition}[Decomposable Search Problem~\cite{saxe79}] + \label{def:dsp} + Given a search problem $F: (D, Q) \to R$, $F$ is decomposable if and + only if there exists a consant-time computable, associative, and + commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +The constant-time requirement was used to prove bounds on the costs of +evaluating DSPs over data broken across multiple partitions. Further work +by Overmars lifted this constraint and considered a more general class +of DSP, +\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}] + Given a search problem $F: (D, Q) \to R$, $F$ is $C(n)$-decomposable + if and only if there exists an $O(C(n))$-time computable, associative, + and commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +Decomposability is an important property because it allows for +search problems to be answered over partitioned datasets. The details +of this will be discussed in Section~\ref{ssec:bentley-saxe} in the +context of creating dynamic data structures. Many common types of +search problems appearing in databases are decomposable, such as +range queries or predicate filtering. + +To demonstrate that a search problem is decomposable, it is necessary +to show the existance of the merge operator, $\square$, and to show +that $F(A \cup B, q) = F(A, q)~ \square ~F(B, q)$. With these two +results, simple induction demonstrates that the problem is decomposable +even in cases with more than two partial results. + +As an example, consider range queries, +\begin{definition}[Range Query] +Let $D$ be a set of $n$ points in $\mathbb{R}$. Given an interval, +$ q = [x, y],\quad x,y \in R$, a range query returns all points in +$D \cap q$. +\end{definition} + +\begin{theorem} +Range Queries are a DSP. +\end{theorem} + +\begin{proof} +Let $\square$ be the set union operator ($\cup$). Applying this to +Definition~\ref{def:dsp}, we have +\begin{align*} + (A \cup B) \cap q = (A \cap q) \cup (B \cap q) +\end{align*} +which is true by the distributive property of set union and +intersection. Assuming an implementation allowing for an $O(1)$ +set union operation, range queries are DSPs. +\end{proof} + +Because the codomain of a DSP is not restricted, more complex output +structures can be used to allow for problems that are not directly +decomposable to be converted to DSPs, possibly with some minor +post-processing. For example, the calculation of the mean of a set +of numbers can be constructed as a DSP using the following technique, +\begin{theorem} +The calculation of the average of a set of numbers is a DSP. +\end{theorem} +\begin{proof} +Define the search problem as $A:D \to (\mathbb{R}, \mathbb{Z})$, +where $D\subset\mathbb{R}$ and is a multiset. The output tuple +contains the sum of the values within the input set, and the +cardinality of the input set. Let the $A(D_1) = (s_1, c_1)$ and +$A(D_2) = (s_2, c_2)$. Then, define $A(D_1)\square A(D_2) = (s_1 + +s_2, c_1 + c_2)$. + +Applying Definition~\ref{def:dsp}, we have +\begin{align*} + A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\ + (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) +\end{align*} +From this result, the average can be determined in constant time by +taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set +of numbers is a DSP. +\end{proof} + +\section{Dynamic Extension Techniques} + +Because data in a database is regularly updated, data structures +intended to be used as an index must support updates (inserts, +in-place modification, and deletes) to their data. In principle, +any data structure can support updates to its underlying data through +global reconstruction: adjusting the record set and then rebuilding +the entire structure. Ignoring this trivial (and highly inefficient) +approach, a data structure with support for updates is called +\emph{dynamic}, and one without support for updates is called +\emph{static}. In this section, we discuss approaches for modifying +a static data structure to grant it support for updates, a process +called \emph{dynamic extension} or \emph{dynamization}. A theoretical +survey of this topic can be found in~\cite{overmars83}, but this +work doesn't cover several techniques that are used in practice. +As such, much of this section constitutes our own analysis, tying +together threads from a variety of sources. + +\subsection{Local Reconstruction} + +One way of viewing updates to a data structure is as reconstructing +all or part of the structure. To minimize the cost of the update, +it is ideal to minimize the size of the reconstruction that accompanies +an update, either by careful structuring of the data to ensure +minimal disruption to surrounding records by an update, or by +deferring the reconstructions and amortizing their costs over as +many updates as possible. + +While minimizing the size of a reconstruction seems the most obvious, +and best, approach, it is limited in its applicability. The more +related ``nearby'' records in the structure are, the more records +will be affected by a change. Records can be related in terms of +some ordering of their values, which we'll term a \emph{spatial +ordering}, or in terms of their order of insertion to the structure, +which we'll term a \emph{temporal ordering}. Note that these terms +don't imply anything about the nature of the data, and instead +relate to the principles used by the data structure to arrange them. + +Arrays provide the extreme version of both of these ordering +principles. In an unsorted array, in which records are appended to +the end of the array, there is no spatial ordering dependence between +records. This means that any insert or update will require no local +reconstruction, aside from the record being directly affected.\footnote{ +A delete can also be performed without any structural adjustments +in a variety of ways. Reorganization of the array as a result of +deleted records serves an efficiency purpose, but isn't required +for the correctness of the structure. } However, the order of +records in the array \emph{does} express a strong temporal dependency: +the index of a record in the array provides the exact insertion +order. + +A sorted array provides exactly the opposite situation. The order +of a record in the array reflects an exact spatial ordering of +records with respect to their sorting function. This means that an +update or insert will require reordering a large number of records +(potentially all of them, in the worst case). Because of the stronger +spatial dependence of records in the structure, an update will +require a larger-scale reconstruction. Additionally, there is no +temporal component to the ordering of the records: inserting a set +of records into a sorted array will produce the same final structure +irrespective of insertion order. + +It's worth noting that the spatial dependency discussed here, as +it relates to reconstruction costs, is based on the physical layout +of the records and not the logical ordering of them. To exemplify +this, a sorted singly-linked list can maintain the same logical +order of records as a sorted array, but limits the spatial dependce +between records each records preceeding node. This means that an +insert into this structure will require only a single node update, +regardless of where in the structure this insert occurs. + +The amount of spatial dependence in a structure directly reflects +a trade-off between read and write performance. In the above example, +performing a lookup for a given record in a sorted array requires +asymptotically fewer comparisons in the worst case than an unsorted +array, because the spatial dependecies can be exploited for an +accelerated search (binary vs. linear search). Interestingly, this +remains the case for lookups against a sorted array vs. a sorted +linked list. Even though both structures have the same logical order +of records, limited spatial dependecies between nodes in a linked +list forces the lookup to perform a scan anyway. + +A balanced binary tree sits between these two extremes. Like a +linked list, individual nodes have very few connections. However +the nodes are arranged in such a way that a connection existing +between two nodes implies further information about the ordering +of children of those nodes. In this light, rebalancing of the tree +can be seen as maintaining a certain degree of spatial dependence +between the nodes in the tree, ensuring that it is balanced between +the two children of each node. A very general summary of tree +rebalancing techniques can be found in~\cite{overmars83}. Using an +AVL tree~\cite{avl} as a specific example, each insert in the tree +involves adding the new node and updating its parent (like you'd +see in a simple linked list), followed by some larger scale local +reconstruction in the form of tree rotations, to maintain the balance +factor invariant. This means that insertion requires more reconstruction +effort than the single pointer update in the linked list case, but +results in much more efficient searches (which, as it turns out, +makes insertion more efficient in general too, even with the overhead, +because finding the insertion point is much faster). + +\subsection{Amortized Local Reconstruction} + +In addition to control update cost by arranging the structure so +as to reduce the amount of reconstruction necessary to maintain the +desired level of spatial dependence, update costs can also be reduced +by amortizing the local reconstruction cost over multiple updates. +This is often done in one of two ways: leaving gaps or adding +overflow buckets. These gaps and buckets allows for a buffer of +insertion capacity to be sustained by the data structure, before +a reconstruction is triggered. + +A classic example of the gap approach is found in the +B$^+$-tree~\cite{b+tree} commonly used in RDBMS indexes, as well +as open addressing for hash tables. In a B$^+$-tree, each node has +a fixed size, which must be at least half-utilized (aside from the +root node). The empty spaces within these nodes are gaps, which can +be cheaply filled with new records on insert. Only when a node has +been filled must a local reconstruction (called a structural +modification operation for B-trees) occur to redistribute the data +into multiple nodes and replenish the supply of gaps. This approach +is particularly well suited to data structures in contexts where +the natural unit of storage is larger than a record, as in disk-based +(with 4KiB pages) or cache-optimized (with 64B cachelines) structures. +This gap-based approach was also used to create ALEX, an updatable +learned index~\cite{ALEX}. + +The gap approach has a number of disadvantages. It results in a +somewhat sparse structure, thereby wasting storage. For example, a +B$^+$-tree requires all nodes other than the root to be at least +half full--meaning in the worst case up to half of the space required +by the structure could be taken up by gaps. Additionally, this +scheme results in some inserts being more expensive than others: +most new records will occupy an available gap, but some will trigger +more expensive SMOs. In particular, it has been observed with +B$^+$-trees that this can lead to ``waves of misery''~\cite{wavesofmisery}: +the gaps in many nodes fill at about the same time, leading to +periodic clusters of high-cost merge operations. + +Overflow buckets are seen in ISAM-tree based indexes~\cite{myisam}, +as well as hash tables with closed addressing. In this approach, +parts of the structure into which records would be inserted (leaf +nodes of ISAM, directory entries in CA hashing) have a pointer to +an overflow location, where newly inserted records can be placed. +This allows for the structure to, theoretically, sustain an unlimited +amount of insertions. However, read performance degrades, because +the more overflow capacity is utilized, the less the records in the +structure are ordered according to the data structure's definition. +Thus, periodically a reconstruction is necessary to distribute the +overflow records into the structure itself. + +\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} + +Another approach to support updates is to amortize the cost of +global reconstruction over multiple updates. This approach can take +take three forms, +\begin{enumerate} + + \item Pairing a dynamic data structure (called a buffer or + memtable) with an instance of the structure being extended. + Updates are written to the buffer, and when the buffer is + full its records are merged with those in the static + structure, and the structure is rebuilt. This approach is + used by one version of the originally proposed + LSM-tree~\cite{oneil93}. Technically this technique proposed + in that work for the purposes of converting random writes + into sequential ones (all structures involved are dynamic), + but it can be used for dynamization as well. + + \item Creating multiple, smaller data structures each + containing a partition of the records from the dataset, and + reconstructing individual structures to accomodate new + inserts in a systematic manner. This technique is the basis + of the Bentley-Saxe method~\cite{saxe79}. + + \item Using both of the above techniques at once. This is + the approach used by modern incarnations of the + LSM~tree~\cite{rocksdb}. + +\end{enumerate} + +In all three cases, it is necessary for the search problem associated +with the index to be a DSP, as answering it will require querying +multiple structures (the buffer and/or one or more instances of the +data structure) and merging the results together to get a final +result. This section will focus exclusively on the Bentley-Saxe +method, as it is the basis for our proposed methodology.p + +When dividing records across multiple structures, there is a clear +trade-off between read performance and write performance. Keeping +the individual structures small reduces the cost of reconstructing, +and thereby increases update performance. However, this also means +that more structures will be required to accommodate the same number +of records, when compared to a scheme that allows the structures +to be larger. As each structure must be queried independently, this +will lead to worse query performance. The reverse is also true, +fewer, larger structures will have better query performance and +worse update performance, with the extreme limit of this being a +single structure that is fully rebuilt on each insert. + +\begin{figure} + \caption{Inserting a new record using the Bentley-Saxe method.} + \label{fig:bsm-example} +\end{figure} + +The key insight of the Bentley-Saxe method~\cite{saxe79} is that a +good balance can be struck by uses a geometrically increasing +structure size. In Bentley-Saxe, the sub-structures are ``stacked'', +with the bottom level having a capacity of a single record, and +each subsequent level doubling in capacity. When an update is +performed, the first empty level is located and a reconstruction +is triggered, merging the structures of all levels below this empty +one, along with the new record. An example of this process is shown +in Figure~\ref{fig:bsm-example}. The merits of this approach are +that it ensures that ``most'' reconstructions involve the smaller +data structures towards the bottom of the sequence, while most of +the records reside in large, infrequently updated, structures towards +the top. This balances between the read and write implications of +structure size, while also allowing the number of structures required +to represent $n$ records to be worst-case bounded by $O(\log n)$. + +Given a structure with an $O(P(n))$ construction cost and $O(Q_S(n))$ +query cost, the Bentley-Saxe Method will produce a dynamic data +structure with, + +\begin{align} + \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\ + \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right) +\end{align} + +However, the method has poor worst-case insertion cost: if the +entire structure is full, it must grow by another level, requiring +a full reconstruction involving every record within the structure. +A slight adjustment to the technique, due to Overmars and van Leuwen +\cite{}, allows for the worst-case insertion cost to be bounded by +$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing +each reconstruction into small pieces, one of which is executed +each time a new update occurs. This has the effect of bounding the +worst-case performance, but does so by sacrificing the expected +case performance, and adds a lot of complexity to the method. This +technique is not used much in practice.\footnote{ + I've yet to find any example of it used in a journal article + or conference paper. +} + + + + + +\subsection{Limitations of the Bentley-Saxe Method} + + + + + |