summaryrefslogtreecommitdiffstats
path: root/chapters/background.tex.bak
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/background.tex.bak')
-rw-r--r--chapters/background.tex.bak574
1 files changed, 574 insertions, 0 deletions
diff --git a/chapters/background.tex.bak b/chapters/background.tex.bak
new file mode 100644
index 0000000..d57b370
--- /dev/null
+++ b/chapters/background.tex.bak
@@ -0,0 +1,574 @@
+\chapter{Background}
+
+This chapter will introduce important background information that
+will be used throughput the remainder of the document. We'll first
+define precisely what is meant by a query, and consider some special
+classes of query that will become relevant in our discussion of dynamic
+extension. We'll then consider the difference between a static and a
+dynamic structure, and techniques for converting static structures into
+dynamic ones in a variety of circumstances.
+
+\section{Database Indexes}
+
+The term \emph{index} is often abused within the database community
+to refer to a range of closely related, but distinct, conceptual
+categories\footnote{
+The word index can be used to refer to a structure mapping record
+information to the set of records matching that information, as a
+general synonym for ``data structure'', to data structures used
+specifically in query processing, etc.
+}.
+This ambiguity is rarely problematic, as the subtle differences
+between these categories are not often significant, and context
+clarifies the intended meaning in situtations where they are.
+However, this work explicitly operates at the interface of two of
+these categories, and so it is important to disambiguiate between
+them. As a result, we will be using the word index to
+refer to a very specific structure
+
+\subsection{The Traditional Index}
+A database index is a specialized structure which provides a means
+to efficiently locate records that satisfy specific criteria. This
+enables more efficient query processing for support queries. A
+traditional database index can be modeled as a function, mapping a
+set of attribute values, called a key, $\mathcal{K}$, to a set of
+record identifiers, $\mathcal{R}$. Technically, the codomain of an
+index can be either a record identifier, a set of record identifiers,
+or the physical record itself, depending upon the configuration of
+the index. For the purposes of this work, the focus will be on the
+first of these, but in principle any of the three index types could
+be used with little material difference to the discussion.
+
+Formally speaking, we will use the following definition of a traditional
+database index,
+\begin{definition}[Traditional Index]
+Consider a set of database records, $\mathcal{D}$. An index over
+these records, $\mathcal{I}_\mathcal{D}$ is a map of the form
+$F:(\mathcal{I}_\mathcal{D}, \mathcal{K}) \to \mathcal{R}$, where
+$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$,
+called a \emph{key}.
+\end{definition}
+
+In order to facilitate this mapping, indexes are built using data
+structures. The specific data structure used has particular
+implications about the performance of the index, and the situations
+in which the index is effectively. Broadly speaking, traditional
+database indexes can be categorized in two ways: ordered indexes
+and unordered indexes. The former of these allows for iteration
+over the set of record identifiers in some sorted order, starting
+at the returned record. The latter allows for point-lookups only.
+
+There is a very small set of data structures that are usually used
+for creating database indexes. The most common range index in RDBMSs
+is the B-tree\footnote{ By \emph{B-tree} here, I am referring not
+to the B-tree datastructure, but to a wide range of related structures
+derived from the B-tree. Examples include the B$^+$-tree,
+B$^\epsilon$-tree, etc. } based index, and key-value stores commonly
+use indices built on the LSM-tree. Some databases support unordered
+indexes using hashtables. Beyond these, some specialized databases or
+database extensions have support for indexes based on other structures,
+such as the R-tree\footnote{
+Like the B-tree, R-tree here is used as a signifier for a general class
+of related data structures} for spatial databases or approximate small
+world graph models for similarity search.
+
+\subsection{The Generalized Index}
+
+The previous section discussed the traditional definition of index
+as might be found in a database systems textbook. However, this
+definition is limited by its association specifically with mapping
+key fields to records. For the purposes of this work, I will be
+considering a slightly broader definition of index,
+
+\begin{definition}[Generalized Index]
+Consider a set of database records, $\mathcal{D}$ and a search
+problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$
+is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{Q}) \to
+\mathcal{R})$.
+\end{definition}
+
+\emph{Search problems} are the topic of the next section, but in
+brief a search problem represents a general class of query, such
+as range scan, point lookup, k-nearest neightbor, etc. A traditional
+index is a special case of a generalized index, having $\mathcal{Q}$
+being a point-lookup or range query based on a set of record
+attributes.
+
+\subsection{Indices in Query Processing}
+
+A database management system utilizes indices to accelerate certain
+types of query. Queries are expressed to the system in some high
+level language, such as SQL or Datalog. These are generalized
+languages capable of expressing a wide range of possible queries.
+The DBMS is then responsible for converting these queries into a
+set of primitive data access procedures that are supported by the
+underlying storage engine. There are a variety of techniques for
+this, including mapping directly to a tree of relational algebra
+operators and interpretting that tree, query compilation, etc. But,
+ultimately, the expressiveness of this internal query representation
+is limited by the routines supported by the storage engine.
+
+As an example, consider the following SQL query (representing a
+2-dimensional k-nearest neighbor)\footnote{There are more efficient
+ways of answering this query, but I'm aiming for simplicity here
+to demonstrate my point},
+
+\begin{verbatim}
+SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A
+ WHERE A.property = filtering_criterion
+ ORDER BY d
+ LIMIT 5;
+\end{verbatim}
+
+This query will be translated into a logical query plan (a sequence
+of relational algebra operators) by the query planner, which could
+result in a plan like this,
+
+\begin{verbatim}
+query plan here
+\end{verbatim}
+
+With this logical query plan, the DBMS will next need to determine
+which supported operations it can use to most efficiently answer
+this query. For example, the selection operation (A) could be
+physically manifested as a table scan, or could be answered using
+an index scan if there is an ordered index over \texttt{A.property}.
+The query optimizer will make this decision based on its estimate
+of the selectivity of the predicate. This may result in one of the
+following physical query plans
+
+\begin{verbatim}
+physical query plan
+\end{verbatim}
+
+In either case, however, the space of possible physical plans is
+limited by the available access methods: either a sorted scan on
+an attribute (index) or an unsorted scan (table scan). The database
+must filter for all elements matching the filtering criterion,
+calculate the distances between all of these points and the query,
+and then sort the results to get the final answer. Additionally,
+note that the sort operation in the plan is a pipeline-breaker. If
+this plan were to appear as a subtree in a larger query plan, the
+overall plan would need to wait for the full evaluation of this
+sub-query before it could proceed, as sorting requires the full
+result set.
+
+Imagine a world where a new index was available to our DBMS: a
+nearest neighbor index. This index would allow the iteration over
+records in sorted order, relative to some predefined metric and a
+query point. If such an index existed over \texttt{(A.x, A.y)} using
+\texttt{dist}, then a third physical plan would be available to the DBMS,
+
+\begin{verbatim}
+\end{verbatim}
+
+This plan pulls records in order of their distance to \texttt{Q}
+directly, using an index, and then filters them, avoiding the
+pipeline breaking sort operation. While it's not obvious in this
+case that this new plan is superior (this would depend a lot on the
+selectivity of the predicate), it is a third option. It becomes
+increasingly superior as the selectivity of the predicate grows,
+and is clearly superior in the case where the predicate has unit
+selectivity (requiring only the consideration of $5$ records total).
+The construction of this special index will be considered in
+Section~\ref{ssec:knn}.
+
+This use of query-specific indexing schemes also presents a query
+planning challenge: how does the database know when a particular
+specialized index can be used for a given query, and how can
+specialized indexes broadcast their capabilities to the query planner
+in a general fashion? This work is focused on the problem of enabling
+the existence of such indexes, rather than facilitating their use,
+however these are important questions that must be considered in
+future work for this solution to be viable. There has been work
+done surrounding the use of arbtrary indexes in queries in the past,
+such as~\cite{byods-datalog}. This problem is considered out-of-scope
+for the proposed work, but will be considered in the future.
+
+\section{Queries and Search Problems}
+
+In our discussion of generalized indexes, we encountered \emph{search
+problems}. A search problem is a term used within the literature
+on data structures in a manner similar to how the database community
+sometimes uses the term query\footnote{
+Like with the term index, the term query is often abused and used to
+refer to several related, but slightly different things. In the vernacular,
+a query can refer to either a) a general type of search problem (as in "range query"),
+b) a specific instance of a search problem, or c) a program written in a query language.
+}, to refer to a general
+class of questions asked of data. Examples include range queries,
+point-lookups, nearest neighbor queries, predicate filtering, random
+sampling, etc. Formally, for the purposes of this work, we will define
+a search problem as follows,
+\begin{definition}[Search Problem]
+Given three multisets, $D$, $R$, and $Q$, a search problem is a function
+$F: (D, Q) \to R$, where $D$ represents the domain of data to be searched,
+$Q$ represents the domain of query parameters, and $R$ represents the
+answer domain.
+\footnote{
+It is important to note that it is not required for $R \subseteq D$. As an
+example, a \texttt{COUNT} aggregation might map a set of strings onto
+an integer. Most common queries do satisfy $R \subseteq D$, but this need
+not be a universal constraint.
+}
+\end{definition}
+
+And we will use the word \emph{query} to refer to a specific instance
+of a search problem, except when used as part of the generally
+accepted name of a search problem (i.e., range query).
+
+\begin{definition}[Query]
+Given three multisets, $D$, $R$, and $Q$, a search problem $F$ and
+a specific set of query parameters $q \in Q$, a query is a specific
+instance of the search problem, $F(D, q)$.
+\end{definition}
+
+As an example of using these definitions, a \emph{membership test}
+or \emph{range query} would be considered search problems, and a
+range query over the interval $[10, 99]$ would be a query.
+
+\subsection{Decomposable Search Problems}
+
+An important subset of search problems is that of decomposable
+search problems (DSPs). This class was first defined by Saxe and
+Bentley as follows,
+
+\begin{definition}[Decomposable Search Problem~\cite{saxe79}]
+ \label{def:dsp}
+ Given a search problem $F: (D, Q) \to R$, $F$ is decomposable if and
+ only if there exists a consant-time computable, associative, and
+ commutative binary operator $\square$ such that,
+ \begin{equation*}
+ F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
+ \end{equation*}
+\end{definition}
+
+The constant-time requirement was used to prove bounds on the costs of
+evaluating DSPs over data broken across multiple partitions. Further work
+by Overmars lifted this constraint and considered a more general class
+of DSP,
+\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}]
+ Given a search problem $F: (D, Q) \to R$, $F$ is $C(n)$-decomposable
+ if and only if there exists an $O(C(n))$-time computable, associative,
+ and commutative binary operator $\square$ such that,
+ \begin{equation*}
+ F(A \cup B, q) = F(A, q)~ \square ~F(B, q)
+ \end{equation*}
+\end{definition}
+
+Decomposability is an important property because it allows for
+search problems to be answered over partitioned datasets. The details
+of this will be discussed in Section~\ref{ssec:bentley-saxe} in the
+context of creating dynamic data structures. Many common types of
+search problems appearing in databases are decomposable, such as
+range queries or predicate filtering.
+
+To demonstrate that a search problem is decomposable, it is necessary
+to show the existance of the merge operator, $\square$, and to show
+that $F(A \cup B, q) = F(A, q)~ \square ~F(B, q)$. With these two
+results, simple induction demonstrates that the problem is decomposable
+even in cases with more than two partial results.
+
+As an example, consider range queries,
+\begin{definition}[Range Query]
+Let $D$ be a set of $n$ points in $\mathbb{R}$. Given an interval,
+$ q = [x, y],\quad x,y \in R$, a range query returns all points in
+$D \cap q$.
+\end{definition}
+
+\begin{theorem}
+Range Queries are a DSP.
+\end{theorem}
+
+\begin{proof}
+Let $\square$ be the set union operator ($\cup$). Applying this to
+Definition~\ref{def:dsp}, we have
+\begin{align*}
+ (A \cup B) \cap q = (A \cap q) \cup (B \cap q)
+\end{align*}
+which is true by the distributive property of set union and
+intersection. Assuming an implementation allowing for an $O(1)$
+set union operation, range queries are DSPs.
+\end{proof}
+
+Because the codomain of a DSP is not restricted, more complex output
+structures can be used to allow for problems that are not directly
+decomposable to be converted to DSPs, possibly with some minor
+post-processing. For example, the calculation of the mean of a set
+of numbers can be constructed as a DSP using the following technique,
+\begin{theorem}
+The calculation of the average of a set of numbers is a DSP.
+\end{theorem}
+\begin{proof}
+Define the search problem as $A:D \to (\mathbb{R}, \mathbb{Z})$,
+where $D\subset\mathbb{R}$ and is a multiset. The output tuple
+contains the sum of the values within the input set, and the
+cardinality of the input set. Let the $A(D_1) = (s_1, c_1)$ and
+$A(D_2) = (s_2, c_2)$. Then, define $A(D_1)\square A(D_2) = (s_1 +
+s_2, c_1 + c_2)$.
+
+Applying Definition~\ref{def:dsp}, we have
+\begin{align*}
+ A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\
+ (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c)
+\end{align*}
+From this result, the average can be determined in constant time by
+taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
+of numbers is a DSP.
+\end{proof}
+
+\section{Dynamic Extension Techniques}
+
+Because data in a database is regularly updated, data structures
+intended to be used as an index must support updates (inserts,
+in-place modification, and deletes) to their data. In principle,
+any data structure can support updates to its underlying data through
+global reconstruction: adjusting the record set and then rebuilding
+the entire structure. Ignoring this trivial (and highly inefficient)
+approach, a data structure with support for updates is called
+\emph{dynamic}, and one without support for updates is called
+\emph{static}. In this section, we discuss approaches for modifying
+a static data structure to grant it support for updates, a process
+called \emph{dynamic extension} or \emph{dynamization}. A theoretical
+survey of this topic can be found in~\cite{overmars83}, but this
+work doesn't cover several techniques that are used in practice.
+As such, much of this section constitutes our own analysis, tying
+together threads from a variety of sources.
+
+\subsection{Local Reconstruction}
+
+One way of viewing updates to a data structure is as reconstructing
+all or part of the structure. To minimize the cost of the update,
+it is ideal to minimize the size of the reconstruction that accompanies
+an update, either by careful structuring of the data to ensure
+minimal disruption to surrounding records by an update, or by
+deferring the reconstructions and amortizing their costs over as
+many updates as possible.
+
+While minimizing the size of a reconstruction seems the most obvious,
+and best, approach, it is limited in its applicability. The more
+related ``nearby'' records in the structure are, the more records
+will be affected by a change. Records can be related in terms of
+some ordering of their values, which we'll term a \emph{spatial
+ordering}, or in terms of their order of insertion to the structure,
+which we'll term a \emph{temporal ordering}. Note that these terms
+don't imply anything about the nature of the data, and instead
+relate to the principles used by the data structure to arrange them.
+
+Arrays provide the extreme version of both of these ordering
+principles. In an unsorted array, in which records are appended to
+the end of the array, there is no spatial ordering dependence between
+records. This means that any insert or update will require no local
+reconstruction, aside from the record being directly affected.\footnote{
+A delete can also be performed without any structural adjustments
+in a variety of ways. Reorganization of the array as a result of
+deleted records serves an efficiency purpose, but isn't required
+for the correctness of the structure. } However, the order of
+records in the array \emph{does} express a strong temporal dependency:
+the index of a record in the array provides the exact insertion
+order.
+
+A sorted array provides exactly the opposite situation. The order
+of a record in the array reflects an exact spatial ordering of
+records with respect to their sorting function. This means that an
+update or insert will require reordering a large number of records
+(potentially all of them, in the worst case). Because of the stronger
+spatial dependence of records in the structure, an update will
+require a larger-scale reconstruction. Additionally, there is no
+temporal component to the ordering of the records: inserting a set
+of records into a sorted array will produce the same final structure
+irrespective of insertion order.
+
+It's worth noting that the spatial dependency discussed here, as
+it relates to reconstruction costs, is based on the physical layout
+of the records and not the logical ordering of them. To exemplify
+this, a sorted singly-linked list can maintain the same logical
+order of records as a sorted array, but limits the spatial dependce
+between records each records preceeding node. This means that an
+insert into this structure will require only a single node update,
+regardless of where in the structure this insert occurs.
+
+The amount of spatial dependence in a structure directly reflects
+a trade-off between read and write performance. In the above example,
+performing a lookup for a given record in a sorted array requires
+asymptotically fewer comparisons in the worst case than an unsorted
+array, because the spatial dependecies can be exploited for an
+accelerated search (binary vs. linear search). Interestingly, this
+remains the case for lookups against a sorted array vs. a sorted
+linked list. Even though both structures have the same logical order
+of records, limited spatial dependecies between nodes in a linked
+list forces the lookup to perform a scan anyway.
+
+A balanced binary tree sits between these two extremes. Like a
+linked list, individual nodes have very few connections. However
+the nodes are arranged in such a way that a connection existing
+between two nodes implies further information about the ordering
+of children of those nodes. In this light, rebalancing of the tree
+can be seen as maintaining a certain degree of spatial dependence
+between the nodes in the tree, ensuring that it is balanced between
+the two children of each node. A very general summary of tree
+rebalancing techniques can be found in~\cite{overmars83}. Using an
+AVL tree~\cite{avl} as a specific example, each insert in the tree
+involves adding the new node and updating its parent (like you'd
+see in a simple linked list), followed by some larger scale local
+reconstruction in the form of tree rotations, to maintain the balance
+factor invariant. This means that insertion requires more reconstruction
+effort than the single pointer update in the linked list case, but
+results in much more efficient searches (which, as it turns out,
+makes insertion more efficient in general too, even with the overhead,
+because finding the insertion point is much faster).
+
+\subsection{Amortized Local Reconstruction}
+
+In addition to control update cost by arranging the structure so
+as to reduce the amount of reconstruction necessary to maintain the
+desired level of spatial dependence, update costs can also be reduced
+by amortizing the local reconstruction cost over multiple updates.
+This is often done in one of two ways: leaving gaps or adding
+overflow buckets. These gaps and buckets allows for a buffer of
+insertion capacity to be sustained by the data structure, before
+a reconstruction is triggered.
+
+A classic example of the gap approach is found in the
+B$^+$-tree~\cite{b+tree} commonly used in RDBMS indexes, as well
+as open addressing for hash tables. In a B$^+$-tree, each node has
+a fixed size, which must be at least half-utilized (aside from the
+root node). The empty spaces within these nodes are gaps, which can
+be cheaply filled with new records on insert. Only when a node has
+been filled must a local reconstruction (called a structural
+modification operation for B-trees) occur to redistribute the data
+into multiple nodes and replenish the supply of gaps. This approach
+is particularly well suited to data structures in contexts where
+the natural unit of storage is larger than a record, as in disk-based
+(with 4KiB pages) or cache-optimized (with 64B cachelines) structures.
+This gap-based approach was also used to create ALEX, an updatable
+learned index~\cite{ALEX}.
+
+The gap approach has a number of disadvantages. It results in a
+somewhat sparse structure, thereby wasting storage. For example, a
+B$^+$-tree requires all nodes other than the root to be at least
+half full--meaning in the worst case up to half of the space required
+by the structure could be taken up by gaps. Additionally, this
+scheme results in some inserts being more expensive than others:
+most new records will occupy an available gap, but some will trigger
+more expensive SMOs. In particular, it has been observed with
+B$^+$-trees that this can lead to ``waves of misery''~\cite{wavesofmisery}:
+the gaps in many nodes fill at about the same time, leading to
+periodic clusters of high-cost merge operations.
+
+Overflow buckets are seen in ISAM-tree based indexes~\cite{myisam},
+as well as hash tables with closed addressing. In this approach,
+parts of the structure into which records would be inserted (leaf
+nodes of ISAM, directory entries in CA hashing) have a pointer to
+an overflow location, where newly inserted records can be placed.
+This allows for the structure to, theoretically, sustain an unlimited
+amount of insertions. However, read performance degrades, because
+the more overflow capacity is utilized, the less the records in the
+structure are ordered according to the data structure's definition.
+Thus, periodically a reconstruction is necessary to distribute the
+overflow records into the structure itself.
+
+\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method}
+
+Another approach to support updates is to amortize the cost of
+global reconstruction over multiple updates. This approach can take
+take three forms,
+\begin{enumerate}
+
+ \item Pairing a dynamic data structure (called a buffer or
+ memtable) with an instance of the structure being extended.
+ Updates are written to the buffer, and when the buffer is
+ full its records are merged with those in the static
+ structure, and the structure is rebuilt. This approach is
+ used by one version of the originally proposed
+ LSM-tree~\cite{oneil93}. Technically this technique proposed
+ in that work for the purposes of converting random writes
+ into sequential ones (all structures involved are dynamic),
+ but it can be used for dynamization as well.
+
+ \item Creating multiple, smaller data structures each
+ containing a partition of the records from the dataset, and
+ reconstructing individual structures to accomodate new
+ inserts in a systematic manner. This technique is the basis
+ of the Bentley-Saxe method~\cite{saxe79}.
+
+ \item Using both of the above techniques at once. This is
+ the approach used by modern incarnations of the
+ LSM~tree~\cite{rocksdb}.
+
+\end{enumerate}
+
+In all three cases, it is necessary for the search problem associated
+with the index to be a DSP, as answering it will require querying
+multiple structures (the buffer and/or one or more instances of the
+data structure) and merging the results together to get a final
+result. This section will focus exclusively on the Bentley-Saxe
+method, as it is the basis for our proposed methodology.p
+
+When dividing records across multiple structures, there is a clear
+trade-off between read performance and write performance. Keeping
+the individual structures small reduces the cost of reconstructing,
+and thereby increases update performance. However, this also means
+that more structures will be required to accommodate the same number
+of records, when compared to a scheme that allows the structures
+to be larger. As each structure must be queried independently, this
+will lead to worse query performance. The reverse is also true,
+fewer, larger structures will have better query performance and
+worse update performance, with the extreme limit of this being a
+single structure that is fully rebuilt on each insert.
+
+\begin{figure}
+ \caption{Inserting a new record using the Bentley-Saxe method.}
+ \label{fig:bsm-example}
+\end{figure}
+
+The key insight of the Bentley-Saxe method~\cite{saxe79} is that a
+good balance can be struck by uses a geometrically increasing
+structure size. In Bentley-Saxe, the sub-structures are ``stacked'',
+with the bottom level having a capacity of a single record, and
+each subsequent level doubling in capacity. When an update is
+performed, the first empty level is located and a reconstruction
+is triggered, merging the structures of all levels below this empty
+one, along with the new record. An example of this process is shown
+in Figure~\ref{fig:bsm-example}. The merits of this approach are
+that it ensures that ``most'' reconstructions involve the smaller
+data structures towards the bottom of the sequence, while most of
+the records reside in large, infrequently updated, structures towards
+the top. This balances between the read and write implications of
+structure size, while also allowing the number of structures required
+to represent $n$ records to be worst-case bounded by $O(\log n)$.
+
+Given a structure with an $O(P(n))$ construction cost and $O(Q_S(n))$
+query cost, the Bentley-Saxe Method will produce a dynamic data
+structure with,
+
+\begin{align}
+ \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\
+ \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right)
+\end{align}
+
+However, the method has poor worst-case insertion cost: if the
+entire structure is full, it must grow by another level, requiring
+a full reconstruction involving every record within the structure.
+A slight adjustment to the technique, due to Overmars and van Leuwen
+\cite{}, allows for the worst-case insertion cost to be bounded by
+$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing
+each reconstruction into small pieces, one of which is executed
+each time a new update occurs. This has the effect of bounding the
+worst-case performance, but does so by sacrificing the expected
+case performance, and adds a lot of complexity to the method. This
+technique is not used much in practice.\footnote{
+ I've yet to find any example of it used in a journal article
+ or conference paper.
+}
+
+
+
+
+
+\subsection{Limitations of the Bentley-Saxe Method}
+
+
+
+
+