summaryrefslogtreecommitdiffstats
path: root/chapters
diff options
context:
space:
mode:
Diffstat (limited to 'chapters')
-rw-r--r--chapters/beyond-dsp.tex2
-rw-r--r--chapters/dynamization.tex601
-rw-r--r--chapters/introduction.tex63
3 files changed, 401 insertions, 265 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex
index af84b43..74afdd2 100644
--- a/chapters/beyond-dsp.tex
+++ b/chapters/beyond-dsp.tex
@@ -365,7 +365,7 @@ interface, with the same performance as their specialized implementations.
\begin{algorithm}
\caption{Answering an Iterative Deletion Decomposable Search Problem}
- \label{alg:dyn-query}
+ \label{alg:dyn-idsp-query}
\KwIn{$q$: query parameters, $\mathscr{I}_1 \ldots \mathscr{I}_m$: blocks}
\KwOut{$R$: query results}
diff --git a/chapters/dynamization.tex b/chapters/dynamization.tex
index bbd5534..c48a781 100644
--- a/chapters/dynamization.tex
+++ b/chapters/dynamization.tex
@@ -1,21 +1,28 @@
\chapter{Classical Dynamization Techniques}
\label{chap:background}
-This chapter will introduce important background information and
-existing work in the area of data structure dynamization. We will
-first discuss the concept of a search problem, which is central to
-dynamization techniques. While one might imagine that restrictions on
-dynamization would be functions of the data structure to be dynamized,
+This chapter will introduce important background information and existing
+work in the area of data structure dynamization. We will first discuss the
+core concepts of search problems and data structures, which are central
+to dynamization techniques. While one might imagine that restrictions
+on dynamization would be functions of the data structure to be dynamized,
in practice the requirements placed on the data structure are quite mild.
Instead, the central difficulties to applying dynamization lie in the
necessary properties of the search problem that the data structure is
-intended to solve. Following this, existing theoretical results in the
+intended to solve. Following this, existing theoretical results in the
area of data structure dynamization will be discussed, which will serve
as the building blocks for our techniques in subsequent chapters. The
chapter will conclude with a discussion of some of the limitations of
these existing techniques.
-\section{Queries and Search Problems}
+\section{Background}
+
+Before discussing dynamization itself, there are a few important
+definitions to dispose of. In this section, we'll discuss some relevant
+background information on search problems and data structures, which will
+form the foundation of our discussion of dynamization itself.
+
+\subsection{Queries and Search Problems}
\label{sec:dsp}
Data access lies at the core of most database systems. We want to ask
@@ -38,8 +45,8 @@ purposes of this work, a search problem is defined as follows,
\begin{definition}[Search Problem]
Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function
$F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched,
- $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the
-answer domain.\footnote{
+ $\mathcal{Q}$ represents the domain of search parameters, and $\mathcal{R}$ represents the
+domain of possible answers.\footnote{
It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an
example, a \texttt{COUNT} aggregation might map a set of strings onto
an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need
@@ -48,7 +55,7 @@ not be a universal constraint.
\end{definition}
We will use the term \emph{query} to mean a specific instance of a search
-problem,
+problem, with a fixed set of search parameters,
\begin{definition}[Query]
Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and
@@ -56,24 +63,241 @@ problem,
instance of the search problem, $F(\mathcal{D}, q)$.
\end{definition}
-As an example of using these definitions, a \emph{membership test}
-or \emph{range scan} would be considered search problems, and a range
+As an example of these definitions, a \emph{membership test} or a
+\emph{range scan} would be considered a search problem, and a range
scan over the interval $[10, 99]$ would be a query. We've drawn this
distinction because, as we'll see as we enter into the discussion of
our work in later chapters, it is useful to have separate, unambiguous
terms for these two concepts.
+\subsection{Data Structures}
+
+Answering a search problem over an unordered set of data is not terribly
+efficient, and so usually the data is organized into a particular
+layout to facilitate more efficient answers. Such layouts are called
+\emph{data structures}. Examples of data structures include B-trees,
+hash tables, sorted arrays, etc. A data structure can be thought of as
+a \emph{solution} to a search problem~\cite{saxe79}.
+
+The symbol $\mathcal{I}$ indicates a data structure, and the symbol
+$\mathscr{I} \in \mathcal{I}$ represents an instance of a data structure
+built over a particular set of data, $d \subseteq \mathcal{D}$. We will use
+two abuses of notation pertaining to data structures throughout this
+work. First, $F(\mathscr{I}, q)$ will be used to indicate a query with
+search parameters $q$ over the data set $d$, answered using the data
+structure $\mathscr{I}$. Second, we will use $|\mathscr{I}|$ to indicate
+the number of records within $\mathscr{I}$ (which is equivalent to $|d|$,
+where $d$ is the set of records that $\mathscr{I}$ has been built over).
+
+We broadly classify data structures into three types, based upon
+the operations supported by the structure: static, half-dynamic, and
+full-dynamic. Static data structures do not support updates, half-dynamic
+structures support inserts, and full-dynamic support inserts and deletes.
+Note that we will use the unqualified term \emph{dynamic} to refer to
+both half-dynamic and full-dynamic structures when the distinction isn't
+relevant. Additionally, the term \emph{native dynamic} will be used to
+indicate a data structure that has been custom-built with support for
+inserts and/or deletes without the need for dynamization. These categories
+are not all-inclusive, as there are a number of data structures which
+do not fit the classification, but such structures are outside of the
+scope of this work.
+
+\subsubsection{Static Data Structures}
+
+A static data structure does not support updates of any kind, but can
+be constructed from a data set and answer queries. Additionally, we
+require that the static data structure provide the ability to reproduce
+the set of records that was used to construct it. Specifically, static
+data structures must support the following three operations,
+
+\begin{itemize}
+\item $\mathbftt{query}: \left(\mathcal{I}, \mathcal{Q}\right) \to \mathcal{R}$ \\
+ $\mathbftt{query}(\mathscr{I}, q)$ answers the query
+ $F(\mathscr{I}, q)$ and returns the result. This operation runs
+ in $\mathscr{Q}_S(n)$ time in the worst-case and \emph{cannot alter
+ the state of $\mathscr{I}$}.
+
+ In principle, a single data structure may be a solution to multiple
+ search problems. For example, B+trees can efficiently answer
+ point-lookup, set membership, and range scan. For our purposes,
+ however, we will be considering solutions to individual search
+ problems.
+
+\item $\mathbftt{build}:\left(\mathcal{PS}(\mathcal{D})\right) \to \mathcal{I}$ \\
+ $\mathbftt{build}(d)$ constructs a new instance of $\mathcal{I}$
+ using the records in set $d$. This operation runs in $B(n)$ time in
+ the worst case.
+
+\item $\mathbftt{unbuild}\left(\mathcal{I}\right) \to \mathcal{PS}(\mathcal{D})$ \\
+ $\mathbftt{unbuild}(\mathscr{I})$ recovers the set of records, $d$
+ used to construct $\mathscr{I}$. The literature on dynamization
+ generally assumes that this operation runs in $\Theta(1)$
+ time~\cite{saxe79}, and we will adopt the same assumption in our
+ analysis.
+\end{itemize}
+
+Note that the term static is distinct from immutable. Static refers
+to the layout of records within the data structure, whereas immutable
+refers to the data stored within those records. This distinction will
+become relevant when we discuss different techniques for adding delete
+support to data structures. The data structures used are always static,
+but not necessarily immutable, because the records may contain header
+information (like visibility) that is updated in place.
+
+\subsubsection{Half-dynamic Data Structures}
+
+A half-dynamic data structure requires the three operations of a static
+data structure, as well as the ability to efficiently insert new data into
+a structure built over an existing data set, $d$.
+
+\begin{itemize}
+\item $\mathbftt{insert}: \left(\mathcal{I}, \mathcal{D}\right) \to \mathcal{I}$ \\
+ $\mathbftt{insert}(\mathscr{I}, r)$ returns a data structure,
+ $\mathscr{I}^\prime$, such that $\mathbftt{query}(\mathscr{I}^\prime,
+ q) = F(d \cup r, q)$, for some $r \in \mathcal{D}$. This operation
+ runs in $I(n)$ time in the worst-case.
+\end{itemize}
+
+Note that the important aspect of insertion in this model is that the
+effect of the new record on the query result is observed, not necessarily
+that the result is a structure exactly identical to the one that would
+be obtained by building a new structure over $d \cup r$. Also, though
+the formalism used implies a functional operation where the original data
+structure is unmodified, this is not actually a requirement. $\mathscr{I}$
+could be sightly modified in place, and returned as $\mathscr{I}^\prime$,
+as is conventionally done with native dynamic data structures.
+
+\subsubsection{Full-dynamic Data Structures}
+A full-dynamic data structure is a half-dynamic structure that also
+has support for deleting records from the dataset.
+
+\begin{itemize}
+\item $\mathbftt{delete}: \left(\mathcal{I}, \mathcal{D}\right) \to \mathcal{I}$ \\
+ $\mathbftt{delete}(\mathscr{I}, r)$ returns a data structure, $\mathscr{I}^\prime$,
+ such that $\mathbftt{query}(\mathscr{I}^\prime,
+ q) = F(d - r, q)$, for some $r \in \mathcal{D}$. This operation
+ runs in $D(n)$ time in the worst-case.
+\end{itemize}
+
+As with insertion, the important aspect of deletion is that the effect
+of $r$ on the results of queries answered using $\mathscr{I}^\prime$
+has been removed, not necessarily that the record is physically
+removed from the structure. A full-dynamic data structure also
+supports in-place modification of an existing record. In order to
+update a record $r$ to $r^\prime$, it is sufficient to perform
+$\mathbftt{insert}\left(\mathbftt{delete}\left(\mathscr{I}, r\right),
+r^\prime\right)$.
+
+\subsubsection{Other Data Structures}
+There are data structures that do not fit into this classification
+scheme. For example, approximate structures like Bloom
+filters~\cite{bloom70} or various sketchs and summaries, do not retain
+full information about the records that have been used to construct them.
+Such structures cannot support \texttt{unbuild} as a result. Some other
+data structures cannot be statically queried--the act of querying them
+mutates their state. This is the case for structures like heaps, stacks,
+and queues, for example.
+
+\section{Decomposition-based Dynamization}
+
+\emph{Dynamization} is the process of transforming a static data structure
+into a dynamic one. When certain conditions are satisfied by the data
+structure and its associated search problem, this process can be done
+automatically, and with provable asymptotic bounds on amortized insertion
+performance, as well as worst-case query performance. This automatic
+approach is in constrast with the design of a native dynamic data
+structure, which involves altering the data structure itself to natively
+support updates. This process usually involves implementing techniques
+that partially rebuild small portions of the structure to accomodate new
+records, which is called \emph{local reconstruction}~\cite{overmars83}.
+This is a very manual intervention that requires significant effort on the
+part of the data structure designer, whereas conventional dynamization
+can be performed with little-to-no modification of the underlying data
+structure at all.
+
+It is worth noting that there are a variety of techniques
+discussed in the literature for dynamizing structures with specific
+properties, or under very specific sets of circumstances. Examples
+include frameworks for adding update support succinct data
+structures~\cite{dynamize-succinct} or taking advantage of batching
+of insert and query operations~\cite{batched-decomposable}. This
+section discusses techniques that are more general, and don't require
+workload-specific assumptions. For more detail than is included in
+this section, Overmars wrote a book providing a comprehensive survey of
+techniques for creating dynamic data structures, including not only the
+dynamization techniques discussed here, but also local reconstruction
+based techniques and more~\cite{overmars83}.\footnote{
+ Sadly, this book isn't readily available in
+ digital format as of the time of writing.
+}
+
+
+\subsection{Global Reconstruction}
+
+The most fundamental dynamization technique is that of \emph{global
+reconstruction}. While not particularly useful on its own, global
+reconstruction serves as the basis for the techniques to follow, and so
+we will begin our discussion of dynamization with it.
+
+Consider some search problem, $F$, for which we have a static solution,
+$\mathcal{I}$. Given the operations supported by static structures, it
+is possible to insert a new record, $r \in \mathcal{D}$, into an instance
+$\mathscr{I} \in \mathcal{I}$ as follows,
+\begin{equation*}
+\mathbftt{insert}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) \cup \{r\})
+\end{equation*}
+Likewise, a record can be deleted using,
+\begin{equation*}
+\mathbftt{delete}(\mathscr{I}, r) \triangleq \mathbftt{build}(\mathbftt{unbuild}(\mathscr{I}) - \{r\})
+\end{equation*}
+
+It goes without saying that this operation is sub-optimal, as the
+insertion and deletion costs are both $\Theta(B(n))$, and $B(n)
+\in \Omega(n)$ at best for most data structures. However, this global
+reconstruction strategy can be used as a primitive for more sophisticated
+techniques that can provide reasonable performance.
+
+\begin{figure}
+
+\caption{\textbf{Data Structure Decomposition.} }
+\label{fig:bg-decomp}
+\end{figure}
+
+The problem with global reconstruction is that each insert or delete
+must rebuild the entire data structure, involving all of its records. The
+key insight, first discussed by Bentley and Saxe~\cite{saxe79}, is that
+the cost associated with global reconstruction can be reduced by be
+accomplished by \emph{decomposing} the data structure into multiple,
+smaller structures, each built from a disjoint partition of the data.
+These smaller structures are called \emph{blocks}. It is possible to
+devise decomposition schemes that result in asymptotic improvements
+of insertion performance when compared to global reconstruction alone.
+
+\begin{example}[Data Structure Decomposition]
+Consider the sorted array data structure in Figure~\ref{fig:bg-decomp-1},
+where $|\mathscr{I}| = n$ and $B(n) \in \Theta(n \log n)$. Inserting a
+new record into $\mathscr{I}$ will, then, require $\Theta(n \log n)$.
+However, if the data structure is decomposed into blocks such that each
+block has $\Theta(\sqrt{n})$ records, as shown in Figure~\ref{fig:bg-decomp-2},
+an insert only needs to rebuild one of the blocks, and thus the worst-case
+cost of an insert becomes $\Theta(\sqrt{n} \log \sqrt{n})$.
+\end{example}
+
+Much of the existing work on dynamization has considered different
+approaches to decomposing data structures, and the effects that these
+approachs have on insertion and query performance. However, before we can
+discuss these approaches, we must first address the problem of answering
+search problems over these decomposed structures.
+
\subsection{Decomposable Search Problems}
-The dynamization techniques we will be considering require decomposing
-one data structure into several, smaller ones, called blocks, each built
-over a disjoint partition of the data. As a result, these techniques
-can only be applied in situations where the search problem can be
-answered from this set of decomposed blocks. The answer to the search
+Not all search problems can be correctly answered over a decomposed data
+structure, and this problem introduces one of the major limitations
+of traditional dynamization techniques: The answer to the search
problem from the decomposition should be the same as would have been
obtained had all of the data been stored in a single data structure. This
requirement is formalized in the definition of a class of problems called
-\emph{decomposable search problems (DSP)}. This class was first defined
+\emph{decomposable search problems} (DSP). This class was first defined
by Bentley and Saxe in their work on dynamization, and we will adopt
their definition,
@@ -88,12 +312,12 @@ their definition,
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
-The requirement for $\mergeop$ to be constant-time was used by Bentley and
-Saxe to prove specific performance bounds for answering queries from a
+The requirement for $\mergeop$ to be constant-time was used by Bentley
+and Saxe to prove specific performance bounds for answering queries from a
decomposed data structure. However, it is not strictly \emph{necessary},
-and later work by Overmars lifted this constraint and considered a more
-general class of search problems called \emph{$C(n)$-decomposable search
-problems},
+and later work by Overmars lifted this constraint and considered a
+more general class of search problems called \emph{$C(n)$-decomposable
+search problems},
\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars-cn-decomp}]
A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable
@@ -105,11 +329,12 @@ problems},
for all $A, B \in \mathcal{PS}(\mathcal{D})$ where $A \cap B = \emptyset$.
\end{definition}
-To demonstrate that a search problem is decomposable, it is necessary to
-show the existence of the merge operator, $\mergeop$, with the necessary
-properties, and to show that $F(A \cup B, q) = F(A, q)~ \mergeop ~F(B,
-q)$. With these two results, induction demonstrates that the problem is
-decomposable even in cases with more than two partial results.
+\Paragraph{Examples.} To demonstrate that a search problem is
+decomposable, it is necessary to show the existence of the merge operator,
+$\mergeop$, with the necessary properties, and to show that $F(A \cup
+B, q) = F(A, q)~ \mergeop ~F(B, q)$. With these two results, induction
+demonstrates that the problem is decomposable even in cases with more
+than two partial results.
As an example, consider range scans,
\begin{definition}[Range Count]
@@ -162,166 +387,74 @@ taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set
of numbers is a DSP.
\end{proof}
-\section{Dynamization for Decomposable Search Problems}
-
-Because data in a database is regularly updated, data structures
-intended to be used as an index must support updates (inserts, in-place
-modification, and deletes). Not all potentially useful data structures
-support updates, and so a general strategy for adding update support
-would increase the number of data structures that could be used as
-database indices. We refer to a data structure with update support as
-\emph{dynamic}, and one without update support as \emph{static}.\footnote{
- The term static is distinct from immutable. Static refers to the
- layout of records within the data structure, whereas immutable
- refers to the data stored within those records. This distinction
- will become relevant when we discuss different techniques for adding
- delete support to data structures. The data structures used are
- always static, but not necessarily immutable, because the records may
- contain header information (like visibility) that is updated in place.
-}
-
-This section discusses \emph{dynamization}, the construction of a dynamic
-data structure based on an existing static one. When certain conditions
-are satisfied by the data structure and its associated search problem,
-this process can be done automatically, and with provable asymptotic
-bounds on amortized insertion performance, as well as worst case
-query performance. This automatic approach is in constrast with the
-manual design of a dynamic data structure, which involves altering
-the data structure itself to natively support updates. This process
-usually involves implementing techniques that partially rebuild small
-portions of the structure to accomodate new records, which is called
-\emph{local reconstruction}~\cite{overmars83}. This is a very high cost
-intervention that requires significant effort on the part of the data
-structure designer, whereas conventional dynamization can be performed
-with little-to-no modification of the underlying data structure at all.
-
-It is worth noting that there are a variety of techniques
-discussed in the literature for dynamizing structures with specific
-properties, or under very specific sets of circumstances. Examples
-include frameworks for adding update support succinct data
-structures~\cite{dynamize-succinct} or taking advantage of batching
-of insert and query operations~\cite{batched-decomposable}. This
-section discusses techniques that are more general, and don't require
-workload-specific assumptions.
-
-We will first discuss the necessary data structure requirements, and
-then examine several classical dynamization techniques. The section
-will conclude with a discussion of delete support within the context
-of these techniques. For more detail than is included in this chapter,
-Overmars wrote a book providing a comprehensive survey of techniques for
-creating dynamic data structures, including not only the dynamization
-techniques discussed here, but also local reconstruction based
-techniques and more~\cite{overmars83}.\footnote{
- Sadly, this book isn't readily available in
- digital format as of the time of writing.
-}
-
-
-\subsection{Global Reconstruction}
-
-The most fundamental dynamization technique is that of \emph{global
-reconstruction}. While not particularly useful on its own, global
-reconstruction serves as the basis for the techniques to follow, and so
-we will begin our discussion of dynamization with it.
-
-Consider a class of data structure, $\mathcal{I}$, capable of answering a
-search problem, $\mathcal{Q}$. Insertion via global reconstruction is
-possible if $\mathcal{I}$ supports the following two operations,
-\begin{align*}
-\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\
-\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D})
-\end{align*}
-where $\mathtt{build}$ constructs an instance $\mathscr{I}\in\mathcal{I}$
-over the data structure over a set of records $d \subseteq \mathcal{D}$
-in $B(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d
-\subseteq \mathcal{D}$ used to construct $\mathscr{I} \in \mathcal{I}$ in
-$\Theta(1)$ time,\footnote{
- There isn't any practical reason why $\mathtt{unbuild}$ must run
- in constant time, but this is the assumption made in \cite{saxe79}
- and in subsequent work based on it, and so we will follow the same
- definition here.
-} such that $\mathscr{I} = \mathtt{build}(\mathtt{unbuild}(\mathscr{I}))$.
-
-Given this structure, an insert of record $r \in \mathcal{D}$ into a
-data structure $\mathscr{I} \in \mathcal{I}$ can be defined by,
-\begin{align*}
-\mathscr{I}_{i}^\prime = \text{build}(\text{unbuild}(\mathscr{I}_i) \cup \{r\})
-\end{align*}
-
-It goes without saying that this operation is sub-optimal, as the
-insertion cost is $\Theta(B(n))$, and $B(n) \in \Omega(n)$ at best for
-most data structures. However, this global reconstruction strategy can
-be used as a primitive for more sophisticated techniques that can provide
-reasonable performance.
-
-\subsection{Amortized Global Reconstruction}
-\label{ssec:agr}
-
-The problem with global reconstruction is that each insert must rebuild
-the entire data structure, involving all of its records. This results
-in a worst-case insert cost of $\Theta(B(n))$. However, opportunities
-for improving this scheme can present themselves when considering the
-\emph{amortized} insertion cost.
-
-Consider the cost accrued by the dynamized structure under global
-reconstruction over the lifetime of the structure. Each insert will result
-in all of the existing records being rewritten, so at worst each record
-will be involved in $\Theta(n)$ reconstructions, each reconstruction
-having $\Theta(B(n))$ cost. We can amortize this cost over the $n$ records
-inserted to get an amortized insertion cost for global reconstruction of,
-
+\Paragraph{Answering Queries for DSPs.} Queries for a decomposable
+search problem can be answered over a decomposed structure by
+individually querying each block, and then merging the results together
+using $\mergeop$. In many cases, this process will introduce some
+overhead in the query cost. Given a decomposed data structure $\mathscr{I}
+= \{\mathscr{I}_0, \mathscr{I}_1, \ldots, \mathscr{I}_m\}$,
+a query for a $C(n)$-decomposable search problem can be answered using,
\begin{equation*}
-I_a(n) = \frac{B(n) \cdot n}{n} = B(n)
+\mathbftt{query}\left(\mathscr{I}, q\right) \triangleq \bigmergeop_{i=0}^{m} F(\mathscr{I}_i, q)
\end{equation*}
-
-This doesn't improve things as is, however it does present two
-opportunities for improvement. If we could either reduce the size of
-the reconstructions, or the number of times a record is reconstructed,
-then we could reduce the amortized insertion cost.
-
-The key insight, first discussed by Bentley and Saxe, is that
-both of these goals can be accomplished by \emph{decomposing} the
-data structure into multiple, smaller structures, each built from
-a disjoint partition of the data. As long as the search problem
-being considered is decomposable, queries can be answered from
-this structure with bounded worst-case overhead, and the amortized
-insertion cost can be improved~\cite{saxe79}. Significant theoretical
-work exists in evaluating different strategies for decomposing the
-data structure~\cite{saxe79, overmars81, overmars83} and for leveraging
-specific efficiencies of the data structures being considered to improve
-these reconstructions~\cite{merge-dsp}.
-
-There are two general decomposition techniques that emerged from
-this work. The earliest of these is the logarithmic method, often
-called the Bentley-Saxe method in modern literature, and is the most
-commonly discussed technique today. The Bentley-Saxe method has been
-directly applied in a few instances in the literature, such as to
-metric indexing structures~\cite{naidan14} and spatial structures~\cite{bkdtree},
+which requires $\mathscr{Q}_S(n) \in O\left(m \cdot (
+\mathscr{Q}(n_\text{max}) + C(n)\right)$ time, where $m$ is the number
+of blocks and $n_\text{max}$ is the size of the largest block. Note
+the fact that $C(n)$ is multiplied by $m$ in this expression--this is a
+large part of the reason why $C(n)$-decomposability is not particularly
+desirable compared to standard decomposability, where $C(n) \in \Theta(1)$
+and thus falls out of the cost function.
+
+\section{Decomposition Methods}
+
+The previous discussion reveals the basic tension that exists
+within decomposition based techniques: larger block sizes result
+in worse insertion performance and better query performance. Query
+performance is improved by reducing the number of blocks, but this is
+concomitant with making the blocks larger, harming insertion performance.
+The literature on decomposition-based dynamization techniques discusses
+different approaches for performing the decomposition to balance these two
+competing interests, as well as various additional properties of search
+problems and structures that can be leveraged for better performance. In
+this section, we will discuss these topics in the context of creating
+half-dynamic data structures, and the next section will discuss similar
+considerations for full-dynamic structures.
+
+Of the decomposition techniques, we will focus on the two most important
+from a practical standpoint.\footnote{
+ There are, in effect, two main methods for decomposition. Other,
+ more complex, methods exist that consist of various compositions
+ of the two simpler ones. These more complex methods are of largely
+ theoretical interest, as they are complex enough to be of questionable
+ utility in practice.~\cite{overmars83}
+} The earliest of these is the logarithmic method, often called the
+Bentley-Saxe method in modern literature, and is the most commonly
+discussed technique today. The Bentley-Saxe method has been directly
+applied in a few instances in the literature, such as to metric indexing
+structures~\cite{naidan14} and spatial structures~\cite{bkdtree},
and has also been used in a modified form for genetic sequence search
-structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite a few
-examples.
+structures~\cite{almodaresi23} and graphs~\cite{lsmgraph}, to cite
+a few examples. A later technique, the equal block method, was also
+developed. It is generally not as effective as the Bentley-Saxe method,
+and we have not identified any specific applications of this technique
+outside of the theoretical literature, however we will discuss it as well
+because it is simple, and lends itself well to demonstrating certain
+useful properties of decomposition-based dynamization techniques that
+we will take advantage of later.
-A later technique, the equal block method, was also developed. It is
-generally not as effective as the Bentley-Saxe method, and as a result we
-have not identified any specific applications of this technique outside
-of the theoretical literature, however we will discuss it as well in
-the interest of completeness, and because it does lend itself well to
-demonstrating certain properties of decomposition-based dynamization
-techniques.
\subsection{Equal Block Method}
\label{ssec:ebm}
Though chronologically later, the equal block method is theoretically a
bit simpler, and so we will begin our discussion of decomposition-based
-technique for dynamization of decomposable search problems with it. There
-have been several proposed variations of this concept~\cite{maurer79,
-maurer80}, but we will focus on the most developed form as described by
-Overmars and von Leeuwen~\cite{overmars-art-of-dyn, overmars83}. The core
-concept of the equal block method is to decompose the data structure
-into several smaller data structures, called blocks, over partitions
-of the data. This decomposition is performed such that each block is of
-roughly equal size.
+techniques for the dynamization of decomposable search problems
+with it. There have been several proposed variations of this
+concept~\cite{maurer79, maurer80}, but we will focus on the most developed
+form as described by Overmars and von Leeuwen~\cite{overmars-art-of-dyn,
+overmars83}. The core concept of the equal block method is to decompose
+the data structure into a specified number of blocks, such that each
+block is of roughly equal size.
Consider a data structure $\mathscr{I} \in \mathcal{I}$ that solves
some decomposable search problem, $F$ and is built over a set of records
@@ -463,8 +596,9 @@ cost of answering a decomposable search problem from a BSM dynamization
is $\Theta\left(\mathscr{Q}(n)\right)$.~\cite{saxe79}
+\subsection{Optimizations}
-\subsection{Merge Decomposable Search Problems}
+\subsubsection{Merge Decomposable Search Problems}
When a reconstruction is performed using these techniques, the inputs to
that reconstruction are not random collections of records, but rather
@@ -523,7 +657,59 @@ the above definition.\footnote{
useful as an optimization.
}
-\subsection{Delete Support}
+\subsubsection{Worst-Case Optimal Techniques}
+\label{ssec:bsm-worst-optimal}
+
+Dynamization based upon amortized global reconstruction has a
+significant gap between its \emph{amortized} insertion performance,
+and its \emph{worst-case} insertion performance. When using the
+Bentley-Saxe method, the logarithmic decomposition ensures that the
+majority of inserts involve rebuilding only small data structures,
+and thus are relatively fast. However, the worst-case insertion cost is
+still $\Theta(B(n))$, no better than unamortized global reconstruction,
+because the worst-case insert requires a reconstruction using all of
+the records in the structure.
+
+Overmars and van Leeuwen~\cite{overmars81, overmars83} proposed an
+alteration to the Bentley-Saxe method that is capable of bringing the
+worst-case insertion cost in line with amortized, $I(n) \in \Theta
+\left(\frac{B(n)}{n} \log n\right)$. To accomplish this, they introduce
+a structure that is capable of spreading the work of reconstructions
+out across multiple inserts. Their structure consists of $\log_2 n$
+levels, like the Bentley-Saxe method, but each level contains four data
+structures, rather than one, called $Oldest_i$, $Older_i$, $Old_i$, $New_i$
+respectively.\footnote{
+ We are here adopting nomenclature used by Erickson in his lecture
+ notes on the topic~\cite{erickson-bsm-notes}, which is a bit clearer
+ than the more mathematical notation in the original source material.
+} The $Old$, $Older$, $Oldest$ structures represent completely built
+versions of the data structure on each level, and will be either full
+($2^i$ records) or empty. If $Oldest$ is empty, then so is $Older$,
+and if $Older$ is empty, then so is $Old$. The fourth structure,
+$New$, represents a partially built structure on the level. A record
+in the structure will be present in exactly one old structure, and may
+additionally appear in a new structure as well.
+
+When inserting into this structure, the algorithm first examines every
+level, $i$. If both $Older_{i-1}$ and $Oldest_{i-1}$ are full, then the
+algorithm will execute $\frac{B(2^i)}{2^i}$ steps of the algorithm
+to construct $New_i$ from $\text{unbuild}(Older_{i-1}) \cup
+\text{unbuild}(Oldest_{i-1})$. Once enough inserts have been performed
+to completely build some block, $New_i$, the source blocks for the
+reconstruction, $Oldest_{i-1}$ and $Older_{i-1}$ are deleted, $Old_{i-1}$
+becomes $Oldest_{i-1}$, and $New_i$ is assigned to the oldest empty block
+on level $i$.
+
+This approach means that, in the worst case, partial reconstructions will
+be executed on every level in the structure, resulting in
+\begin{equation*}
+ I(n) \in \Theta\left(\sum_{i=0}^{\log_2 n-1} \frac{B(2^i)}{2^i}\right) \in \Theta\left(\log_2 n \frac{B(n)}{n}\right)
+\end{equation*}
+time. Additionally, if $B(n) \in \Omega(n^{1 + \epsilon})$ for $\epsilon
+> 0$, then the bottom level dominates the reconstruction cost, and the
+worst-case bound drops to $I(n) \in \Theta(\frac{B(n)}{n})$.
+
+\section{Delete Support}
\label{ssec:dyn-deletes}
Classical dynamization techniques have also been developed with
@@ -568,7 +754,7 @@ proposed for supporting deletes, each of which rely on certain properties
of the search problem and data structure. These are the use of a ghost
structure and weak deletes.
-\subsubsection{Ghost Structure for Invertible Search Problems}
+\subsection{Ghost Structure for Invertible Search Problems}
The first proposed mechanism for supporting deletes was discussed
alongside the Bentley-Saxe method in Bentley and Saxe's original
@@ -680,7 +866,7 @@ completely. Then all of the blocks can be rebuilt from the remaining
records, partitioning them according to the strict binary decomposition
of the Bentley-Saxe method.
-\subsubsection{Weak Deletes for Deletion Decomposable Search Problems}
+\subsection{Weak Deletes for Deletion Decomposable Search Problems}
Another approach for supporting deletes was proposed later, by Overmars
and van Leeuwen, for a class of search problem called \emph{deletion
@@ -847,57 +1033,6 @@ an insert results in a violation: $s$ is updated to be exactly $f(n)$, all
existing blocks are unbuilt, and then the records are evenly redistributed
into the $s$ blocks.~\cite{overmars-art-of-dyn}
-\subsection{Worst-Case Optimal Techniques}
-\label{ssec:bsm-worst-optimal}
-
-Dynamization based upon amortized global reconstruction has a
-significant gap between its \emph{amortized} insertion performance,
-and its \emph{worst-case} insertion performance. When using the
-Bentley-Saxe method, the logarithmic decomposition ensures that the
-majority of inserts involve rebuilding only small data structures,
-and thus are relatively fast. However, the worst-case insertion cost is
-still $\Theta(B(n))$, no better than unamortized global reconstruction,
-because the worst-case insert requires a reconstruction using all of
-the records in the structure.
-
-Overmars and van Leeuwen~\cite{overmars81, overmars83} proposed an
-alteration to the Bentley-Saxe method that is capable of bringing the
-worst-case insertion cost in line with amortized, $I(n) \in \Theta
-\left(\frac{B(n)}{n} \log n\right)$. To accomplish this, they introduce
-a structure that is capable of spreading the work of reconstructions
-out across multiple inserts. Their structure consists of $\log_2 n$
-levels, like the Bentley-Saxe method, but each level contains four data
-structures, rather than one, called $Oldest_i$, $Older_i$, $Old_i$, $New_i$
-respectively.\footnote{
- We are here adopting nomenclature used by Erickson in his lecture
- notes on the topic~\cite{erickson-bsm-notes}, which is a bit clearer
- than the more mathematical notation in the original source material.
-} The $Old$, $Older$, $Oldest$ structures represent completely built
-versions of the data structure on each level, and will be either full
-($2^i$ records) or empty. If $Oldest$ is empty, then so is $Older$,
-and if $Older$ is empty, then so is $Old$. The fourth structure,
-$New$, represents a partially built structure on the level. A record
-in the structure will be present in exactly one old structure, and may
-additionally appear in a new structure as well.
-
-When inserting into this structure, the algorithm first examines every
-level, $i$. If both $Older_{i-1}$ and $Oldest_{i-1}$ are full, then the
-algorithm will execute $\frac{B(2^i)}{2^i}$ steps of the algorithm
-to construct $New_i$ from $\text{unbuild}(Older_{i-1}) \cup
-\text{unbuild}(Oldest_{i-1})$. Once enough inserts have been performed
-to completely build some block, $New_i$, the source blocks for the
-reconstruction, $Oldest_{i-1}$ and $Older_{i-1}$ are deleted, $Old_{i-1}$
-becomes $Oldest_{i-1}$, and $New_i$ is assigned to the oldest empty block
-on level $i$.
-
-This approach means that, in the worst case, partial reconstructions will
-be executed on every level in the structure, resulting in
-\begin{equation*}
- I(n) \in \Theta\left(\sum_{i=0}^{\log_2 n-1} \frac{B(2^i)}{2^i}\right) \in \Theta\left(\log_2 n \frac{B(n)}{n}\right)
-\end{equation*}
-time. Additionally, if $B(n) \in \Omega(n^{1 + \epsilon})$ for $\epsilon
-> 0$, then the bottom level dominates the reconstruction cost, and the
-worst-case bound drops to $I(n) \in \Theta(\frac{B(n)}{n})$.
\section{Limitations of Classical Dynamization Techniques}
\label{sec:bsm-limits}
diff --git a/chapters/introduction.tex b/chapters/introduction.tex
index 6b6904a..8a45bd0 100644
--- a/chapters/introduction.tex
+++ b/chapters/introduction.tex
@@ -123,25 +123,26 @@ lines can be found in Chapter~\ref{chap:related-work}, and the third
will be extensively discussed in Chapter~\ref{chap:background}.
Automatic index composition has been considered in a variety of
-papers~\cite{periodic-table,ds-alchemy,fluid-ds,gene,cosine}, each considering
-differing sets of data structure primitives and different techniques for
-composing the structure. The general principle across all incarnations
-of the technique is to consider a (usually static) set of data, and a
-workload consisting of single-dimensional range queries and point lookups.
-The system then analyzes the workload, either statically or in real time,
-selects specific primitive structures optimized for certain operations
-(e.g., hash table-like structures for point lookups, sorted runs for range
-scans), and applies them to different regions of the data, in an attempt
-to maximize the overall performance of the workload. Although some work
-in this area suggests generalization to more complex data types, such
-as multi-dimensional data~\cite{fluid-ds}, this line is broadly focused
-on creating instance-optimal indices for workloads that databases are
-already well equipped to handle. While this task is quite important, it
-is not precisely the work that we are trying to accomplish here. And,
-because the techniques are limited to specified sets of structural
-primitives, it isn't clear that the approach can be usefully extended
-to support \emph{arbitrary} query and data types. We thus consider this
-line to be largely orthogonal to ours.
+papers~\cite{periodic-table,ds-alchemy,fluid-ds,gene,cosine}, each
+considering differing sets of data structure primitives and different
+techniques for composing the structure. The general principle across all
+incarnations of the technique is to consider a (usually static) set of
+data, and a workload consisting of single-dimensional range queries and
+point lookups. The system then analyzes the workload, either statically
+or in real time, selects specific primitive structures optimized for
+certain operations (e.g., hash table-like structures for point lookups,
+sorted runs for range scans), and applies them to different regions
+of the data, in an attempt to maximize the overall performance of the
+workload. Although some work in this area suggests generalization to
+more complex data types, such as multi-dimensional data~\cite{fluid-ds},
+this line is broadly focused on creating instance-optimal indices for
+workloads that databases are already well equipped to handle. While this
+task is quite important, it is not precisely the work that we are trying
+to accomplish here. And, because the techniques are limited to specified
+sets of structural primitives, it isn't clear that the approach can
+be usefully extended to support \emph{arbitrary} query and data types
+without reintroducing the very problem we are trying to address. We thus
+consider this line to be largely orthogonal to ours.
The second approach, generalized index templates, \emph{does} attempt
to address the problem of expanding indexing support of databases to
@@ -178,17 +179,17 @@ rebuilding these blocks. The most commonly used version of this
approach is the Bentley-Saxe method~\cite{saxe79}, which has been
individually applied to several specific data structures in past
work~\cite{almodaresi23,pgm,naidan14,xie21,bkdtree}. Dynamization
-of this sort is not a fully general solution though; it places
-a number of restrictions on the data structures and queries that
-it can support. These limitations will be discussed at length in
-Chapter~\ref{chap:background}, but briefly they include: (1) restrictions
-on query types that can be supported, as well as even stricter constraints
-on when deletes are supported, (2) a lack of useful performance configuration,
-and (3) sub-optimal performance characteristics, particularly in terms of
-insertion tail latencies.
+of this sort is not a fully general solution though; it places a
+number of restrictions on the data structures and queries that
+it can support. These limitations will be discussed at length
+in Chapter~\ref{chap:background}, but briefly they include: (1)
+restrictions on query types that can be supported, as well as even
+stricter constraints on when deletes are supported, (2) a lack of
+useful performance configuration, and (3) sub-optimal performance
+characteristics, particularly in terms of insertion tail latencies.
Of the three approaches, we believe the latter to be the most promising
-from the prospective of easing the development of novel indices
+from the perspective of easing the development of novel indices
for specialized queries and data types. While dynamization does have
limitations, they are less onerous than the other two approaches. This
is because dynamization is unburdened by specific selections of primitive
@@ -239,6 +240,6 @@ two chapters, and formally considers the design space and trade-offs
within it. In Chapter~\ref{chap:tail-latency}, we consider the problem
of insertion tail latency, and extend our framework with support for
techniques to mitigate this problem. Chapter~\ref{chap:related-work}
-contains a more detailed discussion of works related to our own and the
-ways in which are approaches differ, and finally Chapter~\ref{chap:conclusion}
-concludes the work.
+contains a more detailed discussion of works related to our
+own and the ways in which are approaches differ, and finally
+Chapter~\ref{chap:conclusion} concludes the work.