summaryrefslogtreecommitdiffstats
path: root/chapters/beyond-dsp.tex
diff options
context:
space:
mode:
Diffstat (limited to 'chapters/beyond-dsp.tex')
-rw-r--r--chapters/beyond-dsp.tex130
1 files changed, 72 insertions, 58 deletions
diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex
index 26f733c..af84b43 100644
--- a/chapters/beyond-dsp.tex
+++ b/chapters/beyond-dsp.tex
@@ -13,18 +13,22 @@
\section{Introduction}
-In the previous chapter, we discussed how several of the limitations of
-dynamization could be overcome by proposing a systematic dynamization
-approach for sampling data structures. In doing so, we introduced
-a multi-stage query mechanism to overcome the non-decomposability of
-these queries, provided two mechanisms for supporting deletes along with
-specialized processing to integrate these with the query mechanism, and
-introduced some performance tuning capability inspired by the design space
-of modern LSM Trees. While promising, these results are highly specialized
-and remain useful only within the context of sampling queries. In this
-chapter, we develop new generalized query abstractions based on these
-specific results, and discuss a fully implemented framework based upon
-these abstractions.
+In the previous chapter, we considered the problem of answering sampling
+queries over a decomposed data structure. Because such problems are
+not decomposable, they are not efficiently solvable using traditional
+dynamization techniques. However, by introducing a number of additional
+mechanisms, we were able to produce a specialized system that used a
+multiple-stage query process to overcome the non-decomposability of
+the queries and provide efficient and correct answers. We additionally
+introduced systems for supporting deletes in two different ways, which
+were tightly integrated with both the query and reconstruction processes.
+Finally, we introduced a configuration space inspired by LSM trees
+to allow for tuning between query and insert performance. While this
+result is promising, it is not a general solution as the mechanisms are
+highly specialized to sampling queries. In this chapter, we consider
+these mechanisms and generalize them, creating new query abstractions
+based upon our specific results, and then discuss a fully implemented
+framework based on these new abstractions
More specifically, in this chapter we propose \emph{extended
decomposability} and \emph{iterative deletion decomposability} as two
@@ -32,10 +36,17 @@ new, broader classes of search problem which are strict supersets of
decomposability and deletion decomposability respectively, providing a
more powerful interface to allow the efficient implementation of a larger
set of search problems over a dynamized structure. We then implement
-a C++ library based upon these abstractions which is capable of adding
+a C++ library based upon these abstractions that is capable of adding
support for inserts, deletes, and concurrency to static data structures
automatically, and use it to provide dynamizations for independent range
-sampling, range queries with learned indices, string search with succinct
+sampling,\footnote{
+ Our generalized framework's support for sampling is not \emph{exactly}
+ equivalent to that of the previous chapter. In particular, our
+ generalized framework does not support tombstone-based deletes
+ for sampling problem. It also lacks some mutable buffer-related
+ optimizations for weighted sampling. These features were sacrificed
+ in the name of generality.
+} range queries with learned indices, string search with succinct
tries, and high dimensional vector search with metric indices. In each
case we compare our dynamized implementation with existing dynamic
structures, and standard Bentley-Saxe dynamizations, where possible.
@@ -352,8 +363,7 @@ interface, with the same performance as their specialized implementations.
\subsection{Iterative Deletion Decomposability}
\label{ssec:dyn-idsp}
-
-\begin{algorithm}[t]
+\begin{algorithm}
\caption{Answering an Iterative Deletion Decomposable Search Problem}
\label{alg:dyn-query}
\KwIn{$q$: query parameters, $\mathscr{I}_1 \ldots \mathscr{I}_m$: blocks}
@@ -378,49 +388,51 @@ interface, with the same performance as their specialized implementations.
\end{algorithm}
-We next turn out attention to support for deletes. Efficient delete
+We next turn our attention to support for deletes. Efficient delete
support in Bentley-Saxe dynamization is provably impossible~\cite{saxe79},
-but, as discussed in Section~\ref{ssec:dyn-deletes} it is possible
-to support them in restricted situations, where either the search
-problem is invertible (Definition~\ref{def:invert}) or the data
-structure and search problem combined are deletion decomposable
-(Definition~\ref{def:background-ddsp}). In Chapter~\ref{chap:sampling},
-we considered a set of search problems which did \emph{not} satisfy
-any of these properties, and instead built a customized solution for
-deletes that required tight integration with the query process in order
-to function. While such a solution was acceptable for the goals of that
-chapter, it is not sufficient for our goal in this chapter of producing
-a generalized system.
-
-Additionally, of the two types of problem that can support deletes, the
-invertible case is preferable. This is because the amount of work necessary
-to support deletes for invertible search problems is very small. The data
+but, as discussed in Section~\ref{ssec:dyn-deletes}, it is possible to
+support them in restricted situations. Efficient delete mechanisms exist
+when the search problem is invertible (Definition~\ref{def:invert})
+or when the data structure and search problem combined are
+deletion decomposable (Definition~\ref{def:background-ddsp}).
+In Chapter~\ref{chap:sampling}, we considered a set of search problems
+which did \emph{not} satisfy either of these properties, and instead
+built a customized solution for deletes that required tight integration
+with the query process. While such a solution was acceptable for the
+goals of that chapter, it is insufficient for this chapter's goal of
+producing a generalized system.
+
+Of the two types of problem that can support deletes, the invertible
+case is preferable. This is because the amount of work necessary to
+support deletes for invertible search problems is very small. The data
structure requires no modification (such as to implement weak deletes),
-and the query requires no modification (to ignore the weak deletes) aside
-from the addition of the $\Delta$ operator. This is appealing from a
-framework design standpoint. Thus, it would also be worth it to consider
+and the query requires no modification (to ignore the weak deletes)
+aside from the addition of the $\Delta$ operator. This is appealing from
+a framework design standpoint. Thus, it would also be worth considering
approaches for expanding the range of search problems that can be answered
using the ghost structure mechanism supported by invertible problems.
A significant limitation of invertible problems is that the result set
-size is not able to be controlled. We do not know how many records in our
-local results have been deleted until we reach the combine operation and
-they begin to cancel out, at which point we lack a mechanism to go back
-and retrieve more records. This presents difficulties for addressing
-important search problems such as top-$k$, $k$-NN, and sampling. In
+size cannot be known until after the query has been answered. We do
+not know how many records in the local results have been deleted until
+we reach the combine operation and they begin to cancel out. Once this
+point has been reached, we lack a mechanism to return to the structure
+and retrieve more records. This presents difficulties for addressing
+important search problems such as top-$k$, $k$-NN, and sampling,
+where the required result set size is a user-specified parameter. In
principle, these queries could be supported by repeating the query with
larger-and-larger $k$ values until the desired number of records is
returned, but in the eDSP model this requires throwing away a lot of
-useful work, as the state of the query must be rebuilt each time.
+useful work, because the state of the query must be rebuilt each time.
We can resolve this problem by moving the decision to repeat the query
into the query interface itself, allowing retries \emph{before} the
result set is returned to the user and the local meta-information objects
-discarded. This allows us to preserve this pre-processing work, and repeat
+discarded. This allows us to preserve pre-processing results, and repeat
the local query process as many times as is necessary to achieve our
-desired number of records. From this observation, we propose another new
-class of search problem: \emph{iterative deletion decomposable} (IDSP). The
-IDSP definition expands eDSP with a fifth operation,
+desired number of records. From this observation, we propose another
+new class of search problem: \emph{iterative deletion decomposable}
+(IDSP). The IDSP definition expands eDSP with a fifth operation,
\begin{itemize}
\item $\mathbftt{repeat}(q, R, q_1, \ldots, q_m) \to
@@ -433,11 +445,14 @@ IDSP definition expands eDSP with a fifth operation,
If this routine returns true, it must also modify the local queries as
necessary to account for the work that remains to be completed (e.g.,
-update the number of records to retrieve). Then, the query process resumes
-from the execution of the local queries. If it returns false, then the
-result is simply returned to the user. If the number of repetitions of
-the query is bounded by $R(n)$, then the following provides an upper
-bound on the worst-case query complexity of an IDSP,
+update the number of records to retrieve). Then, the query process
+resumes from the execution of the local queries. If it returns false,
+then the result is simply returned to the user. The full IDSP query
+algorithm is shown in Figure~\ref{alg:dyn-idsp-query}
+
+If the number of repetitions of the query is bounded by $R(n)$, then the
+following provides an upper bound on the worst-case query complexity of
+an IDSP,
\begin{equation*}
O\left(\log_2 n \cdot P(n) + D(n) + R(n) \left(\log_2 n \cdot Q_s(n) +
@@ -454,11 +469,10 @@ records. This can be done, for example, using the full-reconstruction
techniques in the literature~\cite{saxe79, merge-dsp, overmars83}
or through proactively performing reconstructions, such as with the
mechanism discussed in Section~\ref{sssec:sampling-rejection-bound},
-depending on the particulars of how deletes are implemented. The
-full IDSP query algorithm is shown in Figure~\ref{alg:dyn-idsp-query}
-
+depending on the particulars of how deletes are implemented.
+% \subsubsection{IDSP for $k$-NN}
As an example of how IDSP can facilitate delete support for search
problems, let's consider $k$-NN. This problem can be $C(n)$-deletion
@@ -614,7 +628,7 @@ a search problem falls into a particular classification in the general
taxonomy doesn't imply any particular information about where in the
deletion taxonomy that same problem might also fall.
-\begin{figure}[t]
+\begin{figure}
\subfloat[General Taxonomy]{\includegraphics[width=.49\linewidth]{diag/taxonomy}
\label{fig:taxonomy-main}}
\subfloat[Deletion Taxonomy]{\includegraphics[width=.49\linewidth]{diag/deletes} \label{fig:taxonomy-deletes}}
@@ -1084,7 +1098,7 @@ in Algorithm~\ref{alg:dyn-insert}.
$\texttt{buffer.append}(r)$\;
\Return
}
- $\texttt{buffer\_shard} \gets \texttt{build\_shard}(buffer)$ \;
+ $\texttt{buffer\_shard} \gets \texttt{build\_shard}(\texttt{buffer})$ \;
\BlankLine
$\texttt{idx} \gets 0$\;
\For{$i \gets 0 \ldots \texttt{n\_levels}$}{
@@ -1320,7 +1334,7 @@ get better asymptotic performance.
\label{ssec:dyn-concurrency}
The decomposition-based dynamization scheme we are considering in this
-work lends itself to a very straightfoward concurrency control scheme,
+work lends itself to a very straightforward concurrency control scheme,
because it is founded upon static data structures. As a result, concurrent
writes only need to be managed within the mutable buffer. Beyond this,
reconstructions within the levels of the structure only reorganize
@@ -1354,11 +1368,11 @@ assign jobs to the thread pool.
\subsubsection{The Mutable Buffer}
Our mutable buffer is an unsorted array to which new records are appended.
-This makes concurrent writes very straightfoward to support using a simple
+This makes concurrent writes very straightforward to support using a simple
fetch-and-add instruction on the tail pointer of the buffer. When a write
is issued, a fetch-and-add is executed against the tail pointer. This
effectively reserves a slot at the end of the array for the new record
-to be written into, as each thread will recieve a unique index from this
+to be written into, as each thread will receive a unique index from this
operation. Then, the record can be directly assigned to that index. If
the buffer is full, then a reconstruction is scheduled (if one isn't
already running) and a failure is immediately returned to the user.