Updates

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-30 21:31:31 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-30 21:31:31 -0400
commit: 3df3d11f71073419ea05fd66bc77c0d9474ca4ce (patch)
tree: 216a977bcee6f7a8b220dd7fbe48843d39878cd4 /chapters/tail-latency.tex
parent: 6bbc26424eae2d8069de716e7c685a4188d923b9 (diff)
download: dissertation-3df3d11f71073419ea05fd66bc77c0d9474ca4ce.tar.gz
1 files changed, 152 insertions, 27 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index 5935737..c3cf5b7 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -285,8 +285,21 @@ this discussion, they are
 \end{enumerate}
 In this section, we consider how these restrictions can be overcome given
 our dynamization framework, and propose a strategy that achieves the
-same worst-case insertion time as the worst-case optimized techniques,
-given a few assumptions about available resources.
+same worst-case insertion time as the worst-case optimized theoretical
+techniques, given a few assumptions about available resources, by taking
+advantage of parallelism and proactive scheduling of reconstructions.
+
+We will be restricting ourselves in this chapter to the tiering layout
+policy, because it has some specific properties that are useful to our
+goals. In tiering, the input shards to a reconstruction are restricted
+to a single level, unlike in leveling where the shards come from two
+levels, and BSM where the shards come from potentially \emph{all}
+the levels. This allows us to maximize parallelism, which we will
+be using to improve the tail latency performance, greatly simplifies
+synchronization, and provides us with the largest window over which to
+amortize the costs of reconstruction.  The techniques we describe in
+this chapter will work with leveling as well, albeit less effectively,
+and will not work \emph{at all} using BSM.
 
 First, a comment on nomenclature. We define the term \emph{last level},
 $i = \ell$, to mean the level in the dynamized structure with the
@@ -309,9 +322,24 @@ of blocks in the structure remains bounded by $\Theta(\log n)$, we will
 throttle the insertion rate so that it is balanced with amount of time
 needed to complete reconstructions.
 
+
 \begin{figure}
-\caption{Several "states" of tiering, leading up to the worst-case
-reconstruction.}
+\centering
+\includegraphics[width=\textwidth]{diag/tail-latency/last-level-recon.pdf}
+\caption{\textbf{Worst-case Reconstruction.} Using the tiering layout
+policy, the worst-case reconstruction occurs when every level in
+the structure has been filled (middle portion of the figure) and a
+reconstruction must be performed on each level to merge it into a single
+shard, and place it on the level below, leaving the structure with one
+shard per level after the records from the buffer have been added to
+L0 (right portion of the figure). The cost of this reconstruction is
+dominated by the cost of performing a reconstruction on the last level. The
+last level reconstruction, however, is able to be performed well in advance,
+as it only requires the blocks on the last level, which fills $\Theta(n)$
+inserts before the worst-case reconstruction is triggered (left portion of the
+figure). This provides us with the opportunity to initiate this reconstruction
+early. \emph{Note: the block sizes in this diagram are not scaled to their
+record counts--each level has an increasing number of records per block.}}
 \label{fig:tl-tiering}
 \end{figure}
 
@@ -506,8 +534,11 @@ rather than blocking, as a means of controlling the shard count within
 the structure. However, there are a number of practical problems to be
 solved before this idea can be used in a real system. In this section,
 we discuss these problems, and our approaches to solving them to produce
-a dynamization framework based upon the technique.
-
+a dynamization framework based upon the technique. Note that this
+system is based on the same high level architecture as we described in
+Section~\ref{ssec:dyn-concurrency}. To avoid redundancy, we will focus
+on how this system differs, without fully recapitulating the content of
+that earlier section.
 
 \subsection{Parallel Reconstruction Architecture}
 
@@ -529,7 +560,6 @@ shards. However, the process of managing, creating, and installing
 versions is much more complex, to allow more than two versions to exist
 at the same time under certain circumstances.
 
-
 \subsubsection{Structure Versioning}
 
 The internal structure of the dynamization consists of a sequence of
@@ -621,6 +651,21 @@ atomic pointer assignment. All versions are reference counted using
 once all threads containing a reference to the version have terminated,
 so no special memory management is necessary during version installation.
 
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{diag/tail-latency/dropped-shard.pdf}
+\caption{\textbf{Shard Reconciliation Problem.} Because maintenance
+reconstructions don't obtain their version number until after they have
+completed, it is possible for the internal structure of the dynamization
+to change between when the reconstruction is scheduled, and when it completes. In
+this example, a maintenance reconstruction is scheduled based on V1 of the
+structure. Before it can finish, V2 is created as a result of a buffer flush. As a
+result, the maintenance reconstruction's resulting structure is assigned V3. But, when
+it is installed, the shard produced by the flush in V2 is lost. It will be necessary
+to devise a means to prevent this from happening.}
+\label{fig:tl-dropped-shard}
+\end{figure}
+
 \Paragraph{Maintenance Version Reconciliation.} Waiting until the
 moment of installation to assign a version number to maintenance
 reconstructions avoids stalling buffer flushes, however it introduces
@@ -630,7 +675,7 @@ may not still be the active version at the time the reconstruction is
 installed, $\mathcal{V}_{a^\prime}$. This means that the version of
 the structure produced by the reconstruction, $\mathcal{V}_r$, will not
 reflect any updates to the structure that were performed in version ids on
-the interval $(a, a^\prime]$. Figure~\ref{fig:tl-version-reconcilliation}
+the interval $(a, a^\prime]$. Figure~\ref{fig:tl-dropped-shard}
 shows an example of the sort of problem that can arise.
 
 One possible approach is to simply merge the versions together,
@@ -676,7 +721,7 @@ can be introduced into the framework to allow reconstructions to lock
 entire levels. A reconstruction can only be scheduled if it is able to
 acquire the lock on the level that it is using as the \emph{source}
 for its shards. Note that there is no synchronization problem with a
-concurrent reconstruction on level $\mathscr{L}_i-1$ appending a shard
+concurrent reconstruction on level $\mathscr{L}_{i-1}$ appending a shard
 to $\mathscr{L}_i$. This will not violate any ordering properties or
 result in any duplication of records. Thus, each reconstruction only
 needs to lock a single level.
@@ -696,7 +741,7 @@ process begins. The thread running the reconstruction waits for its turn
 to install, and \emph{then} makes a copy of $\mathcal{V}_{a^\prime}$. To
 this copy, the newly created shards are added, and any necessary deletes
 are performed. Because the shards to be deleted are currently referenced
-in, at minimum, the reference to $\mathcal{V}(a)$ maintained by the
+in, at minimum, the reference to $\mathcal{V}_a$ maintained by the
 reconstruction thread, pointer equality can be used to identify the
 shards and the ABA problem avoided. Then, once all the updates are complete,
 the new version can be installed.
@@ -741,13 +786,12 @@ records. At this point, any further writes would either clobber records
 in the old version, or exceed the user-specified buffer capacity, and so
 any inserts must block until a flush has been completed.
 
-Flushes are triggered based on a user-configurable set point,
-$N_F \leq N_B$. When $\mathtt{tail} - \mathtt{head} = N_F$, a
-flush operation is scheduled (more on the details of this process in
-Section~\ref{ssec:tl-flush}). The location of \texttt{tail} is recorded
-as part of the flush, but records can continue to be inserted until one
-of the blocking conditions in the previous paragraph is reached. When the
-flush has completed, a new shard is created containing the records between
+Flushes are triggered based on a user-configurable set point, $N_F \leq
+N_B$. When $\mathtt{tail} - \mathtt{head} = N_F$, a flush operation
+is scheduled. The location of \texttt{tail} is recorded as part of
+the flush, but records can continue to be inserted until one of the
+blocking conditions in the previous paragraph is reached. When the flush
+has completed, a new shard is created containing the records between
 \texttt{head} and the value of \texttt{tail} at the time the flush began.
 The buffer version can then be advanced by setting \texttt{old head}
 to \texttt{head} and setting \texttt{head} to \texttt{tail}. All of the
@@ -778,20 +822,101 @@ for the buffer, which requires $N_B$ space per available version.
 
 \subsection{Concurrent Queries}
 
-\subsubsection{Query Pre-emption}
-
-Because our implementation only supports a finite number of versions of
-the mutable buffer at any point in time, and insertions will stall after
-this finite number is reached, it is possible for queries to introduce
-additional insertion latency. Queries hold a reference to the version of
-the structure the are using, which includes holding on to a buffer head
-pointer. If a query is particularly long running, or otherwise stalled,
-it is possible that the query will block insertions by holding onto this
-head pointer.
+Queries are answered based upon the active version of the structure
+at the moment the query begins to execute. When the query routine of
+the dynamization is called, a query is scheduled. Once a thread becomes
+available, the query will begin to execute. At the start of execution, the
+query thread takes a reference to $\mathcal{V}_a$ as well as the current
+\texttt{head} and \texttt{tail} of the buffer. Both $\mathcal{V}_a$
+and \texttt{head} are reference counted, and will be retained for the
+duration of the query. Once the query has finished processing, it will
+return the result to the user via an \texttt{std::promise} and release
+its references to the active version and buffer.
+
+\subsubsection{Query Preemption}
+
+Because our implementation only supports two active head pointers in the
+mutable buffer, queries can lead to insertion stalls. If a long running
+query is holding a reference to \texttt{old head}, then an active buffer
+flush of the old version will be blocked by this query. If this blocking
+goes on for sufficiently long, then the buffer may fill up and the system
+begin to reject inserts.
+
+One possible solution to this problem is to process the
+\texttt{buffer\_query} first, and then discard the reference to
+\texttt{old head}, allowing the buffer flush to proceed. However, this
+would not work for iterative deletion decomposable search problems,
+which may require re-processing the buffer query arbitrarily many times.
+As a result, we instead implement a simple preemption mechanism to
+defeat long running queries. The framework keeps track of how long
+a buffer flush has been stalled by queries maintaining references to
+\texttt{old head}.  Once this stalling passes a user-defined threshold,
+a preemption flag will be set ordering the queries in question to restart
+themselves. This is implemented fully within the framework code, requiring
+no user adjustment to their queries to support it, as the framework query
+mechanism simply checks this flag in between calls to user code. If
+a query sees this flag is set, it will release its references to the
+\texttt{old head} and structure version, and automatically put itself
+back in the scheduling queue to be retried against newer versions of
+the structure and buffer.
+
+Note that, if misconfigured, it is possible that this mechanism will
+entirely prevent certain long-running queries from being answered.
+If the threshold for preemption is set lower than the expected
+run-time of a valid query, it's possible that the query will loop forever
+if the system is experiencing sufficient insertion pressure. To help
+avoid this, another parameter is available to specify a maximum preemption
+count, after which a query will ignore a request for preemption.
 
 \subsection{Insertion Stall Mechanism}
 
+The results of Theorem~\ref{theo:worst-case-optimal} and
+\ref{theo:par-worst-case-optimal} are based upon enforcing a rate
+limit upon incoming inserts, to ensure that there is sufficient time
+for reconstructions to complete. In practice, calculating and precisely
+stalling for the correct amount of time is quite difficult because of
+the vagaries of working with a real system. While ultimately it would be
+ideal to have a reasonable cost model that can estimate this stall time
+on the fly based on the cost of building a data structure, the number of
+records involved in reconstructions, the number of available threads,
+available memory bandwidth, etc., for the purposes of this prototype
+we have settled for a simple system that demonstrates the robustness of
+the technique.
+
+Recall the basic insert process within our system. Inserts bypass the
+scheduling system and communicate directly with the buffer, on the same
+client thread that called the function, to maximize insertion performance
+and eliminate as much concurrency control overhead as possible. The
+insert routine is synchronous, and returns a boolean indicating whether
+the insert has succeeded or not. The insert can fail if the buffer is full,
+in which case the user is expected to delay for a moment and retry. Once
+space has been cleared in the buffer, the insert will succeed.
+
+We can leverage this same rejection mechanism as a form of rate limiting,
+by probabilistically rejecting inserts. To this end, we introduce a
+configurable parameter indicating the probability that an insert will
+succeed. For each insert, we we use Bernoulli sampling based upon this
+probably to determine whether the insert should be attempted or not. If
+the insert is rejected, then the attempt is aborted and the function
+returns a failure. The user can then delay and attempt again.
+
+This approach was selected because it has a few specific
+advantages. First, it is based on a single parameter that can be
+readily updated on demand using atomics. Our current prototype
+uses a single, fixed value for the probability, but ultimately it
+should be dynamically tuned to approximate the $\delta$ value from
+Theorem~\ref{theo:worst-case-optimal} as closely as possible. Second,
+random sampling to apply stalls ensures fairness. Ultimately, there's a
+limit on how small of a stall can be meaningfully applied, particularly
+if we want to avoid busy waiting. This approach allows us to fairly
+approximate these small stalls by using larger, more practical, amounts
+attached to randomly sampled inserts. Finally, the approach is simple
+and requires no significant changes to the user interface, while (as
+we will see in Section~\ref{sec:tl-eval}) directly exposing the design
+space associated with the parameter to the user.
+
 \section{Evaluation}
+\label{sec:tl-eval}
 
 \subsection{}
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-30 21:31:31 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-30 21:31:31 -0400
commit	3df3d11f71073419ea05fd66bc77c0d9474ca4ce (patch)
tree	216a977bcee6f7a8b220dd7fbe48843d39878cd4 /chapters/tail-latency.tex
parent	6bbc26424eae2d8069de716e7c685a4188d923b9 (diff)
download	dissertation-3df3d11f71073419ea05fd66bc77c0d9474ca4ce.tar.gz