1 files changed, 322 insertions, 51 deletions
diff --git a/chapters/tail-latency.tex b/chapters/tail-latency.tex
index 38e8f27..5935737 100644
--- a/chapters/tail-latency.tex
+++ b/chapters/tail-latency.tex
@@ -120,26 +120,31 @@ blocks in the decomposition. Placing a bound on this number is necessary
 to bound the worst-case query cost, and is done using reconstructions
 to either merge (in the case of the Bentley-Saxe method) or re-partition
 (in the case of the equal block method) the blocks. Performing less
-frequent reconstructions reduces the amount of work associated with
-inserts, at the cost of allowing more blocks to accumulate and thereby
-hurting query performance.
+frequent (or smaller) reconstructions reduces the amount of work
+associated with inserts, at the cost of allowing more blocks to accumulate
+and thereby hurting query performance.
 
 This trade-off between insertion and query performance by way of block
 count is most directly visible in the equal block method described
-in Section~\ref{ssec:ebm}. As a reminder, this technique provides the
+in Section~\ref{ssec:ebm}. This technique provides the
 following worst-case insertion and query bounds,
 \begin{align*}
 I(n) &\in \Theta\left(\frac{n}{f(n)}\right) \\
 \mathscr{Q}(n) &\in \Theta\left(f(n) \cdot \mathscr{Q}\left(\frac{n}{f(n)}\right)\right)
 \end{align*}
-where $f(n)$ is the number of blocks.
+where $f(n)$ is the number of blocks. This worst-case result ignores
+re-partitioning costs, which may be necessary for certain selections
+of $f(n)$.  We omit it here because we are about to examine a case
+of the equal block method were no re-partitioning is necessary. When
+re-partitioning is used, the worst case cost rises to the now familiar $I(n)
+\in \Theta(B(n))$ result.
 
 \begin{figure}
 \centering
-\subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/design-space/isam-insert-dist.pdf} \label{fig:tl-ebm-tradeoff}} 
-\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/design-space/vptree-insert-dist.pdf} \label{fig:tl-ebm-tail-latency}} \\
+\subfloat[Insertion vs. Query Trade-off]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf}\label{fig:tl-ebm-tradeoff}} 
+\subfloat[Insertion Latency Distribution]{\includegraphics[width=.5\textwidth]{img/tail-latency/ebm-count-sweep.pdf} \label{fig:tl-ebm-tail-latency}} \\
 
-\caption{The equal block method with varying values of $f(n)$.}
+\caption{The equal block method with $f(n) = C$ for varying values of C. \textbf{Plots not yet populated}}
 \label{fig:tl-ebm}
 \end{figure}
 
@@ -248,16 +253,17 @@ the next section, we'll discuss a technique based on this idea.
 
 \section{Relaxed Reconstruction}
 
-There is theoretical work in this area, which we discussed in
-Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach for
-controlling the worst-case insertion cost is to break the largest
-reconstructions up into small sequences of operations, that can then be
-attached to each insert, spreading the total workload out and ensuring
-each insert takes a consistent amount of time. Theoretically, the total
-throughput should remain about the same when doing this, but rather
-than having a bursty latency distribution with many fast inserts, and
-a small number of incredibly slow ones, distribution should be far more
-uniform.
+There does exist theoretical work on throttling the insertion
+rate of a Bentley-Saxe dynamization to control the worst-case
+insertion cost~\cite{overmars81}, which we discussed in
+Section~\ref{ssec:bsm-worst-optimal}. The gist of this approach is to
+break the largest reconstructions up into small sequences of operations,
+that can then be attached to each insert, spreading the total workload out
+and ensuring each insert takes a consistent amount of time. Theoretically,
+the total throughput should remain about the same when doing this, but
+rather than having a bursty latency distribution with many fast inserts,
+and a small number of incredibly slow ones, distribution should be
+more normal.
 
 Unfortunately, this technique has a number of limitations that we
 discussed in Section~\ref{ssec:bsm-tail-latency-problem}. Notably for
@@ -283,24 +289,25 @@ same worst-case insertion time as the worst-case optimized techniques,
 given a few assumptions about available resources.
 
 First, a comment on nomenclature. We define the term \emph{last level},
-$i = \ell$, to mean the level in the dynamized structure with the largest
-index value (and thereby the most records) and \emph{first level}
-to mean the level with index $i=0$. Any level with $0 < i < \ell$ is
-called an \emph{internal level}. A reconstruction on a level involves the
-compaction of all blocks on that level into one, larger, block, that is
-then appended to the level below. Relative to some level at index $i$,
-the \emph{next level} is the level at index $i + 1$.
-
-At a very high level, our proposed approach as follows. We will fully
-detach reconstructions from buffer flushes. When the buffer fills,
-it will immediately flush and a new shard will be placed in the first
-level. Reconstructions will be performed in the background to maintain the
-internal structure according to the tiering policy. When a level contains
-$s$ shards, a reconstruction will immediately be triggered to merge these
-shards and push the result down to the next level. To ensure that the
-number of shards in the structure remains bounded by $\Theta(\log n)$,
-we will throttle the insertion rate so that it is balanced with amount
-of time needed to complete reconstructions.
+$i = \ell$, to mean the level in the dynamized structure with the
+largest index value (and thereby the most records) and \emph{first
+level} to mean the level with index $i=0$. Any level with $0 < i <
+\ell$ is called an \emph{internal level}. A reconstruction on level $i$
+involves the combination of all blocks on that level into one, larger,
+block, that is then appended level $i+1$. Relative to some level at
+index $i$, the \emph{next level} is the level at index $i + 1$, and the
+\emph{previous level} is at index $i-1$.
+
+Our proposed approach is as follows. We will fully detach reconstructions
+from buffer flushes. When the buffer fills, it will immediately flush
+and a new block will be placed in the first level. Reconstructions
+will be performed in the background to maintain the internal structure
+according to the tiering policy. When a level contains $s$ blocks,
+a reconstruction will immediately be triggered to merge these blocks
+and push the result down to the next level. To ensure that the number
+of blocks in the structure remains bounded by $\Theta(\log n)$, we will
+throttle the insertion rate so that it is balanced with amount of time
+needed to complete reconstructions.
 
 \begin{figure}
 \caption{Several "states" of tiering, leading up to the worst-case
@@ -326,7 +333,7 @@ Given a buffered, dynamized structure utilizing the tiering layout policy,
 and at least $2$ parallel threads of execution, it is possible to maintain
 a worst-case insertion cost of
 \begin{equation}
-I(n) \in \Theta\left(\frac{B(n)}{n} \log n\right)
+I(n) \in O\left(\frac{B(n)}{n} \log n\right)
 \end{equation}
 \end{theorem}
 \begin{proof}
@@ -352,9 +359,9 @@ at least $\frac{B(n)}{n}$ time to fully cover the cost of the last level
 reconstruction. However, this is not sufficient to guarantee the bound, as
 other reconstructions will also occur within the structure. At the point
 at which the last level reconstruction can be scheduled, there will be
-exactly $1$ shard on each level. Thus, each level will potentially also
+exactly $1$ block on each level. Thus, each level will potentially also
 have an ongoing reconstruction that must be covered by inserting more
-stall time, to ensure that no level in the structure exceeds $s$ shards.
+stall time, to ensure that no level in the structure exceeds $s$ blocks.
 There are $\log n$ levels in total, and so in the worst case we will need
 to introduce a extra stall time to account for a reconstruction on each
 level,
@@ -363,7 +370,7 @@ I(n) \in \Theta(1 + \delta_0 + \delta_1 + \ldots \delta_{\log n - 1})
 \end{equation*}
 All of these internal reconstructions will be strictly less than the
 size of the last-level reconstruction, and so we can bound them all
-above by $\frac{B(n)}{n}$ time. 
+above by $O(\frac{B(n)}{n})$ time. 
 
 Given this, and assuming that the smallest (i.e., most pressing)
 reconstruction is prioritized on the background thread, we find that
@@ -376,12 +383,12 @@ This approach results in an equivalent worst-case insertion latency
 bound to~\cite{overmars81}, but manages to resolve both of the issues
 cited above. By leveraging two parallel threads, instead of trying to
 manually multiplex a single thread, this approach requires \emph{no}
-modification to the user's shard code to function. And, by leveraging
+modification to the user's block code to function. And, by leveraging
 the fact that reconstructions under tiering are strictly local to a
 single level, we can avoid needing to add any complicated additional
-structures to manage partially building shards as new records are added.
+structures to manage partially building blocks as new records are added.
 
-\subsection{Reducing Stall with Parallelism}
+\subsection{Reducing Stall with Additional Parallelism}
 
 The result in Theorem~\ref{theo:worst-case-optimal} assumes that there
 are two available threads of parallel execution, which allows for the
@@ -391,10 +398,11 @@ threads are available.
 
 The major limitation on Theorem~\ref{theo:worst-case-optimal}'s worst-case
 bound is that it is insufficient to cover only the cost of the last level
-reconstruction to maintain the bound on the shard count. From the moment
+reconstruction to maintain the bound on the block count. From the moment
 that the last level has filled, and this reconstruction can begin, every
-level within the structure will sustain another $s - 1$ reconstructions
-before it is necessary to have completed the last level reconstruction.
+level within the structure must sustain another $s - 1$ reconstructions
+before it is necessary to have completed the last level reconstruction,
+in order to maintain the $\Theta(\log n)$ bound on the number of blocks.
 
 Consider a parallel implementation that, contrary to
 Theorem~\ref{theo:worst-case-optimal}, only stalls enough to cover
@@ -402,8 +410,8 @@ the last level reconstruction, and blocks all other reconstructions
 until it has been completed. Such an approach would result in $\delta
 = \frac{B(n)}{n}$ stall and complete the last level reconstruction
 after $\Theta(n)$ inserts. During this time, $\Theta(\frac{n}{N_B})$
-shards would accumulate in L0, ultimately resulting in a bound of
-$\Theta(n)$ shards in the structure, rather than the $\Theta(\log
+blocks would accumulate in L0, ultimately resulting in a bound of
+$\Theta(n)$ blocks in the structure, rather than the $\Theta(\log
 n)$ bound we are trying to maintain. This is the reason why
 Theorem~\ref{theo:worst-case-optimal} must account for stalls on every
 level, and assumes that the smallest (and therefore most pressing)
@@ -482,7 +490,7 @@ we find that,
 I(n) \in O \left(\frac{B(n)}{n}\right)
 \end{equation*}
 is the worst-case insertion cost, while ensuring that all reconstructions
-are done in time to maintain the shard bound given $\log n$ parallel threads.
+are done in time to maintain the block bound given $\log n$ parallel threads.
 \end{proof}
 
 
@@ -503,6 +511,271 @@ a dynamization framework based upon the technique.
 
 \subsection{Parallel Reconstruction Architecture}
 
+The existing concurrency implementation described in
+Section~\ref{ssec:dyn-concurrency} is insufficient for the purposes of
+constructing a framework supporting the parallel reconstruction scheme
+described in the previous section. In particular, it is limited to
+only two active versions of the structure at at time, with one ongoing
+reconstruction. Additionally, it does not consider buffer flushes as
+distinct events from reconstructions. In this section, we will discuss
+the modifications made to the concurrency support within our framework
+to support parallel reconstructions.
+
+Much like the simpler scheme in Section~\ref{ssec:dyn-concurrency},
+our concurrency framework will be based on multi-versioning. Each
+\emph{version} consists of three pieces of information: a buffer
+head pointer, buffer tail pointer, and a collection of levels and
+shards. However, the process of managing, creating, and installing
+versions is much more complex, to allow more than two versions to exist
+at the same time under certain circumstances.
+
+
+\subsubsection{Structure Versioning}
+
+The internal structure of the dynamization consists of a sequence of
+levels containing immutable shards, as well as a snapshot of the state
+of the mutable buffer. This section pertains specifically to the internal
+structure; the mutable buffer handles its own versioning separate from
+this and will be discussed in the next section.
+
+\begin{figure}
+\centering
+\subfloat[Buffer Flush]{\includegraphics[width=.5\textwidth]{diag/tail-latency/flush.pdf}\label{fig:tl-flush}} 
+\subfloat[Maintenance Reconstruction]{\includegraphics[width=.5\textwidth]{diag/tail-latency/maint.pdf}\label{fig:tl-maint}} 
+\caption{\textbf{Structure Version Transitions.} The dynamized structure
+can transition to a new version via two operations, flushing the buffer
+into the first level or performing a maintenance reconstruction to
+merge shards on some level and append the result onto the next one. In
+each case, \texttt{V2} contains a shallow copy of \texttt{V1}'s
+light grey shards, with the dark grey shards being newly created
+and the white shards being deleted. The buffer flush operation in
+Figure~\ref{fig:tl-flush} simply creates a new shard from the buffer
+and places it in \texttt{L0} to create \texttt{V2}. The maintenance
+reconstruction in Figure~\ref{fig:tl-maint} is slightly more complex,
+creating a new shard in \texttt{L2} using the two shards in \texttt{V1}'s
+\texttt{L1}, and then removing the shards in \texttt{V2}'s \texttt{L1}.
+}
+\label{fig:tl-flush-maint}. 
+\end{figure}
+
+The internal structure of the dynamized data structure (ignoring the
+buffer) can be thought of as a list of immutable levels, $\mathcal{V}
+= \{\mathscr{L}_0, \ldots \mathscr{L}_h\}$, where each level
+contains immutable shards, $\mathcal{L}_i = \{\mathscr{I}_0, \ldots
+\mathscr{I}_m\}$. Buffer flushes and reconstructions can be thought of
+as functions, which accept a version as input and produce a new version
+as output. Namely,
+\begin{align*}
+	\mathcal{V}_{i+1} &= \mathbftt{flush}(\mathcal{V}_i, \mathcal{B}) \\ 
+	\mathcal{V}_{i+1} &= \mathbftt{maint}(\mathcal{V}_i, \mathscr{L}_x, j)
+\end{align*}
+where the subscript represents the \texttt{version\_id} and is a strictly
+increasing number assigned to each version. The $\mathbftt{flush}$
+operation builds a new shard using records from the buffer, $\mathcal{B}$,
+and creates a new version identical to $\mathcal{V}_i$, except with
+the new shard appended to $\mathscr{L}_0$. $\mathbftt{maint}$ performs
+a maintenance reconstruction by building a new shard using all of the
+shards in level $\mathscr{L}_x$ and creating a new version identical
+to $\mathcal{V}_i$ except that the new shard is appended to level
+$\mathscr{L}_j$ and the shards in $\mathscr{L}_x$ are removed from
+$\mathscr{L}_x$ in the new version. These two operations are shown in
+Figure~\ref{fig:tl-flush-maint}.
+
+At any point in time, the framework will have \emph{one} active version,
+$\mathcal{V}_a$, as well as a maximum unassigned version number, $v_m
+> a$. New version ids are obtained by performing an atomic fetch-and-add
+on $v_m$, and versions will become active in the exact order of their
+assigned version numbers. We use the term \emph{installing} a version,
+$\mathcal{V}_x$ to refer to setting $\mathcal{V}_a \gets \mathcal{V}_x$.
+
+\Paragraph{Version Number Assignment.} It is the intention of this
+framework to prioritize buffer flushes, meaning that the versions
+resulting from a buffer flush should become active as rapidly as
+possible. It is undesirable to have some version, $\mathcal{V}_f$,
+resulting from a buffer flush, attempting to install while there is
+a version $\mathcal{V}_r$ associated with an in-process maintenance
+reconstruction such that $a < r < f$. In this case, the flush must wait
+for the maintenance reconstruction to finalize before it can itself be
+installed. To avoid this problem, we assign version numbers differently
+based upon whether the new version is created by a flush or a maintenance
+reconstruction.
+
+\begin{itemize}
+	\item \textbf{Flush.} When a buffer flush is scheduled, it is
+	immediately assigned the next available version number at the
+	time of scheduling.
+
+	\item \textbf{Maintenance Reconstruction.} Maintenance reconstructions
+	are \emph{not} assigned a version number immediately. Instead, they
+	are assigned a version number \emph{after} all of the reconstruction
+	work is performed, during their installation process.
+\end{itemize}
+
+\Paragraph{Version Installation.} Once a given flush or maintenance
+reconstruction has completed and has been assigned a version
+number, $i$, the version will attempt to install itself. The
+thread running the operation will wait until $a = i - 1$, and
+then it will update $\mathcal{V}_a \gets \mathcal{V}_i$ using an
+atomic pointer assignment. All versions are reference counted using
+\texttt{std::shared\_pointer}, and so will be automatically deleted
+once all threads containing a reference to the version have terminated,
+so no special memory management is necessary during version installation.
+
+\Paragraph{Maintenance Version Reconciliation.} Waiting until the
+moment of installation to assign a version number to maintenance
+reconstructions avoids stalling buffer flushes, however it introduces
+additional complexity in the installation process. This is because active
+version at the time the reconstruction was scheduled, $\mathcal{V}_a$,
+may not still be the active version at the time the reconstruction is
+installed, $\mathcal{V}_{a^\prime}$. This means that the version of
+the structure produced by the reconstruction, $\mathcal{V}_r$, will not
+reflect any updates to the structure that were performed in version ids on
+the interval $(a, a^\prime]$. Figure~\ref{fig:tl-version-reconcilliation}
+shows an example of the sort of problem that can arise.
+
+One possible approach is to simply merge the versions together,
+adding all of the shards that are in $\mathcal{V}_{a^\prime}$ but
+not in $\mathcal{V}_r$ prior to installation. Sadly, this approach is
+insufficient because it can lead to three possible problems,
+
+\begin{enumerate}
+	\item If shards used in the maintenance reconstruction to
+	produce $\mathcal{V}_r$ were \emph{also} used as part of a
+	different maintenance reconstruction resulting in a version
+	$\mathcal{V}_o$ with $o < r$, then \textbf{records will be
+	duplicated} by the merge.
+
+	\item If another reconstruction produced a version $\mathcal{V}_o$
+	with $o < r$, and $\mathcal{V}_o$ added a new shard to the same
+	level that $\mathcal{V}_r$ did, it is possible that the
+	temporal ordering properties of the shards on the level
+	may be violated. Recall that supporting tombstone-based
+	deletes requires that shards be strictly ordered within each
+	level by their age to ensure that tombstone cancellation
+	(Section~\ref{sssec:dyn-deletes}).
+
+	\item The shards that were deleted from $\mathcal{V}_r$ after the
+	reconstruction will still be present in $\mathcal{V}_{a^\prime}$
+	and so may be reintroduced into the new version, again leading
+	to duplication of records. It is non-trivial to identify these
+	shards during the merge to skip over them, because the shards
+	don't have a unique identifier other than their pointers, and
+	using the pointers for this check can lead to the ABA problem
+	using the reference counting based memory management scheme the
+	framework is built on.
+
+\end{enumerate}
+
+The first two of these problems result from a simple synchronization
+problem and can be solved using locking. A maintenance reconstruction
+operates on some level $\mathscr{L}_i$, merging and then deleting shards
+from that level and placing the result in $\mathscr{L}_{i+1}$. In
+order for either of these problems to occur, multiple concurrent
+reconstructions must be operating on $\mathscr{L}_i$. Thus, a lock manager
+can be introduced into the framework to allow reconstructions to lock
+entire levels. A reconstruction can only be scheduled if it is able to
+acquire the lock on the level that it is using as the \emph{source}
+for its shards. Note that there is no synchronization problem with a
+concurrent reconstruction on level $\mathscr{L}_i-1$ appending a shard
+to $\mathscr{L}_i$. This will not violate any ordering properties or
+result in any duplication of records. Thus, each reconstruction only
+needs to lock a single level.
+
+The final problem is a bit trickier to address, but is fundamentally
+an implementation detail. Our approach for resolving it is to
+change the way that maintenance reconstructions produce a version
+in the first place. Rather than taking a copy of $\mathcal{V}_a$,
+manipulating it to perform the reconstruction, and then reconciling
+it with $\mathcal{V}_{a^\prime}$ when it is installed, we delay
+\emph{all} structural updates to the version to installation. When a
+reconstruction is scheduled, a reference to $\mathcal{V}_a$ is taken,
+instead of a copy.  Then, any new shards are built based on the contents
+of $\mathcal{V}_a$, but no updates to the structure are made. Once all
+of the shard reconstructions are complete, the version installation
+process begins. The thread running the reconstruction waits for its turn
+to install, and \emph{then} makes a copy of $\mathcal{V}_{a^\prime}$. To
+this copy, the newly created shards are added, and any necessary deletes
+are performed. Because the shards to be deleted are currently referenced
+in, at minimum, the reference to $\mathcal{V}(a)$ maintained by the
+reconstruction thread, pointer equality can be used to identify the
+shards and the ABA problem avoided. Then, once all the updates are complete,
+the new version can be installed.
+
+This process does push a fair amount of work to the moment of install,
+between when a version id is claimed by the reconstruction thread, and
+that version id becomes active. During this time, any buffer flushes
+will be blocked. However, relative to the work associated with actually
+performing the reconstructions, the overhead of these metadata operations
+is fairly minor, and so it doesn't have a significant effect on buffer
+flush performance.
+
+
+\subsubsection{Mutable Buffer}
+
+\begin{figure}
+\centering
+\subfloat[Buffer Initial State]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-1.pdf}\label{fig:tl-buffer1}} 
+\subfloat[Buffer Following an Insert]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-2.pdf}\label{fig:tl-buffer2}} 
+\subfloat[Buffer Version Transition]{\includegraphics[width=.3\textwidth]{diag/tail-latency/conc-buffer-3.pdf}\label{fig:tl-buffer3}} 
+\caption{\textbf{Versioning process for the mutable buffer.} A schematic
+view of the mutable buffer demonstrating the three pointers representing
+its state, and how they are adjusted as inserts occur. Dark grey slots
+represent the currently active version, light grey slots the old version,
+and white slots are available space.}
+\label{fig:tl-buffer}
+\end{figure}
+
+Next, we'll address concurrent access and versioning of the mutable
+buffer. In our system, the mutable buffer consists of a large ring buffer
+with a head and tail pointer, as shown in Figure~\ref{fig:tl-buffer}. In
+order to support versioning, the buffer actually uses two head pointers,
+one called \texttt{head} and one called \texttt{old head}, along
+with a single \texttt{tail} pointer. Records are inserted into the
+buffer by atomically incrementing \texttt{tail} and then placing the
+record into the slot. For records that cannot be atomically assigned,
+a visibility bit can be used to ensure that concurrent readers don't
+access a partially written value.  \texttt{tail} can be incremented
+until it matches \texttt{old head}, or until the current version of
+the buffer (between \texttt{head} and \texttt{tail}) contains $N_B$
+records. At this point, any further writes would either clobber records
+in the old version, or exceed the user-specified buffer capacity, and so
+any inserts must block until a flush has been completed.
+
+Flushes are triggered based on a user-configurable set point,
+$N_F \leq N_B$. When $\mathtt{tail} - \mathtt{head} = N_F$, a
+flush operation is scheduled (more on the details of this process in
+Section~\ref{ssec:tl-flush}). The location of \texttt{tail} is recorded
+as part of the flush, but records can continue to be inserted until one
+of the blocking conditions in the previous paragraph is reached. When the
+flush has completed, a new shard is created containing the records between
+\texttt{head} and the value of \texttt{tail} at the time the flush began.
+The buffer version can then be advanced by setting \texttt{old head}
+to \texttt{head} and setting \texttt{head} to \texttt{tail}. All of the
+records associated with the old version are freed, and the records that
+were just flushed now begin part of the old version.
+
+The reason for this scheme is to allow threads accessing an older
+version of the dynamized structure to still see a current view of all
+of the records. These threads will have a reference to a dynamized
+structure containing none of the records in the buffer, as well as
+the old head. Because the older version of the buffer always directly
+precedes the newer, all of the buffered records are visible to this
+older version.  However, threads accessing the more current version of
+the buffer will \emph{not} see the records contained between \texttt{old
+head} and \texttt{head}, as these records will have been flushed into
+the structure and are visible to the thread there. If this thread could
+still see records in the older version of the buffer, then it would see
+these records twice, which is incorrect.
+
+One consequence of this technique is that a buffer flush cannot complete
+until all threads referencing \texttt{old head} have completed. To ensure
+that this is the case, the two head pointers are reference counted, and
+a flush will stall until all references to \texttt{old head} have been
+removed. In principle, this problem could be reduced by allowing for more
+than two heads, but it becomes difficult to atomically transition between
+versions in that case, and it would also increase the storage requirements
+for the buffer, which requires $N_B$ space per available version.
+
 \subsection{Concurrent Queries}
 
 \subsubsection{Query Pre-emption}
@@ -516,8 +789,6 @@ pointer. If a query is particularly long running, or otherwise stalled,
 it is possible that the query will block insertions by holding onto this
 head pointer.
 
-
-
 \subsection{Insertion Stall Mechanism}
 
 \section{Evaluation}