From 5e4ad2777acc4c2420514e39fb98b7cf2e200996 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Sun, 27 Apr 2025 17:36:57 -0400 Subject: Initial commit --- chapters/future-work.tex | 174 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100644 chapters/future-work.tex (limited to 'chapters/future-work.tex') diff --git a/chapters/future-work.tex b/chapters/future-work.tex new file mode 100644 index 0000000..d4ddd52 --- /dev/null +++ b/chapters/future-work.tex @@ -0,0 +1,174 @@ +\chapter{Proposed Work} +\label{chap:proposed} + +The previous two chapters described work already completed, however +there are a number of work that remains to be done as part of this +project. Update support is only one of the important features that an +index requires of its data structure. In this chapter, the remaining +research problems will be discussed briefly, to lay out a set of criteria +for project completion. + +\section{Concurrency Support} + +Database management systems are designed to hide the latency of +IO operations, and one of the techniques they use are being highly +concurrent. As a result, any data structure used to build a database +index must also support concurrent updates and queries. The sampling +extension framework described in Chapter~\ref{chap:sampling} had basic +concurrency support, but work is ongoing to integrate a superior system +into the framework of Chapter~\ref{chap:framework}. + +Because the framework is based on the Bentley-Saxe method, it has a number +of desirable properties for making concurrency management simpler. With +the exception of the buffer, the vast majority of the data resides in +static data structures. When using tombstones, these static structures +become fully immutable. This turns concurrency control into a resource +management problem, and suggests a simple multi-version concurrency +control scheme. Each version of the structure, defined as being the +state between two reconstructions, is tagged with an epoch number. A +query, then, will read only a single epoch, which will be preserved +in storage until all queries accessing it have terminated. Because the +mutable buffer is append-only, a consistent view of it can be obtained +by storing the tail of the log at the start of query execution. Thus, +a fixed snapshot of the index can be represented as a two-tuple containing +the epoch number and buffer tail index. + +The major limitation of the Chapter~\ref{chap:sampling} system was +the handling of buffer expansion. While the mutable buffer itself is +an unsorted array, and thus supports concurrent inserts using a simple +fetch-and-add operation, the real hurdle to insert performance is managing +reconstruction. During a reconstruction, the buffer is full and cannot +support any new inserts. Because active queries may be using the buffer, +it cannot be immediately flushed, and so inserts are blocked. Because of +this, it is necessary to use multiple buffers to sustain insertions. When +a buffer is filled, a background thread is used to perform the +reconstruction, and a new buffer is added to continue inserting while that +reconstruction occurs. In Chapter~\ref{chap:sampling}, the solution used +was limited by its restriction to only two buffers (and as a result, +a maximum of two active epochs at any point in time). Any sustained +insertion workload would quickly fill up the pair of buffers, and then +be forced to block until one of the buffers could be emptied. This +emptying of the buffer was contingent on \emph{both} all queries using +the buffer finishing, \emph{and} on the reconstruction using that buffer +to finish. As a result, the length of the block on inserts could be long +(multiple seconds, or even minutes for particularly large reconstructions) +and indeterminate (a given index could be involved in a very long running +query, and the buffer would be blocked until the query completed). + +Thus, a more effective concurrency solution would need to support +dynamically adding mutable buffers as needed to maintain insertion +throughput. This would allow for insertion throughput to be maintained +so long as memory for more buffer space is available.\footnote{For the +in-memory indexes considered thus far, it isn't clear that running out of +memory for buffers is a recoverable error in all cases. The system would +require the same amount of memory for storing record (technically more, +considering index overhead) in a shard as it does in the buffer. In the +case of an external storage system, the calculus would be different, +of course.} It would also ensure that a long running could only block +insertion if there is insufficient memory to create a new buffer or to +run a reconstruction. However, as the number of buffered records grows, +there is the potential for query performance to suffer, which leads to +another important aspect of an effective concurrency control scheme. + +\subsection{Tail Latency Control} + +The concurrency control scheme discussed thus far allows for maintaining +insertion throughput by allowing an unbounded portion of the new data +to remain buffered in an unsorted fashion. Over time, this buffered +data will be moved into data structures in the background, as the +system performs merges (which are moved off of the critical path for +most operations). While this system allows for fast inserts, it has the +potential to damage query performance. This is because the more buffered +data there is, the more a query must fall back on its inefficient +scan-based buffer path, as opposed to using the data structure. + +Unfortunately, reconstructions can be incredibly lengthy (recall that +the worst-case scenario involves rebuilding a static structure over +all of the records; this is, thankfully, quite rare). This implies that +it may be necessary in certain circumstances to throttle insertions to +maintain certain levels of query performance. Additionally, it may be +worth preemptively performing large reconstructions during periods of +low utilization, similar to systems like Silk designed for mitigating +tail latency spikes in LSM-tree based systems~\cite{balmau19}. + +Additionally, it is possible that large reconstructions may have a +negative effect on query performance, due to system resource utilization. +Reconstructions can use a large amount of memory bandwidth, which must +be shared by queries. The effects of parallel reconstruction on query +performance will need to be assessed, and strategies for mitigation of +this effect, be it a scheduling-based solution, or a resource-throttling +one, considered if necessary. + + +\section{Fine-Grained Online Performance Tuning} + +The framework has a large number of configurable parameters, and +introducing concurrency control will add even more. The parameter sweeps +in Section~\ref{ssec:ds-exp} show that there are trade-offs between +read and write performance across this space. Unfortunately, the current +framework applies this configuration parameters globally, and does not +allow them to be changed after the index is constructed. It seems apparent +that better performance might be obtained by adjusting this approach. + +First, there is nothing preventing these parameters from being configured +on a per-level basis. Having different layout policies on different +levels (for example, tiering on higher levels and leveling on lower ones), +different scale factors, etc. More index specific tuning, like controlling +memory budget for auxiliary structures, could also be considered. + +This fine-grained tuning will open up an even broader design space, +which has the benefit of improving the configurability of the system, +but the disadvantage of making configuration more difficult. Additionally, +it does nothing to address the problem of workload drift: a configuration +may be optimal now, but will it remain effective in the future as the +read/write mix of the workload changes? Both of these challenges can be +addressed using dynamic tuning. + +The theory is that the framework could be augmented with some workload +and performance statistics tracking. Based on these numbers, during +reconstruction, the framework could decide to adjust the configuration +of one or more levels in an online fashion, to lean more towards read +or write performance, or to dial back memory budgets as the system's +memory usage increases. Additionally, buffer-related parameters could +be tweaked in real time as well. If insertion throughput is high, it +might be worth it to temporarily increase the buffer size, rather than +spawning multiple smaller buffers. + +A system like this would allow for more consistent performance of the +system in the face of changing workloads, and also increase the ease +of use of the framework by removing the burden of configuration from +the user. + + +\section{Alternative Data Partitioning Schemes} + +One problem with Bentley-Saxe or LSM-tree derived systems is temporary +memory usage spikes. When performing a reconstruction, the system needs +enough storage to store the shards involved in the reconstruction, +and also the newly constructed shard. This is made worse in the face +of multi-version concurrency, where multiple older versions of shards +may be retained in memory at once. It's well known that, in the worst +case, such a system may temporarily require double its current memory +usage~\cite{dayan22}. + +One approach to addressing this problem in LSM-tree based systems is +to adjust the compaction granularity~\cite{dayan22}. In the terminology +associated with this framework, the idea is to further sub-divide each +shard into smaller chunks, partitioned based on keys. That way, when a +reconstruction is triggered, rather than reconstructing an entire shard, +these smaller partitions can be used instead. One of the partitions in +the source shard can be selected, and then merged with the partitions +in the next level down having overlapping key ranges. The amount of +memory required for reconstruction (and also reconstruction time costs) +can then be controlled by adjusting these partitions. + +Unfortunately, while this system works incredibly well for LSM-tree +based systems which store one-dimensional data in sorted arrays, it +encounters some problems in the context of a general index. It isn't +clear how to effectively partition multi-dimensional data in the same +way. Additionally, in the general case, each partition would need to +contain its own instance of the index, as the framework supports data +structures that don't themselves support effective partitioning in the +way that a simple sorted array would. These challenges will need to be +overcome to devise effective, general schemes for data partitioning to +address the problems of reconstruction size and memory usage. -- cgit v1.2.3