\chapter{Proposed Work} \label{chap:proposed} The previous two chapters described work already completed, however there are a number of work that remains to be done as part of this project. Update support is only one of the important features that an index requires of its data structure. In this chapter, the remaining research problems will be discussed briefly, to lay out a set of criteria for project completion. \section{Concurrency Support} Database management systems are designed to hide the latency of IO operations, and one of the techniques they use are being highly concurrent. As a result, any data structure used to build a database index must also support concurrent updates and queries. The sampling extension framework described in Chapter~\ref{chap:sampling} had basic concurrency support, but work is ongoing to integrate a superior system into the framework of Chapter~\ref{chap:framework}. Because the framework is based on the Bentley-Saxe method, it has a number of desirable properties for making concurrency management simpler. With the exception of the buffer, the vast majority of the data resides in static data structures. When using tombstones, these static structures become fully immutable. This turns concurrency control into a resource management problem, and suggests a simple multi-version concurrency control scheme. Each version of the structure, defined as being the state between two reconstructions, is tagged with an epoch number. A query, then, will read only a single epoch, which will be preserved in storage until all queries accessing it have terminated. Because the mutable buffer is append-only, a consistent view of it can be obtained by storing the tail of the log at the start of query execution. Thus, a fixed snapshot of the index can be represented as a two-tuple containing the epoch number and buffer tail index. The major limitation of the Chapter~\ref{chap:sampling} system was the handling of buffer expansion. While the mutable buffer itself is an unsorted array, and thus supports concurrent inserts using a simple fetch-and-add operation, the real hurdle to insert performance is managing reconstruction. During a reconstruction, the buffer is full and cannot support any new inserts. Because active queries may be using the buffer, it cannot be immediately flushed, and so inserts are blocked. Because of this, it is necessary to use multiple buffers to sustain insertions. When a buffer is filled, a background thread is used to perform the reconstruction, and a new buffer is added to continue inserting while that reconstruction occurs. In Chapter~\ref{chap:sampling}, the solution used was limited by its restriction to only two buffers (and as a result, a maximum of two active epochs at any point in time). Any sustained insertion workload would quickly fill up the pair of buffers, and then be forced to block until one of the buffers could be emptied. This emptying of the buffer was contingent on \emph{both} all queries using the buffer finishing, \emph{and} on the reconstruction using that buffer to finish. As a result, the length of the block on inserts could be long (multiple seconds, or even minutes for particularly large reconstructions) and indeterminate (a given index could be involved in a very long running query, and the buffer would be blocked until the query completed). Thus, a more effective concurrency solution would need to support dynamically adding mutable buffers as needed to maintain insertion throughput. This would allow for insertion throughput to be maintained so long as memory for more buffer space is available.\footnote{For the in-memory indexes considered thus far, it isn't clear that running out of memory for buffers is a recoverable error in all cases. The system would require the same amount of memory for storing record (technically more, considering index overhead) in a shard as it does in the buffer. In the case of an external storage system, the calculus would be different, of course.} It would also ensure that a long running could only block insertion if there is insufficient memory to create a new buffer or to run a reconstruction. However, as the number of buffered records grows, there is the potential for query performance to suffer, which leads to another important aspect of an effective concurrency control scheme. \subsection{Tail Latency Control} The concurrency control scheme discussed thus far allows for maintaining insertion throughput by allowing an unbounded portion of the new data to remain buffered in an unsorted fashion. Over time, this buffered data will be moved into data structures in the background, as the system performs merges (which are moved off of the critical path for most operations). While this system allows for fast inserts, it has the potential to damage query performance. This is because the more buffered data there is, the more a query must fall back on its inefficient scan-based buffer path, as opposed to using the data structure. Unfortunately, reconstructions can be incredibly lengthy (recall that the worst-case scenario involves rebuilding a static structure over all of the records; this is, thankfully, quite rare). This implies that it may be necessary in certain circumstances to throttle insertions to maintain certain levels of query performance. Additionally, it may be worth preemptively performing large reconstructions during periods of low utilization, similar to systems like Silk designed for mitigating tail latency spikes in LSM-tree based systems~\cite{balmau19}. Additionally, it is possible that large reconstructions may have a negative effect on query performance, due to system resource utilization. Reconstructions can use a large amount of memory bandwidth, which must be shared by queries. The effects of parallel reconstruction on query performance will need to be assessed, and strategies for mitigation of this effect, be it a scheduling-based solution, or a resource-throttling one, considered if necessary. \section{Fine-Grained Online Performance Tuning} The framework has a large number of configurable parameters, and introducing concurrency control will add even more. The parameter sweeps in Section~\ref{ssec:ds-exp} show that there are trade-offs between read and write performance across this space. Unfortunately, the current framework applies this configuration parameters globally, and does not allow them to be changed after the index is constructed. It seems apparent that better performance might be obtained by adjusting this approach. First, there is nothing preventing these parameters from being configured on a per-level basis. Having different layout policies on different levels (for example, tiering on higher levels and leveling on lower ones), different scale factors, etc. More index specific tuning, like controlling memory budget for auxiliary structures, could also be considered. This fine-grained tuning will open up an even broader design space, which has the benefit of improving the configurability of the system, but the disadvantage of making configuration more difficult. Additionally, it does nothing to address the problem of workload drift: a configuration may be optimal now, but will it remain effective in the future as the read/write mix of the workload changes? Both of these challenges can be addressed using dynamic tuning. The theory is that the framework could be augmented with some workload and performance statistics tracking. Based on these numbers, during reconstruction, the framework could decide to adjust the configuration of one or more levels in an online fashion, to lean more towards read or write performance, or to dial back memory budgets as the system's memory usage increases. Additionally, buffer-related parameters could be tweaked in real time as well. If insertion throughput is high, it might be worth it to temporarily increase the buffer size, rather than spawning multiple smaller buffers. A system like this would allow for more consistent performance of the system in the face of changing workloads, and also increase the ease of use of the framework by removing the burden of configuration from the user. \section{Alternative Data Partitioning Schemes} One problem with Bentley-Saxe or LSM-tree derived systems is temporary memory usage spikes. When performing a reconstruction, the system needs enough storage to store the shards involved in the reconstruction, and also the newly constructed shard. This is made worse in the face of multi-version concurrency, where multiple older versions of shards may be retained in memory at once. It's well known that, in the worst case, such a system may temporarily require double its current memory usage~\cite{dayan22}. One approach to addressing this problem in LSM-tree based systems is to adjust the compaction granularity~\cite{dayan22}. In the terminology associated with this framework, the idea is to further sub-divide each shard into smaller chunks, partitioned based on keys. That way, when a reconstruction is triggered, rather than reconstructing an entire shard, these smaller partitions can be used instead. One of the partitions in the source shard can be selected, and then merged with the partitions in the next level down having overlapping key ranges. The amount of memory required for reconstruction (and also reconstruction time costs) can then be controlled by adjusting these partitions. Unfortunately, while this system works incredibly well for LSM-tree based systems which store one-dimensional data in sorted arrays, it encounters some problems in the context of a general index. It isn't clear how to effectively partition multi-dimensional data in the same way. Additionally, in the general case, each partition would need to contain its own instance of the index, as the framework supports data structures that don't themselves support effective partitioning in the way that a simple sorted array would. These challenges will need to be overcome to devise effective, general schemes for data partitioning to address the problems of reconstruction size and memory usage.