\chapter{Related Work} \label{chap:related-work} While we have already discussed, at length, the most directly relevant background work in the area of dynamization in Chapter~\ref{chap:background}, there are a number of other lines of work that are related, either directly or superficially, to our ultimate goal of automating some, or all, of the process of constructing a database index. In this chapter, we will discuss some of the most notable of these works. \section{Existing Applications of Dynamization} We will begin with the most directly relevant topics: other papers which use dynamization to construct specific dynamic data structures. There are a few examples of works which introduce a new data structure, and simply apply dynamization to them for the purposes of adding update support. For example, both the PGM learned index~\cite{pgm} and the KDS tree~\cite{xie21} propose static structures, and then apply dynamization techniques to support updates. However, in this section, we will focus on works in which dynamization appears as a major focus of the paper, not simply as an incidental tool. One of the older applications of the Bentley-Saxe method is in the creation of a data structure called the Bkd-tree~\cite{bkd-tree}. This structure is a search tree, based on the kd-tree~\cite{kd-tree}, for multi-dimensional searching, that has been designed for use in external storage. While it was not the first external kd-tree, existing implementations struggled with support for updates, which typically were both inefficient, and resulted in node structures that poorly utilized the space (i.e., many nodes were mostly empty, resulting in less efficient searches). To resolve these problems, the authors used a statically constructed external structure,\footnote{To clarify, the K-D-B-tree supports updates, but poorly. For Bkd-tree, the authors exclusively used the K-D-B-tree's static bulk-loading, which is highly efficient. } the K-D-B-tree, and combined it with the logarithmic method to create a full-dynamic structure (K-D-B-tree supports deletes natively, so the problem is deletion decomposable). The major contribution of this paper, per the authors, was not necessarily the structure itself, but the demonstration of the viability of the logarithmic method, as they showed extensively that their dynamized static structure was able to outperform native dynamic implementations on external storage. A more recent paper discussing the application of the logarithmic method to a specific example is its application to the Mantis structure for large-scale DNA sequence search~\cite{mantis-dyn}. Mantis~\cite{mantis} is one of the fastest and most space efficient structures for sequence search, but is static. To create a half-dynamic version of Mantis, the authors first design an algorithm to efficiently merge multiple Mantis structures together (turning their problem into an MDSP, though they don't use this terminology). They then apply a modified version of the logarithmic method, including LSM-tree inspired features like leveling, tiering, and a scale factor. The resulting structure was shown to perform quite well compared to other existing solutions. Another notable work considering dynamization techniques examines applying the logarithmic method to produce full-dynamic versions of various metric indexing structures~\cite{naidan14}. In this paper, the logarithmic method is directly applied to two static metric indexing structures, VPTree and SSS-tree, to create full-dynamic versions (using weak deletes) These dynamized structures are compared to two dynamic baselines, DSA-tree and EGNAT, for both multi-dimensional range scans and $k$-NN searches. The paper contains extensive benchmarks which demonstrate that the dynamized versions perform quite well, being capable of even beating the dynamic ones in query performance under certain circumstances. It's worth noting that we also tested a dynamized VPTree in Chapter~\ref{chap:framework} for $k$-NN, and obtained results in line with theirs. Finally, LSMGraph is a recently proposed system which applies dynamization techniques\footnote{ The authors make a point of saying that they are \emph{not} applying dynamization, but instead embedding their structure inside of an LSM-tree, noting the challenges associated with applying dynamization directly to graphs. While the specific techniques they are using are not directly taken from any of the dynamization techniques we discussed in Chapter~\ref{chap:background}, we nonetheless consider this work to be an example of dynamization, at least in principle, because they decompose a static structure into smaller blocks and handle inserts by rebuilding these blocks systematically.} to the compressed sparse row (CSR) matrix representation of graphs to produce an dynamic, external, graph storage system~\cite{lsmgraph}. This is a particularly interesting example, because graphs and graph algorithms are \emph{not} remotely decomposable. Adjacent vertices in the graph may be spread across many levels, and this means that graph algorithms cannot be decomposed, as traversals must access adjacent vertices, regardless of which block they are contained within. To resolve this problem, the authors discard the general query model and build a tightly integrated system which uses an index to map each vertex to the block containing it, and implement a vertex-adjacency aware reconstruction process which helps ensure that adjacent vertices are compacted into the same blocks during reconstruction. \section{LSM Tree} The Log-structured Merge-tree (LSM tree)~\cite{oneil96} is a data structure proposed by Oneil \emph{et al.} in the mid-90s that is designed to optimize for write throughput in external indexing contexts. While Oneil never cites any of the dynamization work we have considered in this work, the structure that he proposed is eerily similar to a decomposed data structure, and future developments have driven the structure in a direction that looks incredibly similar to Bentley and Saxe's logarithmic method. In fact, several of the examples of dynamization used in the previous section (as well as this work) either borrow concepts from modern LSM trees, or go so far as to use the term ``LSM'' as a synonym for what we call dynamization in this work. However, the work on LSM trees is distinct from dynamization, at least general dynamization, because it leans heavily on very specific aspects of the search problems (point lookup and single-dimensional range search) and data structure in ways that don't generalize well. In this section, we'll discuss a few of the relevant works on LSM trees and attempt to differentiate them from dynamization. \subsection{The Structure} The modern LSM tree is a single-dimensional range data structure that is commonly used in key-value stores such as RocksDB. It consists of a small, dynamic in-memory structure called a memtable, and a sequence of static, external structures on disk of geometrically increasing size. These structures are organized into levels, which can contain either one structure or several, with the former strategy being called leveling and the latter tiering. The individual structures are often simple sorted arrays (with some attached metadata) called runs, which can be further decomposed into smaller files called sorted string tables (SSTs). Records are inserted into the memtable initially. When the memtable is filled, it is flushed and the records are merged into the top level of the structure, with reconstructions proceeding according to various merge policies to make room as necessary. LSM trees typically support point lookup and range queries, and answer these by searching all of the runs in the structure and merging the results together. To accelerate point lookups, Bloom filters~\cite{bloom70} are often built over the records in each run to allow skipping of some of the runs that don't contain the key being searched for. Deletes are typically handled using tombstones. \subsection{Design Space} The bulk of work on LSM trees that is of interest to us focuses on the associated design space and performance tuning of the structure. There have been a very large number of papers discussing different ways of decomposing the structure, performing reconstructions, allocating resources to filters and other auxiliary structures, etc., to optimize for resource usage, enable performance tuning, etc. We'll summarize a few of these works here. One major line of work in LSM trees involves optimizing the memory allocation of Bloom filters to the sorted runs within the structure. Bloom filters are commonly used in LSM trees to accelerate point lookups, because these queries must examine each run, from top to bottom, until a matching key is found. Bloom filters can be used to improve performance by allowing some runs to be skipped over in this searching process. Because LSM trees are an external data structure, the savings from doing this can be quite large. There are a number of works in this area~\cite{dayan18-1, zhu21}, but we will highlight Monkey~\cite{dayan17} specifically. Monkey is a system that optimizes the allocation of Bloom filter memory across the levels of an LSM tree, based on the observation that the worst-case lookup cost (i.e., the cost of a point-lookup on a key that doesn't exist within the LSM tree) is directly proportional to the sum of the false positive rates across all levels in the tree. Thus, memory can be allocated to filters in a way that minimizes this sum. These works could be useful in the context of dynamization for problems which allow similar optimizations, such as a dynamized structure for point lookups using Bloom filters, or possibly range scans using range filters such as SuRF~\cite{zhang18} or Rosetta~\cite{siqiang20}, but aren't directly applicable to the general problem of dynamization. Other work in LSM trees considers different merge policies. Dostoevsky~\cite{dayan18} introduces two new merge policies, and the so-called Wacky Continuum~\cite{dayan19} introduces a general design space that includes Dostoevsky's policies, the traditional LSM policies, and a new policy called an LSM bush. As it encompasses all of these, we'll exclusively summarize the Wacky Continuum here. Wacky defines a merge policy based on three parameters: the capping ratio ($C$), growth exponential ($X$), merge greed ($K$), and largest-level merge greed ($Z$). The merge greed parameters are used to define the merge threshold, which effectively allows merge policies that sit between leveling and tiering. Each level contains an ``active'' run, into which new records will be merged. Once this run contains specified fraction of the level's total capacity (determined by the merge greed parameters), a new run will be added to the level and made active. Leveling can be simulated in this model by setting this merge greed parameter to 1, so that 100\% of the level's capacity is allocated to a single run. Tiering is simulated by setting this parameter such that each active run can only sustain a single set of records, and a new run is created each time records are merged into the level. The Wacky continuum allows configuring the merge greed of the last level independently from the inner levels, to allow it to support lazy leveling, a policy where the largest level in the LSM tree contains a single run, but the inner levels are tiered. The other design parameters are simpler: the capping ratio allows the size ratio of the last level to be varied independently of the inner levels, and the growth exponential allows the size ratio between adjacent levels to grow as the levels get larger. This work also introduces an optimization system for determining good values for all of these parameters for a given workload. It's worth taking a moment to address why we did not consider the Wacky Continuum design space in our attempts to introduce a design space into dynamization. It appears that these concepts would be useful to us, given that we imported the basic leveling and tiering concepts from LSM trees. However, we believe that this particular set of design parameters are not broadly useful outside of the LSM context. The reason for this is shown within the experimental section of the Wacky paper itself. For workloads involving range reads, the standard leveling/tiering designs show perfectly reasonable (and sometimes even superior) performance trade-offs. In large part, the Wacky Continuum work is an extension of the authors' earlier work on Monkey, as it is most effective at improving trade-offs for point-lookup performance, which are strongly influenced by Bloom filters. The range scan results are the ones most closely related to the general dynamization case we have been considering in this work, where filters cannot be assumed and filter memory allocation isn't an important consideration. And, in that case, the new merge policies available within the Wacky Continuum didn't provide enough advantage to be considered here. Another aspect of the LSM tree design space is compaction granularity, which was studied in Spooky~\cite{dayan22}. Because LSM trees are built over sorted runs of data, it is possible to perform partial merges between levels--moving only a small portion of the data from one level to another. This can improve write performance by making reconstructions smaller, and also help reduce the transient storage requirements of the structure, but comes at the cost of additional write amplification due to files being repeatedly re-written on the same level more frequently. The storage benefit comes from the fact that LSM trees require extra working storage to perform reconstructions, and making the reconstructions smaller reduces this space requirement. Spooky is a method for determining effective approaches to performing partial reconstructions. It works partitioning the largest level into equally sized files, and then dynamically partitioning the other levels in the structure based on the key ranges of the last level files. This approach is shown to provide a good balance between write-amplification and reconstruction storage requirements. This concept of compaction granularity is another example of an LSM tree specific design element that doesn't generalize well. It could be used in the context of dynamization, but only for single-dimensional sorted data, and so we have not considered it as part of our general techniques. \subsection{Tail Latency Control} LSM trees are susceptible to similar insertion tail latency problems as dynamization. While tail latency can be controlled by using partial reconstructions, such as in Spooky~\cite{dayan22} (though this isn't a focus of the work), there does exist some work specifically on controlling tail latency. One notable result in this area is SILK~\cite{balmau19,silk-plus}, which focuses on reducing tail latency using intelligent scheduling of reconstruction operations. After performing an experimental evaluation of various LSM tree based key value systems, the authors determine three main principles for designing their tail latency control system, \begin{enumerate} \item I/O bandwidth should be opportunistically allocated to reconstructions. \item Reconstructions at smaller levels in the tree should be prioritized. \item Reconstructions at larger levels in the tree should be preemptable by those on lower levels. \end{enumerate} The resulting system prioritizes buffer flushes, which are given dedicated threads and priority access to I/O bandwidth. The next highest priority operations are reconstructions involving levels $0$ and $1$, which must be completed to allow flushes to proceed. These reconstructions are able to preempt any other running compaction if there is not an available thread when one is scheduled. All other reconstructions run with lower priority, and may need to be wholly discarded if a high priority reconstruction invalidates them. The system also includes sophisticated I/O bandwidth controls, as this is a constrained resource in external contexts. Some of the core concepts underlying the SILK system inspired the tail latency control system we have proposed in Chapter~\ref{chap:tail-latency}, but our system is quite distinct from it. SILK leverages some consequences of the LSM tree design space in ways that our system cannot rely upon. For example, SILK uses a two-version buffer (like we do), but is able to allocate enough I/O bandwidth to ensure that one of the buffer versions can be flushed before the other one fills up. Given the constraints of our dynamization system, with an unsorted buffer, this was not possible to do. Additionally, SILK uses partial compactions to reduce the size of reconstructions. These factors let SILK maintain the LSM tree structure without having to resort to insertion throttling, as we do in our system. \section{Automated Index Composition} At the beginning of this work, we described what we see as the three major lines of work that are attempting to partially automate index design. The first of these we discussed was something we called automated index composition. This is an approach to index design which uses a series of data structure primitives to compose an index over a set of data that has been optimized for a particular workload. Of these works, \emph{all} of them are focused explicitly on single dimensional data with range scans and point lookups, and only Cosine supports inserting new records. Two closely related papers in this line are the so-called Data Alchemist~\cite{ds-alchemy} and the Periodic Table of Data Structures~\cite{periodic-table}. Both of these works consider automatically designing indexes for single dimensional range data, capable of addressing point lookups and range scans. The Periodic Table of Data Structures proposes a wide design space for data structures based on creating individual ''nodes`` over the data, each of which have different design decisions applied to them, which are termed ``first principles''. They consider first principles to be any design decision that is irreducible to other decisions, such as whether the type of partitioning used (range, radix, etc.), whether the data is stored in a row-based or columnar format, etc. From this model, an index can be designed by iteratively describing the first principles of each node. Given these first principles, and the set of nodes, access algorithms can be automatically devised. The Data Alchemist extends this model of data structures with learned cost models and machine learning based systems for automatically composing an optimal data structure for a given workload. Both of these papers discuss their core ideas, but don't include testing of a working system. The same authors have two further works in this area that go more into detail on cost models~\cite{data-calc} and explore the design continuum in more detail~\cite{ds-continuum}. GENE~\cite{gene} advances the same basic line of research as the previously mentioned works, but actually includes a functioning, end-to-end index generation system. GENE decomposes data structures in a few specific primitives based upon search algorithm and data layout. Specifically, it supports scanning, binary search, interpolation search, exponential search, hashing with closed addressing, and linear regression modeling, as search algorithms. The supported data layout parameters include column vs. row orientation, sorted vs. unsorted ordering, compression, and function mapping (which obviates the need to make other layout decisions, and is intended to be used with linear regression search algorithms). GENE then designs an index based upon these options automatically for a given workload by applying a genetic algorithm. Cosine~\cite{cosine} is another similar system, which has been designed for cloud based systems and accounts for cloud SLAs and (monetary) budgeting concerns. It includes sophisticate cost modeling systems to allow it to dynamically adapt the structure of the index, shifting elements of it between an LSM-like structure, a B-tree-like structure, and an LSH-like (log-structured hash) structure, as well as adjusting various configuration parameters associated with each of these primitive components. It is particularly notable for being the only work discussed in this section that supports updates. Finally, fluid data structures~\cite{fluid-ds} represents a more formal work in this area, based upon immutable data primitives. The core idea of this work is that the physical representation of the data can be mutated while maintaining its logical ordering, under certain circumstances. This allows for regions of the data structure to shift dynamically to optimize for particular types of search. To accomplish this, the authors define a formal grammar, which they call a compositional organizational grammar (COG) as a description of a data structure, and the transformation rules that can be applied to a data structure based on this language while ensuring logical equivalence. A runtime system, then, can automatically apply these transformations to the structure. This is the only of the works in this section that discusses generalizing its techniques to non-single dimensional data. \section{Generalized Index Templates}