diff options
| -rw-r--r-- | chapters/related-works.tex | 269 | ||||
| -rw-r--r-- | references/references.bib | 20 |
2 files changed, 282 insertions, 7 deletions
diff --git a/chapters/related-works.tex b/chapters/related-works.tex index 7a42003..e828593 100644 --- a/chapters/related-works.tex +++ b/chapters/related-works.tex @@ -1,23 +1,278 @@ \chapter{Related Work} \label{chap:related-work} -\section{Implementations of Bentley-Saxe} +While we have already discussed, at length, the most +directly relevant background work in the area of dynamization in +Chapter~\ref{chap:background}, there are a number of other lines of work +that are related, either directly or superficially, to our ultimate +goal of automating some, or all, of the process of constructing a +database index. In this chapter, we will discuss some of the most notable +of these works. -\subsection{Mantis} +\section{Existing Applications of Dynamization} -\subsection{Metric Indexing Structures} +We will begin with the most directly relevant topics: other papers which +use dynamization to construct specific dynamic data structures. There +are a few examples of works which introduce a new data structure, and +simply apply dynamization to them for the purposes of adding update +support. For example, both the PGM learned index~\cite{pgm} and the KDS +tree~\cite{xie21} propose static structures, and then apply dynamization +techniques to support updates. However, in this section, we will focus +on works in which dynamization appears as a major focus of the paper, +not simply as an incidental tool. -\subsection{LSMGraph} +One of the older applications of the Bentley-Saxe method is in the +creation of a data structure called the Bkd-tree~\cite{bkd-tree}. +This structure is a search tree, based on the kd-tree~\cite{kd-tree}, +for multi-dimensional searching, that has been designed for use in +external storage. While it was not the first external kd-tree, existing +implementations struggled with support for updates, which typically +were both inefficient, and resulted in node structures that poorly +utilized the space (i.e., many nodes were mostly empty, resulting in +less efficient searches). To resolve these problems, the authors used +a statically constructed external structure,\footnote{To clarify, +the K-D-B-tree supports updates, but poorly. For Bkd-tree, the authors +exclusively used the K-D-B-tree's static bulk-loading, which is highly +efficient. } the K-D-B-tree, and combined it with the logarithmic method +to create a full-dynamic structure (K-D-B-tree supports deletes natively, +so the problem is deletion decomposable). The major contribution of +this paper, per the authors, was not necessarily the structure itself, +but the demonstration of the viability of the logarithmic method, as +they showed extensively that their dynamized static structure was able +to outperform native dynamic implementations on external storage. -\subsection{PGM} +A more recent paper discussing the application of the logarithmic method +to a specific example is its application to the Mantis structure for +large-scale DNA sequence search~\cite{mantis-dyn}. Mantis~\cite{mantis} +is one of the fastest and most space efficient structures for sequence +search, but is static. To create a half-dynamic version of Mantis, the +authors first design an algorithm to efficiently merge multiple Mantis +structures together (turning their problem into an MDSP, though they +don't use this terminology). They then apply a modified version of the +logarithmic method, including LSM-tree inspired features like leveling, +tiering, and a scale factor. The resulting structure was shown to perform +quite well compared to other existing solutions. -\subsection{BKD Tree} +Another notable work considering dynamization techniques examines applying +the logarithmic method to produce full-dynamic versions of various metric +indexing structures~\cite{naidan14}. In this paper, the logarithmic +method is directly applied to two static metric indexing structures, +VPTree and SSS-tree, to create full-dynamic versions (using weak deletes) +These dynamized structures are compared to two dynamic baselines, DSA-tree +and EGNAT, for both multi-dimensional range scans and $k$-NN searches. The +paper contains extensive benchmarks which demonstrate that the dynamized +versions perform quite well, being capable of even beating the dynamic +ones in query performance under certain circumstances. It's worth noting +that we also tested a dynamized VPTree in Chapter~\ref{chap:framework} +for $k$-NN, and obtained results in line with theirs. + +Finally, LSMGraph is a recently proposed system which applies +dynamization techniques\footnote{ + The authors make a point of saying that they are \emph{not} + applying dynamization, but instead embedding their structure + inside of an LSM-tree, noting the challenges associated + with applying dynamization directly to graphs. While the + specific techniques they are using are not directly taken + from any of the dynamization techniques we discussed in + Chapter~\ref{chap:background}, we nonetheless consider this + work to be an example of dynamization, at least in principle, + because they decompose a static structure into smaller blocks + and handle inserts by rebuilding these blocks systematically.} +to the compressed sparse row (CSR) matrix representation of graphs to +produce an dynamic, external, graph storage system~\cite{lsmgraph}. This +is a particularly interesting example, because graphs and graph algorithms +are \emph{not} remotely decomposable. Adjacent vertices in the graph may +be spread across many levels, and this means that graph algorithms cannot +be decomposed, as traversals must access adjacent vertices, regardless +of which block they are contained within. To resolve this problem, the +authors discard the general query model and build a tightly integrated +system which uses an index to map each vertex to the block containing +it, and implement a vertex-adjacency aware reconstruction process which +helps ensure that adjacent vertices are compacted into the same blocks +during reconstruction. \section{LSM Tree} +The Log-structured Merge-tree (LSM tree)~\cite{oneil96} is a data +structure proposed by Oneil \emph{et al.} in the mid-90s that is designed +to optimize for write throughput in external indexing contexts. While +Oneil never cites any of the dynamization work we have considered in this +work, the structure that he proposed is eerily similar to a decomposed +data structure, and future developments have driven the structure in a +direction that looks incredibly similar to Bentley and Saxe's logarithmic +method. In fact, several of the examples of dynamization used in the +previous section (as well as this work) either borrow concepts from +modern LSM trees, or go so far as to use the term ``LSM'' as a synonym +for what we call dynamization in this work. However, the work on LSM +trees is distinct from dynamization, at least general dynamization, +because it leans heavily on very specific aspects of the search problems +(point lookup and single-dimensional range search) and data structure in +ways that don't generalize well. In this section, we'll discuss a few +of the relevant works on LSM trees and attempt to differentiate them +from dynamization. + +\subsection{The Structure} + +The modern LSM tree is a single-dimensional range data structure that is +commonly used in key-value stores such as RocksDB. It consists of a small, +dynamic in-memory structure called a memtable, and a sequence of static, +external structures on disk of geometrically increasing size. These +structures are organized into levels, which can contain either one +structure or several, with the former strategy being called leveling and +the latter tiering. The individual structures are often simple sorted +arrays (with some attached metadata) called runs, which can be further +decomposed into smaller files called sorted string tables (SSTs). Records +are inserted into the memtable initially. When the memtable is filled, it +is flushed and the records are merged into the top level of the structure, +with reconstructions proceeding according to various merge policies +to make room as necessary. LSM trees typically support point lookup +and range queries, and answer these by searching all of the runs in the +structure and merging the results together. To accelerate point lookups, +Bloom filters~\cite{bloom70} are often built over the records in each run +to allow skipping of some of the runs that don't contain the key being +searched for. Deletes are typically handled using tombstones. + \subsection{Design Space} -\subsection{SILK} +The bulk of work on LSM trees that is of interest to us focuses on the +associated design space and performance tuning of the structure. There +have been a very large number of papers discussing different ways +of decomposing the structure, performing reconstructions, allocating +resources to filters and other auxiliary structures, etc., to optimize +for resource usage, enable performance tuning, etc. We'll summarize a few +of these works here. + +One major line of work in LSM trees involves optimizing the memory +allocation of Bloom filters to the sorted runs within the structure. Bloom +filters are commonly used in LSM trees to accelerate point lookups, +because these queries must examine each run, from top to bottom, until a +matching key is found. Bloom filters can be used to improve performance by +allowing some runs to be skipped over in this searching process. Because +LSM trees are an external data structure, the savings from doing this can +be quite large. There are a number of works in this area~\cite{dayan18-1, +zhu21}, but we will highlight Monkey~\cite{dayan17} specifically. +Monkey is a system that optimizes the allocation of Bloom filter memory +across the levels of an LSM tree, based on the observation that the +worst-case lookup cost (i.e., the cost of a point-lookup on a key that +doesn't exist within the LSM tree) is directly proportional to the sum +of the false positive rates across all levels in the tree. Thus, memory +can be allocated to filters in a way that minimizes this sum. These works +could be useful in the context of dynamization for problems which allow +similar optimizations, such as a dynamized structure for point lookups +using Bloom filters, or possibly range scans using range filters such +as SuRF~\cite{zhang18} or Rosetta~\cite{siqiang20}, but aren't directly +applicable to the general problem of dynamization. + +Other work in LSM trees considers different merge +policies. Dostoevsky~\cite{dayan18} introduces two new merge policies, and +the so-called Wacky Continuum~\cite{dayan19} introduces a general design +space that includes Dostoevsky's policies, the traditional LSM policies, +and a new policy called an LSM bush. As it encompasses all of these, +we'll exclusively summarize the Wacky Continuum here. Wacky defines a +merge policy based on three parameters: the capping ratio ($C$), growth +exponential ($X$), merge greed ($K$), and largest-level merge greed +($Z$). The merge greed parameters are used to define the merge threshold, +which effectively allows merge policies that sit between leveling and +tiering. Each level contains an ``active'' run, into which new records +will be merged. Once this run contains specified fraction of the level's +total capacity (determined by the merge greed parameters), a new run +will be added to the level and made active. Leveling can be simulated in +this model by setting this merge greed parameter to 1, so that 100\% of +the level's capacity is allocated to a single run. Tiering is simulated +by setting this parameter such that each active run can only sustain a +single set of records, and a new run is created each time records are +merged into the level. The Wacky continuum allows configuring the merge +greed of the last level independently from the inner levels, to allow it +to support lazy leveling, a policy where the largest level in the LSM +tree contains a single run, but the inner levels are tiered. The other +design parameters are simpler: the capping ratio allows the size ratio +of the last level to be varied independently of the inner levels, and +the growth exponential allows the size ratio between adjacent levels to +grow as the levels get larger. This work also introduces an optimization +system for determining good values for all of these parameters for a +given workload. + +It's worth taking a moment to address why we did not consider the Wacky +Continuum design space in our attempts to introduce a design space into +dynamization. It appears that these concepts would be useful to us, given +that we imported the basic leveling and tiering concepts from LSM trees. +However, we believe that this particular set of design parameters are +not broadly useful outside of the LSM context. The reason for this is +shown within the experimental section of the Wacky paper itself. For +workloads involving range reads, the standard leveling/tiering designs +show perfectly reasonable (and sometimes even superior) performance +trade-offs. In large part, the Wacky Continuum work is an extension of +the authors' earlier work on Monkey, as it is most effective at improving +trade-offs for point-lookup performance, which are strongly influenced by +Bloom filters. The range scan results are the ones most closely related to +the general dynamization case we have been considering in this work, where +filters cannot be assumed and filter memory allocation isn't an important +consideration. And, in that case, the new merge policies available within +the Wacky Continuum didn't provide enough advantage to be considered here. + +Another aspect of the LSM tree design space is compaction granularity, +which was studied in Spooky~\cite{dayan22}. Because LSM trees are built +over sorted runs of data, it is possible to perform partial merges between +levels--moving only a small portion of the data from one level to another. +This can improve write performance by making reconstructions smaller, +and also help reduce the transient storage requirements of the structure, +but comes at the cost of additional write amplification due to files +being repeatedly re-written on the same level more frequently. The storage +benefit comes from the fact that LSM trees require extra working storage +to perform reconstructions, and making the reconstructions smaller reduces +this space requirement. Spooky is a method for determining effective +approaches to performing partial reconstructions. It works partitioning the +largest level into equally sized files, and then dynamically partitioning +the other levels in the structure based on the key ranges of the last +level files. This approach is shown to provide a good balance between +write-amplification and reconstruction storage requirements. This concept +of compaction granularity is another example of an LSM tree specific +design element that doesn't generalize well. It could be used in the +context of dynamization, but only for single-dimensional sorted data, and +so we have not considered it as part of our general techniques. + +\subsection{Tail Latency Control} +LSM trees are susceptible to similar insertion tail latency +problems as dynamization. While tail latency can be controlled by +using partial reconstructions, such as in Spooky~\cite{dayan22} +(though this isn't a focus of the work), there does exist some work +specifically on controlling tail latency. One notable result in this +area is SILK~\cite{balmau19,silk-plus}, which focuses on reducing tail +latency using intelligent scheduling of reconstruction operations. + +After performing an experimental evaluation of various LSM tree based key +value systems, the authors determine three main principles for designing +their tail latency control system, +\begin{enumerate} + \item I/O bandwidth should be opportunistically allocated to + reconstructions. + \item Reconstructions at smaller levels in the tree should be + prioritized. + \item Reconstructions at larger levels in the tree should be + preemptable by those on lower levels. +\end{enumerate} +The resulting system prioritizes buffer flushes, which are given dedicated +threads and priority access to I/O bandwidth. The next highest priority +operations are reconstructions involving levels $0$ and $1$, which must be +completed to allow flushes to proceed. These reconstructions are able to +preempt any other running compaction if there is not an available thread +when one is scheduled. All other reconstructions run with lower priority, +and may need to be wholly discarded if a high priority reconstruction +invalidates them. The system also includes sophisticated I/O bandwidth +controls, as this is a constrained resource in external contexts. + +Some of the core concepts underlying the SILK system +inspired the tail latency control system we have proposed in +Chapter~\ref{chap:tail-latency}, but our system is quite distinct from +it. SILK leverages some consequences of the LSM tree design space in ways +that our system cannot rely upon. For example, SILK uses a two-version +buffer (like we do), but is able to allocate enough I/O bandwidth to +ensure that one of the buffer versions can be flushed before the other +one fills up. Given the constraints of our dynamization system, with an +unsorted buffer, this was not possible to do. Additionally, SILK uses +partial compactions to reduce the size of reconstructions. These factors +let SILK maintain the LSM tree structure without having to resort to +insertion throttling, as we do in our system. \section{GiST and GIN} diff --git a/references/references.bib b/references/references.bib index 9011fd1..297f67b 100644 --- a/references/references.bib +++ b/references/references.bib @@ -165,6 +165,26 @@ bibsource = {dblp computer science bibliography, https://dblp.org} } + +@article{silk-plus, +author = {Balmau, Oana and Dinu, Florin and Zwaenepoel, Willy and Gupta, Karan and Chandhiramoorthi, Ravishankar and Didona, Diego}, +title = {SILK+ Preventing Latency Spikes in Log-Structured Merge Key-Value Stores Running Heterogeneous Workloads}, +year = {2020}, +issue_date = {November 2018}, +publisher = {Association for Computing Machinery}, +address = {New York, NY, USA}, +volume = {36}, +number = {4}, +issn = {0734-2071}, +url = {https://doi.org/10.1145/3380905}, +doi = {10.1145/3380905}, +abstract = {Log-Structured Merge Key-Value stores (LSM KVs) are designed to offer good write performance, by capturing client writes in memory, and only later flushing them to storage. Writes are later compacted into a tree-like data structure on disk to improve read performance and to reduce storage space use. It has been widely documented that compactions severely hamper throughput. Various optimizations have successfully dealt with this problem. These techniques include, among others, rate-limiting flushes and compactions, selecting among compactions for maximum effect, and limiting compactions to the highest level by so-called fragmented LSMs.In this article, we focus on latencies rather than throughput. We first document the fact that LSM KVs exhibit high tail latencies. The techniques that have been proposed for optimizing throughput do not address this issue, and, in fact, in some cases, exacerbate it. The root cause of these high tail latencies is interference between client writes, flushes, and compactions. Another major cause for tail latency is the heterogeneous nature of the workloads in terms of operation mix and item sizes whereby a few more computationally heavy requests slow down the vast majority of smaller requests.We introduce the notion of an Input/Output (I/O) bandwidth scheduler for an LSM-based KV store to reduce tail latency caused by interference of flushing and compactions and by workload heterogeneity. We explore three techniques as part of this I/O scheduler: (1) opportunistically allocating more bandwidth to internal operations during periods of low load, (2) prioritizing flushes and compactions at the lower levels of the tree, and (3) separating client requests by size and by data access path. SILK+ is a new open-source LSM KV that incorporates this notion of an I/O scheduler.}, +journal = {ACM Trans. Comput. Syst.}, +month = may, +articleno = {12}, +numpages = {27}, +keywords = {I/O scheduling, log-structured merge key-value stores, tail latency} +} @inproceedings{afshani17, author = {Peyman Afshani and Zhewei Wei}, |