From 5e4ad2777acc4c2420514e39fb98b7cf2e200996 Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Sun, 27 Apr 2025 17:36:57 -0400 Subject: Initial commit --- chapters/abstract.tex | 42 ++ chapters/acknowledgments.tex | 2 + chapters/background.tex | 746 ++++++++++++++++++++++++++ chapters/background.tex.bak | 574 ++++++++++++++++++++ chapters/beyond-bsm.tex | 3 + chapters/beyond-dsp.tex | 863 ++++++++++++++++++++++++++++++ chapters/chapter1-old.tex | 256 +++++++++ chapters/chapter1.tex.bak | 204 +++++++ chapters/conclusion.tex | 43 ++ chapters/dynamic-extension-sampling.tex | 22 + chapters/future-work.tex | 174 ++++++ chapters/introduction.tex | 95 ++++ chapters/sigmod23/abstract.tex | 29 + chapters/sigmod23/background.tex | 182 +++++++ chapters/sigmod23/conclusion.tex | 17 + chapters/sigmod23/examples.tex | 143 +++++ chapters/sigmod23/exp-baseline.tex | 98 ++++ chapters/sigmod23/exp-extensions.tex | 40 ++ chapters/sigmod23/exp-parameter-space.tex | 105 ++++ chapters/sigmod23/experiment.tex | 48 ++ chapters/sigmod23/extensions.tex | 57 ++ chapters/sigmod23/framework.tex | 573 ++++++++++++++++++++ chapters/sigmod23/introduction.tex | 20 + chapters/sigmod23/relatedwork.tex | 33 ++ chapters/vita.tex | 0 25 files changed, 4369 insertions(+) create mode 100644 chapters/abstract.tex create mode 100644 chapters/acknowledgments.tex create mode 100644 chapters/background.tex create mode 100644 chapters/background.tex.bak create mode 100644 chapters/beyond-bsm.tex create mode 100644 chapters/beyond-dsp.tex create mode 100644 chapters/chapter1-old.tex create mode 100644 chapters/chapter1.tex.bak create mode 100644 chapters/conclusion.tex create mode 100644 chapters/dynamic-extension-sampling.tex create mode 100644 chapters/future-work.tex create mode 100644 chapters/introduction.tex create mode 100644 chapters/sigmod23/abstract.tex create mode 100644 chapters/sigmod23/background.tex create mode 100644 chapters/sigmod23/conclusion.tex create mode 100644 chapters/sigmod23/examples.tex create mode 100644 chapters/sigmod23/exp-baseline.tex create mode 100644 chapters/sigmod23/exp-extensions.tex create mode 100644 chapters/sigmod23/exp-parameter-space.tex create mode 100644 chapters/sigmod23/experiment.tex create mode 100644 chapters/sigmod23/extensions.tex create mode 100644 chapters/sigmod23/framework.tex create mode 100644 chapters/sigmod23/introduction.tex create mode 100644 chapters/sigmod23/relatedwork.tex create mode 100644 chapters/vita.tex (limited to 'chapters') diff --git a/chapters/abstract.tex b/chapters/abstract.tex new file mode 100644 index 0000000..5ddfd37 --- /dev/null +++ b/chapters/abstract.tex @@ -0,0 +1,42 @@ +Modern data systems must cope with a wider variety of data than ever +before, and as a result we've seen the proliferation of a large number of +highly specialized data management systems, such as vector and graph +databases. These systems are built upon specialized data structures for +a particular query, or class of queries, and as a result have a very +specific range of efficacy. Beyond this, they are difficult to develop +because of the requirements that they place upon the data structures at +their core, including requiring support for concurrent updates. As a +result, a large number of potentially useful data structures are excluded +from use in such systems, or at the very least require a large amount of +development time to be made useful. + +This work seeks to address this difficulty by introducing a framework for +automatic data structure dynamization. Given a static data structure and +an associated query, satisfying certain requirements, this proposed work +will enable automatically adding support for concurrent updates, with +minimal modification to the data structure itself. It is based on a +body of theoretical work on dynamization, often called the "Bentley-Saxe +Method", which partitions data into a number of small data structures, +and periodically rebuilds these as records are inserted or deleted, in +a manner that maintains asymptotic bounds on worst case query time, +as well as amortized insertion time. These techniques, as they currently +exist, are limited in usefulness as they exhibit poor performance in +practice, and lack support for concurrency. But, they serve as a solid +theoretical base upon which a novel system can be built to address +these concerns. + +To develop this framework, sampling queries (which are not well served +by existing dynamic data structures) are first considered. The results +of this analysis are then generalized to produce a framework for +single-threaded dynamization that is applicable to a large number +of possible data structures and query types, and the general framework +evaluated across a number of data structures and query types. These +dynamized static structures are shown to equal or exceed the performance +of existing specialized dynamic structures in both update and query +performance. + +Finally, this general framework is expanded with support for concurrent +operations (inserts and queries), and the use of scheduling and +parallelism is studied to provide worst-case insertion guarantees, +as well as a rich trade-off space between query and insertion performance. + diff --git a/chapters/acknowledgments.tex b/chapters/acknowledgments.tex new file mode 100644 index 0000000..c6e25fd --- /dev/null +++ b/chapters/acknowledgments.tex @@ -0,0 +1,2 @@ +And again here--no header, just text. + diff --git a/chapters/background.tex b/chapters/background.tex new file mode 100644 index 0000000..75e2b59 --- /dev/null +++ b/chapters/background.tex @@ -0,0 +1,746 @@ +\chapter{Background} +\label{chap:background} + +This chapter will introduce important background information and +existing work in the area of data structure dynamization. We will +first discuss the concept of a search problem, which is central to +dynamization techniques. While one might imagine that restrictions on +dynamization would be functions of the data structure to be dynamized, +in practice the requirements placed on the data structure are quite mild, +and it is the necessary properties of the search problem that the data +structure is used to address that provide the central difficulty to +applying dynamization techniques in a given area. After this, database +indices will be discussed briefly. Indices are the primary use of data +structures within the database context that is of interest to our work. +Following this, existing theoretical results in the area of data structure +dynamization will be discussed, which will serve as the building blocks +for our techniques in subsquent chapters. The chapter will conclude with +a discussion of some of the limitations of these existing techniques. + +\section{Queries and Search Problems} +\label{sec:dsp} + +Data access lies at the core of most database systems. We want to ask +questions of the data, and ideally get the answer efficiently. We +will refer to the different types of question that can be asked as +\emph{search problems}. We will be using this term in a similar way as +the word \emph{query} \footnote{ + The term query is often abused and used to + refer to several related, but slightly different things. In the + vernacular, a query can refer to either a) a general type of search + problem (as in "range query"), b) a specific instance of a search + problem, or c) a program written in a query language. +} +is often used within the database systems literature: to refer to a +general class of questions. For example, we could consider range scans, +point-lookups, nearest neighbor searches, predicate filtering, random +sampling, etc., to each be a general search problem. Formally, for the +purposes of this work, a search problem is defined as follows, + +\begin{definition}[Search Problem] + Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem is a function + $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$, where $\mathcal{D}$ represents the domain of data to be searched, + $\mathcal{Q}$ represents the domain of query parameters, and $\mathcal{R}$ represents the +answer domain.\footnote{ + It is important to note that it is not required for $\mathcal{R} \subseteq \mathcal{D}$. As an +example, a \texttt{COUNT} aggregation might map a set of strings onto + an integer. Most common queries do satisfy $\mathcal{R} \subseteq \mathcal{D}$, but this need +not be a universal constraint. +} +\end{definition} + +We will use the term \emph{query} to mean a specific instance of a search +problem, + +\begin{definition}[Query] + Given three multi-sets, $\mathcal{D}$, $\mathcal{R}$, and $\mathcal{Q}$, a search problem $F$ and + a specific set of query parameters $q \in \mathcal{Q}$, a query is a specific + instance of the search problem, $F(\mathcal{D}, q)$. +\end{definition} + +As an example of using these definitions, a \emph{membership test} +or \emph{range scan} would be considered search problems, and a range +scan over the interval $[10, 99]$ would be a query. We've drawn this +distinction because, as we'll see as we enter into the discussion of +our work in later chapters, it is useful to have seperate, unambiguous +terms for these two concepts. + +\subsection{Decomposable Search Problems} + +Dynamization techniques require the partitioning of one data structure +into several, smaller ones. As a result, these techniques can only +be applied in situations where the search problem to be answered can +be answered from this set of smaller data structures, with the same +answer as would have been obtained had all of the data been used to +construct a single, large structure. This requirement is formalized in +the definition of a class of problems called \emph{decomposable search +problems (DSP)}. This class was first defined by Bentley and Saxe in +their work on dynamization, and we will adopt their definition, + +\begin{definition}[Decomposable Search Problem~\cite{saxe79}] + \label{def:dsp} + A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is decomposable if and + only if there exists a constant-time computable, associative, and + commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +The requirement for $\square$ to be constant-time was used by Bentley and +Saxe to prove specific performance bounds for answering queries from a +decomposed data structure. However, it is not strictly \emph{necessary}, +and later work by Overmars lifted this constraint and considered a more +general class of search problems called \emph{$C(n)$-decomposable search +problems}, + +\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}] + A search problem $F: (\mathcal{D}, \mathcal{Q}) \to \mathcal{R}$ is $C(n)$-decomposable + if and only if there exists an $O(C(n))$-time computable, associative, + and commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +To demonstrate that a search problem is decomposable, it is necessary to +show the existence of the merge operator, $\square$, with the necessary +properties, and to show that $F(A \cup B, q) = F(A, q)~ \square ~F(B, +q)$. With these two results, induction demonstrates that the problem is +decomposable even in cases with more than two partial results. + +As an example, consider range scans, +\begin{definition}[Range Count] + Let $d$ be a set of $n$ points in $\mathbb{R}$. Given an interval, + $ q = [x, y],\quad x,y \in \mathbb{R}$, a range count returns + the cardinality, $|d \cap q|$. +\end{definition} + +\begin{theorem} +Range Count is a decomposable search problem. +\end{theorem} + +\begin{proof} +Let $\square$ be addition ($+$). Applying this to +Definition~\ref{def:dsp}, gives +\begin{align*} + |(A \cup B) \cap q| = |(A \cap q)| + |(B \cap q)| +\end{align*} +which is true by the distributive property of union and +intersection. Addition is an associative and commutative +operator that can be calculated in $O(1)$ time. Therefore, range counts +are DSPs. +\end{proof} + +Because the codomain of a DSP is not restricted, more complex output +structures can be used to allow for problems that are not directly +decomposable to be converted to DSPs, possibly with some minor +post-processing. For example, calculating the arithmetic mean of a set +of numbers can be formulated as a DSP, +\begin{theorem} +The calculation of the arithmetic mean of a set of numbers is a DSP. +\end{theorem} +\begin{proof} + Consider the search problem $A:\mathcal{D} \to (\mathbb{R}, \mathbb{Z})$, + where $\mathcal{D}\subset\mathbb{R}$ and is a multi-set. The output tuple +contains the sum of the values within the input set, and the +cardinality of the input set. For two disjoint paritions of the data, +$D_1$ and $D_2$, let $A(D_1) = (s_1, c_1)$ and $A(D_2) = (s_2, c_2)$. Let +$A(D_1) \square A(D_2) = (s_1 + s_2, c_1 + c_2)$. + +Applying Definition~\ref{def:dsp}, gives +\begin{align*} + A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\ + (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) +\end{align*} +From this result, the average can be determined in constant time by +taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set +of numbers is a DSP. +\end{proof} + + + +\section{Database Indexes} +\label{sec:indexes} + +Within a database system, search problems are expressed using +some high level language (or mapped directly to commands, for +simpler systems like key-value stores), which is processed by +the database system to produce a result. Within many database +systems, the most basic access primitive is a table scan, which +sequentially examines each record within the data set. There are many +situations in which the same query could be answered in less time using +a more sophisticated data access scheme, however, and databases support +a limited number of such schemes through the use of specialized data +structures called \emph{indices} (or indexes). Indices can be built over +a set of attributes in a table and provide faster access for particular +search problems. + +The term \emph{index} is often abused within the database community +to refer to a range of closely related, but distinct, conceptual +categories.\footnote{ +The word index can be used to refer to a structure mapping record +information to the set of records matching that information, as a +general synonym for ``data structure'', to data structures used +specifically in query processing, etc. +} +This ambiguity is rarely problematic, as the subtle differences between +these categories are not often significant, and context clarifies the +intended meaning in situations where they are. However, this work +explicitly operates at the interface of two of these categories, and so +it is important to disambiguate between them. + +\subsection{The Classical Index} + +A database index is a specialized data structure that provides a means +to efficiently locate records that satisfy specific criteria. This +enables more efficient query processing for supported search problems. A +classical index can be modeled as a function, mapping a set of attribute +values, called a key, $\mathcal{K}$, to a set of record identifiers, +$\mathcal{R}$. The codomain of an index can be either the set of +record identifiers, a set containing sets of record identifiers, or +the set of physical records, depending upon the configuration of the +index.~\cite{cowbook} For our purposes here, we'll focus on the first of +these, but the use of other codmains wouldn't have any material effect +on our discussion. + +We will use the following definition of a "classical" database index, + +\begin{definition}[Classical Index~\cite{cowbook}] +Consider a set of database records, $\mathcal{D}$. An index over +these records, $\mathcal{I}_\mathcal{D}$ is a map of the form + $\mathcal{I}_\mathcal{D}:(\mathcal{K}, \mathcal{D}) \to \mathcal{R}$, where +$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, +called a \emph{key}. +\end{definition} + +In order to facilitate this mapping, indexes are built using data +structures. The selection of data structure has implications on the +performance of the index, and the types of search problem it can be +used to accelerate. Broadly speaking, classical indices can be divided +into two categories: ordered and unordered. Ordered indices allow for +the iteration over a set of record identifiers in a particular sorted +order of keys, and the efficient location of a specific key value in +that order. These indices can be used to accelerate range scans and +point-lookups. Unordered indices are specialized for point-lookups on a +particular key value, and do not support iterating over records in some +order.~\cite{cowbook, mysql-btree-hash} + +There is a very small set of data structures that are usually used for +creating classical indexes. For ordered indices, the most commonly used +data structure is the B-tree~\cite{ubiq-btree},\footnote{ + By \emph{B-tree} here, we are referring not to the B-tree data + structure, but to a wide range of related structures derived from + the B-tree. Examples include the B$^+$-tree, B$^\epsilon$-tree, etc. +} +and the log-structured merge (LSM) tree~\cite{oneil96} is also often +used within the context of key-value stores~\cite{rocksdb}. Some databases +implement unordered indices using hash tables~\cite{mysql-btree-hash}. + + +\subsection{The Generalized Index} + +The previous section discussed the classical definition of index +as might be found in a database systems textbook. However, this +definition is limited by its association specifically with mapping +key fields to records. For the purposes of this work, a broader +definition of index will be considered, + +\begin{definition}[Generalized Index] +Consider a set of database records, $\mathcal{D}$, and search +problem, $\mathcal{Q}$. +A generalized index, $\mathcal{I}_\mathcal{D}$ +is a map of the form $\mathcal{I}_\mathcal{D}:(\mathcal{Q}, \mathcal{D}) \to +\mathcal{R})$. +\end{definition} + +A classical index is a special case of a generalized index, with $\mathcal{Q}$ +being a point-lookup or range scan based on a set of record attributes. + +There are a number of generalized indexes that appear in some database systems. +For example, some specialized databases or database extensions have support for +indexes based the R-tree\footnote{ Like the B-tree, R-tree here is used as a +signifier for a general class of related data structures.} for spatial +databases~\cite{postgis-doc, ubiq-rtree} or hierarchical navigable small world +graphs for similarity search~\cite{pinecone-db}, among others. These systems +are typically either an add-on module, or a specialized standalone database +that has been designed specifically for answering particular types of queries +(such as spatial queries, similarity search, string matching, etc.). + +%\subsection{Indexes in Query Processing} + +%A database management system utilizes indexes to accelerate certain +%types of query. Queries are expressed to the system in some high +%level language, such as SQL or Datalog. These are generalized +%languages capable of expressing a wide range of possible queries. +%The DBMS is then responsible for converting these queries into a +%set of primitive data access procedures that are supported by the +%underlying storage engine. There are a variety of techniques for +%this, including mapping directly to a tree of relational algebra +%operators and interpreting that tree, query compilation, etc. But, +%ultimately, this internal query representation is limited by the routines +%supported by the storage engine.~\cite{cowbook} + +%As an example, consider the following SQL query (representing a +%2-dimensional k-nearest neighbor problem)\footnote{There are more efficient +%ways of answering this query, but I'm aiming for simplicity here +%to demonstrate my point}, +% +%\begin{verbatim} +%SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A +% WHERE A.property = filtering_criterion +% ORDER BY d +% LIMIT 5; +%\end{verbatim} +% +%This query will be translated into a logical query plan (a sequence +%of relational algebra operators) by the query planner, which could +%result in a plan like this, +% +%\begin{verbatim} +%query plan here +%\end{verbatim} +% +%With this logical query plan, the DBMS will next need to determine +%which supported operations it can use to most efficiently answer +%this query. For example, the selection operation (A) could be +%physically manifested as a table scan, or could be answered using +%an index scan if there is an ordered index over \texttt{A.property}. +%The query optimizer will make this decision based on its estimate +%of the selectivity of the predicate. This may result in one of the +%following physical query plans +% +%\begin{verbatim} +%physical query plan +%\end{verbatim} +% +%In either case, however, the space of possible physical plans is +%limited by the available access methods: either a sorted scan on +%an attribute (index) or an unsorted scan (table scan). The database +%must filter for all elements matching the filtering criterion, +%calculate the distances between all of these points and the query, +%and then sort the results to get the final answer. Additionally, +%note that the sort operation in the plan is a pipeline-breaker. If +%this plan were to appear as a sub-tree in a larger query plan, the +%overall plan would need to wait for the full evaluation of this +%sub-query before it could proceed, as sorting requires the full +%result set. +% +%Imagine a world where a new index was available to the DBMS: a +%nearest neighbor index. This index would allow the iteration over +%records in sorted order, relative to some predefined metric and a +%query point. If such an index existed over \texttt{(A.x, A.y)} using +%\texttt{dist}, then a third physical plan would be available to the DBMS, +% +%\begin{verbatim} +%\end{verbatim} +% +%This plan pulls records in order of their distance to \texttt{Q} +%directly, using an index, and then filters them, avoiding the +%pipeline breaking sort operation. While it's not obvious in this +%case that this new plan is superior (this would depend upon the +%selectivity of the predicate), it is a third option. It becomes +%increasingly superior as the selectivity of the predicate grows, +%and is clearly superior in the case where the predicate has unit +%selectivity (requiring only the consideration of $5$ records total). +% +%This use of query-specific indexing schemes presents a query +%optimization challenge: how does the database know when a particular +%specialized index can be used for a given query, and how can +%specialized indexes broadcast their capabilities to the query optimizer +%in a general fashion? This work is focused on the problem of enabling +%the existence of such indexes, rather than facilitating their use; +%however these are important questions that must be considered in +%future work for this solution to be viable. There has been work +%done surrounding the use of arbitrary indexes in queries in the past, +%such as~\cite{byods-datalog}. This problem is considered out-of-scope +%for the proposed work, but will be considered in the future. + +\section{Classical Dynamization Techniques} + +Because data in a database is regularly updated, data structures +intended to be used as an index must support updates (inserts, in-place +modification, and deletes). Not all potentially useful data structures +support updates, and so a general strategy for adding update support +would increase the number of data structures that could be used as +database indices. We refer to a data structure with update support as +\emph{dynamic}, and one without update support as \emph{static}.\footnote{ + + The term static is distinct from immutable. Static refers to the + layout of records within the data structure, whereas immutable + refers to the data stored within those records. This distinction + will become relevant when we discuss different techniques for adding + delete support to data structures. The data structures used are + always static, but not necessarily immutable, because the records may + contain header information (like visibility) that is updated in place. +} + +This section discusses \emph{dynamization}, the construction of a dynamic +data structure based on an existing static one. When certain conditions +are satisfied by the data structure and its associated search problem, +this process can be done automatically, and with provable asymptotic +bounds on amortized insertion performance, as well as worst case query +performance. We will first discuss the necessary data structure +requirements, and then examine several classical dynamization techniques. +The section will conclude with a discussion of delete support within the +context of these techniques. + +\subsection{Global Reconstruction} + +The most fundamental dynamization technique is that of \emph{global +reconstruction}. While not particularly useful on its own, global +reconstruction serves as the basis for the techniques to follow, and so +we will begin our discussion of dynamization with it. + +Consider a class of data structure, $\mathcal{I}$, capable of answering a +search problem, $\mathcal{Q}$. Insertion via global reconstruction is +possible if $\mathcal{I}$ supports the following two operations, +\begin{align*} +\mathtt{build} : \mathcal{PS}(\mathcal{D})& \to \mathcal{I} \\ +\mathtt{unbuild} : \mathcal{I}& \to \mathcal{PS}(\mathcal{D}) +\end{align*} +where $\mathtt{build}$ constructs an instance $\mathscr{i}\in\mathcal{I}$ +over the data structure over a set of records $d \subseteq \mathcal{D}$ +in $C(|d|)$ time, and $\mathtt{unbuild}$ returns the set of records $d +\subseteq \mathcal{D}$ used to construct $\mathscr{i} \in \mathcal{I}$ in +$\Theta(1)$ time,\footnote{ + There isn't any practical reason why $\mathtt{unbuild}$ must run + in constant time, but this is the assumption made in \cite{saxe79} + and in subsequent work based on it, and so we will follow the same + defininition here. +} such that $\mathscr{i} = \mathtt{build}(\mathtt{unbuild}(\mathscr{i}))$. + + + + + + +\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} +\label{ssec:bsm} + +Another approach to support updates is to amortize the cost of +global reconstruction over multiple updates. This approach can take +take three forms, +\begin{enumerate} + + \item Pairing a dynamic data structure (called a buffer or + memtable) with an instance of the structure being extended. + Updates are written to the buffer, and when the buffer is + full its records are merged with those in the static + structure, and the structure is rebuilt. This approach is + used by one version of the originally proposed + LSM-tree~\cite{oneil96}. Technically this technique proposed + in that work for the purposes of converting random writes + into sequential ones (all structures involved are dynamic), + but it can be used for dynamization as well. + + \item Creating multiple, smaller data structures each + containing a partition of the records from the dataset, and + reconstructing individual structures to accommodate new + inserts in a systematic manner. This technique is the basis + of the Bentley-Saxe method~\cite{saxe79}. + + \item Using both of the above techniques at once. This is + the approach used by modern incarnations of the + LSM-tree~\cite{rocksdb}. + +\end{enumerate} + +In all three cases, it is necessary for the search problem associated +with the index to be a DSP, as answering it will require querying +multiple structures (the buffer and/or one or more instances of the +data structure) and merging the results together to get a final +result. This section will focus exclusively on the Bentley-Saxe +method, as it is the basis for the proposed methodology. + +When dividing records across multiple structures, there is a clear +trade-off between read performance and write performance. Keeping +the individual structures small reduces the cost of reconstructing, +and thereby increases update performance. However, this also means +that more structures will be required to accommodate the same number +of records, when compared to a scheme that allows the structures +to be larger. As each structure must be queried independently, this +will lead to worse query performance. The reverse is also true, +fewer, larger structures will have better query performance and +worse update performance, with the extreme limit of this being a +single structure that is fully rebuilt on each insert. + +The key insight of the Bentley-Saxe method~\cite{saxe79} is that a +good balance can be struck by using a geometrically increasing +structure size. In Bentley-Saxe, the sub-structures are ``stacked'', +with the base level having a capacity of a single record, and +each subsequent level doubling in capacity. When an update is +performed, the first empty level is located and a reconstruction +is triggered, merging the structures of all levels below this empty +one, along with the new record. The merits of this approach are +that it ensures that ``most'' reconstructions involve the smaller +data structures towards the bottom of the sequence, while most of +the records reside in large, infrequently updated, structures towards +the top. This balances between the read and write implications of +structure size, while also allowing the number of structures required +to represent $n$ records to be worst-case bounded by $O(\log n)$. + +Given a structure and DSP with $P(n)$ construction cost and $Q_S(n)$ +query cost, the Bentley-Saxe Method will produce a dynamic data +structure with, + +\begin{align} + \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\ + \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right) +\end{align} + +In the case of a $C(n)$-decomposable problem, the query cost grows to +\begin{equation} + O\left((Q_s(n) + C(n)) \cdot \log n\right) +\end{equation} + + +While the Bentley-Saxe method manages to maintain good performance in +terms of \emph{amortized} insertion cost, it has has poor worst-case performance. If the +entire structure is full, it must grow by another level, requiring +a full reconstruction involving every record within the structure. +A slight adjustment to the technique, due to Overmars and van +Leeuwen~\cite{overmars81}, allows for the worst-case insertion cost to be bounded by +$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing +each reconstruction into small pieces, one of which is executed +each time a new update occurs. This has the effect of bounding the +worst-case performance, but does so by sacrificing the expected +case performance, and adds a lot of complexity to the method. This +technique is not used much in practice.\footnote{ + We've yet to find any example of it used in a journal article + or conference paper. +} + +\section{Limitations of the Bentley-Saxe Method} +\label{sec:bsm-limits} + +While fairly general, the Bentley-Saxe method has a number of limitations. Because +of the way in which it merges query results together, the number of search problems +to which it can be efficiently applied is limited. Additionally, the method does not +expose any trade-off space to configure the structure: it is one-size fits all. + +\subsection{Limits of Decomposability} +\label{ssec:decomp-limits} +Unfortunately, the DSP abstraction used as the basis of the Bentley-Saxe +method has a few significant limitations that must first be overcome, +before it can be used for the purposes of this work. At a high level, these limitations +are as follows, + +\begin{itemize} + \item Each local query must be oblivious to the state of every partition, + aside from the one it is directly running against. Further, + Bentley-Saxe provides no facility for accessing cross-block state + or performing multiple query passes against each partition. + + \item The result merge operation must be $O(1)$ to maintain good query + performance. + + \item The result merge operation must be commutative and associative, + and is called repeatedly to merge pairs of results. +\end{itemize} + +These requirements restrict the types of queries that can be supported by +the method efficiently. For example, k-nearest neighbor and independent +range sampling are not decomposable. + +\subsubsection{k-Nearest Neighbor} +\label{sssec-decomp-limits-knn} +The k-nearest neighbor (KNN) problem is a generalization of the nearest +neighbor problem, which seeks to return the closest point within the +dataset to a given query point. More formally, this can be defined as, +\begin{definition}[Nearest Neighbor] + + Let $D$ be a set of $n>0$ points in $\mathbb{R}^d$ and $f(x, y)$ + be some function $f: D^2 \to \mathbb{R}^+$ representing the distance + between two points within $D$. The nearest neighbor problem, $NN(D, + q)$ returns some $d \in D$ having $\min_{d \in D} \{f(d, q)\}$ + for some query point, $q \in \mathbb{R}^d$. + +\end{definition} + +In practice, it is common to require $f(x, y)$ be a metric,\footnote +{ + Contrary to its vernacular usage as a synonym for ``distance'', a + metric is more formally defined as a valid distance function over + a metric space. Metric spaces require their distance functions to + have the following properties, + \begin{itemize} + \item The distance between a point and itself is always 0. + \item All distances between non-equal points must be positive. + \item For all points, $x, y \in D$, it is true that + $f(x, y) = f(y, x)$. + \item For any three points $x, y, z \in D$ it is true that + $f(x, z) \leq f(x, y) + f(y, z)$. + \end{itemize} + + These distances also must have the interpretation that $f(x, y) < + f(x, z)$ means that $y$ is ``closer'' to $x$ than $z$ is to $x$. This + is the opposite of the definition of similarity, and so some minor + manipulations are usually required to make similarity measures work + in metric-based indexes. \cite{intro-analysis} +} +and this will be done in the examples of indexes for addressing +this problem in this work, but it is not a fundamental aspect of the problem +formulation. The nearest neighbor problem itself is decomposable, with +a simple merge function that accepts the result with the smallest value +of $f(x, q)$ for any two inputs\cite{saxe79}. + +The k-nearest neighbor problem generalizes nearest-neighbor to return +the $k$ nearest elements, +\begin{definition}[k-Nearest Neighbor] + + Let $D$ be a set of $n \geq k$ points in $\mathbb{R}^d$ and $f(x, y)$ + be some function $f: D^2 \to \mathbb{R}^+$ representing the distance + between two points within $D$. The k-nearest neighbor problem, + $KNN(D, q, k)$ seeks to identify a set $R\subset D$ with $|R| = k$ + such that $\forall d \in D - R, r \in R, f(d, q) \geq f(r, q)$. + +\end{definition} + +This can be thought of as solving the nearest-neighbor problem $k$ times, +each time removing the returned result from $D$ prior to solving the +problem again. Unlike the single nearest-neighbor case (which can be +thought of as KNN with $k=1$), this problem is \emph{not} decomposable. + +\begin{theorem} + KNN is not a decomposable search problem. +\end{theorem} + +\begin{proof} +To prove this, consider the query $KNN(D, q, k)$ against some partitioned +dataset $D = D_0 \cup D_1 \ldots \cup D_\ell$. If KNN is decomposable, +then there must exist some constant-time, commutative, and associative +binary operator $\square$, such that $R = \square_{0 \leq i \leq l} +R_i$ where $R_i$ is the result of evaluating the query $KNN(D_i, q, +k)$. Consider the evaluation of the merge operator against two arbitrary +result sets, $R = R_i \square R_j$. It is clear that $|R| = |R_i| = +|R_j| = k$, and that the contents of $R$ must be the $k$ records from +$R_i \cup R_j$ that are nearest to $q$. Thus, $\square$ must solve the +problem $KNN(R_i \cup R_j, q, k)$. However, KNN cannot be solved in $O(1)$ +time. Therefore, KNN is not a decomposable search problem. +\end{proof} + +With that said, it is clear that there isn't any fundamental restriction +preventing the merging of the result sets; +it is only the case that an +arbitrary performance requirement wouldn't be satisfied. It is possible +to merge the result sets in non-constant time, and so it is the case that +KNN is $C(n)$-decomposable. Unfortunately, this classification brings with +it a reduction in query performance as a result of the way result merges are +performed in Bentley-Saxe. + +As a concrete example of these costs, consider using Bentley-Saxe to +extend the VPTree~\cite{vptree}. The VPTree is a static, metric index capable of +answering KNN queries in $KNN(D, q, k) \in O(k \log n)$. One possible +merge algorithm for KNN would be to push all of the elements in the two +arguments onto a min-heap, and then pop off the first $k$. In this case, +the cost of the merge operation would be $C(k) = k \log k$. Were $k$ assumed +to be constant, then the operation could be considered to be constant-time. +But given that $k$ is only bounded in size above +by $n$, this isn't a safe assumption to make in general. Evaluating the +total query cost for the extended structure, this would yield, + +\begin{equation} + KNN(D, q, k) \in O\left(k\log n \left(\log n + \log k\right) \right) +\end{equation} + +The reason for this large increase in cost is the repeated application +of the merge operator. The Bentley-Saxe method requires applying the +merge operator in a binary fashion to each partial result, multiplying +its cost by a factor of $\log n$. Thus, the constant-time requirement +of standard decomposability is necessary to keep the cost of the merge +operator from appearing within the complexity bound of the entire +operation in the general case.\footnote { + There is a special case, noted by Overmars, where the total cost is + $O(Q(n) + C(n))$, without the logarithmic term, when $(Q(n) + C(n)) + \in \Omega(n^\epsilon)$ for some $\epsilon >0$. This accounts for the + case where the cost of the query and merge operation are sufficiently + large to consume the logarithmic factor, and so it doesn't represent + a special case with better performance. +} +If the result merging operation could be revised to remove this +duplicated cost, the cost of supporting $C(n)$-decomposable queries +could be greatly reduced. + +\subsubsection{Independent Range Sampling} + +Another problem that is not decomposable is independent sampling. There +are a variety of problems falling under this umbrella, including weighted +set sampling, simple random sampling, and weighted independent range +sampling, but this section will focus on independent range sampling. + +\begin{definition}[Independent Range Sampling~\cite{tao22}] + Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range + sampling query returns $k$ independent samples from $D \cap q$ + with each point having equal probability of being sampled. +\end{definition} + +This problem immediately encounters a category error when considering +whether it is decomposable: the result set is randomized, whereas the +conditions for decomposability are defined in terms of an exact matching +of records in result sets. To work around this, a slight abuse of definition +is in order: +assume that the equality conditions within the DSP definition can +be interpreted to mean ``the contents in the two sets are drawn from the +same distribution''. This enables the category of DSP to apply to this type +of problem. More formally, +\begin{definition}[Decomposable Sampling Problem] + A sampling problem $F: (D, Q) \to R$, $F$ is decomposable if and + only if there exists a constant-time computable, associative, and + commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) \sim F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +Even with this abuse, however, IRS cannot generally be considered decomposable; +it is at best $C(n)$-decomposable. The reason for this is that matching the +distribution requires drawing the appropriate number of samples from each each +partition of the data. Even in the special case that $|D_0| = |D_1| = \ldots = +|D_\ell|$, the number of samples from each partition that must appear in the +result set cannot be known in advance due to differences in the selectivity +of the predicate across the partitions. + +\begin{example}[IRS Sampling Difficulties] + + Consider three partitions of data, $D_0 = \{1, 2, 3, 4, 5\}, D_1 = + \{1, 1, 1, 1, 3\}, D_2 = \{4, 4, 4, 4, 4\}$ using bag semantics and + an IRS query over the interval $[3, 4]$ with $k=12$. Because all three + partitions have the same size, it seems sensible to evenly distribute + the samples across them ($4$ samples from each partition). Applying + the query predicate to the partitions results in the following, + $d_0 = \{3, 4\}, d_1 = \{3 \}, d_2 = \{4, 4, 4, 4\}$. + + In expectation, then, the first result set will contain $R_0 = \{3, + 3, 4, 4\}$ as it has a 50\% chance of sampling a $3$ and the same + probability of a $4$. The second and third result sets can only + be ${3, 3, 3, 3}$ and ${4, 4, 4, 4}$ respectively. Merging these + together, we'd find that the probability distribution of the sample + would be $p(3) = 0.5$ and $p(4) = 0.5$. However, were were to perform + the same sampling operation over the full dataset (not partitioned), + the distribution would be $p(3) = 0.25$ and $p(4) = 0.75$. + +\end{example} + +The problem is that the number of samples drawn from each partition needs to be +weighted based on the number of elements satisfying the query predicate in that +partition. In the above example, by drawing $4$ samples from $D_1$, more weight +is given to $3$ than exists within the base dataset. This can be worked around +by sampling a full $k$ records from each partition, returning both the sample +and the number of records satisfying the predicate as that partition's query +result, and then performing another pass of IRS as the merge operator, but this +is the same approach as was used for KNN above. This leaves IRS firmly in the +$C(n)$-decomposable camp. If it were possible to pre-calculate the number of +samples to draw from each partition, then a constant-time merge operation could +be used. + +\section{Conclusion} +This chapter discussed the necessary background information pertaining to +queries and search problems, indexes, and techniques for dynamic extension. It +described the potential for using custom indexes for accelerating particular +kinds of queries, as well as the challenges associated with constructing these +indexes. The remainder of this document will seek to address these challenges +through modification and extension of the Bentley-Saxe method, describing work +that has already been completed, as well as the additional work that must be +done to realize this vision. diff --git a/chapters/background.tex.bak b/chapters/background.tex.bak new file mode 100644 index 0000000..d57b370 --- /dev/null +++ b/chapters/background.tex.bak @@ -0,0 +1,574 @@ +\chapter{Background} + +This chapter will introduce important background information that +will be used throughput the remainder of the document. We'll first +define precisely what is meant by a query, and consider some special +classes of query that will become relevant in our discussion of dynamic +extension. We'll then consider the difference between a static and a +dynamic structure, and techniques for converting static structures into +dynamic ones in a variety of circumstances. + +\section{Database Indexes} + +The term \emph{index} is often abused within the database community +to refer to a range of closely related, but distinct, conceptual +categories\footnote{ +The word index can be used to refer to a structure mapping record +information to the set of records matching that information, as a +general synonym for ``data structure'', to data structures used +specifically in query processing, etc. +}. +This ambiguity is rarely problematic, as the subtle differences +between these categories are not often significant, and context +clarifies the intended meaning in situtations where they are. +However, this work explicitly operates at the interface of two of +these categories, and so it is important to disambiguiate between +them. As a result, we will be using the word index to +refer to a very specific structure + +\subsection{The Traditional Index} +A database index is a specialized structure which provides a means +to efficiently locate records that satisfy specific criteria. This +enables more efficient query processing for support queries. A +traditional database index can be modeled as a function, mapping a +set of attribute values, called a key, $\mathcal{K}$, to a set of +record identifiers, $\mathcal{R}$. Technically, the codomain of an +index can be either a record identifier, a set of record identifiers, +or the physical record itself, depending upon the configuration of +the index. For the purposes of this work, the focus will be on the +first of these, but in principle any of the three index types could +be used with little material difference to the discussion. + +Formally speaking, we will use the following definition of a traditional +database index, +\begin{definition}[Traditional Index] +Consider a set of database records, $\mathcal{D}$. An index over +these records, $\mathcal{I}_\mathcal{D}$ is a map of the form +$F:(\mathcal{I}_\mathcal{D}, \mathcal{K}) \to \mathcal{R}$, where +$\mathcal{K}$ is a set of attributes of the records in $\mathcal{D}$, +called a \emph{key}. +\end{definition} + +In order to facilitate this mapping, indexes are built using data +structures. The specific data structure used has particular +implications about the performance of the index, and the situations +in which the index is effectively. Broadly speaking, traditional +database indexes can be categorized in two ways: ordered indexes +and unordered indexes. The former of these allows for iteration +over the set of record identifiers in some sorted order, starting +at the returned record. The latter allows for point-lookups only. + +There is a very small set of data structures that are usually used +for creating database indexes. The most common range index in RDBMSs +is the B-tree\footnote{ By \emph{B-tree} here, I am referring not +to the B-tree datastructure, but to a wide range of related structures +derived from the B-tree. Examples include the B$^+$-tree, +B$^\epsilon$-tree, etc. } based index, and key-value stores commonly +use indices built on the LSM-tree. Some databases support unordered +indexes using hashtables. Beyond these, some specialized databases or +database extensions have support for indexes based on other structures, +such as the R-tree\footnote{ +Like the B-tree, R-tree here is used as a signifier for a general class +of related data structures} for spatial databases or approximate small +world graph models for similarity search. + +\subsection{The Generalized Index} + +The previous section discussed the traditional definition of index +as might be found in a database systems textbook. However, this +definition is limited by its association specifically with mapping +key fields to records. For the purposes of this work, I will be +considering a slightly broader definition of index, + +\begin{definition}[Generalized Index] +Consider a set of database records, $\mathcal{D}$ and a search +problem, $\mathcal{Q}$. A generalized index, $\mathcal{I}_\mathcal{D}$ +is a map of the form $F:(\mathcal{I}_\mathcal{D}, \mathcal{Q}) \to +\mathcal{R})$. +\end{definition} + +\emph{Search problems} are the topic of the next section, but in +brief a search problem represents a general class of query, such +as range scan, point lookup, k-nearest neightbor, etc. A traditional +index is a special case of a generalized index, having $\mathcal{Q}$ +being a point-lookup or range query based on a set of record +attributes. + +\subsection{Indices in Query Processing} + +A database management system utilizes indices to accelerate certain +types of query. Queries are expressed to the system in some high +level language, such as SQL or Datalog. These are generalized +languages capable of expressing a wide range of possible queries. +The DBMS is then responsible for converting these queries into a +set of primitive data access procedures that are supported by the +underlying storage engine. There are a variety of techniques for +this, including mapping directly to a tree of relational algebra +operators and interpretting that tree, query compilation, etc. But, +ultimately, the expressiveness of this internal query representation +is limited by the routines supported by the storage engine. + +As an example, consider the following SQL query (representing a +2-dimensional k-nearest neighbor)\footnote{There are more efficient +ways of answering this query, but I'm aiming for simplicity here +to demonstrate my point}, + +\begin{verbatim} +SELECT dist(A.x, A.y, Qx, Qy) as d, A.key FROM A + WHERE A.property = filtering_criterion + ORDER BY d + LIMIT 5; +\end{verbatim} + +This query will be translated into a logical query plan (a sequence +of relational algebra operators) by the query planner, which could +result in a plan like this, + +\begin{verbatim} +query plan here +\end{verbatim} + +With this logical query plan, the DBMS will next need to determine +which supported operations it can use to most efficiently answer +this query. For example, the selection operation (A) could be +physically manifested as a table scan, or could be answered using +an index scan if there is an ordered index over \texttt{A.property}. +The query optimizer will make this decision based on its estimate +of the selectivity of the predicate. This may result in one of the +following physical query plans + +\begin{verbatim} +physical query plan +\end{verbatim} + +In either case, however, the space of possible physical plans is +limited by the available access methods: either a sorted scan on +an attribute (index) or an unsorted scan (table scan). The database +must filter for all elements matching the filtering criterion, +calculate the distances between all of these points and the query, +and then sort the results to get the final answer. Additionally, +note that the sort operation in the plan is a pipeline-breaker. If +this plan were to appear as a subtree in a larger query plan, the +overall plan would need to wait for the full evaluation of this +sub-query before it could proceed, as sorting requires the full +result set. + +Imagine a world where a new index was available to our DBMS: a +nearest neighbor index. This index would allow the iteration over +records in sorted order, relative to some predefined metric and a +query point. If such an index existed over \texttt{(A.x, A.y)} using +\texttt{dist}, then a third physical plan would be available to the DBMS, + +\begin{verbatim} +\end{verbatim} + +This plan pulls records in order of their distance to \texttt{Q} +directly, using an index, and then filters them, avoiding the +pipeline breaking sort operation. While it's not obvious in this +case that this new plan is superior (this would depend a lot on the +selectivity of the predicate), it is a third option. It becomes +increasingly superior as the selectivity of the predicate grows, +and is clearly superior in the case where the predicate has unit +selectivity (requiring only the consideration of $5$ records total). +The construction of this special index will be considered in +Section~\ref{ssec:knn}. + +This use of query-specific indexing schemes also presents a query +planning challenge: how does the database know when a particular +specialized index can be used for a given query, and how can +specialized indexes broadcast their capabilities to the query planner +in a general fashion? This work is focused on the problem of enabling +the existence of such indexes, rather than facilitating their use, +however these are important questions that must be considered in +future work for this solution to be viable. There has been work +done surrounding the use of arbtrary indexes in queries in the past, +such as~\cite{byods-datalog}. This problem is considered out-of-scope +for the proposed work, but will be considered in the future. + +\section{Queries and Search Problems} + +In our discussion of generalized indexes, we encountered \emph{search +problems}. A search problem is a term used within the literature +on data structures in a manner similar to how the database community +sometimes uses the term query\footnote{ +Like with the term index, the term query is often abused and used to +refer to several related, but slightly different things. In the vernacular, +a query can refer to either a) a general type of search problem (as in "range query"), +b) a specific instance of a search problem, or c) a program written in a query language. +}, to refer to a general +class of questions asked of data. Examples include range queries, +point-lookups, nearest neighbor queries, predicate filtering, random +sampling, etc. Formally, for the purposes of this work, we will define +a search problem as follows, +\begin{definition}[Search Problem] +Given three multisets, $D$, $R$, and $Q$, a search problem is a function +$F: (D, Q) \to R$, where $D$ represents the domain of data to be searched, +$Q$ represents the domain of query parameters, and $R$ represents the +answer domain. +\footnote{ +It is important to note that it is not required for $R \subseteq D$. As an +example, a \texttt{COUNT} aggregation might map a set of strings onto +an integer. Most common queries do satisfy $R \subseteq D$, but this need +not be a universal constraint. +} +\end{definition} + +And we will use the word \emph{query} to refer to a specific instance +of a search problem, except when used as part of the generally +accepted name of a search problem (i.e., range query). + +\begin{definition}[Query] +Given three multisets, $D$, $R$, and $Q$, a search problem $F$ and +a specific set of query parameters $q \in Q$, a query is a specific +instance of the search problem, $F(D, q)$. +\end{definition} + +As an example of using these definitions, a \emph{membership test} +or \emph{range query} would be considered search problems, and a +range query over the interval $[10, 99]$ would be a query. + +\subsection{Decomposable Search Problems} + +An important subset of search problems is that of decomposable +search problems (DSPs). This class was first defined by Saxe and +Bentley as follows, + +\begin{definition}[Decomposable Search Problem~\cite{saxe79}] + \label{def:dsp} + Given a search problem $F: (D, Q) \to R$, $F$ is decomposable if and + only if there exists a consant-time computable, associative, and + commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +The constant-time requirement was used to prove bounds on the costs of +evaluating DSPs over data broken across multiple partitions. Further work +by Overmars lifted this constraint and considered a more general class +of DSP, +\begin{definition}[$C(n)$-decomposable Search Problem~\cite{overmars83}] + Given a search problem $F: (D, Q) \to R$, $F$ is $C(n)$-decomposable + if and only if there exists an $O(C(n))$-time computable, associative, + and commutative binary operator $\square$ such that, + \begin{equation*} + F(A \cup B, q) = F(A, q)~ \square ~F(B, q) + \end{equation*} +\end{definition} + +Decomposability is an important property because it allows for +search problems to be answered over partitioned datasets. The details +of this will be discussed in Section~\ref{ssec:bentley-saxe} in the +context of creating dynamic data structures. Many common types of +search problems appearing in databases are decomposable, such as +range queries or predicate filtering. + +To demonstrate that a search problem is decomposable, it is necessary +to show the existance of the merge operator, $\square$, and to show +that $F(A \cup B, q) = F(A, q)~ \square ~F(B, q)$. With these two +results, simple induction demonstrates that the problem is decomposable +even in cases with more than two partial results. + +As an example, consider range queries, +\begin{definition}[Range Query] +Let $D$ be a set of $n$ points in $\mathbb{R}$. Given an interval, +$ q = [x, y],\quad x,y \in R$, a range query returns all points in +$D \cap q$. +\end{definition} + +\begin{theorem} +Range Queries are a DSP. +\end{theorem} + +\begin{proof} +Let $\square$ be the set union operator ($\cup$). Applying this to +Definition~\ref{def:dsp}, we have +\begin{align*} + (A \cup B) \cap q = (A \cap q) \cup (B \cap q) +\end{align*} +which is true by the distributive property of set union and +intersection. Assuming an implementation allowing for an $O(1)$ +set union operation, range queries are DSPs. +\end{proof} + +Because the codomain of a DSP is not restricted, more complex output +structures can be used to allow for problems that are not directly +decomposable to be converted to DSPs, possibly with some minor +post-processing. For example, the calculation of the mean of a set +of numbers can be constructed as a DSP using the following technique, +\begin{theorem} +The calculation of the average of a set of numbers is a DSP. +\end{theorem} +\begin{proof} +Define the search problem as $A:D \to (\mathbb{R}, \mathbb{Z})$, +where $D\subset\mathbb{R}$ and is a multiset. The output tuple +contains the sum of the values within the input set, and the +cardinality of the input set. Let the $A(D_1) = (s_1, c_1)$ and +$A(D_2) = (s_2, c_2)$. Then, define $A(D_1)\square A(D_2) = (s_1 + +s_2, c_1 + c_2)$. + +Applying Definition~\ref{def:dsp}, we have +\begin{align*} + A(D_1 \cup D_2) &= A(D_1)\square A(D_2) \\ + (s_1 + s_2, c_1 + c_2) &= (s_1 + s_2, c_1 + c_2) = (s, c) +\end{align*} +From this result, the average can be determined in constant time by +taking $\nicefrac{s}{c}$. Therefore, calculating the average of a set +of numbers is a DSP. +\end{proof} + +\section{Dynamic Extension Techniques} + +Because data in a database is regularly updated, data structures +intended to be used as an index must support updates (inserts, +in-place modification, and deletes) to their data. In principle, +any data structure can support updates to its underlying data through +global reconstruction: adjusting the record set and then rebuilding +the entire structure. Ignoring this trivial (and highly inefficient) +approach, a data structure with support for updates is called +\emph{dynamic}, and one without support for updates is called +\emph{static}. In this section, we discuss approaches for modifying +a static data structure to grant it support for updates, a process +called \emph{dynamic extension} or \emph{dynamization}. A theoretical +survey of this topic can be found in~\cite{overmars83}, but this +work doesn't cover several techniques that are used in practice. +As such, much of this section constitutes our own analysis, tying +together threads from a variety of sources. + +\subsection{Local Reconstruction} + +One way of viewing updates to a data structure is as reconstructing +all or part of the structure. To minimize the cost of the update, +it is ideal to minimize the size of the reconstruction that accompanies +an update, either by careful structuring of the data to ensure +minimal disruption to surrounding records by an update, or by +deferring the reconstructions and amortizing their costs over as +many updates as possible. + +While minimizing the size of a reconstruction seems the most obvious, +and best, approach, it is limited in its applicability. The more +related ``nearby'' records in the structure are, the more records +will be affected by a change. Records can be related in terms of +some ordering of their values, which we'll term a \emph{spatial +ordering}, or in terms of their order of insertion to the structure, +which we'll term a \emph{temporal ordering}. Note that these terms +don't imply anything about the nature of the data, and instead +relate to the principles used by the data structure to arrange them. + +Arrays provide the extreme version of both of these ordering +principles. In an unsorted array, in which records are appended to +the end of the array, there is no spatial ordering dependence between +records. This means that any insert or update will require no local +reconstruction, aside from the record being directly affected.\footnote{ +A delete can also be performed without any structural adjustments +in a variety of ways. Reorganization of the array as a result of +deleted records serves an efficiency purpose, but isn't required +for the correctness of the structure. } However, the order of +records in the array \emph{does} express a strong temporal dependency: +the index of a record in the array provides the exact insertion +order. + +A sorted array provides exactly the opposite situation. The order +of a record in the array reflects an exact spatial ordering of +records with respect to their sorting function. This means that an +update or insert will require reordering a large number of records +(potentially all of them, in the worst case). Because of the stronger +spatial dependence of records in the structure, an update will +require a larger-scale reconstruction. Additionally, there is no +temporal component to the ordering of the records: inserting a set +of records into a sorted array will produce the same final structure +irrespective of insertion order. + +It's worth noting that the spatial dependency discussed here, as +it relates to reconstruction costs, is based on the physical layout +of the records and not the logical ordering of them. To exemplify +this, a sorted singly-linked list can maintain the same logical +order of records as a sorted array, but limits the spatial dependce +between records each records preceeding node. This means that an +insert into this structure will require only a single node update, +regardless of where in the structure this insert occurs. + +The amount of spatial dependence in a structure directly reflects +a trade-off between read and write performance. In the above example, +performing a lookup for a given record in a sorted array requires +asymptotically fewer comparisons in the worst case than an unsorted +array, because the spatial dependecies can be exploited for an +accelerated search (binary vs. linear search). Interestingly, this +remains the case for lookups against a sorted array vs. a sorted +linked list. Even though both structures have the same logical order +of records, limited spatial dependecies between nodes in a linked +list forces the lookup to perform a scan anyway. + +A balanced binary tree sits between these two extremes. Like a +linked list, individual nodes have very few connections. However +the nodes are arranged in such a way that a connection existing +between two nodes implies further information about the ordering +of children of those nodes. In this light, rebalancing of the tree +can be seen as maintaining a certain degree of spatial dependence +between the nodes in the tree, ensuring that it is balanced between +the two children of each node. A very general summary of tree +rebalancing techniques can be found in~\cite{overmars83}. Using an +AVL tree~\cite{avl} as a specific example, each insert in the tree +involves adding the new node and updating its parent (like you'd +see in a simple linked list), followed by some larger scale local +reconstruction in the form of tree rotations, to maintain the balance +factor invariant. This means that insertion requires more reconstruction +effort than the single pointer update in the linked list case, but +results in much more efficient searches (which, as it turns out, +makes insertion more efficient in general too, even with the overhead, +because finding the insertion point is much faster). + +\subsection{Amortized Local Reconstruction} + +In addition to control update cost by arranging the structure so +as to reduce the amount of reconstruction necessary to maintain the +desired level of spatial dependence, update costs can also be reduced +by amortizing the local reconstruction cost over multiple updates. +This is often done in one of two ways: leaving gaps or adding +overflow buckets. These gaps and buckets allows for a buffer of +insertion capacity to be sustained by the data structure, before +a reconstruction is triggered. + +A classic example of the gap approach is found in the +B$^+$-tree~\cite{b+tree} commonly used in RDBMS indexes, as well +as open addressing for hash tables. In a B$^+$-tree, each node has +a fixed size, which must be at least half-utilized (aside from the +root node). The empty spaces within these nodes are gaps, which can +be cheaply filled with new records on insert. Only when a node has +been filled must a local reconstruction (called a structural +modification operation for B-trees) occur to redistribute the data +into multiple nodes and replenish the supply of gaps. This approach +is particularly well suited to data structures in contexts where +the natural unit of storage is larger than a record, as in disk-based +(with 4KiB pages) or cache-optimized (with 64B cachelines) structures. +This gap-based approach was also used to create ALEX, an updatable +learned index~\cite{ALEX}. + +The gap approach has a number of disadvantages. It results in a +somewhat sparse structure, thereby wasting storage. For example, a +B$^+$-tree requires all nodes other than the root to be at least +half full--meaning in the worst case up to half of the space required +by the structure could be taken up by gaps. Additionally, this +scheme results in some inserts being more expensive than others: +most new records will occupy an available gap, but some will trigger +more expensive SMOs. In particular, it has been observed with +B$^+$-trees that this can lead to ``waves of misery''~\cite{wavesofmisery}: +the gaps in many nodes fill at about the same time, leading to +periodic clusters of high-cost merge operations. + +Overflow buckets are seen in ISAM-tree based indexes~\cite{myisam}, +as well as hash tables with closed addressing. In this approach, +parts of the structure into which records would be inserted (leaf +nodes of ISAM, directory entries in CA hashing) have a pointer to +an overflow location, where newly inserted records can be placed. +This allows for the structure to, theoretically, sustain an unlimited +amount of insertions. However, read performance degrades, because +the more overflow capacity is utilized, the less the records in the +structure are ordered according to the data structure's definition. +Thus, periodically a reconstruction is necessary to distribute the +overflow records into the structure itself. + +\subsection{Amortized Global Reconstruction: The Bentley-Saxe Method} + +Another approach to support updates is to amortize the cost of +global reconstruction over multiple updates. This approach can take +take three forms, +\begin{enumerate} + + \item Pairing a dynamic data structure (called a buffer or + memtable) with an instance of the structure being extended. + Updates are written to the buffer, and when the buffer is + full its records are merged with those in the static + structure, and the structure is rebuilt. This approach is + used by one version of the originally proposed + LSM-tree~\cite{oneil93}. Technically this technique proposed + in that work for the purposes of converting random writes + into sequential ones (all structures involved are dynamic), + but it can be used for dynamization as well. + + \item Creating multiple, smaller data structures each + containing a partition of the records from the dataset, and + reconstructing individual structures to accomodate new + inserts in a systematic manner. This technique is the basis + of the Bentley-Saxe method~\cite{saxe79}. + + \item Using both of the above techniques at once. This is + the approach used by modern incarnations of the + LSM~tree~\cite{rocksdb}. + +\end{enumerate} + +In all three cases, it is necessary for the search problem associated +with the index to be a DSP, as answering it will require querying +multiple structures (the buffer and/or one or more instances of the +data structure) and merging the results together to get a final +result. This section will focus exclusively on the Bentley-Saxe +method, as it is the basis for our proposed methodology.p + +When dividing records across multiple structures, there is a clear +trade-off between read performance and write performance. Keeping +the individual structures small reduces the cost of reconstructing, +and thereby increases update performance. However, this also means +that more structures will be required to accommodate the same number +of records, when compared to a scheme that allows the structures +to be larger. As each structure must be queried independently, this +will lead to worse query performance. The reverse is also true, +fewer, larger structures will have better query performance and +worse update performance, with the extreme limit of this being a +single structure that is fully rebuilt on each insert. + +\begin{figure} + \caption{Inserting a new record using the Bentley-Saxe method.} + \label{fig:bsm-example} +\end{figure} + +The key insight of the Bentley-Saxe method~\cite{saxe79} is that a +good balance can be struck by uses a geometrically increasing +structure size. In Bentley-Saxe, the sub-structures are ``stacked'', +with the bottom level having a capacity of a single record, and +each subsequent level doubling in capacity. When an update is +performed, the first empty level is located and a reconstruction +is triggered, merging the structures of all levels below this empty +one, along with the new record. An example of this process is shown +in Figure~\ref{fig:bsm-example}. The merits of this approach are +that it ensures that ``most'' reconstructions involve the smaller +data structures towards the bottom of the sequence, while most of +the records reside in large, infrequently updated, structures towards +the top. This balances between the read and write implications of +structure size, while also allowing the number of structures required +to represent $n$ records to be worst-case bounded by $O(\log n)$. + +Given a structure with an $O(P(n))$ construction cost and $O(Q_S(n))$ +query cost, the Bentley-Saxe Method will produce a dynamic data +structure with, + +\begin{align} + \text{Query Cost} \qquad & O\left(Q_s(n) \cdot \log n\right) \\ + \text{Amortized Insert Cost} \qquad & O\left(\frac{P(n)}{n} \log n\right) +\end{align} + +However, the method has poor worst-case insertion cost: if the +entire structure is full, it must grow by another level, requiring +a full reconstruction involving every record within the structure. +A slight adjustment to the technique, due to Overmars and van Leuwen +\cite{}, allows for the worst-case insertion cost to be bounded by +$O\left(\frac{P(n)}{n} \log n\right)$, however it does so by dividing +each reconstruction into small pieces, one of which is executed +each time a new update occurs. This has the effect of bounding the +worst-case performance, but does so by sacrificing the expected +case performance, and adds a lot of complexity to the method. This +technique is not used much in practice.\footnote{ + I've yet to find any example of it used in a journal article + or conference paper. +} + + + + + +\subsection{Limitations of the Bentley-Saxe Method} + + + + + diff --git a/chapters/beyond-bsm.tex b/chapters/beyond-bsm.tex new file mode 100644 index 0000000..290d9b1 --- /dev/null +++ b/chapters/beyond-bsm.tex @@ -0,0 +1,3 @@ +\chapter{Expanding the Design Space} +\section{The LSM Tree} +\section{Benefits of Buffering} diff --git a/chapters/beyond-dsp.tex b/chapters/beyond-dsp.tex new file mode 100644 index 0000000..77f5fb4 --- /dev/null +++ b/chapters/beyond-dsp.tex @@ -0,0 +1,863 @@ +\chapter{Generalizing the Framework} +\label{chap:framework} + +The previous chapter demonstrated +the possible utility of +designing indexes based upon the dynamic extension of static data +structures. However, the presented strategy falls short of a general +framework, as it is specific to sampling problems. In this chapter, +the techniques of that work will be discussed in more general terms, +to arrive at a more broadly applicable solution. A general +framework is proposed, which places only two requirements on supported data +structures, + +\begin{itemize} + \item Extended Decomposability + \item Record Identity +\end{itemize} + +In this chapter, first these two properties are defined. Then, +a general dynamic extension framework is described which can +be applied to any data structure supporting these properties. Finally, +an experimental evaluation is presented that demonstrates the viability +of this framework. + +\section{Extended Decomposability} + +Chapter~\ref{chap:sampling} demonstrated how non-DSPs can be efficiently +addressed using Bentley-Saxe, so long as the query interface is +modified to accommodate their needs. For Independent sampling +problems, this involved a two-pass approach, where some pre-processing +work was performed against each shard and used to construct a shard +alias structure. This structure was then used to determine how many +samples to draw from each shard. + +To generalize this approach, a new class of decomposability is proposed, +called \emph{extended decomposability}. At present, its +definition is tied tightly to the query interface, rather +than a formal mathematical definition. In extended decomposability, +rather than treating a search problem as a monolith, the algorithm +is decomposed into multiple components. +This allows +for communication between shards as part of the query process. +Additionally, rather than using a binary merge operator, extended +decomposability uses a variadic function that merges all of the +result sets in one pass, reducing the cost due to merging by a +logarithmic factor without introducing any new restrictions. + +The basic interface that must be supported by a extended-decomposable +search problem (eDSP) is, +\begin{itemize} + + \item $\mathbftt{local\_preproc}(\mathcal{I}_i, \mathcal{Q}) \to + \mathscr{S}_i$ \\ + Pre-processes each partition $\mathcal{D}_i$ using index + $\mathcal{I}_i$ to produce preliminary information about the + query result on this partition, encoded as an object + $\mathscr{S}_i$. + + \item $\mathbftt{distribute\_query}(\mathscr{S}_1, \ldots, + \mathscr{S}_m, \mathcal{Q}) \to \mathcal{Q}_1, \ldots, + \mathcal{Q}_m$\\ + Processes the list of preliminary information objects + $\mathscr{S}_i$ and emits a list of local queries + $\mathcal{Q}_i$ to run independently on each partition. + + \item $\mathbftt{local\_query}(\mathcal{I}_i, \mathcal{Q}_i) + \to \mathcal{R}_i$ \\ + Executes the local query $\mathcal{Q}_i$ over partition + $\mathcal{D}_i$ using index $\mathcal{I}_i$ and returns a + partial result $\mathcal{R}_i$. + + \item $\mathbftt{merge}(\mathcal{R}_1, \ldots \mathcal{R}_m) \to + \mathcal{R}$ \\ + Merges the partial results to produce the final answer. + +\end{itemize} + +The pseudocode for the query algorithm using this interface is, +\begin{algorithm} + \DontPrintSemicolon + \SetKwProg{Proc}{procedure}{ BEGIN}{END} + \SetKwProg{For}{for}{ DO}{DONE} + + \Proc{\mathbftt{QUERY}($D[]$, $\mathscr{Q}$)} { + \For{$i \in [0, |D|)$} { + $S[i] := \mathbftt{local\_preproc}(D[i], \mathscr{Q})$ + } \; + + $ Q := \mathbftt{distribute\_query}(S, \mathscr{Q}) $ \; \; + + \For{$i \in [0, |D|)$} { + $R[i] := \mathbftt{local\_query}(D[i], Q[i])$ + } \; + + $OUT := \mathbftt{merge}(R)$ \; + + \Return {$OUT$} \; + } +\end{algorithm} + +In this system, each query can report a partial result with +\mathbftt{local\_preproc}, which can be used by +\mathbftt{distribute\_query} to adjust the per-partition query +parameters, allowing for direct communication of state between +partitions. Queries which do not need this functionality can simply +return empty $\mathscr{S}_i$ objects from \mathbftt{local\_preproc}. + +\subsection{Query Complexity} + +Before describing how to use this new interface and definition to +support more efficient queries than standard decomposability, more +more general expression for the cost of querying such a structure should +be derived. +Recall that Bentley-Saxe, when applied to a $C(n)$-decomposable +problem, has the following query cost, + +\begin{equation} + \label{eq3:Bentley-Saxe} + O\left(\log n \cdot \left( Q_s(n) + C(n)\right)\right) +\end{equation} +where $Q_s(n)$ is the cost of the query against one partition, and +$C(n)$ is the cost of the merge operator. + +Let $Q_s(n)$ represent the cost of \mathbftt{local\_query} and +$C(n)$ the cost of \mathbftt{merge} in the extended decomposability +case. Additionally, let $P(n)$ be the cost of $\mathbftt{local\_preproc}$ +and $\mathcal{D}(n)$ be the cost of \mathbftt{distribute\_query}. +Additionally, recall that $|D| = \log n$ for the Bentley-Saxe method. +In this case, the cost of a query is +\begin{equation} + O \left( \log n \cdot P(n) + \mathcal{D}(n) + + \log n \cdot Q_s(n) + C(n) \right) +\end{equation} + +Superficially, this looks to be strictly worse than the Bentley-Saxe +case in Equation~\ref{eq3:Bentley-Saxe}. However, the important +thing to understand is that for $C(n)$-decomposable queries, $P(n) +\in O(1)$ and $\mathcal{D}(n) \in O(1)$, as these steps are unneeded. +Thus, for normal decomposable queries, the cost actually reduces +to, +\begin{equation} + O \left( \log n \cdot Q_s(n) + C(n) \right) +\end{equation} +which is actually \emph{better} than Bentley-Saxe. Meanwhile, the +ability perform state-sharing between queries can facilitate better +solutions than would otherwise be possible. + +In light of this new approach, consider the two examples of +non-decomposable search problems from Section~\ref{ssec:decomp-limits}. + +\subsection{k-Nearest Neighbor} +\label{ssec:knn} +The KNN problem is $C(n)$-decomposable, and Section~\ref{sssec-decomp-limits-knn} +arrived at a Bentley-Saxe based solution to this problem based on +VPTree, with a query cost of +\begin{equation} + O \left( k \log^2 n + k \log n \log k \right) +\end{equation} +by running KNN on each partition, and then merging the result sets +with a heap. + +Applying the interface of extended-decomposability to this problem +allows for some optimizations. Pre-processing is not necessary here, +but the variadic merge function can be leveraged to get an asymptotically +better solution. Simply dropping the existing algorithm into this +interface will result in a merge algorithm with cost, +\begin{equation} + C(n) \in O \left( k \log n \left( \log k + \log\log n\right)\right) +\end{equation} +which results in a total query cost that is slightly \emph{worse} +than the original, + +\begin{equation} + O \left( k \log^2 n + k \log n \left(\log k + \log\log n\right) \right) +\end{equation} + +The problem is that the number of records considered in a given +merge has grown from $O(k)$ in the binary merge case to $O(\log n +\cdot k)$ in the variadic merge. However, because the merge function +now has access to all of the data at once, the algorithm can be modified +slightly for better efficiency by only pushing $\log n$ elements +into the heap at a time. This trick only works if +the $R_i$s are in sorted order relative to $f(x, q)$, +however this condition is satisfied by the result sets returned by +KNN against a VPTree. Thus, for each $R_i$, the first element in sorted +order can be inserted into the heap, +element in sorted order into the heap, tagged with a reference to +which $R_i$ it was taken from. Then, when the heap is popped, the +next element from the associated $R_i$ can be inserted. +This allows the heap's size to be maintained at no larger +than $O(\log n)$, and limits the algorithm to no more than +$k$ pop operations and $\log n + k - 1$ pushes. + +This algorithm reduces the cost of KNN on this structure to, +\begin{equation} + O(k \log^2 n + \log n) +\end{equation} +which is strictly better than the original. + +\subsection{Independent Range Sampling} + +The eDSP abstraction also provides sufficient features to implement +IRS, using the same basic approach as was used in the previous +chapter. Unlike KNN, IRS will take advantage of the extended query +interface. Recall from the Chapter~\ref{chap:sampling} that the approach used +for answering sampling queries (ignoring the buffer, for now) was, + +\begin{enumerate} + \item Query each shard to establish the weight that should be assigned to the + shard in sample size assignments. + \item Build an alias structure over those weights. + \item For each sample, reference the alias structure to determine which shard + to sample from, and then draw the sample. +\end{enumerate} + +This approach can be mapped easily onto the eDSP interface as follows, +\begin{itemize} + \item[\texttt{local\_preproc}] Determine and return the total weight of candidate records for + sampling in the shard. + \item[\texttt{distribute\_query}] Using the shard weights, construct an alias structure associating + each shard with its total weight. Then, query this alias structure $k$ times. For shard $i$, the + local query $\mathscr{Q}_i$ will have its sample size assigned based on how many times $i$ is returned + during the alias querying. + \item[\texttt{local\_query}] Process the local query using the underlying data structure's normal sampling + procedure. + \item[\texttt{merge}] Union all of the partial results together. +\end{itemize} + +This division of the query maps closely onto the cost function, +\begin{equation} + O\left(P(n) + kS(n)\right) +\end{equation} +used in Chapter~\ref{chap:sampling}, where the $W(n) + P(n)$ pre-processing +cost is associated with the cost of \texttt{local\_preproc} and the +$kS(n)$ sampling cost is associated with $\texttt{local\_query}$. +The \texttt{distribute\_query} operation will require $O(\log n)$ +time to construct the shard alias structure, and $O(k)$ time to +query it. Accounting then for the fact that \texttt{local\_preproc} +will be called once per shard ($\log n$ times), and a total of $k$ +records will be sampled as the cost of $S(n)$ each, this results +in a total query cost of, +\begin{equation} + O\left(\left[W(n) + P(n)\right]\log n + k S(n)\right) +\end{equation} +which matches the cost in Equation~\ref{eq:sample-cost}. + +\section{Record Identity} + +Another important consideration for the framework is support for +deletes, which are important in the contexts of database systems. +The sampling extension framework supported two techniques +for the deletion of records: tombstone-based deletes and tagging-based +deletes. In both cases, the solution required that the shard support +point lookups, either for checking tombstones or for finding the +record to mark it as deleted. Implicit in this is an important +property of the underlying data structure which was taken for granted +in that work, but which will be made explicit here: record identity. + +Delete support requires that each record within the index be uniquely +identifiable, and linkable directly to a location in storage. This +property is called \emph{record identity}. + In the context of database +indexes, it isn't a particularly contentious requirement. Indexes +already are designed to provide a mapping directly to a record in +storage, which (at least in the context of RDBMS) must have a unique +identifier attached. However, in more general contexts, this +requirement will place some restrictions on the applicability of +the framework. + +For example, approximate data structures or summaries, such as Bloom +filters~\cite{bloom70} or count-min sketches~\cite{countmin-sketch} +are data structures which don't necessarily store the underlying +record. In principle, some summaries \emph{could} be supported by +normal Bentley-Saxe as there exist mergeable +summaries~\cite{mergeable-summaries}. But because these data structures +violate the record identity property, they would not support deletes +(either in the framework, or Bentley-Saxe). The framework considers +deletes to be a first-class citizen, and this is formalized by +requiring record identity as a property that supported data structures +must have. + +\section{The General Framework} + +Based on these properties, and the work described in +Chapter~\ref{chap:sampling}, dynamic extension framework has been devised with +broad support for data structures. It is implemented in C++20, using templates +and concepts to define the necessary interfaces. A user of this framework needs +to provide a definition for their data structure with a prescribed interface +(called a \texttt{shard}), and a definition for their query following an +interface based on the above definition of an eDSP. These two classes can then +be used as template parameters to automatically create a dynamic index, which +exposes methods for inserting and deleting records, as well as executing +queries. + +\subsection{Framework Design} + +\Paragraph{Structure.} The overall design of the general framework +itself is not substantially different from the sampling framework +discussed in the Chapter~\ref{chap:sampling}. It consists of a mutable buffer +and a set of levels containing data structures with geometrically +increasing capacities. The \emph{mutable buffer} is a small unsorted +record array of fixed capacity that buffers incoming inserts. As +the mutable buffer is kept sufficiently small (e.g. fits in L2 CPU +cache), the cost of querying it without any auxiliary structures +can be minimized, while still allowing better insertion performance +than Bentley-Saxe, which requires rebuilding an index structure for +each insertion. The use of an unsorted buffer is necessary to +ensure that the framework doesn't require an existing dynamic version +of the index structure being extended, which would defeat the purpose +of the entire exercise. + +The majority of the data within the structure is stored in a sequence +of \emph{levels} with geometrically increasing record capacity, +such that the capacity of level $i$ is $s^{i+1}$, where $s$ is a +configurable parameter called the \emph{scale factor}. Unlike +Bentley-Saxe, these levels are permitted to be partially full, which +allows significantly more flexibility in terms of how reconstruction +is performed. This also opens up the possibility of allowing each +level to allocate its record capacity across multiple data structures +(named \emph{shards}) rather than just one. This decision is called +the \emph{layout policy}, with the use of a single structure being +called \emph{leveling}, and multiple structures being called +\emph{tiering}. + +\begin{figure} +\centering +\subfloat[Leveling]{\includegraphics[width=.5\textwidth]{img/leveling} \label{fig:leveling}} +\subfloat[Tiering]{\includegraphics[width=.5\textwidth]{img/tiering} \label{fig:tiering}} + \caption{\textbf{An overview of the general structure of the + dynamic extension framework} using leveling (Figure~\ref{fig:leveling}) and +tiering (Figure~\ref{fig:tiering}) layout policies. The pictured extension has +a scale factor of 3, with $L_0$ being at capacity, and $L_1$ being at +one third capacity. Each shard is shown as a dotted box, wrapping its associated +dataset ($D_i$), data structure ($I_i$), and auxiliary structures $(A_i)$. } +\label{fig:framework} +\end{figure} + +\Paragraph{Shards.} The basic building block of the dynamic extension +is called a shard, defined as $\mathcal{S}_i = (\mathcal{D}_i, +\mathcal{I}_i, A_i)$, which consists of a partition of the data +$\mathcal{D}_i$, an instance of the static index structure being +extended $\mathcal{I}_i$, and an optional auxiliary structure $A_i$. +To ensure the viability of level reconstruction, the extended data +structure should at least support a construction method +$\mathtt{build}(\mathcal{D})$ that can build a new static index +from a set of records $\mathcal{D}$ from scratch. This set of records +may come from the mutable buffer, or from a union of underlying +data of multiple other shards. It is also beneficial for $\mathcal{I}_i$ +to support efficient point-lookups, which can search for a record's +storage location by its identifier (given by the record identify +requirements of the framework). The shard can also be customized +to provide any necessary features for supporting the index being +extended. For example, auxiliary data structures like Bloom filters +or hash tables can be added to improve point-lookup performance, +or additional, specialized query functions can be provided for use +by the query functions. + +From an implementation standpoint, the shard object provides a shim +between the data structure and the framework itself. At minimum, +it must support the following interface, +\begin{itemize} + \item $\mathbftt{construct}(B) \to S$ \\ + Construct a new shard from the contents of the mutable buffer, $B$. + + \item $\mathbftt{construct}(S_0, \ldots, S_n) \to S$ + Construct a new shard from the records contained within a list of already + existing shards. + + \item $\mathbftt{point\_lookup}(r) \to *r$ \\ + Search for a record, $r$, by identity and return a reference to its + location in storage. +\end{itemize} + +\Paragraph{Insertion \& deletion.} The framework supports inserting +new records and deleting records already in the index. These two +operations also allow for updates to existing records, by first +deleting the old version and then inserting a new one. These +operations are added by the framework automatically, and require +only a small shim or minor adjustments to the code of the data +structure being extended within the implementation of the shard +object. + +Insertions are performed by first wrapping the record to be inserted +with a framework header, and then appending it to the end of the +mutable buffer. If the mutable buffer is full, it is flushed to +create a new shard, which is combined into the first level of the +structure. The level reconstruction process is layout policy +dependent. In the case of leveling, the underlying data of the +source shard and the target shard are combined, resulting a new +shard replacing the target shard in the target level. When using +tiering, the newly created shard is simply placed into the target +level. If the target level is full, the framework first triggers a merge on the +target level, which will create another shard at one higher level, +and then inserts the former shard at the now empty target level. +Note that each time a new shard is created, the framework must invoke +$\mathtt{build}$ to construct a new index from scratch for this +shard. + +The framework supports deletes using two approaches: either by +inserting a special tombstone record or by performing a lookup for +the record to be deleted and setting a bit in the header. This +decision is called the \emph{delete policy}, with the former being +called \emph{tombstone delete} and the latter \emph{tagged delete}. +The framework will automatically filter deleted records from query +results before returning them to the user, either by checking for +the delete tag, or by performing a lookup of each record for an +associated tombstone. The number of deleted records within the +framework can be bounded by canceling tombstones and associated +records when they meet during reconstruction, or by dropping all +tagged records when a shard is reconstructed. The framework also +supports aggressive reconstruction (called \emph{compaction}) to +precisely bound the number of deleted records within the index, +which can be helpful to improve the performance of certain types +of query. This is useful for certain search problems, as was seen with +sampling queries in Chapter~\ref{chap:sampling}, but is not +generally necessary to bound query cost in most cases. + +\Paragraph{Design space.} The framework described in this section +has a large design space. In fact, much of the design space has +similar knobs to the well-known LSM Tree~\cite{dayan17}, albeit in +a different environment: the framework targets in-memory static +index structures for general extended decomposable queries without +efficient index merging support, whereas the LSM-tree targets +external range indexes that can be efficiently merged. + +The framework's design trades off among auxiliary memory usage, read performance, +and write performance. The two most significant decisions are the +choice of layout and delete policy. A tiering layout policy reduces +write amplification compared to leveling, requiring each record to +only be written once per level, but increases the number of shards +within the structure, which can hurt query performance. As for +delete policy, the use of tombstones turns deletes into insertions, +which are typically faster. However, depending upon the nature of +the query being executed, the delocalization of the presence +information for a record may result in one extra point lookup for +each record in the result set of a query, vastly reducing read +performance. In these cases, tagging may make more sense. This +results in each delete turning into a slower point-lookup, but +always allows for constant-time visibility checks of records. The +other two major parameters, scale factor and buffer size, can be +used to tune the performance once the policies have been selected. +Generally speaking, larger scale factors result in fewer shards, +but can increase write amplification under leveling. Large buffer +sizes can adversely affect query performance when an unsorted buffer +is used, while allowing higher update throughput. Because the overall +design of the framework remains largely unchanged, the design space +exploration of Section~\ref{ssec:ds-exp} remains relevant here. + +\subsection{The Shard Interface} + +The shard object serves as a ``shim'' between a data structure and +the extension framework, providing a set of mandatory functions +which are used by the framework code to facilitate reconstruction +and deleting records. The data structure being extended can be +provided by a different library and included as an attribute via +composition/aggregation, or can be directly implemented within the +shard class. Additionally, shards can contain any necessary auxiliary +structures, such as bloom filters or hash tables, as necessary to +support the required interface. + +The require interface for a shard object is as follows, +\begin{verbatim} + new(MutableBuffer) -> Shard + new(Shard[]) -> Shard + point_lookup(Record, Boolean) -> Record + get_data() -> Record + get_record_count() -> Int + get_tombstone_count() -> Int + get_memory_usage() -> Int + get_aux_memory_usage() -> Int +\end{verbatim} + +The first two functions are constructors, necessary to build a new Shard +from either an array of other shards (for a reconstruction), or from +a mutable buffer (for a buffer flush).\footnote{ + This is the interface as it currently stands in the existing implementation, but + is subject to change. In particular, we are considering changing the shard reconstruction + procedure to allow for only one necessary constructor, with a more general interface. As + we look to concurrency, being able to construct shards from arbitrary combinations of shards + and buffers will become convenient, for example. + } +The \texttt{point\_lookup} operation is necessary for delete support, and is +used either to locate a record for delete when tagging is used, or to search +for a tombstone associated with a record when tombstones are used. The boolean +is intended to be used to communicate to the shard whether the lookup is +intended to locate a tombstone or a record, and is meant to be used to allow +the shard to control whether a point lookup checks a filter before searching, +but could also be used for other purposes. The \texttt{get\_data} +function exposes a pointer to the beginning of the array of records contained +within the shard--it imposes no restriction on the order of these records, but +does require that all records can be accessed sequentially from this pointer, +and that the order of records does not change. The rest of the functions are +accessors for various shard metadata. The record and tombstone count numbers +are used by the framework for reconstruction purposes.\footnote{The record +count includes tombstones as well, so the true record count on a level is +$\text{reccnt} - \text{tscnt}$.} The memory usage statistics are, at present, +only exposed directly to the user and have no effect on the framework's +behavior. In the future, these may be used for concurrency control and task +scheduling purposes. + +Beyond these, a shard can expose any additional functions that are necessary +for its associated query classes. For example, a shard intended to be used for +range queries might expose upper and lower bound functions, or a shard used for +nearest neighbor search might expose a nearest-neighbor function. + +\subsection{The Query Interface} +\label{ssec:fw-query-int} + +The required interface for a query in the framework is a bit more +complicated than the interface defined for an eDSP, because the +framework needs to query the mutable buffer as well as the shards. +As a result, there is some slight duplication of functions, with +specialized query and pre-processing routines for both shards and +buffers. Specifically, a query must define the following functions, +\begin{verbatim} + get_query_state(QueryParameters, Shard) -> ShardState; + get_buffer_query_state(QueryParameters, Buffer) -> BufferState; + + process_query_states(QueryParameters, ShardStateList, BufferStateList) -> LocalQueryList; + + query(LocalQuery, Shard) -> ResultList + buffer_query(LocalQuery, Buffer) -> ResultList + + merge(ResultList) -> FinalResult + + delete_query_state(ShardState) + delete_buffer_query_state(BufferState) + + bool EARLY_ABORT; + bool SKIP_DELETE_FILTER; +\end{verbatim} + +The \texttt{get\_query\_state} and \texttt{get\_buffer\_query\_state} functions +map to the \texttt{local\_preproc} operation of the eDSP definition for shards +and buffers respectively. \texttt{process\_query\_states} serves the function +of \texttt{distribute\_query}. Note that this function takes a list of buffer +states; although the proposed framework above contains only a single buffer, +future support for concurrency will require multiple buffers, and so the +interface is set up with support for this. The \texttt{query} and +\texttt{buffer\_query} functions execute the local query against the shard or +buffer and return the intermediate results, which are merged using +\texttt{merge} into a final result set. The \texttt{EARLY\_ABORT} parameter can +be set to \texttt{true} to force the framework to immediately return as soon as +the first result is found, rather than querying the entire structure, and the +\texttt{SKIP\_DELETE\_FILTER} disables the framework's automatic delete +filtering, allowing deletes to be manually handled within the \texttt{merge} +function by the developer. These flags exist to allow for optimizations for +certain types of query. For example, point-lookups can take advantage of +\texttt{EARLY\_ABORT} to stop as soon as a match is found, and +\texttt{SKIP\_DELETE\_FILTER} can be used for more efficient tombstone delete +handling in range queries, where tombstones for results will always be in the +\texttt{ResultList}s going into \texttt{merge}. + +The framework itself answers queries by simply calling these routines in +a prescribed order, +\begin{verbatim} +query(QueryArguments qa) BEGIN + FOR i < BufferCount DO + BufferStates[i] = get_buffer_query_state(qa, Buffers[i]) + DONE + + FOR i < ShardCount DO + ShardStates[i] = get_query_state(qa, Shards[i]) + DONE + + process_query_states(qa, ShardStates, BufferStates) + + FOR i < BufferCount DO + temp = buffer_query(BufferStates[i], Buffers[i]) + IF NOT SKIP_DELETE_FILTER THEN + temp = filter_deletes(temp) + END + Results[i] = temp; + + IF EARLY_ABORT AND Results[i].size() > 0 THEN + delete_states(ShardStates, BufferStates) + return merge(Results) + END + DONE + + FOR i < ShardCount DO + temp = query(ShardStates[i], Shards[i]) + IF NOT SKIP_DELETE_FILTER THEN + temp = filter_deletes(temp) + END + Results[i + BufferCount] = temp + IF EARLY_ABORT AD Results[i + BufferCount].size() > 0 THEN + delete_states(ShardStates, BufferStates) + return merge(Results) + END + DONE + + delete_states(ShardStates, BufferStates) + return merge(Results) +END +\end{verbatim} + +\subsubsection{Standardized Queries} + +Provided with the framework are several "standardized" query classes, including +point lookup, range query, and IRS. These queries can be freely applied to any +shard class that implements the necessary optional interfaces. For example, the +provided IRS and range query both require the shard to implement a +\texttt{lower\_bound} and \texttt{upper\_bound} function that returns an index. +They then use this index to access the record array exposed via +\texttt{get\_data}. This is convenient, because it helps to separate the search +problem from the data structure, and moves towards presenting these two objects +as orthogonal. + +In the next section the framework is evaluated by producing a number of indexes +for three different search problems. Specifically, the framework is applied to +a pair of learned indexes, as well as an ISAM-tree. All three of these shards +provide the bound interface described above, meaning that the same range query +class can be used for all of them. It also means that the learned indexes +automatically have support for IRS. And, of course, they also all can be used +with the provided point-lookup query, which simply uses the required +\texttt{point\_lookup} function of the shard. + +At present, the framework only supports associating a single query class with +an index. However, this is simply a limitation of implementation. In the future, +approaches will be considered for associating arbitrary query classes to allow +truly multi-purpose indexes to be constructed. This is not to say that every +data structure will necessarily be efficient at answering every type of query +that could be answered using their interface--but in a database system, being +able to repurpose an existing index to accelerate a wide range of query types +would certainly seem worth considering. + +\section{Framework Evaluation} + +The framework was evaluated using three different types of search problem: +range-count, high-dimensional k-nearest neighbor, and independent range +sampling. In all three cases, an extended static data structure was compared +with dynamic alternatives for the same search problem to demonstrate the +framework's competitiveness. + +\subsection{Methodology} + +All tests were performed using Ubuntu 22.04 +LTS on a dual-socket Intel Xeon Gold 6242R server with 384 GiB of +installed memory and 40 physical cores. Benchmark code was compiled +using \texttt{gcc} version 11.3.0 at the \texttt{-O3} optimization level. + + +\subsection{Range Queries} + +A first test evaluates the performance of the framework in the context of +range queries against learned indexes. In Chapter~\ref{chap:intro}, the +lengthy development cycle of this sort of data structure was discussed, +and so learned indexes were selected as an evaluation candidate to demonstrate +how this framework could allow such lengthy development lifecycles to be largely +bypassed. + +Specifically, the framework is used to produce dynamic learned indexes based on +TrieSpline~\cite{plex} (DE-TS) and the static version of PGM~\cite{pgm} (DE-PGM). These +are both single-pass construction static learned indexes, and thus well suited for use +within this framework compared to more complex structures like RMI~\cite{RMI}, which have +more expensive construction algorithms. The two framework-extended data structures are +compared with dynamic learned indexes, namely ALEX~\cite{ALEX} and the dynamic version of +PGM~\cite{pgm}. PGM provides an interesting comparison, as its native +dynamic version was implemented using a slightly modified version Bentley-Saxe method. + +When performing range queries over large data sets, the +copying of query results can introduce significant overhead. Because the four +tested structures have different data copy behaviors, a range count query was +used for testing, rather than a pure range query. This search problem exposes +the searching performance of the data structures, while controlling for different +data copy behaviors, and so should provide more directly comparable results. + +Range count +queries were executed with a selectivity of $0.01\%$ against three datasets +from the SOSD benchmark~\cite{sosd-datasets}: \texttt{book}, \texttt{fb}, and +\texttt{osm}, which all have 200 million 64-bit keys following a variety of +distributions, which were paired with uniquely generated 64-bit values. There +is a fourth dataset in SOSD, \texttt{wiki}, which was excluded from testing +because it contained duplicate keys, which are not supported by dynamic +PGM.\footnote{The dynamic version of PGM supports deletes using tombstones, +but doesn't wrap records with a header to accomplish this. Instead it reserves +one possible value to represent a tombstone. Records are deleted by inserting a +record having the same key, but this different value. This means that duplicate +keys, even if they have different values, are unsupported as two records with +the same key will be treated as a delete by the index.~\cite{pgm} } + +The shard implementations for DE-PGM and DE-TS required about 300 lines of +C++ code each, and no modification to the data structures themselves. For both +data structures, the framework was configured with a buffer of 12,000 records, a scale +factor of 8, the tombstone delete policy, and tiering. Each shard stored $D_i$ +as a sorted array of records, used an instance of the learned index for +$\mathcal{I}_i$, and has no auxiliary structures. The local query routine used +the learned index to locate the first key in the query range and then iterated +over the sorted array until the end of the range is reached, counting the +number of records and tombstones required. The mutable buffer query performed +the counting over a full scan. No local preprocessing was needed, and the merge +operation simply summed the record and tombstone counts, and returned their +difference. + +\begin{figure*}[t] + \centering + \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-insert} \label{fig:rq-insert}} + \subfloat[Query Latency]{\includegraphics[width=.5\textwidth]{img/fig-bs-rq-query} \label{fig:rq-query}} \\ + \subfloat[Index Sizes]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0 ]{img/fig-bs-rq-space} \label{fig:idx-space}} + \caption{Range Count Evaluation} + \label{fig:results1} +\end{figure*} + +Figure~\ref{fig:rq-insert} shows the update throughput of all competitors. ALEX +performs the worst in all cases, and PGM performs the best, with the extended +indexes falling in the middle. It is not unexpected that PGM performs better +than the framework, because the Bentley-Saxe extension in PGM is custom-built, +and thus has a tighter integration than a general framework would allow. +However, even with this advantage, DE-PGM still reaches up to 85\% of PGM's +insertion throughput. Additionally, Figure~\ref{fig:rq-query} shows that PGM +pays a large cost in query latency for its advantage in insertion, with the +framework extended indexes significantly outperforming it. Further, DE-TS even +outperforms ALEX for query latency in some cases. Finally, +Figure~\ref{fig:idx-space} shows the storage cost of the indexes, without +counting the space necessary to store the records themselves. The storage cost +of a learned index is fairly variable, as it is largely a function of the +distribution of the data, but in all cases, the extended learned +indexes, which build compact data arrays without gaps, occupy three orders of +magnitude smaller storage space compared to ALEX, which requires leaving gaps +in the data arrays. + +\subsection{High-Dimensional k-Nearest Neighbor} +The next test evaluates the framework for the extension of high-dimensional +metric indexes for the k-nearest neighbor search problem. An M-tree~\cite{mtree} +was used as the dynamic baseline,\footnote{ + Specifically, the M-tree implementation tested can be found at \url{https://github.com/dbrumbaugh/M-Tree} + and is a fork of a structure written originally by Eduardo D'Avila, modified to compile under C++20. The + tree uses a random selection algorithm for ball splitting. +} and a VPTree~\cite{vptree} as the static structure. The framework was used to +extend VPTree to produce the dynamic version, DE-VPTree. +An M-Tree is a tree that partitions records based on +high-dimensional spheres and supports updates by splitting and merging these +partitions. +A VPTree is a binary tree that is produced by recursively selecting +a point, called the vantage point, and partitioning records based on their +distance from that point. This results in a difficult to modify structure that +can be constructed in $O(n \log n)$ time and can answer KNN queries in $O(k +\log n)$ time. + +DE-VPTree, used a buffer of 12,000 records, a scale factor of 6, tiering, and +delete tagging. The query was implemented without a pre-processing step, using +the standard VPTree algorithm for KNN queries against each shard. All $k$ +records were determined for each shard, and then the merge operation used a +heap to merge the results sets together and return the $k$ nearest neighbors +from the $k\log(n)$ intermediate results. This is a type of query that pays a +non-constant merge cost, even with the framework's expanded query interface, of +$O(k \log k)$. In effect, the kNN query must be answered twice: once for each +shard to get the intermediate result sets, and then a second time within the +merge operation to select the kNN from the result sets. + +\begin{figure} + \centering + \includegraphics[width=.75\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-knn} + \caption{KNN Index Evaluation} + \label{fig:knn} +\end{figure} +Euclidean distance was used as the metric for both structures, and $k=1000$ was +used for all queries. The reference point for each query was selected randomly +from points within the dataset. Tests were run using the Spanish Billion Words +dataset~\cite{sbw}, of 300-dimensional vectors. The results are shown in +Figure~\ref{fig:knn}. In this case, the static nature of the VPTree allows it +to dominate the M-Tree in query latency, and the simpler reconstruction +procedure shows a significant insertion performance improvement as well. + +\subsection{Independent Range Sampling} +Finally, the +framework was tested using one-dimensional IRS queries. As before, +a static ISAM-tree was used as the data structure to be extended, +however the sampling query was implemented using the query interface from +Section~\ref{ssec:fw-query-int}. The pre-processing step identifies the first +and last query falling into the range to be sampled from, and determines the +total weight based on this range, for each shard. Then, in the local query +generation step, these weights are used to construct and alias structure, which +is used to assign sample sizes to each shard based on weight to avoid +introducing skew into the results. After this, the query routine generates +random numbers between the established bounds to sample records, and the merge +operation appends the individual result sets together. This static procedure +only requires a pair of tree traversals per shard, regardless of how many +samples are taken. + +\begin{figure} + \centering + \subfloat[Query Latency]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-query} \label{fig:irs-query}} + \subfloat[Update Throughput]{\includegraphics[width=.5\textwidth, trim=5mm 5mm 0 0]{img/fig-bs-irs-insert} \label{fig:irs-insert}} + \caption{IRS Index Evaluation} + \label{fig:results2} +\end{figure} + +The extended ISAM structure (DE-IRS) was compared to a B$^+$-Tree +with aggregate weight tags on internal nodes (AGG B+Tree) for sampling +and insertion performance, and to a single instance of the static ISAM-tree (ISAM), +which does not support updates. DE-IRS was configured with a buffer size +of 12,000 records, a scale factor of 6, tiering, and delete tagging. The IRS +queries had a selectivity of $0.1\%$ with sample size of $k=1000$. Testing +was performed using the same datasets as were used for range queries. + +Figure~\ref{fig:irs-query} +shows the significant latency advantage that the dynamically extended ISAM tree +enjoys compared to a B+Tree. DE-IRS is up to 23 times faster than the B$^+$-Tree at +answering sampling queries, and only about 3 times slower than the fully static +solution. In this case, the extra query cost caused by needing to query +multiple structures is more than balanced by the query efficiency of each of +those structures, relative to tree sampling. Interestingly, the framework also +results in better update performance compared to the B$^+$-Tree, as shown in +Figure~\ref{fig:irs-insert}. This is likely because the ISAM shards can be +efficiently constructed using a combination of sorted-merge operations and +bulk-loading, and avoid expensive structural modification operations that are +necessary for maintaining a B$^+$-Tree. + +\subsection{Discussion} + + +The results demonstrate not only that the framework's update support is +competitive with custom-built dynamic data structures, but that the framework +is even able to, in many cases, retain some of the query performance advantage +of its extended static data structure. This is particularly evident in the k-nearest +neighbor and independent range sampling tests, where the static version of the +structure was directly tested as well. These tests demonstrate one of the advantages +of static data structures: they are able to maintain much tighter inter-record relationships +than dynamic ones, because update support typically requires relaxing these relationships +to make it easier to update them. While the framework introduces the overhead of querying +multiple structures and merging them together, it is clear from the results that this overhead +is generally less than the overhead incurred by the update support techniques used +in the dynamic structures. The only case where the framework was defeated in query performance +was in competition with ALEX, where the resulting query latencies were comparable. + +It is also evident that the update support provided by the framework is on par with, if not +superior, to that provided by the dynamic baselines, at least in terms of throughput. The +framework will certainly suffer from larger tail latency spikes, which weren't measured in +this round of testing, due to the larger scale of the reconstructions, but the amortization +of these costs over a large number of inserts allows for the maintenance of a respectable +level of throughput. In fact, the only case where the framework loses in insertion throughput +is against the dynamic PGM. However, an examination of the query latency reveals that this +is likely due to the fact that the standard configuration of the Bently-Saxe variant used +by PGM is highly tuned for insertion performance, as the query latencies against this data +structure are far worse than any other learned index tested, so even this result shouldn't +be taken as a ``clear'' defeat of the framework's implementation. + +Overall, it is clear from this evaluation that the dynamic extension framework is a +promising alternative to manual index redesign for accommodating updates. In almost +all cases, the framework-extended static data structures provided superior insertion +throughput in all cases, and query latencies that either matched or exceeded that of +the dynamic baselines. Additionally, though it is hard to quantity, the code complexity +of the framework-extended data structures was much less, with the shard implementations +requiring only a small amount of relatively straightforward code to interface with pre-existing +static data structures, or with the necessary data structure implementations themselves being +simpler. + +\section{Conclusion} + +In this chapter, a generalize version of the framework originally proposed in +Chapter~\ref{chap:sampling} was proposed. This framework is based on two +key properties: extended decomposability and record identity. It is capable +of extending any data structure and search problem supporting these two properties +with support for inserts and deletes. An evaluation of this framework was performed +by extending several static data structures, and comparing the resulting structures' +performance against dynamic baselines capable of answering the same type of search +problem. The extended structures generally performed as well as, if not better, than +their dynamic baselines in query performance, insert performance, or both. This demonstrates +the capability of this framework to produce viable indexes in a variety of contexts. However, +the framework is not yet complete. In the next chapter, the work required to bring this +framework to completion will be described. diff --git a/chapters/chapter1-old.tex b/chapters/chapter1-old.tex new file mode 100644 index 0000000..fca257d --- /dev/null +++ b/chapters/chapter1-old.tex @@ -0,0 +1,256 @@ +\chapter{Introduction} + +It probably goes without saying that database systems are heavily +dependent upon data structures, both for auxiliary use within the system +itself, and for indexing the data in storage to facilitate faster access. +As a result of this, the design of novel data structures constitutes a +significant sub-field within the database community. However, there is a +stark divide between theoretical work and so-called "practical" work in +this area, with many theoretically oriented data structures not seeing +much, if any, use in real systems. I would go so far as to assert that +many of these published data structures have \emph{never} been actually +used. + +This situation exists with reason, of course. Fundamentally, the rules +of engagement within the theory community differ from those within the +systems community. Asymptotic analysis, which eschews constant factors, +dominates theoretical analysis of data structures, whereas the systems +community cares a great deal about these constants. We'll see within +this document itself just how significant a divide this is in terms of +real performance numbers. But, perhaps an even more significant barrier +to theoretical data structures is that of support for features. + +A data structure, technically speaking, only needs to define algorithms +for constructing and querying it. I'll describe such minimal structures +as \emph{static data structures} within this document. Many theoretical +structures that seem potentially useful fall into this category. Examples +include alias-augmented structures for independent sampling, vantage-point +trees for multi-dimensional similarity search, ISAM trees for traditional +one-dimensional indexing, the vast majority of learned indexes, etc. + +These structures allow for highly efficient answering of their associated +types of query, but have either fallen out of use (ISAM Trees) or have +yet to see widespread adoption in database systems. This is because the +minimal interface provided by a static data structure is usually not +sufficient to address the real-world engineering challenges associated +with database systems. Instead, data structures used by such systems must +support variety of additional features: updates to the underlying data, +concurrent access, fault-tolerance, etc. This lack of feature support +is a major barrier to the adoption of such structures. + +In the current data structure design paradigm, support for such features +requires extensive redesign of the static data structure, often over a +lengthy development cycle. Learned indexes provide a good case study for +this. The first learned index, RMI, was proposed by Kraska \emph{et al.} +in 2017~\cite{kraska-rmi}. As groundbreaking as this data structure, +and the idea behind it, was, it lacks support for updates and thus was +of very limited practical utility. Work then proceeded over the next +year-and-a-half to develop an updatable data structure based on the +concepts of RMI, culminating in ALEX~\cite{alex}, which first appeared +on archive a year-and-a-half later. The next several years saw the +development of a wide range of learned indexes, promising support for +updates and concurrency. However, a recent survey found that all of them +were still largely inferior to more mature indexing techniques, at least +on certain workloads. + +These adventures in learned index design represent much of the modern +index design process in microcosm. It is not unreasonable to expect +that, as the technology matures, learned indexes may one day become +commonplace. But the amount of development and research effort to get +there is, clearly, vast. + +On the opposite end of the spectrum, theoretical data structure works +also attempt to extend their structures with update support using a +variety of techniques. However, the differing rules of engagement often +result in solutions to this problem that are horribly impractical in +database systems. As an example, Hu, Qiao, and Tao have proposed a data +structure for efficient range sampling, and included in their design a +discussion of efficient support for updates~\cite{irs}. Without getting +into details, they need to add multiple additional data structures beside +their sampling structure to facilitate this, including a hash table and +multiple linked lists. Asymptotically, this approach doesn't affect space +or time complexity as there is a constant number of extra structures, +and the cost of maintaining and accessing them are on par with the costs +associated with their main structure. But it's clear that the space +and time costs of these extra data structures would have relevance in +a real system. A similar problem arises in a recent attempt to create a +dynamic alias structure, which uses multiple auxiliary data structures, +and further assumes that the key space size is a constant that can be +neglected~\cite{that-paper}. + +Further, update support is only one of many features that a data +structure must support for use in database systems. Given these challenges +associated with just update support, one can imagine the amount of work +required to get a data structure fully ``production ready''! + +However, all of these tribulations are, I'm going to argue, not +fundamental to data structure design, but rather a consequence of the +modern data structure design paradigm. Rather than this process of manual +integration of features into the data structure itself, we propose a +new paradigm: \emph{Framework-driven Data Structure Design}. Under this +paradigm, the process of designing a data structure is reduced to the +static case: an algorithm for querying the structure and an algorithm +for building it from a set of elements. Once these are defined, a high +level framework can be used to automatically add support for other +desirable features, such as updates, concurrency, and fault-tolerance, +in a manner that is mostly transparent to the static structure itself. + +This idea is not without precedent. For example, a similar approach +is used to provide fault-tolerance to indexes within traditional, +disk-based RDBMS. The RDBMS provides a storage engine which has its own +fault tolerance systems. Any data structure built on top of this storage +engine can benefit from its crash recovery, requiring only a small amount +of effort to integrate the system. As a result, crash recovery/fault +tolerance is not handled at the level of the data structure in such +systems. The B+Tree index itself doesn't have the mechanism built into +it, it relies upon the framework provided by the RDBMS. + +Similarly, there is an existing technique which uses a similar process +to add support for updates to static structures, commonly called the +Bentley-Saxe method. + +\section{Research Objectives} +The proposed project has four major objectives, +\begin{enumerate} +\item Automatic Dynamic Extension + + The first phase of this project has seen the development of a + \emph{dynamic extension framework}, which is capable of adding + support for inserts and deletes of data to otherwise static data + structures, so long as a few basic assumptions about the structure + and associated queries are satisfied. This framework is based on + the core principles of the Bentley-Saxe method, and is implemented + using C++ templates to allow for ease of use. + + As part of the extension of BSM, a large design space has been added, + giving the framework a trade-off space between memory usage, insert + performance, and query performance. This allows for the performance + characteristics of the framework-extended data structure to be tuned + for particular use cases, and provides a large degree of flexibility + to the technique. + +\item Automatic Concurrency Support + + Because the Bentley-Saxe method is based on the reconstruction + of otherwise immutable blocks, a basic concurrency implementation + is straightforward. While there are hard blocking points when a + reconstruction requires the results of an as-of-yet incomplete + reconstruction, all other operations can be easily performed + concurrently, so long as the destruction of blocks can be deferred + until all operations actively using it are complete. This lends itself + to a simple epoch-based system, where a particular configuration of + blocks constitutes an epoch, and the reconstruction of one or more + blocks triggers a shift to a new epoch upon its completion. Each + query will see exactly one epoch, and that epoch will remain in + existence until all queries using it have terminated. + + With this strategy, the problem of adding support for concurrent + operations is largely converted into one of resource management. + Retaining old epochs, adding more buffers, and running reconstruction + operations all require storage. Further, large reconstructions + consume memory bandwidth and CPU resources, which must be shared + with active queries. And, at least some reconstructions will actively + block others, which will lead to tail latency spikes. + + The objective of this phase of the project is the creation of a + scheduling system, built into the framework, that will schedule + queries and merges so as to ensure that the system operates within + specific tail latency and resource utilization constraints. In + particular, it is important to effectively hide the large insertion + tail latencies caused by reconstructions, and to limit the storage + required to retain old versions of the structure. Alongside + scheduling, the use of admission control will be considered for helping + to maintain latency guarantees even in adversarial conditions. + +\item Automatic Multi-node Support + + It is increasingly the case that the requirements for data management + systems exceed the capacity of a single node, requiring horizontal + scaling. Unfortunately, the design of data structures that work + effectively in a distributed, multi-node environment is non-trivial. + However, the same design elements that make it straightforward to + implement a framework-driven concurrency system should also lend + themselves to adding multi-node support to a data structure. The + framework uses immutable blocks of data, which are periodically + reconstructed by combining them with other blocks. This system is + superficially similar to the RDDs used by Apache Spark, for example. + + What is not so straightforward, however, is the implementation + decisions that underlie this framework. It is not obvious that the + geometric block sizing technique used by BSM is well suited to this + task, and so a comprehensive evaluation of block sizing techniques + will be required. Additionally, there are significant challenges + to be overcome regarding block placement on nodes, fault-tolerance + and recovery, how best to handle buffering, and the effect of block + sizing strategies and placement on end-to-end query performance. All + of these problems will be studied during this phase of the project. + + +\item Automatic Performance Tuning + + During all phases of the project, various tunable parameters will + be introduced that allow for various trade-offs between insertion + performance, query performance, and memory usage. These allow for a + user to fine-tune the performance characteristics of the framework + to suit her use-cases. However, this tunability may introduce an + obstacle to adoption for the system, as it is not necessarily trivial + to arrive at an effective configuration of the system, given a set of + performance requirements. Thus, the final phase of the project will + consider systems to automatically tune the framework. As a further + benefit, such a system could allow dynamic adjustment to the tunable + parameters of the framework during execution, to allow for automatic + and transparent evolution in the phase of changing workloads. + +\end{enumerate} + + +\begin{enumerate} + \item Thrust 1. Automatic Concurrency and Scheduling + + The design of the framework lends itself to a straightforward, data + structure independent, concurrency implementation, but ensuring good + performance of this implementation will require intelligent scheduling. + In this thrust, we will study the problem of scheduling operations + within the framework to meet certain tail latency guarantees, within a + particular set of resource constraints. + + RQ1: How best to parameterize merge and query operations + RQ2: Develop a real-time (or nearly real time) scheduling + system to make decisions about when the merge, while + ensuring certain tail latency requirements within a + set of resource constraints + + + \item Thrust 2. Temporal and Spatial Data Partitioning + + The framework is based upon a temporal partitioning of data, however + there are opportunities to improve the performance of certain + operations by introducing a spatial partitioning scheme as well. In + this thrust, we will expand the framework to support arbitrary + partitioning schemes, and access the efficacy of spatial partitioning + under a variety of contexts. + + RQ1: What effect does spatial partitioning within levels have on + the performance of inserts and queries? + RQ2: Does a trade-offs exist between spatial and temporal partitioning? + RQ3: To what degree do results about spatial partitioning generalize + across different types of index (particularly multi-dimensional + ones). + + \item Thrust 3. Dynamic Performance Tuning + + The framework contains a large number of tunable parameters which allow + for trade-offs between memory usage, read performance, and write + performance. In this thrust, we will comprehensively evaluate this + design space, and develop a system for automatically adjusting these + parameters during system operation. This will allow the system to + dynamically change its own configuration when the workload changes. + + RQ1: Quantity and model the effects of framework tuning parameters on + various performance metrics. + RQ2: Evaluate the utility of having a heterogeneous configuration, with + different parameter values on different levels. + RQ3: Develop a system for dynamically adjusting these values based on + current performance data. + +\end{enumerate} diff --git a/chapters/chapter1.tex.bak b/chapters/chapter1.tex.bak new file mode 100644 index 0000000..c66ba2c --- /dev/null +++ b/chapters/chapter1.tex.bak @@ -0,0 +1,204 @@ +\chapter{Introduction} + +It probably goes without saying that database systems are heavily +dependent upon data structures, both for auxiliary use within the system +itself, and for indexing the data in storage to facilitate faster access. +As a result of this, the design of novel data structures constitutes a +significant subfield within the database community. However, there is a +stark divide between theoretical work and so-called "practical" work in +this area, with many theoretically oriented data structures not seeing +much, if any, use in real systems. I would go so far as to assert that +many of these published data structures have \emph{never} been actually +used. + +This situation exists with reason, of course. Fundamentally, the rules +of engadgement within the theory community differ from those within the +systems community. Asymptotic analysis, which eschews constant factors, +dominates theoretical analysis of data structures, whereas the systems +community cares a great deal about these constants. We'll see within +this document itself just how significant a divide this is in terms of +real performance numbers. But, perhaps an even more significant barrier +to theoretical data structures is that of support for features. + +A data structure, technically speaking, only needs to define algorithms +for constructing and querying it. I'll describe such minimal structures +as \emph{static data structures} within this document. Many theoretical +structures that seem potentially useful fall into this category. Examples +include alias-augmented structures for independent sampling, vantage-point +trees for multi-dimensional similiarity search, ISAM trees for traditional +one-dimensional indexing, the vast majority of learned indexes, etc. + +These structures allow for highly efficient answering of their associated +types of query, but have either fallen out of use (ISAM Trees) or have +yet to see widespread adoption in database systems. This is because the +minimal interface provided by a static data structure is usually not +sufficient to address the real-world engineering challenges associated +with database systems. Instead, data structures used by such systems must +support variety of additional features: updates to the underlying data, +concurrent access, fault-tolerance, etc. This lack of feature support +is a major barrier to the adoption of such structures. + +In the current data structure design paradigm, support for such features +requires extensive redesign of the static data structure, often over a +lengthy development cycle. Learned indexes provide a good case study for +this. The first learned index, RMI, was proposed by Kraska \emph{et al.} +in 2017~\cite{kraska-rmi}. As groundbreaking as this data structure, +and the idea behind it, was, it lacks support for updates and thus was +of very limited practical utility. Work then proceeded over the next +year-and-a-half to develop an updatable data structure based on the +concepts of RMI, culmintating in ALEX~\cite{alex}, which first appeared +on archive a year-and-a-half later. The next several years saw the +development of a wide range of learned indexes, promising support for +updates and concurrency. However, a recent survey found that all of them +were still largely inferior to more mature indexing techniques, at least +on certain workloads. + +These adventures in learned index design represent much of the modern +index design process in microcosm. It is not unreasonable to expect +that, as the technology matures, learned indexes may one day become +commonplace. But the amount of development and research effort to get +there is, clearly, vast. + +On the opposite end of the spectrum, theoretical data structure works +also attempt to extend their structures with update support using a +variety of techniques. However, the differing rules of engagement often +result in solutions to this problem that are horribly impractical in +database systems. As an example, Hu, Qiao, and Tao have proposed a data +structure for efficient range sampling, and included in their design a +discussion of efficient support for updates~\cite{irs}. Without getting +into details, they need to add multiple additional data structures beside +their sampling structure to facilitate this, including a hash table and +multiple linked lists. Asymptotically, this approach doesn't affect space +or time complexity as there is a constant number of extra structures, +and the cost of maintaining and accessing them are on par with the costs +associated with their main structure. But it's clear that the space +and time costs of these extra data structures would have relevance in +a real system. A similar problem arises in a recent attempt to create a +dynamic alias structure, which uses multiple auxilliary data structures, +and further assumes that the key space size is a constant that can be +neglected~\cite{that-paper}. + +Further, update support is only one of many features that a data +structure must support for use in database systems. Given these challenges +associated with just update support, one can imagine the amount of work +required to get a data structure fully ``production ready''! + +However, all of these tribulations are, I'm going to argue, not +fundamental to data structure design, but rather a consequence of the +modern data structure design paradigm. Rather than this process of manual +integration of features into the data structure itself, we propose a +new paradigm: \emph{Framework-driven Data Structure Design}. Under this +paradigm, the process of designing a data structure is reduced to the +static case: an algorithm for querying the structure and an algorithm +for building it from a set of elements. Once these are defined, a high +level framework can be used to automatically add support for other +desirable features, such as updates, concurrency, and fault-tolerance, +in a manner that is mostly transparent to the static structure itself. + +This idea is not without precident. For example, a similar approach +is used to provide fault-tolerance to indexes within traditional, +disk-based RDBMS. The RDBMS provides a storage engine which has its own +fault tolerance systems. Any data structure built on top of this storage +engine can benefit from its crash recovery, requiring only a small amount +of effort to integrate the system. As a result, crash recovery/fault +tolerance is not handled at the level of the data structure in such +systems. The B+Tree index itself doesn't have the mechanism built into +it, it relies upon the framework provided by the RDBMS. + +Similarly, there is an existing technique which uses a similar process +to add support for updates to static structures, commonly called the +Bentley-Saxe method. + +\section{Research Objectives} +The proposed project has four major objectives, +\begin{enumerate} +\item Automatic Dynamic Extension + + The first phase of this project has seen the development of a + \emph{dynamic extension framework}, which is capable of adding + support for inserts and deletes of data to otherwise static data + structures, so long as a few basic assumptions about the structure + and associated queries are satisified. This framework is based on + the core principles of the Bentley-Saxe method, and is implemented + using C++ templates to allow for ease of use. + + As part of the extension of BSM, a large design space has been added, + giving the framework a trade-off space between memory usage, insert + performance, and query performance. This allows for the performance + characteristics of the framework-extended data structure to be tuned + for particular use cases, and provides a large degree of flexibility + to the technique. + +\item Automatic Concurrency Support + + Because the Bentley-Saxe method is based on the reconstruction + of otherwise immutable blocks, a basic concurrency implementation + is straightforward. While there are hard blocking points when a + reconstruction requires the results of an as-of-yet incomplete + reconstruction, all other operations can be easily performed + concurrently, so long as the destruction of blocks can be deferred + until all operations actively using it are complete. This lends itself + to a simple epoch-based system, where a particular configuration of + blocks constitutes an epoch, and the reconstruction of one or more + blocks triggers a shift to a new epoch upon its completion. Each + query will see exactly one epoch, and that epoch will remain in + existence until all queries using it have terminated. + + With this strategy, the problem of adding support for concurrent + operations is largely converted into one of resource management. + Retaining old epochs, adding more buffers, and running reconstruction + operations all require storage. Further, large reconstructions + consume memory bandwidth and CPU resources, which must be shared + with active queries. And, at least some reconstructions will actively + block others, which will lead to tail latency spikes. + + The objective of this phase of the project is the creation of a + scheduling system, built into the framework, that will schedule + queries and merges so as to ensure that the system operates within + specific tail latency and resource utilization constraints. In + particular, it is important to effectively hide the large insertion + tail latencies caused by reconstructions, and to limit the storage + required to retain old versions of the structure. Alongside + scheduling, the use of admission control will be considered for helping + to maintain latency guarentees even in adverserial conditions. + +\item Automatic Multi-node Support + + It is increasingly the case that the requirements for data management + systems exceed the capacity of a single node, requiring horizontal + scaling. Unfortunately, the design of data structures that work + effectively in a distributed, multi-node environment is non-trivial. + However, the same design elements that make it straightforward to + implement a framework-driven concurrency system should also lend + themselves to adding multi-node support to a data structure. The + framework uses immutable blocks of data, which are periodically + reconstructed by combining them with other blocks. This system is + superficially similar to the RDDs used by Apache Spark, for example. + + What is not so straightforward, however, is the implementation + decisions that underly this framework. It is not obvious that the + geometric block sizing technique used by BSM is well suited to this + task, and so a comprehensive evaluation of block sizing techniques + will be required. Additionally, there are significant challenges + to be overcome regarding block placement on nodes, fault-tolerance + and recovery, how best to handle buffering, and the effect of block + sizing strategies and placement on end-to-end query performance. All + of these problems will be studied during this phase of the project. + + +\item Automatic Performance Tuning + + During all phases of the project, various tunable parameters will + be introduced that allow for various trade-offs between insertion + performance, query performance, and memory usage. These allow for a + user to fine-tune the performance characteristics of the framework + to suit her use-cases. However, this tunability may introduce an + obstical to adoption for the system, as it is not necessarily trivial + to arrive at an effective configuration of the system, given a set of + performance requirements. Thus, the final phase of the project will + consider systems to automatically tune the framework. As a further + benefit, such a system could allow dynamic adjustment to the tunable + parameters of the framework during execution, to allow for automatic + and transparent evolution in the phase of changing workloads. + +\end{enumerate} diff --git a/chapters/conclusion.tex b/chapters/conclusion.tex new file mode 100644 index 0000000..b4439ec --- /dev/null +++ b/chapters/conclusion.tex @@ -0,0 +1,43 @@ +\chapter{Conclusion} +\label{chap:conclusion} + +Using data structures, a wide range of analytical queries against large data +sets can be accelerated. Unfortunately, these data structures must be +concurrently updatable to ensure timely results, as the underlying data is +frequently subject to change. This requirement for concurrent update support +excludes many possible data structures from use in these contexts, and the +creation of a data structure with update support is non-trivial. + +The framework proposed by this work would allow for existing data +structures to be automatically extended with tunable support for +concurrent updates, with potential for future work to add even more +features. It is based on an extension of the Bentley-Saxe method, +which supports updates in static structures by splitting the data +structure into multiple partitions and systematically reconstructing +them. The Bentley-Saxe method has been adjusted to utilize a different +query interface, based on the newly proposed extended decomposability, +which brings with it more efficient support for many types of search +problems not well served by the original techniques. It also introduces +two approaches for handling deletes, buffering of inserts, and a more +tunable reconstruction strategy, as well as support for concurrency, +none of which were present in the original method. + +Using this framework, many data structures and search problems can be +used as the basis of an index, requiring only that they support the +eDSP abstraction and can uniquely identify and locate each record. The +creation of an index requires only a small amount of shim code between +the structure and the framework (called a shard). + +The current version of the framework supports tunable, single-threaded +updates, and has been experimentally validated to extend static data +structures with update support, and maintain performance on-par +with or better than existing dynamic alternatives for a number of +complex search problems, including k-nearest neighbor and a variety +of independent sampling problems. Beyond presenting these results, +this work proposes the extension of this framework with support for +concurrency with tail-latency mitigations, online and fine-grained +tuning, and examining more sophisticated data partitioning schemes to +ease certain challenges associated with large-scale reconstructions. +The completion of this framework would be a major milestone in a larger +project to vastly expand the capabilities of database management systems +through the use of more complex data access primitives. diff --git a/chapters/dynamic-extension-sampling.tex b/chapters/dynamic-extension-sampling.tex new file mode 100644 index 0000000..58db672 --- /dev/null +++ b/chapters/dynamic-extension-sampling.tex @@ -0,0 +1,22 @@ +\chapter{Dynamic Extension Framework for Sampling Indexes} +\label{chap:sampling} + +\begin{center} + \emph{The following chapter is an adaptation of work completed in collaboration with Dr. Dong Xie and published + in PACMMOD Volume 1, Issue 4 (December 2023) under the title "Practical Dynamic Extension of Sampling Indexes". + } + \hrule +\end{center} + +\input{chapters/sigmod23/introduction} +\input{chapters/sigmod23/background} +\input{chapters/sigmod23/framework} +\input{chapters/sigmod23/examples} +\input{chapters/sigmod23/extensions} +\input{chapters/sigmod23/experiment} +\input{chapters/sigmod23/exp-parameter-space} +\input{chapters/sigmod23/exp-baseline} +\input{chapters/sigmod23/exp-extensions} +%\input{chapters/sigmod23/relatedwork} +\input{chapters/sigmod23/conclusion} + diff --git a/chapters/future-work.tex b/chapters/future-work.tex new file mode 100644 index 0000000..d4ddd52 --- /dev/null +++ b/chapters/future-work.tex @@ -0,0 +1,174 @@ +\chapter{Proposed Work} +\label{chap:proposed} + +The previous two chapters described work already completed, however +there are a number of work that remains to be done as part of this +project. Update support is only one of the important features that an +index requires of its data structure. In this chapter, the remaining +research problems will be discussed briefly, to lay out a set of criteria +for project completion. + +\section{Concurrency Support} + +Database management systems are designed to hide the latency of +IO operations, and one of the techniques they use are being highly +concurrent. As a result, any data structure used to build a database +index must also support concurrent updates and queries. The sampling +extension framework described in Chapter~\ref{chap:sampling} had basic +concurrency support, but work is ongoing to integrate a superior system +into the framework of Chapter~\ref{chap:framework}. + +Because the framework is based on the Bentley-Saxe method, it has a number +of desirable properties for making concurrency management simpler. With +the exception of the buffer, the vast majority of the data resides in +static data structures. When using tombstones, these static structures +become fully immutable. This turns concurrency control into a resource +management problem, and suggests a simple multi-version concurrency +control scheme. Each version of the structure, defined as being the +state between two reconstructions, is tagged with an epoch number. A +query, then, will read only a single epoch, which will be preserved +in storage until all queries accessing it have terminated. Because the +mutable buffer is append-only, a consistent view of it can be obtained +by storing the tail of the log at the start of query execution. Thus, +a fixed snapshot of the index can be represented as a two-tuple containing +the epoch number and buffer tail index. + +The major limitation of the Chapter~\ref{chap:sampling} system was +the handling of buffer expansion. While the mutable buffer itself is +an unsorted array, and thus supports concurrent inserts using a simple +fetch-and-add operation, the real hurdle to insert performance is managing +reconstruction. During a reconstruction, the buffer is full and cannot +support any new inserts. Because active queries may be using the buffer, +it cannot be immediately flushed, and so inserts are blocked. Because of +this, it is necessary to use multiple buffers to sustain insertions. When +a buffer is filled, a background thread is used to perform the +reconstruction, and a new buffer is added to continue inserting while that +reconstruction occurs. In Chapter~\ref{chap:sampling}, the solution used +was limited by its restriction to only two buffers (and as a result, +a maximum of two active epochs at any point in time). Any sustained +insertion workload would quickly fill up the pair of buffers, and then +be forced to block until one of the buffers could be emptied. This +emptying of the buffer was contingent on \emph{both} all queries using +the buffer finishing, \emph{and} on the reconstruction using that buffer +to finish. As a result, the length of the block on inserts could be long +(multiple seconds, or even minutes for particularly large reconstructions) +and indeterminate (a given index could be involved in a very long running +query, and the buffer would be blocked until the query completed). + +Thus, a more effective concurrency solution would need to support +dynamically adding mutable buffers as needed to maintain insertion +throughput. This would allow for insertion throughput to be maintained +so long as memory for more buffer space is available.\footnote{For the +in-memory indexes considered thus far, it isn't clear that running out of +memory for buffers is a recoverable error in all cases. The system would +require the same amount of memory for storing record (technically more, +considering index overhead) in a shard as it does in the buffer. In the +case of an external storage system, the calculus would be different, +of course.} It would also ensure that a long running could only block +insertion if there is insufficient memory to create a new buffer or to +run a reconstruction. However, as the number of buffered records grows, +there is the potential for query performance to suffer, which leads to +another important aspect of an effective concurrency control scheme. + +\subsection{Tail Latency Control} + +The concurrency control scheme discussed thus far allows for maintaining +insertion throughput by allowing an unbounded portion of the new data +to remain buffered in an unsorted fashion. Over time, this buffered +data will be moved into data structures in the background, as the +system performs merges (which are moved off of the critical path for +most operations). While this system allows for fast inserts, it has the +potential to damage query performance. This is because the more buffered +data there is, the more a query must fall back on its inefficient +scan-based buffer path, as opposed to using the data structure. + +Unfortunately, reconstructions can be incredibly lengthy (recall that +the worst-case scenario involves rebuilding a static structure over +all of the records; this is, thankfully, quite rare). This implies that +it may be necessary in certain circumstances to throttle insertions to +maintain certain levels of query performance. Additionally, it may be +worth preemptively performing large reconstructions during periods of +low utilization, similar to systems like Silk designed for mitigating +tail latency spikes in LSM-tree based systems~\cite{balmau19}. + +Additionally, it is possible that large reconstructions may have a +negative effect on query performance, due to system resource utilization. +Reconstructions can use a large amount of memory bandwidth, which must +be shared by queries. The effects of parallel reconstruction on query +performance will need to be assessed, and strategies for mitigation of +this effect, be it a scheduling-based solution, or a resource-throttling +one, considered if necessary. + + +\section{Fine-Grained Online Performance Tuning} + +The framework has a large number of configurable parameters, and +introducing concurrency control will add even more. The parameter sweeps +in Section~\ref{ssec:ds-exp} show that there are trade-offs between +read and write performance across this space. Unfortunately, the current +framework applies this configuration parameters globally, and does not +allow them to be changed after the index is constructed. It seems apparent +that better performance might be obtained by adjusting this approach. + +First, there is nothing preventing these parameters from being configured +on a per-level basis. Having different layout policies on different +levels (for example, tiering on higher levels and leveling on lower ones), +different scale factors, etc. More index specific tuning, like controlling +memory budget for auxiliary structures, could also be considered. + +This fine-grained tuning will open up an even broader design space, +which has the benefit of improving the configurability of the system, +but the disadvantage of making configuration more difficult. Additionally, +it does nothing to address the problem of workload drift: a configuration +may be optimal now, but will it remain effective in the future as the +read/write mix of the workload changes? Both of these challenges can be +addressed using dynamic tuning. + +The theory is that the framework could be augmented with some workload +and performance statistics tracking. Based on these numbers, during +reconstruction, the framework could decide to adjust the configuration +of one or more levels in an online fashion, to lean more towards read +or write performance, or to dial back memory budgets as the system's +memory usage increases. Additionally, buffer-related parameters could +be tweaked in real time as well. If insertion throughput is high, it +might be worth it to temporarily increase the buffer size, rather than +spawning multiple smaller buffers. + +A system like this would allow for more consistent performance of the +system in the face of changing workloads, and also increase the ease +of use of the framework by removing the burden of configuration from +the user. + + +\section{Alternative Data Partitioning Schemes} + +One problem with Bentley-Saxe or LSM-tree derived systems is temporary +memory usage spikes. When performing a reconstruction, the system needs +enough storage to store the shards involved in the reconstruction, +and also the newly constructed shard. This is made worse in the face +of multi-version concurrency, where multiple older versions of shards +may be retained in memory at once. It's well known that, in the worst +case, such a system may temporarily require double its current memory +usage~\cite{dayan22}. + +One approach to addressing this problem in LSM-tree based systems is +to adjust the compaction granularity~\cite{dayan22}. In the terminology +associated with this framework, the idea is to further sub-divide each +shard into smaller chunks, partitioned based on keys. That way, when a +reconstruction is triggered, rather than reconstructing an entire shard, +these smaller partitions can be used instead. One of the partitions in +the source shard can be selected, and then merged with the partitions +in the next level down having overlapping key ranges. The amount of +memory required for reconstruction (and also reconstruction time costs) +can then be controlled by adjusting these partitions. + +Unfortunately, while this system works incredibly well for LSM-tree +based systems which store one-dimensional data in sorted arrays, it +encounters some problems in the context of a general index. It isn't +clear how to effectively partition multi-dimensional data in the same +way. Additionally, in the general case, each partition would need to +contain its own instance of the index, as the framework supports data +structures that don't themselves support effective partitioning in the +way that a simple sorted array would. These challenges will need to be +overcome to devise effective, general schemes for data partitioning to +address the problems of reconstruction size and memory usage. diff --git a/chapters/introduction.tex b/chapters/introduction.tex new file mode 100644 index 0000000..a5d9740 --- /dev/null +++ b/chapters/introduction.tex @@ -0,0 +1,95 @@ +\chapter{Introduction} +\label{chap:intro} + +One of the major challenges facing current data systems is the processing +of complex and varied analytical queries over vast data sets. One commonly +used technique for accelerating these queries is the application of data +structures to create indexes, which are the basis for specialized database +systems and data processing libraries. Unfortunately, the development +of these indexes is difficult because of the requirements placed on +them by data processing systems. Data is frequently subject to updates, +yet a large number of potentially useful data structures are static. +Further, many large-scale data processing systems are highly concurrent, +which increases the barrier to entry even further. The process for +developing data structures that satisfy these requirements is arduous. + +To demonstrate this difficulty, consder the recent example of the +evolution of learned indexes. These are data structures designed to +efficiently solve a simple problem: single dimensional range queries +over sorted data. They seek to reduce the size of the structure, as +well as lookup times, by replacing a traditional data structure with a +learned model capable of predicting the location of a record in storage +that matches a key value to within bounded error. This concept was first +proposed by Kraska et al. in 2017, when they published a paper on the +first learned index, RMI~\cite{RMI}. This index succeeding in showing +that a learned model can be both faster and smaller than a conventional +range index, but the proposed solution did not support updates. The +first (non-concurrently) updatable learned index, ALEX, took a year +and a half to appear~\cite{ALEX}. Over the course of the subsequent +three years, several learned indexes were proposed with concurrency +support~\cite{10.1145/3332466.3374547,10.14778/3489496.3489512} but a +recent performance study~\cite{10.14778/3551793.3551848} showed that these +were still generally inferior to ART-OLC~\cite{10.1145/2933349.2933352}, +a traditional index. This same study did however demonstrate that a new +design, ALEX+, was able to outperform ART-OLC under certain circumstances, +but even with this result learned indexes are not generally considered +production ready, because they suffer from significant performance +regressions under certain workloads, and are highly sensitive to the +distribution of keys~\cite{10.14778/3551793.3551848}. Despite the +demonstrable advantages of the technique and over half a decade of +development, learned indexes still have not reached a generally usable +state.\footnote{ + In Chapter~\ref{chap:framework}, we apply our proposed technique to + existing static learned indexes to produce an effective dynamic index. +} + +This work proposes a strategy for addressing this problem by providing a +framework for automatically introducing support for concurrent updates +(including both inserts and deletes) to many static data structures. With +this framework, a wide range of static, or otherwise impractical, data +structures will be made practically useful in data systems. Based +on a classical, theoretical framework called the Bentley-Saxe +Method~\cite{saxe79}, the proposed system will provide a library +that can automatically extend many data structures with support for +concurrent updates, as well as a tunable design space to allow for the +user to make trade-offs between read performance, write performance, +and storage usage. The framework will address a number of limitations +present in the original technique, widely increasing its applicability +and practicality. It will also provide a workload-adaptive, online tuning +system that can automatically adjust the tuning parameters of the data +structure in the face of changing workloads. + +This framework is based on the splitting of the data structure into +several smaller pieces, which are periodically reconstructed to support +updates. A systematic partitioning and reconstruction approach is used +to provide specific guarantees on amortized insertion performance, and +worst case query performance. The underlying Bentley-Saxe method is +extended using a novel query abstraction to broaden its applicability, +and the partitioning and reconstruction processes are adjusted to improve +performance and introduce configurability. + +Specifically, the proposed work will address the following points, +\begin{enumerate} + \item The proposal of a theoretical framework for analysing queries + and data structures that extends existing theoretical + approaches and allows for more data structures to be dynamized. + \item The design of a system based upon this theoretical framework + for automatically dynamizing static data structures in a performant + and configurable manner. + \item The extension of this system with support for concurrent operations, + and the use of concurrency to provide more effective worst-case + performance guarantees. +\end{enumerate} + +The rest of this document is structured as follows. First, +Chapter~\ref{chap:background} introduces relevant background information, +including the importance of data structures and indexes in database systems, +the concept of a search problem, and techniques for designing updatable data +structures. Next, in Chapter~\ref{chap:sampling}, the application of the +Bentley-Saxe method to a number of sampling data structures is presented. The +extension of these structures introduces a number of challenges which must be +addressed, resulting in significant modification of the underlying technique. +Then, Chapter~\ref{chap:framework} discusses the generalization of the +modifications from the sampling framework into a more general framework. +Chapter~\ref{chap:proposed} discusses the work that remains to be completed as +part of this project, and Chapter~\ref{chap:conclusion} concludes the work. diff --git a/chapters/sigmod23/abstract.tex b/chapters/sigmod23/abstract.tex new file mode 100644 index 0000000..3ff0c08 --- /dev/null +++ b/chapters/sigmod23/abstract.tex @@ -0,0 +1,29 @@ +\begin{abstract} + + The execution of analytical queries on massive datasets presents challenges + due to long response times and high computational costs. As a result, the + analysis of representative samples of data has emerged as an attractive + alternative; this avoids the cost of processing queries against the entire + dataset, while still producing statistically valid results. Unfortunately, + the sampling techniques in common use sacrifice either sample quality or + performance, and so are poorly suited for this task. However, it is + possible to build high quality sample sets efficiently with the assistance + of indexes. This introduces a new challenge: real-world data is subject to + continuous update, and so the indexes must be kept up to date. This is + difficult, because existing sampling indexes present a dichotomy; efficient + sampling indexes are difficult to update, while easily updatable indexes + have poor sampling performance. This paper seeks to address this gap by + proposing a general and practical framework for extending most sampling + indexes with efficient update support, based on splitting indexes into + smaller shards, combined with a systematic approach to the periodic + reconstruction. The framework's design space is examined, with an eye + towards exploring trade-offs between update performance, sampling + performance, and memory usage. Three existing static sampling indexes are + extended using this framework to support updates, and the generalization of + the framework to concurrent operations and larger-than-memory data is + discussed. Through a comprehensive suite of benchmarks, the extended + indexes are shown to match or exceed the update throughput of + state-of-the-art dynamic baselines, while presenting significant + improvements in sampling latency. + +\end{abstract} diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex new file mode 100644 index 0000000..58324bd --- /dev/null +++ b/chapters/sigmod23/background.tex @@ -0,0 +1,182 @@ +\section{Background} +\label{sec:background} + +This section formalizes the sampling problem and describes relevant existing +solutions. Before discussing these topics, though, a clarification of +definition is in order. The nomenclature used to describe sampling varies +slightly throughout the literature. In this chapter, the term \emph{sample} is +used to indicate a single record selected by a sampling operation, and a +collection of these samples is called a \emph{sample set}; the number of +samples within a sample set is the \emph{sample size}. The term \emph{sampling} +is used to indicate the selection of either a single sample or a sample set; +the specific usage should be clear from context. + + +\Paragraph{Independent Sampling Problem.} When conducting sampling, it is often +desirable for the drawn samples to have \emph{statistical independence}. This +requires that the sampling of a record does not affect the probability of any +other record being sampled in the future. Independence is a requirement for the +application of statistical tools such as the Central Limit +Theorem~\cite{bulmer79}, which is the basis for many concentration bounds. +A failure to maintain independence in sampling invalidates any guarantees +provided by these statistical methods. + +In each of the problems considered, sampling can be performed either with +replacement (WR) or without replacement (WoR). It is possible to answer any WoR +sampling query using a constant number of WR queries, followed by a +deduplication step~\cite{hu15}, and so this chapter focuses exclusively on WR +sampling. + +A basic version of the independent sampling problem is \emph{weighted set +sampling} (WSS),\footnote{ + This nomenclature is adopted from Tao's recent survey of sampling + techniques~\cite{tao22}. This problem is also called + \emph{weighted random sampling} (WRS) in the literature. +} +in which each record is associated with a weight that determines its +probability of being sampled. More formally, WSS is defined +as: +\begin{definition}[Weighted Set Sampling~\cite{walker74}] + Let $D$ be a set of data whose members are associated with positive + weights $w: D \to \mathbb{R}^+$. Given an integer $k \geq 1$, a weighted + set sampling query returns $k$ independent random samples from $D$ with + each data point $d \in D$ having a probability of $\frac{w(d)}{\sum_{p\in + D}w(p)}$ of being sampled. +\end{definition} +Each query returns a sample set of size $k$, rather than a +single sample. Queries returning sample sets are the common case, because the +robustness of analysis relies on having a sufficiently large sample +size~\cite{ben-eliezer20}. The common \emph{simple random sampling} (SRS) +problem is a special case of WSS, where every element has unit weight. + +In the context of databases, it is also common to discuss a more general +version of the sampling problem, called \emph{independent query sampling} +(IQS)~\cite{hu14}. An IQS query samples a specified number of records from the +result set of a database query. In this context, it is insufficient to merely +ensure individual records are sampled independently; the sample sets returned +by repeated IQS queries must be independent as well. This provides a variety of +useful properties, such as fairness and representativeness of query +results~\cite{tao22}. As a concrete example, consider simple random sampling on +the result set of a single-dimensional range reporting query. This is +called independent range sampling (IRS), and is formally defined as: + +\begin{definition}[Independent Range Sampling~\cite{tao22}] + Let $D$ be a set of $n$ points in $\mathbb{R}$. Given a query + interval $q = [x, y]$ and an integer $k$, an independent range sampling + query returns $k$ independent samples from $D \cap q$ with each + point having equal probability of being sampled. +\end{definition} +A generalization of IRS exists, called \emph{Weighted Independent Range +Sampling} (WIRS)~\cite{afshani17}, which is similar to WSS. Each point in $D$ +is associated with a positive weight $w: D \to \mathbb{R}^+$, and samples are +drawn from the range query results $D \cap q$ such that each data point has a +probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled. + + +\Paragraph{Existing Solutions.} While many sampling techniques exist, +few are supported in practical database systems. The existing +\texttt{TABLESAMPLE} operator provided by SQL in all major DBMS +implementations~\cite{postgres-doc} requires either a linear scan (e.g., +Bernoulli sampling) that results in high sample retrieval costs, or relaxed +statistical guarantees (e.g., block sampling~\cite{postgres-doc} used in +PostgreSQL). + +Index-assisted sampling solutions have been studied +extensively. Olken's method~\cite{olken89} is a classical solution to +independent sampling problems. This algorithm operates upon traditional search +trees, such as the B+tree used commonly as a database index. It conducts a +random walk on the tree uniformly from the root to a leaf, resulting in a +$O(\log n)$ sampling cost for each returned record. Should weighted samples be +desired, rejection sampling can be performed. A sampled record, $r$, is +accepted with probability $\nicefrac{w(r)}{w_{max}}$, with an expected +number of $\nicefrac{w_{max}}{w_{avg}}$ samples to be taken per element in the +sample set. Olken's method can also be extended to support general IQS by +rejecting all sampled records failing to satisfy the query predicate. It can be +accelerated by adding aggregated weight tags to internal +nodes~\cite{olken-thesis,zhao22}, allowing rejection sampling to be performed +during the tree-traversal to abort dead-end traversals early. + +\begin{figure} + \centering + \includegraphics[width=.5\textwidth]{img/sigmod23/alias.pdf} + \caption{\textbf{A pictorial representation of an alias + structure}, built over a set of weighted records. Sampling is performed by + first (1) selecting a cell by uniformly generating an integer index on + $[0,n)$, and then (2) selecting an item by generating a + second uniform float on $[0,1]$ and comparing it to the cell's normalized + cutoff values. In this example, the first random number is $0$, + corresponding to the first cell, and the second is $.7$. This is larger + than $\nicefrac{.15}{.25}$, and so $3$ is selected as the result of the + query. + This allows $O(1)$ independent weighted set sampling, but adding a new + element requires a weight adjustment to every element in the structure, and + so isn't generally possible without performing a full reconstruction.} + \label{fig:alias} + +\end{figure} + +There also exist static data structures, referred to in this chapter as static +sampling indexes (SSIs)\footnote{ +The name SSI was established in the published version of this paper prior to the +realization that a distinction between the terms index and data structure would +be useful. We'll continue to use the term SSI for the remainder of this chapter, +to maintain consistency with the published work, but technically an SSI refers to + a data structure, not an index, in the nomenclature established in the previous + chapter. + }, that are capable of answering sampling queries in +near-constant time\footnote{ + The designation +``near-constant'' is \emph{not} used in the technical sense of being constant +to within a polylogarithmic factor (i.e., $\tilde{O}(1)$). It is instead used to mean +constant to within an additive polylogarithmic term, i.e., $f(x) \in O(\log n + +1)$. +%For example, drawing $k$ samples from $n$ records using a near-constant +%approach would require $O(\log n + k)$ time. This is in contrast to a +%tree-traversal approach, which would require $O(k\log n)$ time. +} relative to the size of the dataset. An example of such a +structure is used in Walker's alias method \cite{walker74,vose91}, a technique +for answering WSS queries with $O(1)$ query cost per sample, but requiring +$O(n)$ time to construct. It distributes the weight of items across $n$ cells, +where each cell is partitioned into at most two items, such that the total +proportion of each cell assigned to an item is its total weight. A query +selects one cell uniformly at random, then chooses one of the two items in the +cell by weight; thus, selecting items with probability proportional to their +weight in $O(1)$ time. A pictorial representation of this structure is shown in +Figure~\ref{fig:alias}. + +The alias method can also be used as the basis for creating SSIs capable of +answering general IQS queries using a technique called alias +augmentation~\cite{tao22}. As a concrete example, previous +papers~\cite{afshani17,tao22} have proposed solutions for WIRS queries using $O(\log n ++ k)$ time, where the $\log n$ cost is only be paid only once per query, after which +elements can be sampled in constant time. This structure is built by breaking +the data up into disjoint chunks of size $\nicefrac{n}{\log n}$, called +\emph{fat points}, each with an alias structure. A B+tree is then constructed, +using the fat points as its leaf nodes. The internal nodes are augmented with +an alias structure over the total weight of each child. This alias structure +is used instead of rejection sampling to determine the traversal path to take +through the tree, and then the alias structure of the fat point is used to +sample a record. Because rejection sampling is not used during the traversal, +two traversals suffice to establish the valid range of records for sampling, +after which samples can be collected without requiring per-sample traversals. +More examples of alias augmentation applied to different IQS problems can be +found in a recent survey by Tao~\cite{tao22}. + +There do exist specialized sampling indexes~\cite{hu14} with both efficient +sampling and support for updates, but these are restricted to specific query +types and are often very complex structures, with poor constant factors +associated with sampling and update costs, and so are of limited practical +utility. There has also been work~\cite{hagerup93,matias03,allendorf23} on +extending the alias structure to support weight updates over a fixed set of +elements. However, these solutions do not allow insertion or deletion in the +underlying dataset, and so are not well suited to database sampling +applications. + +\Paragraph{The Dichotomy.} Among these techniques, there exists a +clear trade-off between efficient sampling and support for updates. Tree-traversal +based sampling solutions pay a dataset size based cost per sample, in exchange for +update support. The static solutions lack support for updates, but support +near-constant time sampling. While some data structures exist with support for +both, these are restricted to highly specialized query types. Thus in the +general case there exists a dichotomy: existing sampling indexes can support +either data updates or efficient sampling, but not both. diff --git a/chapters/sigmod23/conclusion.tex b/chapters/sigmod23/conclusion.tex new file mode 100644 index 0000000..de6bffc --- /dev/null +++ b/chapters/sigmod23/conclusion.tex @@ -0,0 +1,17 @@ +\section{Conclusion} +\label{sec:conclusion} + +This chapter discussed the creation of a framework for the dynamic extension of +static indexes designed for various sampling problems. Specifically, extensions +were created for the alias structure (WSS), the in-memory ISAM tree (IRS), and +the alias-augmented B+tree (WIRS). In each case, the SSIs were extended +successfully with support for updates and deletes, without compromising their +sampling performance advantage relative to existing dynamic baselines. This was +accomplished by leveraging ideas borrowed from the Bentley-Saxe method and the +design space of the LSM tree to divide the static index into multiple shards, +which could be individually reconstructed in a systematic fashion to +accommodate new data. This framework provides a large design space for trading +between update performance, sampling performance, and memory usage, which was +explored experimentally. The resulting extended indexes were shown to approach +or match the insertion performance of the B+tree, while simultaneously +performing significantly faster in sampling operations under most situations. diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex new file mode 100644 index 0000000..cdbc398 --- /dev/null +++ b/chapters/sigmod23/examples.tex @@ -0,0 +1,143 @@ +\section{Framework Instantiations} +\label{sec:instance} +In this section, the framework is applied to three sampling problems and their +associated SSIs. All three sampling problems draw random samples from records +satisfying a simple predicate, and so result sets for all three can be +constructed by directly merging the result sets of the queries executed against +individual shards, the primary requirement for the application of the +framework. The SSIs used for each problem are discussed, including their +support of the remaining two optional requirements for framework application. + +\subsection{Dynamically Extended WSS Structure} +\label{ssec:wss-struct} +As a first example of applying this framework for dynamic extension, +the alias structure for answering WSS queries is considered. This is a +static structure that can be constructed in $O(n)$ time and supports WSS +queries in $O(1)$ time. The alias structure will be used as the SSI, with +the shards containing an alias structure paired with a sorted array of +records. { The use of sorted arrays for storing the records +allows for more efficient point-lookups, without requiring any additional +space. The total weight associated with a query for +a given alias structure is the total weight of all of its records, +and can be tracked at the shard level and retrieved in constant time. } + +Using the formulae from Section~\ref{sec:framework}, the worst-case +costs of insertion, sampling, and deletion are easily derived. The +initial construction cost from the buffer is $C_c(N_b) \in O(N_b +\log N_b)$, requiring the sorting of the buffer followed by alias +construction. After this point, the shards can be reconstructed in +linear time while maintaining sorted order. Thus, the reconstruction +cost is $C_r(n) \in O(n)$. As each shard contains a sorted array, +the point-lookup cost is $L(n) \in O(\log n)$. The total weight can +be tracked with the shard, requiring $W(n) \in O(1)$ time to access, +and there is no necessary preprocessing, so $P(n) \in O(1)$. Samples +can be drawn in $S(n) \in O(1)$ time. Plugging these results into the +formulae for insertion, sampling, and deletion costs gives, + +\begin{align*} + \text{Insertion:} \quad &O\left(\log_s n\right) \\ + \text{Sampling:} \quad &O\left(\log_s n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ + \text{Tagged Delete:} \quad &O\left(\log_s n \log n\right) +\end{align*} +where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log n)$ for +tombstones. + +\Paragraph{Bounding Rejection Rate.} In the weighted sampling case, +the framework's generic record-based compaction trigger mechanism +is insufficient to bound the rejection rate. This is because the +probability of a given record being sampling is dependent upon its +weight, as well as the number of records in the index. If a highly +weighted record is deleted, it will be preferentially sampled, resulting +in a larger number of rejections than would be expected based on record +counts alone. This problem can be rectified using the framework's user-specified +compaction trigger mechanism. +In addition to +tracking record counts, each level also tracks its rejection rate, +\begin{equation*} +\rho_i = \frac{\text{rejections}}{\text{sampling attempts}} +\end{equation*} +A configurable rejection rate cap, $\rho$, is then defined. If $\rho_i +> \rho$ on a level, a compaction is triggered. In the case +the tombstone delete policy, it is not the level containing the sampled +record, but rather the level containing its tombstone, that is considered +the source of the rejection. This is necessary to ensure that the tombstone +is moved closer to canceling its associated record by the compaction. + +\subsection{Dynamically Extended IRS Structure} +\label{ssec:irs-struct} +Another sampling problem to which the framework can be applied is +independent range sampling (IRS). The SSI in this example is the in-memory +ISAM tree. The ISAM tree supports efficient point-lookups + directly, and the total weight of an IRS query can be +easily obtained by counting the number of records within the query range, +which is determined as part of the preprocessing of the query. + +The static nature of shards in the framework allows for an ISAM tree +to be constructed with adjacent nodes positioned contiguously in memory. +By selecting a leaf node size that is a multiple of the record size, and +avoiding placing any headers within leaf nodes, the set of leaf nodes can +be treated as a sorted array of records with direct indexing, and the +internal nodes allow for faster searching of this array. +Because of this layout, per-sample tree-traversals are avoided. The +start and end of the range from which to sample can be determined using +a pair of traversals, and then records can be sampled from this range +using random number generation and array indexing. + +Assuming a sorted set of input records, the ISAM tree can be bulk-loaded +in linear time. The insertion analysis proceeds like the WSS example +previously discussed. The initial construction cost is $C_c(N_b) \in +O(N_b \log N_b)$ and reconstruction cost is $C_r(n) \in O(n)$. The ISAM +tree supports point-lookups in $L(n) \in O(\log_f n)$ time, where $f$ +is the fanout of the tree. + +The process for performing range sampling against the ISAM tree involves +two stages. First, the tree is traversed twice: once to establish the index of +the first record greater than or equal to the lower bound of the query, +and again to find the index of the last record less than or equal to the +upper bound of the query. This process has the effect of providing the +number of records within the query range, and can be used to determine +the weight of the shard in the shard alias structure. Its cost is $P(n) +\in O(\log_f n)$. Once the bounds are established, samples can be drawn +by randomly generating uniform integers between the upper and lower bound, +in $S(n) \in O(1)$ time each. + +This results in the extended version of the ISAM tree having the following +insert, sampling, and delete costs, +\begin{align*} + \text{Insertion:} \quad &O\left(\log_s n\right) \\ + \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta}\cdot R(n)\right) \\ + \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) +\end{align*} +where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for +tombstones. + + +\subsection{Dynamically Extended WIRS Structure} +\label{ssec:wirs-struct} +As a final example of applying this framework, the WIRS problem will be +considered. Specifically, the alias-augmented B+tree approach, described +by Tao \cite{tao22}, generalizing work by Afshani and Wei \cite{afshani17}, +and Hu et al. \cite{hu14}, will be extended. +This structure allows for efficient point-lookups, as +it is based on the B+tree, and the total weight of a given WIRS query can +be calculated given the query range using aggregate weight tags within +the tree. + +The alias-augmented B+tree is a static structure of linear space, capable +of being built initially in $C_c(N_b) \in O(N_b \log N_b)$ time, being +bulk-loaded from sorted lists of records in $C_r(n) \in O(n)$ time, +and answering WIRS queries in $O(\log_f n + k)$ time, where the query +cost consists of preliminary work to identify the sampling range +and calculate the total weight, with $P(n) \in O(\log_f n)$ cost, and +constant-time drawing of samples from that range with $S(n) \in O(1)$. +This results in the following costs, +\begin{align*} + \text{Insertion:} \quad &O\left(\log_s n\right) \\ + \text{Sampling:} \quad &O\left(\log_s n \log_f n + \frac{k}{1 - \delta} \cdot R(n)\right) \\ + \text{Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) +\end{align*} +where $R(n) \in O(1)$ for tagging and $R(n) \in O(\log_s n \log_f n)$ for +tombstones. Because this is a weighted sampling structure, the custom +compaction trigger discussed in in Section~\ref{ssec:wss-struct} is applied +to maintain bounded rejection rates during sampling. + diff --git a/chapters/sigmod23/exp-baseline.tex b/chapters/sigmod23/exp-baseline.tex new file mode 100644 index 0000000..9e7929c --- /dev/null +++ b/chapters/sigmod23/exp-baseline.tex @@ -0,0 +1,98 @@ +\subsection{Comparison to Baselines} + +Next, the performance of indexes extended using the framework is compared +against tree sampling on the aggregate B+tree, as well as problem-specific +SSIs for WSS, WIRS, and IRS queries. Unless otherwise specified, IRS and WIRS +queries were executed with a selectivity of $0.1\%$ and 500 million randomly +selected records from the OSM dataset were used. The uniform and zipfian +synthetic datasets were 1 billion records in size. All benchmarks warmed up the +data structure by inserting 10\% of the records, and then measured the +throughput inserting the remaining records, while deleting 5\% of them over the +course of the benchmark. Once all records were inserted, the sampling +performance was measured. The reported update throughputs were calculated using +both inserts and deletes, following the warmup period. + +\begin{figure*} + \centering + \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wss-insert} \label{fig:wss-insert}} + \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wss-sample} \label{fig:wss-sample}} \\ + \subfloat[Insertion Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-wss-insert} \label{fig:wss-insert-s}} + \subfloat[Sampling Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-wss-sample} \label{fig:wss-sample-s}} + \caption{Framework Comparisons to Baselines for WSS} +\end{figure*} + +Starting with WSS, Figure~\ref{fig:wss-insert} shows that the DE-WSS structure +is competitive with the AGG B+tree in terms of insertion performance, achieving +about 85\% of the AGG B+tree's insertion throughput on the Twitter dataset, and +beating it by similar margins on the other datasets. In terms of sampling +performance in Figure~\ref{fig:wss-sample}, it beats the B+tree handily, and +compares favorably to the static alias structure. Figures~\ref{fig:wss-insert-s} +and \ref{fig:wss-sample-s} show the performance scaling of the three structures as +the dataset size increases. All of the structures exhibit the same type of +performance degradation with respect to dataset size. + +\begin{figure*} + \centering + \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wirs-insert} \label{fig:wirs-insert}} + \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wirs-sample} \label{fig:wirs-sample}} + \caption{Framework Comparison to Baselines for WIRS} +\end{figure*} + +Figures~\ref{fig:wirs-insert} and \ref{fig:wirs-sample} show the performance of +the DE-WIRS index, relative to the AGG B+tree and the alias-augmented B+tree. This +example shows the same pattern of behavior as was seen with DE-WSS, though the +margin between the DE-WIRS and its corresponding SSI is much narrower. +Additionally, the constant factors associated with the construction cost of the +alias-augmented B+tree are much larger than the alias structure. The loss of +insertion performance due to this is seen clearly in Figure~\ref{fig:wirs-insert}, where +the margin of advantage between DE-WIRS and the AGG B+tree in insertion +throughput shrinks compared to the DE-WSS index, and the AGG B+tree's advantage +on the Twitter dataset is expanded. + +\begin{figure*} + \subfloat[Insertion Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-insert} \label{fig:irs-insert-s}} + \subfloat[Sampling Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-sample} \label{fig:irs-sample-s}} \\ + + \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-insert} \label{fig:irs-insert1}} + \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-sample} \label{fig:irs-sample1}} \\ + + \subfloat[Delete Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-delete} \label{fig:irs-delete}} + \subfloat[Sampling Latency vs. Sample Size]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-samplesize} \label{fig:irs-samplesize}} + \caption{Framework Comparison to Baselines for IRS} + +\end{figure*} +Finally, Figures~\ref{fig:irs-insert1} and \ref{fig:irs-sample1} show a +comparison of the in-memory DE-IRS index against the in-memory ISAM tree and the AGG +B+tree for answering IRS queries. The cost of bulk-loading the ISAM tree is less +than the cost of building the alias structure, or the alias-augmented B+tree, and +so here DE-IRS defeats the AGG B+tree by wider margins in insertion throughput, +though the margin narrows significantly in terms of sampling performance +advantage. + +DE-IRS was further tested to evaluate scalability. +Figure~\ref{fig:irs-insert-s} shows average insertion throughput, +Figure~\ref{fig:irs-delete} shows average delete latency (under tagging), and +Figure~\ref{fig:irs-sample-s} shows average sampling latencies for DE-IRS and +AGG B+tree over a range of data sizes. In all cases, DE-IRS and B+tree show +similar patterns of performance degradation as the datasize grows. Note that +the delete latencies of DE-IRS are worse than AGG B+tree, because of the B+tree's +cheaper point-lookups. + +Figure~\ref{fig:irs-sample-s} +also includes one other point of interest: the sampling performance of +DE-IRS \emph{improves} when the data size grows from one million to ten million +records. While at first glance the performance increase may appear paradoxical, +it actually demonstrates an important result concerning the effect of the +unsorted mutable buffer on index performance. At one million records, the +buffer constitutes approximately 1\% of the total data size; this results in +the buffer being sampled from with greater frequency (as it has more total +weight) than would be the case with larger data. The greater the frequency of +buffer sampling, the more rejections will occur, and the worse the sampling +performance will be. This illustrates the importance of keeping the buffer +small, even when a scan is not used for buffer sampling. Finally, +Figure~\ref{fig:irs-samplesize} shows the decreasing per-sample cost as the +number of records requested by a sampling query grows for DE-IRS, compared to +AGG B+tree. Note that DE-IRS benefits significantly more from batching samples +than AGG B+tree, and that the improvement is greatest up to $k=100$ samples per +query. + diff --git a/chapters/sigmod23/exp-extensions.tex b/chapters/sigmod23/exp-extensions.tex new file mode 100644 index 0000000..d929e92 --- /dev/null +++ b/chapters/sigmod23/exp-extensions.tex @@ -0,0 +1,40 @@ +\subsection{External and Concurrent Extensions} + +\begin{figure*}[h]% + \centering + \subfloat[External Insertion Throughput]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-ext-insert.pdf} \label{fig:ext-insert}} + \subfloat[External Sampling Latency]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-ext-sample.pdf} \label{fig:ext-sample}} \\ + + \subfloat[Concurrent Insert Latency vs. Throughput]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-cc-irs-scale} \label{fig:con-latency}} + \subfloat[Concurrent Insert Throughput vs. Thread Count]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-cc-irs-thread} \label{fig:con-tput}} + + \caption{External and Concurrent Extensions of DE-IRS} + \label{fig:irs-extensions} +\end{figure*} + +Proof of concept implementations of external and concurrent extensions were +also tested for IRS queries. Figures \ref{fig:ext-sample} and +\ref{fig:ext-insert} show the performance of the external DE-IRS sampling index +against AB-tree. DE-IRS was configured with 4 in-memory levels, using at most +350 MiB of memory in testing, including bloom filters. { +For DE-IRS, the \texttt{O\_DIRECT} flag was used to disable OS caching, and +CGroups were used to limit process memory to 1 GiB to simulate a memory +constrained environment. The AB-tree implementation tested +had a cache, which was configured with a memory budget of 64 GiB. This extra +memory was provided to be fair to AB-tree. Because it uses per-sample +tree-traversals, it is much more reliant on caching for good performance. DE-IRS was +tested without a caching layer.} The tests were performed with 4 billion (80 GiB) +{and 8 billion (162 GiB) uniform and zipfian +records}, and 2.6 billion (55 GiB) OSM records. DE-IRS outperformed the AB-tree +by over an order of magnitude in both insertion and sampling performance. + +Finally, Figures~\ref{fig:con-latency} and \ref{fig:con-tput} show the +multi-threaded insertion performance of the in-memory DE-IRS index with +concurrency support, compared to AB-tree running entirely in memory, using the +synthetic uniform dataset. Note that in Figure~\ref{fig:con-latency}, some of +the AB-tree results are cut off, due to having significantly lower throughput +and higher latency compared with the DE-IRS. Even without concurrent +merging, the framework shows linear scaling up to 4 threads of insertion, +before leveling off; throughput remains flat even up to 32 concurrent +insertion threads. An implementation with support for concurrent merging would +scale even better. diff --git a/chapters/sigmod23/exp-parameter-space.tex b/chapters/sigmod23/exp-parameter-space.tex new file mode 100644 index 0000000..d2057ac --- /dev/null +++ b/chapters/sigmod23/exp-parameter-space.tex @@ -0,0 +1,105 @@ +\subsection{Framework Design Space Exploration} +\label{ssec:ds-exp} + +The proposed framework brings with it a large design space, described in +Section~\ref{ssec:design-space}. First, this design space will be examined +using a standardized benchmark to measure the average insertion throughput and +sampling latency of DE-WSS at several points within this space. Tests were run +using a random selection of 500 million records from the OSM dataset, with the +index warmed up by the insertion of 10\% of the total records prior to +beginning any measurement. Over the course of the insertion period, 5\% of the +records were deleted, except for the tests in +Figures~\ref{fig:insert_delete_prop}, \ref{fig:sample_delete_prop}, and +\ref{fig:bloom}, in which 25\% of the records were deleted. Reported update +throughputs were calculated using both inserts and deletes, following the +warmup period. The standard values +used for parameters not being varied in a given test were $s = 6$, $N_b = +12000$, $k=1000$, and $\delta = 0.05$, with buffer rejection sampling. + +\begin{figure*} + \centering + \subfloat[Insertion Throughput vs. Mutable Buffer Capacity]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-ps-wss-mt-insert} \label{fig:insert_mt}} + \subfloat[Insertion Throughput vs. Scale Factor]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-ps-wss-sf-insert} \label{fig:insert_sf}} \\ + + \subfloat[Insertion Throughput vs.\\Max Delete Proportion]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-ps-wss-tp-insert} \label{fig:insert_delete_prop}} + \subfloat[Per 1000 Sampling Latency vs.\\Mutable Buffer Capacity]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-ps-wss-mt-sample} \label{fig:sample_mt}} \\ + + \caption{DE-WSS Design Space Exploration I} + \label{fig:parameter-sweeps1} +\end{figure*} + +The results of this testing are displayed in +Figures~\ref{fig:parameter-sweeps1},~\ref{fig:parameter-sweeps2},~and:wq~\ref{fig:parameter-sweeps3}. +The two largest contributors to differences in performance were the selection +of layout policy and of delete policy. Figures~\ref{fig:insert_mt} and +\ref{fig:insert_sf} show that the choice of layout policy plays a larger role +than delete policy in insertion performance, with tiering outperforming +leveling in both configurations. The situation is reversed in sampling +performance, seen in Figure~\ref{fig:sample_mt} and \ref{fig:sample_sf}, where +the performance difference between layout policies is far less than between +delete policies. + +The values used for the scale factor and buffer size have less influence than +layout and delete policy. Sampling performance is largely independent of them +over the ranges of values tested, as shown in Figures~\ref{fig:sample_mt} and +\ref{fig:sample_sf}. This isn't surprising, as these parameters adjust the +number of shards, which only contributes to shard alias construction time +during sampling and is is amortized over all samples taken in a query. The +buffer also contributes rejections, but the cost of a rejection is small and +the buffer constitutes only a small portion of the total weight, so these are +negligible. However, under tombstones there is an upward trend in latency with +buffer size, as delete checks occasionally require a full buffer scan. The +effect of buffer size on insertion is shown in Figure~\ref{fig:insert_mt}. +{ There is only a small improvement in insertion performance as the mutable +buffer grows. This is because a larger buffer results in fewer reconstructions, +but these reconstructions individually take longer, and so the net positive +effect is less than might be expected.} Finally, Figure~\ref{fig:insert_sf} +shows the effect of scale factor on insertion performance. As expected, tiering +performs better with higher scale factors, whereas the insertion performance of +leveling trails off as the scale factor is increased, due to write +amplification. + +\begin{figure*} + \centering + \subfloat[Per 1000 Sampling Latency vs. Scale Factor]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-ps-wss-sf-sample} \label{fig:sample_sf}} + \subfloat[Per 1000 Sampling Latency vs. Max Delete Proportion]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-ps-wss-tp-sample}\label{fig:sample_delete_prop}} \\ + \caption{DE-WSS Design Space Exploration II} + \label{fig:parameter-sweeps2} +\end{figure*} + +Figures~\ref{fig:insert_delete_prop} and \ref{fig:sample_delete_prop} show the +cost of maintaining $\delta$ with a base delete rate of 25\%. The low cost of +an in-memory sampling rejection results in only a slight upward trend in the +sampling latency as the number of deleted records increases. While compaction +is necessary to avoid pathological cases, there does not seem to be a +significant benefit to aggressive compaction thresholds. +Figure~\ref{fig:insert_delete_prop} shows the effect of compactions on insert +performance. There is little effect on performance under tagging, but there is +a clear negative performance trend associated with aggressive compaction when +using tombstones. Under tagging, a single compaction is guaranteed to remove +all deleted records on a level, whereas with tombstones a compaction can +cascade for multiple levels before the delete bound is satisfied, resulting in +a larger cost per incident. + +\begin{figure*} + \centering + \subfloat[Sampling Latency vs. Sample Size]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-ps-wss-samplesize} \label{fig:sample_k}} + \subfloat[Per 1000 Sampling Latency vs. Bloom Filter Memory]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-ps-wss-bloom}\label{fig:bloom}} \\ + \caption{DE-WSS Design Space Exploration III} + \label{fig:parameter-sweeps3} +\end{figure*} + +Figure~\ref{fig:bloom} demonstrates the trade-off between memory usage for +Bloom filters and sampling performance under tombstones. This test was run +using 25\% incoming deletes with no compaction, to maximize the number of +tombstones within the index as a worst-case scenario. As expected, allocating +more memory to Bloom filters, decreasing their false positive rates, +accelerates sampling. Finally, Figure~\ref{fig:sample_k} shows the relationship +between average per sample latency and the sample set size. It shows the effect +of amortizing the initial shard alias setup work across an increasing number of +samples, with $k=100$ as the point at which latency levels off. + +Based upon these results, a set of parameters was established for the extended +indexes, which is used in the next section for baseline comparisons. This +standard configuration uses tagging as the delete policy and tiering as the +layout policy, with $k=1000$, $N_b = 12000$, $\delta = 0.05$, and $s = 6$. diff --git a/chapters/sigmod23/experiment.tex b/chapters/sigmod23/experiment.tex new file mode 100644 index 0000000..75cf32e --- /dev/null +++ b/chapters/sigmod23/experiment.tex @@ -0,0 +1,48 @@ +\section{Evaluation} +\label{sec:experiment} + +\Paragraph{Experimental Setup.} All experiments were run under Ubuntu 20.04 LTS +on a dual-socket Intel Xeon Gold 6242R server with 384 GiB of physical memory +and 40 physical cores. External tests were run using a 4 TB WD Red SA500 SATA +SSD, rated for 95000 and 82000 IOPS for random reads and writes respectively. + +\Paragraph{Datasets.} Testing utilized a variety of synthetic and real-world +datasets. For all datasets used, the key was represented as a 64-bit integer, +the weight as a 64-bit integer, and the value as a 32-bit integer. Each record +also contained a 32-bit header. The weight was omitted from IRS testing. +Keys and weights were pulled from the dataset directly, and values were +generated separately and were unique for each record. The following datasets +were used, +\begin{itemize} +\item \textbf{Synthetic Uniform.} A non-weighted, synthetically generated list + of keys drawn from a uniform distribution. +\item \textbf{Synthetic Zipfian.} A non-weighted, synthetically generated list + of keys drawn from a Zipfian distribution with + a skew of $0.8$. +\item \textbf{Twitter~\cite{data-twitter,data-twitter1}.} $41$ million Twitter user ids, weighted by follower counts. +\item \textbf{Delicious~\cite{data-delicious}.} $33.7$ million URLs, represented using unique integers, + weighted by the number of associated tags. +\item \textbf{OSM~\cite{data-osm}.} $2.6$ billion geospatial coordinates for points + of interest, collected by OpenStreetMap. The latitude, converted + to a 64-bit integer, was used as the key and the number of + its associated semantic tags as the weight. +\end{itemize} +The synthetic datasets were not used for weighted experiments, as they do not +have weights. For unweighted experiments, the Twitter and Delicious datasets +were not used, as they have uninteresting key distributions. + +\Paragraph{Compared Methods.} In this section, indexes extended using the +framework are compared against existing dynamic baselines. Specifically, DE-WSS +(Section~\ref{ssec:wss-struct}), DE-IRS (Section~\ref{ssec:irs-struct}), and +DE-WIRS (Section~\ref{ssec:irs-struct}) are examined. In-memory extensions are +compared against the B+tree with aggregate weight tags on internal nodes (AGG +B+tree) \cite{olken95} and concurrent and external extensions are compared +against the AB-tree \cite{zhao22}. Sampling performance is also compared against +comparable static sampling indexes: the alias structure \cite{walker74} for WSS, +the in-memory ISAM tree for IRS, and the alias-augmented B+tree \cite{afshani17} +for WIRS. Note that all structures under test, with the exception of the +external DE-IRS and external AB-tree, were contained entirely within system +memory. All benchmarking code and data structures were implemented using C++17 +and compiled using gcc 11.3.0 at the \texttt{-O3} optimization level. The +extension framework itself, excluding the shard implementations and utility +headers, consisted of a header-only library of about 1200 SLOC. diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex new file mode 100644 index 0000000..6c242e9 --- /dev/null +++ b/chapters/sigmod23/extensions.tex @@ -0,0 +1,57 @@ +\captionsetup[subfloat]{justification=centering} +\section{Extensions} +\label{sec:discussion} +In this section, various extensions of the framework are considered. +Specifically, the applicability of the framework to external or distributed +data structures is discussed, as well as the use of the framework to add +automatic support for concurrent updates and sampling to extended SSIs. + +\Paragraph{Larger-than-Memory Data.} This framework can be applied to external +static sampling structures with minimal modification. As a proof-of-concept, +the IRS structure was extended with support for shards containing external ISAM +trees. This structure supports storing a configurable number of shards in +memory, and the rest on disk, making it well suited for operating in +memory-constrained environments. The on-disk shards contain standard ISAM +trees, with $8\text{KiB}$ page-aligned nodes. The external version of the +index only supports tombstone-based deletes, as tagging would require random +writes. In principle a hybrid approach to deletes is possible, where a delete +first searches the in-memory data for the record to be deleted, tagging it if +found. If the record is not found, then a tombstone could be inserted. As the +data size grows, though, and the preponderance of data is found on disk, this +approach would largely revert to the standard tombstone approach in practice. +External settings make the framework even more attractive, in terms of +performance characteristics, due to the different cost model. In external data +structures, performance is typically measured in terms of the number of IO +operations, meaning that much of the overhead introduced by the framework for +tasks like querying the mutable buffer, building auxiliary structures, extra +random number generations due to the shard alias structure, and the like, +become far less significant. + +Because the framework maintains immutability of shards, it is also well suited for +use on top of distributed file-systems or with other distributed data +abstractions like RDDs in Apache Spark~\cite{rdd}. Each shard can be +encapsulated within an immutable file in HDFS or an RDD in Spark. A centralized +control node or driver program can manage the mutable buffer, flushing it into +a new file or RDD when it is full, merging with existing files or RDDs using +the same reconstruction scheme already discussed for the framework. This setup +allows for datasets exceeding the capacity of a single node to be supported. As +an example, XDB~\cite{li19} features an RDD-based distributed sampling +structure that could be supported by this framework. + +\Paragraph{Concurrency.} The immutability of the majority of the structures +within the index makes for a straightforward concurrency implementation. +Concurrency control on the buffer is made trivial by the fact it is a simple, +unsorted array. The rest of the structure is never updated (aside from possible +delete tagging), and so concurrency becomes a simple matter of delaying the +freeing of memory used by internal structures until all the threads accessing +them have exited, rather than immediately on merge completion. A very basic +concurrency implementation can be achieved by using the tombstone delete +policy, and a reference counting scheme to control the deletion of the shards +following reconstructions. Multiple insert buffers can be used to improve +insertion throughput, as this will allow inserts to proceed in parallel with +merges, ultimately allowing concurrency to scale up to the point of being +bottlenecked by memory bandwidth and available storage. This proof-of-concept +implementation is based on a simplified version of an approach proposed by +Golan-Gueta et al. for concurrent log-structured data stores +\cite{golan-gueta15}. + diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex new file mode 100644 index 0000000..32a32e1 --- /dev/null +++ b/chapters/sigmod23/framework.tex @@ -0,0 +1,573 @@ +\section{Dynamic Sampling Index Framework} +\label{sec:framework} + +This work is an attempt to design a solution to independent sampling +that achieves \emph{both} efficient updates and near-constant cost per +sample. As the goal is to tackle the problem in a generalized fashion, +rather than design problem-specific data structures for used as the basis +of an index, a framework is created that allows for already +existing static data structures to be used as the basis for a sampling +index, by automatically adding support for data updates using a modified +version of the Bentley-Saxe method. + +Unfortunately, Bentley-Saxe as described in Section~\ref{ssec:bsm} cannot be +directly applied to sampling problems. The concept of decomposability is not +cleanly applicable to sampling, because the distribution of records in the +result set, rather than the records themselves, must be matched following the +result merge. Efficiently controlling the distribution requires each sub-query +to access information external to the structure against which it is being +processed, a contingency unaccounted for by Bentley-Saxe. Further, the process +of reconstruction used in Bentley-Saxe provides poor worst-case complexity +bounds~\cite{saxe79}, and attempts to modify the procedure to provide better +worst-case performance are complex and have worse performance in the common +case~\cite{overmars81}. Despite these limitations, this chapter will argue that +the core principles of the Bentley-Saxe method can be profitably applied to +sampling indexes, once a system for controlling result set distributions and a +more effective reconstruction scheme have been devised. The solution to +the former will be discussed in Section~\ref{ssec:sample}. For the latter, +inspiration is drawn from the literature on the LSM tree. + +The LSM tree~\cite{oneil96} is a data structure proposed to optimize +write throughput in disk-based storage engines. It consists of a memory +table of bounded size, used to buffer recent changes, and a hierarchy +of external levels containing indexes of exponentially increasing +size. When the memory table has reached capacity, it is emptied into the +external levels. Random writes are avoided by treating the data within +the external levels as immutable; all writes go through the memory +table. This introduces write amplification but maximizes sequential +writes, which is important for maintaining high throughput in disk-based +systems. The LSM tree is associated with a broad and well studied design +space~\cite{dayan17,dayan18,dayan22,balmau19,dayan18-1} containing +trade-offs between three key performance metrics: read performance, write +performance, and auxiliary memory usage. The challenges +faced in reconstructing predominately in-memory indexes are quite + different from those which the LSM tree is intended +to address, having little to do with disk-based systems and sequential IO +operations. But, the LSM tree possesses a rich design space for managing +the periodic reconstruction of data structures in a manner that is both +more practical and more flexible than that of Bentley-Saxe. By borrowing +from this design space, this preexisting body of work can be leveraged, +and many of Bentley-Saxe's limitations addressed. + +\captionsetup[subfloat]{justification=centering} + +\begin{figure*} + \centering + \subfloat[Leveling]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-leveling} \label{fig:leveling}}\\ + \subfloat[Tiering]{\includegraphics[width=.75\textwidth]{img/sigmod23/merge-tiering} \label{fig:tiering}} + + \caption{\textbf{A graphical overview of the sampling framework and its insert procedure.} A + mutable buffer (MB) sits atop two levels (L0, L1) containing shards (pairs + of SSIs and auxiliary structures [A]) using the leveling + (Figure~\ref{fig:leveling}) and tiering (Figure~\ref{fig:tiering}) layout + policies. Records are represented as black/colored squares, and grey + squares represent unused capacity. An insertion requiring a multi-level + reconstruction is illustrated.} \label{fig:framework} + +\end{figure*} + + +\subsection{Framework Overview} +The goal of this chapter is to build a general framework that extends most SSIs +with efficient support for updates by splitting the index into small data structures +to reduce reconstruction costs, and then distributing the sampling process over these +smaller structures. +The framework is designed to work efficiently with any SSI, so +long as it has the following properties, +\begin{enumerate} + \item The underlying full query $Q$ supported by the SSI from whose results + samples are drawn satisfies the following property: + for any dataset $D = \cup_{i = 1}^{n}D_i$ + where $D_i \cap D_j = \emptyset$, $Q(D) = \cup_{i = 1}^{n}Q(D_i)$. + \item \emph{(Optional)} The SSI supports efficient point-lookups. + \item \emph{(Optional)} The SSI is capable of efficiently reporting the total weight of all records + returned by the underlying full query. +\end{enumerate} + +The first property applies to the query being sampled from, and is essential +for the correctness of sample sets reported by extended sampling +indexes.\footnote{ This condition is stricter than the definition of a +decomposable search problem in the Bentley-Saxe method, which allows for +\emph{any} constant-time merge operation, not just union. +However, this condition is satisfied by many common types of database +query, such as predicate-based filtering queries.} The latter two properties +are optional, but reduce deletion and sampling costs respectively. Should the +SSI fail to support point-lookups, an auxiliary hash table can be attached to +the data structures. +Should it fail to support query result weight reporting, rejection +sampling can be used in place of the more efficient scheme discussed in +Section~\ref{ssec:sample}. The analysis of this framework will generally +assume that all three conditions are satisfied. + +Given an SSI with these properties, a dynamic extension can be produced as +shown in Figure~\ref{fig:framework}. The extended index consists of disjoint +shards containing an instance of the SSI being extended, and optional auxiliary +data structures. The auxiliary structures allow acceleration of certain +operations that are required by the framework, but which the SSI being extended +does not itself support efficiently. Examples of possible auxiliary structures +include hash tables, Bloom filters~\cite{bloom70}, and range +filters~\cite{zhang18,siqiang20}. The shards are arranged into levels of +increasing record capacity, with either one shard, or up to a fixed maximum +number of shards, per level. The decision to place one or many shards per level +is called the \emph{layout policy}. The policy names are borrowed from the +literature on the LSM tree, with the former called \emph{leveling} and the +latter called \emph{tiering}. + +To avoid a reconstruction on every insert, an unsorted array of fixed capacity +($N_b$), called the \emph{mutable buffer}, is used to buffer updates. Because it is +unsorted, it is kept small to maintain reasonably efficient sampling +and point-lookup performance. All updates are performed by appending new +records to the tail of this buffer. +If a record currently within the index is +to be updated to a new value, it must first be deleted, and then a record with +the new value inserted. This ensures that old versions of records are properly +filtered from query results. + +When the buffer is full, it is flushed to make room for new records. The +flushing procedure is based on the layout policy in use. When using leveling +(Figure~\ref{fig:leveling}) a new SSI is constructed using both the records in +$L_0$ and those in the buffer. This is used to create a new shard, which +replaces the one previously in $L_0$. When using tiering +(Figure~\ref{fig:tiering}) a new shard is built using only the records from the +buffer, and placed into $L_0$ without altering the existing shards. Each level +has a record capacity of $N_b \cdot s^{i+1}$, controlled by a configurable +parameter, $s$, called the scale factor. Records are organized in one large +shard under leveling, or in $s$ shards of $N_b \cdot s^i$ capacity each under +tiering. When a level reaches its capacity, it must be emptied to make room for +the records flushed into it. This is accomplished by moving its records down to +the next level of the index. Under leveling, this requires constructing a new +shard containing all records from both the source and target levels, and +placing this shard into the target, leaving the source empty. Under tiering, +the shards in the source level are combined into a single new shard that is +placed into the target level. Should the target be full, it is first emptied by +applying the same procedure. New empty levels +are dynamically added as necessary to accommodate these reconstructions. +Note that shard reconstructions are not necessarily performed using +merging, though merging can be used as an optimization of the reconstruction +procedure where such an algorithm exists. In general, reconstruction requires +only pooling the records of the shards being combined and then applying the SSI's +standard construction algorithm to this set of records. + +\begin{table}[t] +\caption{Frequently Used Notation} +\centering + +\begin{tabular}{|p{2.5cm} p{5cm}|} + \hline + \textbf{Variable} & \textbf{Description} \\ \hline + $N_b$ & Capacity of the mutable buffer \\ \hline + $s$ & Scale factor \\ \hline + $C_c(n)$ & SSI initial construction cost \\ \hline + $C_r(n)$ & SSI reconstruction cost \\ \hline + $L(n)$ & SSI point-lookup cost \\ \hline + $P(n)$ & SSI sampling pre-processing cost \\ \hline + $S(n)$ & SSI per-sample sampling cost \\ \hline + $W(n)$ & Shard weight determination cost \\ \hline + $R(n)$ & Shard rejection check cost \\ \hline + $\delta$ & Maximum delete proportion \\ \hline + %$\rho$ & Maximum rejection rate \\ \hline +\end{tabular} +\label{tab:nomen} + +\end{table} + +Table~\ref{tab:nomen} lists frequently used notation for the various parameters +of the framework, which will be used in the coming analysis of the costs and +trade-offs associated with operations within the framework's design space. The +remainder of this section will discuss the performance characteristics of +insertion into this structure (Section~\ref{ssec:insert}), how it can be used +to correctly answer sampling queries (Section~\ref{ssec:insert}), and efficient +approaches for supporting deletes (Section~\ref{ssec:delete}). Finally, it will +close with a detailed discussion of the trade-offs within the framework's +design space (Section~\ref{ssec:design-space}). + + +\subsection{Insertion} +\label{ssec:insert} +The framework supports inserting new records by first appending them to the end +of the mutable buffer. When it is full, the buffer is flushed into a sequence +of levels containing shards of increasing capacity, using a procedure +determined by the layout policy as discussed in Section~\ref{sec:framework}. +This method allows for the cost of repeated shard reconstruction to be +effectively amortized. + +Let the cost of constructing the SSI from an arbitrary set of $n$ records be +$C_c(n)$ and the cost of reconstructing the SSI given two or more shards +containing $n$ records in total be $C_r(n)$. The cost of an insert is composed +of three parts: appending to the mutable buffer, constructing a new +shard from the buffered records during a flush, and the total cost of +reconstructing shards containing the record over the lifetime of the index. The +cost of appending to the mutable buffer is constant, and the cost of constructing a +shard from the buffer can be amortized across the records participating in the +buffer flush, giving $\nicefrac{C_c(N_b)}{N_b}$. These costs are paid exactly once for +each record. To derive an expression for the cost of repeated reconstruction, +first note that each record will participate in at most $s$ reconstructions on +a given level, resulting in a worst-case amortized cost of $O\left(s\cdot +\nicefrac{C_r(n)}{n}\right)$ paid per level. The index itself will contain at most +$\log_s n$ levels. Thus, over the lifetime of the index a given record +will pay $O\left(s\cdot \nicefrac{C_r(n)}{n}\log_s n\right)$ cost in repeated +reconstruction. + +Combining these results, the total amortized insertion cost is +\begin{equation} +O\left(\frac{C_c(N_b)}{N_b} + s \cdot \frac{C_r(n)}{n} \log_s n\right) +\end{equation} +This can be simplified by noting that $s$ is constant, and that $N_b \ll n$ and also +a constant. By neglecting these terms, the amortized insertion cost of the +framework is, +\begin{equation} +O\left(\frac{C_r(n)}{n}\log_s n\right) +\end{equation} + + +\subsection{Sampling} +\label{ssec:sample} + +\begin{figure} + \centering + \includegraphics[width=\textwidth]{img/sigmod23/sampling} + \caption{\textbf{Overview of the multiple-shard sampling query process} for + Example~\ref{ex:sample} with $k=1000$. First, (1) the normalized weights of + the shards is determined, then (2) these weights are used to construct an + alias structure. Next, (3) the alias structure is queried $k$ times to + determine per shard sample sizes, and then (4) sampling is performed. + Finally, (5) any rejected samples are retried starting from the alias + structure, and the process is repeated until the desired number of samples + has been retrieved.} + \label{fig:sample} + +\end{figure} + +For many SSIs, sampling queries are completed in two stages. Some preliminary +processing is done to identify the range of records from which to sample, and then +samples are drawn from that range. For example, IRS over a sorted list of +records can be performed by first identifying the upper and lower bounds of the +query range in the list, and then sampling records by randomly generating +indexes within those bounds. The general cost of a sampling query can be +modeled as $P(n) + k S(n)$, where $P(n)$ is the cost of preprocessing, $k$ is +the number of samples drawn, and $S(n)$ is the cost of sampling a single +record. + +When sampling from multiple shards, the situation grows more complex. For each +sample, the shard to select the record from must first be decided. Consider an +arbitrary sampling query $X(D, k)$ asking for a sample set of size $k$ against +dataset $D$. The framework splits $D$ across $m$ disjoint shards, such that $D += \bigcup_{i=1}^m D_i$ and $D_i \cap D_j = \emptyset, \forall i,j < m$. The +framework must ensure that $X(D, k)$ and $\bigcup_{i=0}^m X(D_i, k_i)$ follow +the same distribution, by selecting appropriate values for the $k_i$s. If care +is not taken to balance the number of samples drawn from a shard with the total +weight of the shard under $X$, then bias can be introduced into the sample +set's distribution. The selection of $k_i$s can be viewed as an instance of WSS, +and solved using the alias method. + +When sampling using the framework, first the weight of each shard under the +sampling query is determined and a \emph{shard alias structure} built over +these weights. Then, for each sample, the shard alias is used to +determine the shard from which to draw the sample. Let $W(n)$ be the cost of +determining this total weight for a single shard under the query. The initial setup +cost, prior to drawing any samples, will be $O\left([W(n) + P(n)]\log_s +n\right)$, as the preliminary work for sampling from each shard must be +performed, as well as weights determined and alias structure constructed. In +many cases, however, the preliminary work will also determine the total weight, +and so the relevant operation need only be applied once to accomplish both +tasks. + +To ensure that all records appear in the sample set with the appropriate +probability, the mutable buffer itself must also be a valid target for +sampling. There are two generally applicable techniques that can be applied for +this, both of which can be supported by the framework. The query being sampled +from can be directly executed against the buffer and the result set used to +build a temporary SSI, which can be sampled from. Alternatively, rejection +sampling can be used to sample directly from the buffer, without executing the +query. In this case, the total weight of the buffer is used for its entry in +the shard alias structure. This can result in the buffer being +over-represented in the shard selection process, and so any rejections during +buffer sampling must be retried starting from shard selection. These same +considerations apply to rejection sampling used against shards, as well. + + +\begin{example} + \label{ex:sample} + Consider executing a WSS query, with $k=1000$, across three shards + containing integer keys with unit weight. $S_1$ contains only the + key $-2$, $S_2$ contains all integers on $[1,100]$, and $S_3$ + contains all integers on $[101, 200]$. These structures are shown + in Figure~\ref{fig:sample}. Sampling is performed by first + determining the normalized weights for each shard: $w_1 = 0.005$, + $w_2 = 0.4975$, $w_3 = 0.4975$, which are then used to construct a + shard alias structure. The shard alias structure is then queried + $k$ times, resulting in a distribution of $k_i$s that is + commensurate with the relative weights of each shard. Finally, + each shard is queried in turn to draw the appropriate number + of samples. +\end{example} + + +Assuming that rejection sampling is used on the mutable buffer, the worst-case +time complexity for drawing $k$ samples from an index containing $n$ elements +with a sampling cost of $S(n)$ is, +\begin{equation} + \label{eq:sample-cost} + O\left(\left[W(n) + P(n)\right]\log_s n + kS(n)\right) +\end{equation} + +%If instead a temporary SSI is constructed, the cost of sampling +%becomes: $O\left(N_b + C_c(N_b) + (W(n) + P(n))\log_s n + kS(n)\right)$. + +\begin{figure} + \centering + \subfloat[Tombstone Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tombstone} \label{fig:delete-tombstone}}\\ + \subfloat[Tagging Rejection Check]{\includegraphics[width=.75\textwidth]{img/sigmod23/delete-tagging} \label{fig:delete-tag}} + + \caption{\textbf{Overview of the rejection check procedure for deleted records.} First, + a record is sampled (1). + When using the tombstone delete policy + (Figure~\ref{fig:delete-tombstone}), the rejection check starts by (2) querying + the bloom filter of the mutable buffer. The filter indicates the record is + not present, so (3) the filter on $L_0$ is queried next. This filter + returns a false positive, so (4) a point-lookup is executed against $L_0$. + The lookup fails to find a tombstone, so the search continues and (5) the + filter on $L_1$ is checked, which reports that the tombstone is present. + This time, it is not a false positive, and so (6) a lookup against $L_1$ + (7) locates the tombstone. The record is thus rejected. When using the + tagging policy (Figure~\ref{fig:delete-tag}), (1) the record is sampled and + (2) checked directly for the delete tag. It is set, so the record is + immediately rejected.} + + \label{fig:delete} + +\end{figure} + + +\subsection{Deletion} +\label{ssec:delete} + +Because the shards are static, records cannot be arbitrarily removed from them. +This requires that deletes be supported in some other way, with the ultimate +goal being the prevention of deleted records' appearance in sampling query +result sets. This can be realized in two ways: locating the record and marking +it, or inserting a new record which indicates that an existing record should be +treated as deleted. The framework supports both of these techniques, the +selection of which is called the \emph{delete policy}. The former policy is +called \emph{tagging} and the latter \emph{tombstone}. + +Tagging a record is straightforward. Point-lookups are performed against each +shard in the index, as well as the buffer, for the record to be deleted. When +it is found, a bit in a header attached to the record is set. When sampling, +any records selected with this bit set are automatically rejected. Tombstones +represent a lazy strategy for deleting records. When a record is deleted using +tombstones, a new record with identical key and value, but with a ``tombstone'' +bit set, is inserted into the index. A record's presence can be checked by +performing a point-lookup. If a tombstone with the same key and value exists +above the record in the index, then it should be rejected when sampled. + +Two important aspects of performance are pertinent when discussing deletes: the +cost of the delete operation, and the cost of verifying the presence of a +sampled record. The choice of delete policy represents a trade-off between +these two costs. Beyond this simple trade-off, the delete policy also has other +implications that can affect its applicability to certain types of SSI. Most +notably, tombstones do not require any in-place updating of records, whereas +tagging does. This means that using tombstones is the only way to ensure total +immutability of the data within shards, which avoids random writes and eases +concurrency control. The tombstone delete policy, then, is particularly +appealing in external and concurrent contexts. + +\Paragraph{Deletion Cost.} The cost of a delete under the tombstone policy is +the same as an ordinary insert. Tagging, by contrast, requires a point-lookup +of the record to be deleted, and so is more expensive. Assuming a point-lookup +operation with cost $L(n)$, a tagged delete must search each level in the +index, as well as the buffer, requiring $O\left(N_b + L(n)\log_s n\right)$ +time. + +\Paragraph{Rejection Check Costs.} In addition to the cost of the delete +itself, the delete policy affects the cost of determining if a given record has +been deleted. This is called the \emph{rejection check cost}, $R(n)$. When +using tagging, the information necessary to make the rejection decision is +local to the sampled record, and so $R(n) \in O(1)$. However, when using tombstones +it is not; a point-lookup must be performed to search for a given record's +corresponding tombstone. This look-up must examine the buffer, and each shard +within the index. This results in a rejection check cost of $R(n) \in O\left(N_b + +L(n) \log_s n\right)$. The rejection check process for the two delete policies is +summarized in Figure~\ref{fig:delete}. + +Two factors contribute to the tombstone rejection check cost: the size of the +buffer, and the cost of performing a point-lookup against the shards. The +latter cost can be controlled using the framework's ability to associate +auxiliary structures with shards. For SSIs which do not support efficient +point-lookups, a hash table can be added to map key-value pairs to their +location within the SSI. This allows for constant-time rejection checks, even +in situations where the index would not otherwise support them. However, the +storage cost of this intervention is high, and in situations where the SSI does +support efficient point-lookups, it is not necessary. Further performance +improvements can be achieved by noting that the probability of a given record +having an associated tombstone in any particular shard is relatively small. +This means that many point-lookups will be executed against shards that do not +contain the tombstone being searched for. In this case, these unnecessary +lookups can be partially avoided using Bloom filters~\cite{bloom70} for +tombstones. By inserting tombstones into these filters during reconstruction, +point-lookups against some shards which do not contain the tombstone being +searched for can be bypassed. Filters can be attached to the buffer as well, +which may be even more significant due to the linear cost of scanning it. As +the goal is a reduction of rejection check costs, these filters need only be +populated with tombstones. In a later section, techniques for bounding the +number of tombstones on a given level are discussed, which will allow for the +memory usage of these filters to be tightly controlled while still ensuring +precise bounds on filter error. + +\Paragraph{Sampling with Deletes.} The addition of deletes to the framework +alters the analysis of sampling costs. A record that has been deleted cannot +be present in the sample set, and therefore the presence of each sampled record +must be verified. If a record has been deleted, it must be rejected. When +retrying samples rejected due to delete, the process must restart from shard +selection, as deleted records may be counted in the weight totals used to +construct that structure. This increases the cost of sampling to, +\begin{equation} +\label{eq:sampling-cost} + O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \mathbf{Pr}[\text{rejection}]} \cdot R(n)\right) +\end{equation} +where $R(n)$ is the cost of checking if a sampled record has been deleted, and +$\nicefrac{k}{1 -\mathbf{Pr}[\text{rejection}]}$ is the expected number of sampling +attempts required to obtain $k$ samples, given a fixed rejection probability. +The rejection probability itself is a function of the workload, and is +unbounded. + +\Paragraph{Bounding the Rejection Probability.} Rejections during sampling +constitute wasted memory accesses and random number generations, and so steps +should be taken to minimize their frequency. The probability of a rejection is +directly related to the number of deleted records, which is itself a function +of workload and dataset. This means that, without building counter-measures +into the framework, tight bounds on sampling performance cannot be provided in +the presence of deleted records. It is therefore critical that the framework +support some method for bounding the number of deleted records within the +index. + +While the static nature of shards prevents the direct removal of records at the +moment they are deleted, it doesn't prevent the removal of records during +reconstruction. When using tagging, all tagged records encountered during +reconstruction can be removed. When using tombstones, however, the removal +process is non-trivial. In principle, a rejection check could be performed for +each record encountered during reconstruction, but this would increase +reconstruction costs and introduce a new problem of tracking tombstones +associated with records that have been removed. Instead, a lazier approach can +be used: delaying removal until a tombstone and its associated record +participate in the same shard reconstruction. This delay allows both the record +and its tombstone to be removed at the same time, an approach called +\emph{tombstone cancellation}. In general, this can be implemented using an +extra linear scan of the input shards before reconstruction to identify +tombstones and associated records for cancellation, but potential optimizations +exist for many SSIs, allowing it to be performed during the reconstruction +itself at no extra cost. + +The removal of deleted records passively during reconstruction is not enough to +bound the number of deleted records within the index. It is not difficult to +envision pathological scenarios where deletes result in unbounded rejection +rates, even with this mitigation in place. However, the dropping of deleted +records does provide a useful property: any specific deleted record will +eventually be removed from the index after a finite number of reconstructions. +Using this fact, a bound on the number of deleted records can be enforced. A +new parameter, $\delta$, is defined, representing the maximum proportion of +deleted records within the index. Each level, and the buffer, tracks the number +of deleted records it contains by counting its tagged records or tombstones. +Following each buffer flush, the proportion of deleted records is checked +against $\delta$. If any level is found to exceed it, then a proactive +reconstruction is triggered, pushing its shards down into the next level. The +process is repeated until all levels respect the bound, allowing the number of +deleted records to be precisely controlled, which, by extension, bounds the +rejection rate. This process is called \emph{compaction}. + +Assuming every record is equally likely to be sampled, this new bound can be +applied to the analysis of sampling costs. The probability of a record being +rejected is $\mathbf{Pr}[\text{rejection}] = \delta$. Applying this result to +Equation~\ref{eq:sampling-cost} yields, +\begin{equation} +%\label{eq:sampling-cost-del} + O\left([W(n) + P(n)]\log_s n + \frac{kS(n)}{1 - \delta} \cdot R(n)\right) +\end{equation} + +Asymptotically, this proactive compaction does not alter the analysis of +insertion costs. Each record is still written at most $s$ times on each level, +there are at most $\log_s n$ levels, and the buffer insertion and SSI +construction costs are all unchanged, and so on. This results in the amortized +insertion cost remaining the same. + +This compaction strategy is based upon tombstone and record counts, and the +bounds assume that every record is equally likely to be sampled. For certain +sampling problems (such as WSS), there are other conditions that must be +considered to provide a bound on the rejection rate. To account for these +situations in a general fashion, the framework supports problem-specific +compaction triggers that can be tailored to the SSI being used. These allow +compactions to be triggered based on other properties, such as rejection rate +of a level, weight of deleted records, and the like. + + +\subsection{Trade-offs on Framework Design Space} +\label{ssec:design-space} +The framework has several tunable parameters, allowing it to be tailored for +specific applications. This design space contains trade-offs among three major +performance characteristics: update cost, sampling cost, and auxiliary memory +usage. The two most significant decisions when implementing this framework are +the selection of the layout and delete policies. The asymptotic analysis of the +previous sections obscures some of the differences between these policies, but +they do have significant practical performance implications. + +\Paragraph{Layout Policy.} The choice of layout policy represents a clear +trade-off between update and sampling performance. Leveling +results in fewer shards of larger size, whereas tiering results in a larger +number of smaller shards. As a result, leveling reduces the costs associated +with point-lookups and sampling query preprocessing by a constant factor, +compared to tiering. However, it results in more write amplification: a given +record may be involved in up to $s$ reconstructions on a single level, as +opposed to the single reconstruction per level under tiering. + +\Paragraph{Delete Policy.} There is a trade-off between delete performance and +sampling performance that exists in the choice of delete policy. Tagging +requires a point-lookup when performing a delete, which is more expensive than +the insert required by tombstones. However, it also allows constant-time +rejection checks, unlike tombstones which require a point-lookup of each +sampled record. In situations where deletes are common and write-throughput is +critical, tombstones may be more useful. Tombstones are also ideal in +situations where immutability is required, or random writes must be avoided. +Generally speaking, however, tagging is superior when using SSIs that support +it, because sampling rejection checks will usually be more common than deletes. + +\Paragraph{Mutable Buffer Capacity and Scale Factor.} The mutable buffer +capacity and scale factor both influence the number of levels within the index, +and by extension the number of distinct shards. Sampling and point-lookups have +better performance with fewer shards. Smaller shards are also faster to +reconstruct, although the same adjustments that reduce shard size also result +in a larger number of reconstructions, so the trade-off here is less clear. + +The scale factor has an interesting interaction with the layout policy: when +using leveling, the scale factor directly controls the amount of write +amplification per level. Larger scale factors mean more time is spent +reconstructing shards on a level, reducing update performance. Tiering does not +have this problem and should see its update performance benefit directly from a +larger scale factor, as this reduces the number of reconstructions. + +The buffer capacity also influences the number of levels, but is more +significant in its effects on point-lookup performance: a lookup must perform a +linear scan of the buffer. Likewise, the unstructured nature of the buffer also +will contribute negatively towards sampling performance, irrespective of which +buffer sampling technique is used. As a result, although a large buffer will +reduce the number of shards, it will also hurt sampling and delete (under +tagging) performance. It is important to minimize the cost of these buffer +scans, and so it is preferable to keep the buffer small, ideally small enough +to fit within the CPU's L2 cache. The number of shards within the index is, +then, better controlled by changing the scale factor, rather than the buffer +capacity. Using a smaller buffer will result in more compactions and shard +reconstructions; however, the empirical evaluation in Section~\ref{ssec:ds-exp} +demonstrates that this is not a serious performance problem when a scale factor +is chosen appropriately. When the shards are in memory, frequent small +reconstructions do not have a significant performance penalty compared to less +frequent, larger ones. + +\Paragraph{Auxiliary Structures.} The framework's support for arbitrary +auxiliary data structures allows for memory to be traded in exchange for +insertion or sampling performance. The use of Bloom filters for accelerating +tombstone rejection checks has already been discussed, but many other options +exist. Bloom filters could also be used to accelerate point-lookups for delete +tagging, though such filters would require much more memory than tombstone-only +ones to be effective. An auxiliary hash table could be used for accelerating +point-lookups, or range filters like SuRF \cite{zhang18} or Rosetta +\cite{siqiang20} added to accelerate pre-processing for range queries like in +IRS or WIRS. diff --git a/chapters/sigmod23/introduction.tex b/chapters/sigmod23/introduction.tex new file mode 100644 index 0000000..0155c7d --- /dev/null +++ b/chapters/sigmod23/introduction.tex @@ -0,0 +1,20 @@ +\section{Introduction} \label{sec:intro} + +As a first attempt at realizing a dynamic extension framework, one of the +non-decomposable search problems discussed in the previous chapter was +considered: independent range sampling, along with a number of other +independent sampling problems. These sorts of queries are important in a +variety of contexts, including including approximate query processing +(AQP)~\cite{blinkdb,quickr,verdict,cohen23}, interactive data +exploration~\cite{sps,xie21}, financial audit sampling~\cite{olken-thesis}, and +feature selection for machine learning~\cite{ml-sampling}. However, they are +not well served using existing techniques, which tend to sacrifice statistical +independence for performance, or vise versa. In this chapter, a solution for +independent sampling is presented that manages to achieve both statistical +independence, and good performance, by designing a Bentley-Saxe inspired +framework for introducing update support to efficient static sampling data +structures. It seeks to demonstrate the viability of Bentley-Saxe as the basis +for adding update support to data structures, as well as showing that the +limitations of the decomposable search problem abstraction can be overcome +through alternative query processing techniques to preserve good +performance. diff --git a/chapters/sigmod23/relatedwork.tex b/chapters/sigmod23/relatedwork.tex new file mode 100644 index 0000000..600cd0d --- /dev/null +++ b/chapters/sigmod23/relatedwork.tex @@ -0,0 +1,33 @@ +\section{Related Work} +\label{sec:related} + +The general IQS problem was first proposed by Hu, Qiao, and Tao~\cite{hu14} and +has since been the subject of extensive research +\cite{irsra,afshani17,xie21,aumuller20}. These papers involve the use of +specialized indexes to assist in drawing samples efficiently from the result +sets of specific types of query, and are largely focused on in-memory settings. +A recent survey by Tao~\cite{tao22} acknowledged that dynamization remains a major +challenge for efficient sampling indexes. There do exist specific examples of +sampling indexes~\cite{hu14} designed to support dynamic updates, but they are +specialized, and impractical due to their +implementation complexity and high constant-factors in their cost functions. A +static index for spatial independent range sampling~\cite{xie21} has been +proposed with a dynamic extension similar to the one proposed in this paper, but the method was not +generalized, and its design space was not explored. There are also +weight-updatable implementations of the alias structure \cite{hagerup93, +matias03, allendorf23} that function under various assumptions about the weight +distribution. These are of limited utility in a database context as they do not +support direct insertion or deletion of entries. Efforts have also been made to +improve tree-traversal based sampling approaches. Notably, the AB-tree +\cite{zhao22} extends tree-sampling with support for concurrent updates, which +has been a historical pain point. + +The Bentley-Saxe method was first proposed by Saxe and Bentley~\cite{saxe79}. +Overmars and van Leeuwen extended this framework to provide better worst-case +bounds~\cite{overmars81}, but their approach hurts common case performance by +splitting reconstructions into small pieces and executing these pieces each +time a record is inserted. Though not commonly used in database systems, the +method has been applied to address specialized, problems, such as the creation +of dynamic metric indexing structures~\cite{naidan14}, analysis of +trajectories~\cite{custers19}, and genetic sequence search +indexes~\cite{almodaresi23}. diff --git a/chapters/vita.tex b/chapters/vita.tex new file mode 100644 index 0000000..e69de29 -- cgit v1.2.3