diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-06-27 15:21:38 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-06-27 15:21:38 -0400 |
| commit | fcdbcbcd45dc567792429bb314df53b42ed9f22e (patch) | |
| tree | 3f7c135b7b32022fa0a9f03361e60cc0cc4f86e0 /chapters/sigmod23 | |
| parent | ff528e8595e82802832930fae6c9ccee7afd23cb (diff) | |
| download | dissertation-fcdbcbcd45dc567792429bb314df53b42ed9f22e.tar.gz | |
updates
Diffstat (limited to 'chapters/sigmod23')
| -rw-r--r-- | chapters/sigmod23/background.tex | 4 | ||||
| -rw-r--r-- | chapters/sigmod23/examples.tex | 10 | ||||
| -rw-r--r-- | chapters/sigmod23/exp-baseline.tex | 12 | ||||
| -rw-r--r-- | chapters/sigmod23/experiment.tex | 18 | ||||
| -rw-r--r-- | chapters/sigmod23/extensions.tex | 4 | ||||
| -rw-r--r-- | chapters/sigmod23/framework.tex | 16 | ||||
| -rw-r--r-- | chapters/sigmod23/introduction.tex | 2 |
7 files changed, 33 insertions, 33 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index 42a52de..984e36c 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -226,7 +226,7 @@ structure, a technique called \emph{alias augmentation}~\cite{tao22}. For example, alias augmentation can be used to construct an SSI capable of answering WIRS queries in $\Theta(\log n + k)$~\cite{afshani17,tao22}. This structure breaks the data into multiple disjoint partitions of size -$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+Tree +$\nicefrac{n}{\log n}$, each with an associated alias structure. A B+tree is then built, using the augmented partitions as its leaf nodes. Each internal node is also augmented with an alias structure over the aggregate weights associated with the children of each pointer. Constructing this @@ -266,7 +266,7 @@ following desiderata, \begin{enumerate} \item Support data updates (including deletes) with similar average - performance to a standard B+Tree. + performance to a standard B+tree. \item Support IQS queries that do not pay a per-sample cost proportional to some function of the data size. In other words, $k$ should \emph{not} be be multiplied by any function of $n$ diff --git a/chapters/sigmod23/examples.tex b/chapters/sigmod23/examples.tex index 4e7f9ac..32807e1 100644 --- a/chapters/sigmod23/examples.tex +++ b/chapters/sigmod23/examples.tex @@ -74,7 +74,7 @@ makes progress towards removing it. \subsection{Independent Range Sampling (ISAM Tree)} \label{ssec:irs-struct} We will next considered independent range sampling. For this decomposable -sampling problem, we use the ISAM Tree for the SSI. Because our shards are +sampling problem, we use the ISAM tree for the SSI. Because our shards are static, we can build highly compact and efficient ISAM trees by storing the records directly in a sorted array. So long as the leaf node size is a multiple of the record size, this array can be treated as a sequence of @@ -106,7 +106,7 @@ operations are, \text{Worst-case Tagged Delete:} \quad &O\left(\log_s n \log_f n\right) \end{align*} where $R(n) \in \Theta(1)$ for tagging and $R(n) \in \Theta(\log_s n \log_f n)$ -for tombstones and $f$ is the fanout of the ISAM Tree. +for tombstones and $f$ is the fanout of the ISAM tree. \subsection{Weighted Independent Range Sampling (Alias-augmented B+Tree)} @@ -114,13 +114,13 @@ for tombstones and $f$ is the fanout of the ISAM Tree. \label{ssec:wirs-struct} As a final example of applying this framework, we consider WIRS. This is a decomposable sampling problem that can be answered using the -alias-augmented B+Tree structure~\cite{tao22, afshani17,hu14}. This +alias-augmented B+tree structure~\cite{tao22, afshani17,hu14}. This data structure is built over sorted data, but can be bulk-loaded from this data in linear time, resulting in costs of $B(n) \in \Theta(n \log n)$ and $B_M(n) \in \Theta(n)$, though the constant factors associated with these functions are quite high, as each bulk-loading requires multiple -linear-time operations for building both the B+Tree and the alias -structures, among other things. As it is built on a B+Tree, the structure +linear-time operations for building both the B+tree and the alias +structures, among other things. As it is built on a B+tree, the structure supports $L(n) \in \Theta(\log n)$ point lookups. Answering sampling queries requires $P(n) \in \Theta(\log n)$ pre-processing time to establish the query interval, during which the weight of the interval diff --git a/chapters/sigmod23/exp-baseline.tex b/chapters/sigmod23/exp-baseline.tex index d0e1ce0..4ae744b 100644 --- a/chapters/sigmod23/exp-baseline.tex +++ b/chapters/sigmod23/exp-baseline.tex @@ -1,7 +1,7 @@ \subsection{Comparison to Baselines} Next, we compared the performance of our dynamized sampling indices with -Olken's method on an aggregate B+Tree. We also examine the query performance +Olken's method on an aggregate B+tree. We also examine the query performance of a single instance of the SSI in question to establish how much query performance is lost in the dynamization. Unless otherwise specified, IRS and WIRS queries are run with a selectivity of $0.1\%$. Additionally, @@ -51,15 +51,15 @@ resulting in better performance. In Figures~\ref{fig:wirs-insert} and \ref{fig:wirs-sample} we examine the performed of \texttt{DE-WIRS} compared to \text{AGG B+tree} and an -alias-augmented B+Tree. We see the same basic set of patterns in this +alias-augmented B+tree. We see the same basic set of patterns in this case as we did with WSS. \texttt{AGG B+Tree} defeats our dynamized index on the \texttt{twitter} dataset, but loses on the others, in terms of insertion performance. We can see that the alias-augmented -B+Tree is much more expensive to build than an alias structure, and +B+tree is much more expensive to build than an alias structure, and so its insertion performance advantage is eroded somewhat compared to the dynamic structure. For queries we see that the \texttt{AGG B+Tree} performs similarly for WIRS sampling as it did for WSS sampling, but the -alias-augmented B+Tree structure is quite a bit slower at WIRS than the +alias-augmented B+tree structure is quite a bit slower at WIRS than the alias structure was at WSS. This results in \texttt{DE-WIRS} defeating the dynamic baseline by less of a margin in this test, but it still is superior in terms of sampling performance, and is still quite close in @@ -81,7 +81,7 @@ being introduced by the dynamization. We next considered IRS queries. Figures~\ref{fig:irs-insert1} and \ref{fig:irs-sample1} show the results of our testing of single-threaded -\texttt{DE-IRS} running in-memory against the in-memory ISAM Tree and +\texttt{DE-IRS} running in-memory against the in-memory ISAM tree and \texttt{AGG B+tree}. The ISAM tree structure can be efficiently bulk-loaded, which results in a much faster construction time than the alias structure or alias-augmented B+tree. This gives it a significant update performance @@ -112,7 +112,7 @@ to answer queries. However, as the sample set size increases, this cost increasingly begins to pay off, with \texttt{DE-IRS} quickly defeating the dynamic structure in average per-sample latency. One other interesting note is the performance of the static ISAM tree, which begins on-par with -the B+Tree, but also sees an improvement as the sample set size increases. +the B+tree, but also sees an improvement as the sample set size increases. This is because of cache effects. During the initial tree traversal, both the B+tree and ISAM tree have a similar number of cache misses. However, the ISAM tree needs to perform its traversal only once, and then samples diff --git a/chapters/sigmod23/experiment.tex b/chapters/sigmod23/experiment.tex index 1eb704c..14f59a7 100644 --- a/chapters/sigmod23/experiment.tex +++ b/chapters/sigmod23/experiment.tex @@ -60,7 +60,7 @@ following dynamized structures, \item \textbf{DE-WSS.} An implementation of the dynamized alias structure~\cite{walker74} for weighted set sampling discussed in Section~\ref{ssec:wss-struct}. We compare this against a WSS -implementation of Olken's method on a B+Tree with aggregate weight tags +implementation of Olken's method on a B+tree with aggregate weight tags (\textbf{AGG-BTree})~\cite{olken95}, based on the B+tree implementation in the TLX library~\cite{tlx}. @@ -71,24 +71,24 @@ Section~\ref{ssec:ext-concurrency} and an external version from Section~\ref{ssec:ext-external}. We compare the external and concurrent versions against the AB-tree~\cite{zhao22}, and the single-threaded, in memory version was compare with an IRS implementation of Olken's -method on an AGG-BTree. +method on an \texttt{AGG-BTree}. \item \textbf{DE-WIRS.} An implementation of the dynamized alias-augmented -B+Tree~\cite{afshani17} as discussed in Section~\ref{ssec:wirs-struct} for +B+tree~\cite{afshani17} as discussed in Section~\ref{ssec:wirs-struct} for weighted independent range sampling. We compare this against a WIRS -implementation of Olken's method on an AGG-BTree. +implementation of Olken's method on \texttt{AGG-BTree}. \end{itemize} All of the tested structures, with the exception of the external memory -DE-IRS implementation and AB-Tree, were wholly contained within system -memory. AB-Tree is a native external structure, so for the in-memory +DE-IRS implementation and AB-tree, were wholly contained within system +memory. AB-tree is a native external structure, so for the in-memory concurrency evaluation we configured it with enough cache to maintain the entire structure in memory to simulate an in-memory implementation.\footnote{ Because of the nature of sampling queries, traditional - efficient locking techniques for B+Trees are not able to be - used~\cite{zhao22}. The alternatives were to run AB-Tree in this - manner, or to globally lock the B+Tree for every operation. We + efficient locking techniques for B+trees are not able to be + used~\cite{zhao22}. The alternatives were to run AB-tree in this + manner, or to globally lock the B+tree for every operation. We elected to use the former approach for this chapter. We used the latter approach in the next chapter. } diff --git a/chapters/sigmod23/extensions.tex b/chapters/sigmod23/extensions.tex index 053c8e2..f77574d 100644 --- a/chapters/sigmod23/extensions.tex +++ b/chapters/sigmod23/extensions.tex @@ -19,7 +19,7 @@ to reside in memory, and the rest on disk. This allows for the smallest few shards, which sustain the most reconstructions, to reside in memory for performance, while storing most of the data on disk, in an attempt to get the best of both worlds, so to speak.\footnote{ - In traditional LSM Trees, which are an external data structure, + In traditional LSM trees, which are an external data structure, only the memtable resides in memory. We have decided to break with this model because, for query performance reasons, the mutable buffer must remain small. By placing a few levels in memory, the @@ -58,7 +58,7 @@ structure using in XDB~\cite{li19}. Because our dynamization technique is built on top of static data structures, a limited form of concurrency support is straightforward to implement. To that end, we created a proof-of-concept dynamization of an -ISAM Tree for IRS based on a simplified version of a general concurrency +ISAM tree for IRS based on a simplified version of a general concurrency controlled scheme for log-structured data stores~\cite{golan-gueta15}. First, we restrict ourselves to tombstone deletes. This ensures that diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex index b3a8215..1eb2589 100644 --- a/chapters/sigmod23/framework.tex +++ b/chapters/sigmod23/framework.tex @@ -512,19 +512,19 @@ Though it has thus far gone unmentioned, some readers may have noted the astonishing similarity between decomposition-based dynamization techniques, and a data structure called the Log-structured Merge-tree. First proposed by O'Neil in the mid '90s\cite{oneil96}, -the LSM Tree was designed to optimize write throughput for external data +the LSM tree was designed to optimize write throughput for external data structures. It accomplished this task by buffer inserted records in a -small in-memory AVL Tree, and then flushing this buffer to disk when +small in-memory AVL tree, and then flushing this buffer to disk when it filled up. The flush process itself would fully rebuild the on-disk -structure (a B+Tree), including all of the currently existing records +structure (a B+tree), including all of the currently existing records on external storage. O'Neil also proposed version which used several, layered, external structures, to reduce the cost of reconstruction. -In more recent times, the LSM Tree has seen significant development and +In more recent times, the LSM tree has seen significant development and been used as the basis for key-value stores like RocksDB~\cite{dong21} and LevelDB~\cite{leveldb}. This work has produced an incredibly large and well explored parametrization of the reconstruction -procedures of LSM Trees, a good summary of which can be bounded in +procedures of LSM trees, a good summary of which can be bounded in this recent tutorial paper~\cite{sarkar23}. Examples of this design space exploration include: different ways to organize each "level" of the tree~\cite{dayan19, dostoevsky, autumn}, different growth @@ -534,7 +534,7 @@ auxiliary structures attached to the main ones for accelerating certain types of query~\cite{dayan18-1, zhu21, monkey}. This work is discussed in greater depth in Chapter~\ref{chap:related-work}. -Many of the elements within the LSM Tree design space are based upon the +Many of the elements within the LSM tree design space are based upon the specifics of the data structure itself, and are not applicable to our use case. However, some of the higher-level concepts can be imported and applied in the context of dynamization. Specifically, we have decided to @@ -590,7 +590,7 @@ that can be used to help improve the performance of these searches, without requiring as much storage as adding auxiliary hash tables to every block, is to include bloom filters~\cite{bloom70}. A bloom filter is an approximate data structure that answers tests of set membership -with bounded, single-sided error. These are commonly used in LSM Trees +with bounded, single-sided error. These are commonly used in LSM trees to accelerate point lookups by allowing levels that don't contain the record being searched for to be skipped. In our case, we only care about tombstone records, so rather than building these filters over all records, @@ -599,7 +599,7 @@ the sampling performance of the structure when tombstone deletes are used. \Paragraph{Layout Policy.} The Bentley-Saxe method considers blocks individually, without any other organization beyond increasing -size. In contrast, LSM Trees have multiple layers of structural +size. In contrast, LSM trees have multiple layers of structural organization. Record capacity restrictions are enforced on structures called \emph{levels}, which are partitioned into individual data structures, and then further organized into non-overlapping key ranges. diff --git a/chapters/sigmod23/introduction.tex b/chapters/sigmod23/introduction.tex index 8f0635d..7ff82cd 100644 --- a/chapters/sigmod23/introduction.tex +++ b/chapters/sigmod23/introduction.tex @@ -22,7 +22,7 @@ them. Existing implementations tend to sacrifice either performance, by requiring the entire result set of be materialized prior to applying Bernoulli sampling, or statistical independence. There exists techniques for obtaining both sampling performance and independence by leveraging -existing B+Tree indices with slight modification~\cite{olken-thesis}, +existing B+tree indices with slight modification~\cite{olken-thesis}, but even this technique has worse sampling performance than could be achieved using specialized static sampling indices. |