diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-12 19:59:26 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-12 19:59:26 -0400 |
| commit | 5ffc53e69e956054fdefd1fe193e00eee705dcab (patch) | |
| tree | 74fd32db95211d0be067d22919e65ac959e4fa46 /chapters/sigmod23/experiment.tex | |
| parent | 901a04fd8ec9a07b7bd195517a6d9e89da3ecab6 (diff) | |
| download | dissertation-5ffc53e69e956054fdefd1fe193e00eee705dcab.tar.gz | |
Updates
Diffstat (limited to 'chapters/sigmod23/experiment.tex')
| -rw-r--r-- | chapters/sigmod23/experiment.tex | 115 |
1 files changed, 82 insertions, 33 deletions
diff --git a/chapters/sigmod23/experiment.tex b/chapters/sigmod23/experiment.tex index 75cf32e..4dbb4c2 100644 --- a/chapters/sigmod23/experiment.tex +++ b/chapters/sigmod23/experiment.tex @@ -1,18 +1,36 @@ \section{Evaluation} \label{sec:experiment} +In this section, we provide comprehensive performance benchmarks +of implementations of the dynamized structures discussed in +Sections~\ref{sec:instance} and \ref{sec:discussion}. All of the code was +written using C++17. The full implementations, including benchmarking +code, are available on GitHub on the Modified BSD License, at +\url{https://github.com/psu-db/sampling-extension-original}.\footnote{ + We also provide a ``cleaner'' implementation for WSS and WIRS, + with a structure and nomenclature better aligned with this + chapter, here: \url{https://github.com/psu-db/sampling-extension}. +} -\Paragraph{Experimental Setup.} All experiments were run under Ubuntu 20.04 LTS -on a dual-socket Intel Xeon Gold 6242R server with 384 GiB of physical memory -and 40 physical cores. External tests were run using a 4 TB WD Red SA500 SATA -SSD, rated for 95000 and 82000 IOPS for random reads and writes respectively. - -\Paragraph{Datasets.} Testing utilized a variety of synthetic and real-world -datasets. For all datasets used, the key was represented as a 64-bit integer, -the weight as a 64-bit integer, and the value as a 32-bit integer. Each record -also contained a 32-bit header. The weight was omitted from IRS testing. -Keys and weights were pulled from the dataset directly, and values were -generated separately and were unique for each record. The following datasets -were used, +\Paragraph{Experimental Setup.} We ran all of our experiments on Ubuntu +20.04 LTS using a server equipped with dual socket Intel Xeon Gold 6242R +processes with 40 physical cores and 384 GiB of physical memory. We +performed testing of external structures with a 4 TB WD Red SA500 SATA +drive rated at 95000 IOPS for random reads and 82000 IOPS for random +writes All benchmarking code was compiled with GCC version 11.3.0 with +the \texttt{-O3} optimization level. + +\Paragraph{Datasets.} We used a variety of synthetic and real-world +datasets of various distributions to test sampling performance. For all +of our datasets, we treated the data as a sequence of key-value pairs +with a 64-bit integer key and a 32-bit integer value. Our dynamizations +introduced a 32-bit header to each record as well. This header was not +added to records when testing dynamic baselines. Additionally, weighted +testing attached a 64-bit integer weight to each record. This weight was +not included in the record for non-weighted testing. The weights and +keys were both used directly from the datasets, and values were added +seperately and unique to each record. + +We used the following datasets for testing, \begin{itemize} \item \textbf{Synthetic Uniform.} A non-weighted, synthetically generated list of keys drawn from a uniform distribution. @@ -23,26 +41,57 @@ were used, \item \textbf{Delicious~\cite{data-delicious}.} $33.7$ million URLs, represented using unique integers, weighted by the number of associated tags. \item \textbf{OSM~\cite{data-osm}.} $2.6$ billion geospatial coordinates for points - of interest, collected by OpenStreetMap. The latitude, converted - to a 64-bit integer, was used as the key and the number of + of interest, collected by OpenStreetMap. We used the latitude, converted + to a 64-bit integer, as the key and the number of its associated semantic tags as the weight. \end{itemize} -The synthetic datasets were not used for weighted experiments, as they do not -have weights. For unweighted experiments, the Twitter and Delicious datasets -were not used, as they have uninteresting key distributions. - -\Paragraph{Compared Methods.} In this section, indexes extended using the -framework are compared against existing dynamic baselines. Specifically, DE-WSS -(Section~\ref{ssec:wss-struct}), DE-IRS (Section~\ref{ssec:irs-struct}), and -DE-WIRS (Section~\ref{ssec:irs-struct}) are examined. In-memory extensions are -compared against the B+tree with aggregate weight tags on internal nodes (AGG -B+tree) \cite{olken95} and concurrent and external extensions are compared -against the AB-tree \cite{zhao22}. Sampling performance is also compared against -comparable static sampling indexes: the alias structure \cite{walker74} for WSS, -the in-memory ISAM tree for IRS, and the alias-augmented B+tree \cite{afshani17} -for WIRS. Note that all structures under test, with the exception of the -external DE-IRS and external AB-tree, were contained entirely within system -memory. All benchmarking code and data structures were implemented using C++17 -and compiled using gcc 11.3.0 at the \texttt{-O3} optimization level. The -extension framework itself, excluding the shard implementations and utility -headers, consisted of a header-only library of about 1200 SLOC. + +We did not use the synthetic uniform and zipfian data sets for testing +WSS and WIRS, as these datasets lacked weights. We also did not use the +Twitter and Delicious datasets for unweighted testing, as they have +uninteresting key distributions. + +\Paragraph{Structures Compared.} As a basis of comparison, we tested +both our dynamized SSI implementations, and existing dynamic baselines, +for each sampling problem considered. Specifically, we consider a the +following dynamized structures, +\begin{itemize} + +\item \textbf{DE-WSS.} An implementation of the dynamized alias +structure~\cite{walker74} for weighted set sampling discussed +in Section~\ref{ssec:wss-struct}. We compare this against a WSS +implementation of Olken's method on a B+Tree with aggregate weight tags +(\textbf{AGG-BTree})~\cite{olken95}, based on the B+tree implementation +in the TLX library~\cite{tlx}. + +\item \textbf{DE-IRS.} An implementation of the dynamized ISAM tree for +independent range sampling, discussed in Section~\ref{ssec:irs-struct}. We +also implement a concurrent version based on our discussion in +Section~\ref{ssec:ext-concurrency} and an external version from +Section~\ref{ssec:ext-external}. We compare the external and concurrent +versions against the AB-tree~\cite{zhao22}, and the single-threaded, +in memory version was compare with an IRS implementation of Olken's +method on an AGG-BTree. + +\item \textbf{DE-WIRS.} An implementation of the dynamized alias-augmented +B+Tree~\cite{afshani17} as discussed in Section~\ref{ssec:wirs-struct} for +weighted indepedent range sampling. We compare this against a WIRS +implementation of Olken's method on an AGG-BTree. + +\end{itemize} + +All of the tested structures, with the exception of the external memory +DE-IRS implementation and AB-Tree, were wholely contained within system +memory. AB-Tree is a native external structure, so for the in-memory +concurrency evaluation we configured it with enough cache to maintain +the entire structure in memory to simulate an in-memory implementation.\footnote{ + Because of the nature of sampling queries, traditional + efficient locking techniques for B+Trees are not able to be + used~\cite{zhao22}. The alternatives were to run AB-Tree in this + manner, or to globally lock the B+Tree for every operation. We + elected to use the former approach for this chapter. We used the + latter approach in the next chapter. +} + + + |