diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-04-27 17:36:57 -0400 |
| commit | 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch) | |
| tree | 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/sigmod23/exp-baseline.tex | |
| download | dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz | |
Initial commit
Diffstat (limited to 'chapters/sigmod23/exp-baseline.tex')
| -rw-r--r-- | chapters/sigmod23/exp-baseline.tex | 98 |
1 files changed, 98 insertions, 0 deletions
diff --git a/chapters/sigmod23/exp-baseline.tex b/chapters/sigmod23/exp-baseline.tex new file mode 100644 index 0000000..9e7929c --- /dev/null +++ b/chapters/sigmod23/exp-baseline.tex @@ -0,0 +1,98 @@ +\subsection{Comparison to Baselines} + +Next, the performance of indexes extended using the framework is compared +against tree sampling on the aggregate B+tree, as well as problem-specific +SSIs for WSS, WIRS, and IRS queries. Unless otherwise specified, IRS and WIRS +queries were executed with a selectivity of $0.1\%$ and 500 million randomly +selected records from the OSM dataset were used. The uniform and zipfian +synthetic datasets were 1 billion records in size. All benchmarks warmed up the +data structure by inserting 10\% of the records, and then measured the +throughput inserting the remaining records, while deleting 5\% of them over the +course of the benchmark. Once all records were inserted, the sampling +performance was measured. The reported update throughputs were calculated using +both inserts and deletes, following the warmup period. + +\begin{figure*} + \centering + \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wss-insert} \label{fig:wss-insert}} + \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wss-sample} \label{fig:wss-sample}} \\ + \subfloat[Insertion Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-wss-insert} \label{fig:wss-insert-s}} + \subfloat[Sampling Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-wss-sample} \label{fig:wss-sample-s}} + \caption{Framework Comparisons to Baselines for WSS} +\end{figure*} + +Starting with WSS, Figure~\ref{fig:wss-insert} shows that the DE-WSS structure +is competitive with the AGG B+tree in terms of insertion performance, achieving +about 85\% of the AGG B+tree's insertion throughput on the Twitter dataset, and +beating it by similar margins on the other datasets. In terms of sampling +performance in Figure~\ref{fig:wss-sample}, it beats the B+tree handily, and +compares favorably to the static alias structure. Figures~\ref{fig:wss-insert-s} +and \ref{fig:wss-sample-s} show the performance scaling of the three structures as +the dataset size increases. All of the structures exhibit the same type of +performance degradation with respect to dataset size. + +\begin{figure*} + \centering + \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wirs-insert} \label{fig:wirs-insert}} + \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wirs-sample} \label{fig:wirs-sample}} + \caption{Framework Comparison to Baselines for WIRS} +\end{figure*} + +Figures~\ref{fig:wirs-insert} and \ref{fig:wirs-sample} show the performance of +the DE-WIRS index, relative to the AGG B+tree and the alias-augmented B+tree. This +example shows the same pattern of behavior as was seen with DE-WSS, though the +margin between the DE-WIRS and its corresponding SSI is much narrower. +Additionally, the constant factors associated with the construction cost of the +alias-augmented B+tree are much larger than the alias structure. The loss of +insertion performance due to this is seen clearly in Figure~\ref{fig:wirs-insert}, where +the margin of advantage between DE-WIRS and the AGG B+tree in insertion +throughput shrinks compared to the DE-WSS index, and the AGG B+tree's advantage +on the Twitter dataset is expanded. + +\begin{figure*} + \subfloat[Insertion Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-insert} \label{fig:irs-insert-s}} + \subfloat[Sampling Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-sample} \label{fig:irs-sample-s}} \\ + + \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-insert} \label{fig:irs-insert1}} + \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-sample} \label{fig:irs-sample1}} \\ + + \subfloat[Delete Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-delete} \label{fig:irs-delete}} + \subfloat[Sampling Latency vs. Sample Size]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-samplesize} \label{fig:irs-samplesize}} + \caption{Framework Comparison to Baselines for IRS} + +\end{figure*} +Finally, Figures~\ref{fig:irs-insert1} and \ref{fig:irs-sample1} show a +comparison of the in-memory DE-IRS index against the in-memory ISAM tree and the AGG +B+tree for answering IRS queries. The cost of bulk-loading the ISAM tree is less +than the cost of building the alias structure, or the alias-augmented B+tree, and +so here DE-IRS defeats the AGG B+tree by wider margins in insertion throughput, +though the margin narrows significantly in terms of sampling performance +advantage. + +DE-IRS was further tested to evaluate scalability. +Figure~\ref{fig:irs-insert-s} shows average insertion throughput, +Figure~\ref{fig:irs-delete} shows average delete latency (under tagging), and +Figure~\ref{fig:irs-sample-s} shows average sampling latencies for DE-IRS and +AGG B+tree over a range of data sizes. In all cases, DE-IRS and B+tree show +similar patterns of performance degradation as the datasize grows. Note that +the delete latencies of DE-IRS are worse than AGG B+tree, because of the B+tree's +cheaper point-lookups. + +Figure~\ref{fig:irs-sample-s} +also includes one other point of interest: the sampling performance of +DE-IRS \emph{improves} when the data size grows from one million to ten million +records. While at first glance the performance increase may appear paradoxical, +it actually demonstrates an important result concerning the effect of the +unsorted mutable buffer on index performance. At one million records, the +buffer constitutes approximately 1\% of the total data size; this results in +the buffer being sampled from with greater frequency (as it has more total +weight) than would be the case with larger data. The greater the frequency of +buffer sampling, the more rejections will occur, and the worse the sampling +performance will be. This illustrates the importance of keeping the buffer +small, even when a scan is not used for buffer sampling. Finally, +Figure~\ref{fig:irs-samplesize} shows the decreasing per-sample cost as the +number of records requested by a sampling query grows for DE-IRS, compared to +AGG B+tree. Note that DE-IRS benefits significantly more from batching samples +than AGG B+tree, and that the improvement is greatest up to $k=100$ samples per +query. + |