Initial commit

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-04-27 17:36:57 -0400
commit: 5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree: 276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/sigmod23/exp-baseline.tex
download: dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz
1 files changed, 98 insertions, 0 deletions
diff --git a/chapters/sigmod23/exp-baseline.tex b/chapters/sigmod23/exp-baseline.tex
new file mode 100644
index 0000000..9e7929c
--- /dev/null
+++ b/chapters/sigmod23/exp-baseline.tex
@@ -0,0 +1,98 @@
+\subsection{Comparison to Baselines} 
+
+Next, the performance of indexes extended using the framework is compared
+against tree sampling on the aggregate B+tree, as well as problem-specific
+SSIs for WSS, WIRS, and IRS queries. Unless otherwise specified, IRS and WIRS
+queries were executed with a selectivity of $0.1\%$ and 500 million randomly
+selected records from the OSM dataset were used. The uniform and zipfian
+synthetic datasets were 1 billion records in size. All benchmarks warmed up the
+data structure by inserting 10\% of the records, and then measured the
+throughput inserting the remaining records, while deleting 5\% of them over the
+course of the benchmark. Once all records were inserted, the sampling
+performance was measured. The reported update throughputs were calculated using
+both inserts and deletes, following the warmup period.
+
+\begin{figure*}
+    \centering
+    \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wss-insert} \label{fig:wss-insert}}
+    \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wss-sample} \label{fig:wss-sample}} \\
+    \subfloat[Insertion Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-wss-insert} \label{fig:wss-insert-s}}
+    \subfloat[Sampling Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-wss-sample} \label{fig:wss-sample-s}} 
+    \caption{Framework Comparisons to Baselines for WSS}
+\end{figure*}
+
+Starting with WSS, Figure~\ref{fig:wss-insert} shows that the DE-WSS structure
+is competitive with the AGG B+tree in terms of insertion performance, achieving
+about 85\% of the AGG B+tree's insertion throughput on the Twitter dataset, and
+beating it by similar margins on the other datasets. In terms of sampling
+performance in Figure~\ref{fig:wss-sample}, it beats the B+tree handily, and
+compares favorably to the static alias structure. Figures~\ref{fig:wss-insert-s}
+and \ref{fig:wss-sample-s} show the performance scaling of the three structures as
+the dataset size increases. All of the structures exhibit the same type of
+performance degradation with respect to dataset size.
+
+\begin{figure*}
+    \centering
+    \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wirs-insert} \label{fig:wirs-insert}}
+    \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-wirs-sample} \label{fig:wirs-sample}}
+    \caption{Framework Comparison to Baselines for WIRS}
+\end{figure*}
+
+Figures~\ref{fig:wirs-insert} and \ref{fig:wirs-sample} show the performance of
+the DE-WIRS index, relative to the AGG B+tree and the alias-augmented B+tree. This
+example shows the same pattern of behavior as was seen with DE-WSS, though the
+margin between the DE-WIRS and its corresponding SSI is much narrower.
+Additionally, the constant factors associated with the construction cost of the
+alias-augmented B+tree are much larger than the alias structure. The loss of
+insertion performance due to this is seen clearly in Figure~\ref{fig:wirs-insert}, where
+the margin of advantage between DE-WIRS and the AGG B+tree in insertion
+throughput shrinks compared to the DE-WSS index, and the AGG B+tree's advantage
+on the Twitter dataset is expanded.
+
+\begin{figure*}
+    \subfloat[Insertion Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-insert} \label{fig:irs-insert-s}} 
+    \subfloat[Sampling Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-sample} \label{fig:irs-sample-s}} \\
+
+    \subfloat[Insertion Throughput vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-insert} \label{fig:irs-insert1}}
+    \subfloat[Sampling Latency vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-sample} \label{fig:irs-sample1}} \\
+
+    \subfloat[Delete Scalability vs. Baselines]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-sc-irs-delete} \label{fig:irs-delete}} 
+    \subfloat[Sampling Latency vs. Sample Size]{\includegraphics[width=.5\textwidth]{img/sigmod23/plot/fig-bs-irs-samplesize} \label{fig:irs-samplesize}}
+    \caption{Framework Comparison to Baselines for IRS}
+ 
+\end{figure*}
+Finally, Figures~\ref{fig:irs-insert1} and \ref{fig:irs-sample1} show a
+comparison of the in-memory DE-IRS index against the in-memory ISAM tree and the AGG
+B+tree for answering IRS queries. The cost of bulk-loading the ISAM tree is less
+than the cost of building the alias structure, or the alias-augmented B+tree, and
+so here DE-IRS defeats the AGG B+tree by wider margins in insertion throughput,
+though the margin narrows significantly in terms of sampling performance
+advantage. 
+
+DE-IRS was further tested to evaluate scalability.
+Figure~\ref{fig:irs-insert-s} shows average insertion throughput,
+Figure~\ref{fig:irs-delete} shows average delete latency (under tagging), and
+Figure~\ref{fig:irs-sample-s} shows average sampling latencies for DE-IRS and
+AGG B+tree over a range of data sizes. In all cases, DE-IRS and B+tree show
+similar patterns of performance degradation as the datasize grows. Note that
+the delete latencies of DE-IRS are worse than AGG B+tree, because of the B+tree's
+cheaper point-lookups.
+
+Figure~\ref{fig:irs-sample-s}
+also includes one other point of interest: the sampling performance of
+DE-IRS \emph{improves} when the data size grows from one million to ten million
+records. While at first glance the performance increase may appear paradoxical,
+it actually demonstrates an important result concerning the effect of the
+unsorted mutable buffer on index performance. At one million records, the
+buffer constitutes approximately 1\% of the total data size; this results in
+the buffer being sampled from with greater frequency (as it has more total
+weight) than would be the case with larger data. The greater the frequency of
+buffer sampling, the more rejections will occur, and the worse the sampling
+performance will be. This illustrates the importance of keeping the buffer
+small, even when a scan is not used for buffer sampling. Finally,
+Figure~\ref{fig:irs-samplesize} shows the decreasing per-sample cost as the
+number of records requested by a sampling query grows for DE-IRS, compared to
+AGG B+tree. Note that DE-IRS benefits significantly more from batching samples
+than AGG B+tree, and that the improvement is greatest up to $k=100$ samples per
+query.
+
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-04-27 17:36:57 -0400
commit	5e4ad2777acc4c2420514e39fb98b7cf2e200996 (patch)
tree	276c075048e85426436db8babf0ca1f37e9fdba2 /chapters/sigmod23/exp-baseline.tex
download	dissertation-5e4ad2777acc4c2420514e39fb98b7cf2e200996.tar.gz