summaryrefslogtreecommitdiffstats
path: root/chapters/sigmod23/background.tex
diff options
context:
space:
mode:
authorDouglas Rumbaugh <dbr4@psu.edu>2025-05-05 16:23:25 -0400
committerDouglas Rumbaugh <dbr4@psu.edu>2025-05-05 16:23:25 -0400
commitac1244fced7e6c6ba93d4292dd9a18ce293236eb (patch)
tree671696721d572a9e9ec2b92f94e1ff347ac26760 /chapters/sigmod23/background.tex
parenteb519d35d7f11427dd5fc877130b02478f0da80d (diff)
downloaddissertation-ac1244fced7e6c6ba93d4292dd9a18ce293236eb.tar.gz
Updates
Diffstat (limited to 'chapters/sigmod23/background.tex')
-rw-r--r--chapters/sigmod23/background.tex23
1 files changed, 21 insertions, 2 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex
index ad89e03..b4ccbf1 100644
--- a/chapters/sigmod23/background.tex
+++ b/chapters/sigmod23/background.tex
@@ -124,7 +124,13 @@ query, and selecting or rejecting it for inclusion within the sample
with a fixed probability~\cite{db2-doc}. This process requires that each
record in the result set be considered, and thus provides no performance
benefit relative to the query being sampled from, as it must be answered
-in full anyway before returning only some of the results.
+in full anyway before returning only some of the results.\footnote{
+ To clarify, this is not to say that Bernoulli sampling isn't
+ useful. It \emph{can} be used to improve the performance of queries
+ by limiting the cardinality of intermediate results, etc. But it is
+ not particularly useful for improving the performance of IQS queries,
+ where the sampling is performed on the final result set of the query.
+}
For performance, the statistical guarantees can be discarded and
systematic or block sampling used instead. Systematic sampling considers
@@ -230,6 +236,7 @@ structures attached to the nodes. More examples of alias augmentation
applied to different IQS problems can be found in a recent survey by
Tao~\cite{tao22}.
+\Paragraph{Miscellanea.}
There also exist specialized data structures with support for both
efficient sampling and updates~\cite{hu14}, but these structures have
poor constant factors and are very complex, rendering them of little
@@ -252,7 +259,19 @@ only once per sample set (if at all), but fail to support updates. Thus,
there appears to be a general dichotomy of sampling techniques: existing
sampling data structures support either updates, or efficient sampling,
but generally not both. It will be the purpose of this chapter to resolve
-this dichotomy.
+this dichotomy. In particular, we seek to develop structures with the
+following desiderata,
+
+\begin{enumerate}
+ \item Support data updates (including deletes) with similar average
+ performance to a standard B+Tree.
+ \item Support IQS queries that do not pay a per-sample cost
+ proportional to some function of the data size. In other words,
+ $k$ should \emph{not} be be multiplied by any function of $n$
+ in the query cost function.
+ %FIXME: this guy comes out of nowhere...
+ \item Provide the user with some basic performance tuning capability.
+\end{enumerate}