From ac1244fced7e6c6ba93d4292dd9a18ce293236eb Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Mon, 5 May 2025 16:23:25 -0400 Subject: Updates --- chapters/sigmod23/background.tex | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) (limited to 'chapters/sigmod23/background.tex') diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index ad89e03..b4ccbf1 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -124,7 +124,13 @@ query, and selecting or rejecting it for inclusion within the sample with a fixed probability~\cite{db2-doc}. This process requires that each record in the result set be considered, and thus provides no performance benefit relative to the query being sampled from, as it must be answered -in full anyway before returning only some of the results. +in full anyway before returning only some of the results.\footnote{ + To clarify, this is not to say that Bernoulli sampling isn't + useful. It \emph{can} be used to improve the performance of queries + by limiting the cardinality of intermediate results, etc. But it is + not particularly useful for improving the performance of IQS queries, + where the sampling is performed on the final result set of the query. +} For performance, the statistical guarantees can be discarded and systematic or block sampling used instead. Systematic sampling considers @@ -230,6 +236,7 @@ structures attached to the nodes. More examples of alias augmentation applied to different IQS problems can be found in a recent survey by Tao~\cite{tao22}. +\Paragraph{Miscellanea.} There also exist specialized data structures with support for both efficient sampling and updates~\cite{hu14}, but these structures have poor constant factors and are very complex, rendering them of little @@ -252,7 +259,19 @@ only once per sample set (if at all), but fail to support updates. Thus, there appears to be a general dichotomy of sampling techniques: existing sampling data structures support either updates, or efficient sampling, but generally not both. It will be the purpose of this chapter to resolve -this dichotomy. +this dichotomy. In particular, we seek to develop structures with the +following desiderata, + +\begin{enumerate} + \item Support data updates (including deletes) with similar average + performance to a standard B+Tree. + \item Support IQS queries that do not pay a per-sample cost + proportional to some function of the data size. In other words, + $k$ should \emph{not} be be multiplied by any function of $n$ + in the query cost function. + %FIXME: this guy comes out of nowhere... + \item Provide the user with some basic performance tuning capability. +\end{enumerate} -- cgit v1.2.3