diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-05 16:23:25 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-05 16:23:25 -0400 |
| commit | ac1244fced7e6c6ba93d4292dd9a18ce293236eb (patch) | |
| tree | 671696721d572a9e9ec2b92f94e1ff347ac26760 /chapters/sigmod23/background.tex | |
| parent | eb519d35d7f11427dd5fc877130b02478f0da80d (diff) | |
| download | dissertation-ac1244fced7e6c6ba93d4292dd9a18ce293236eb.tar.gz | |
Updates
Diffstat (limited to 'chapters/sigmod23/background.tex')
| -rw-r--r-- | chapters/sigmod23/background.tex | 23 |
1 files changed, 21 insertions, 2 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index ad89e03..b4ccbf1 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -124,7 +124,13 @@ query, and selecting or rejecting it for inclusion within the sample with a fixed probability~\cite{db2-doc}. This process requires that each record in the result set be considered, and thus provides no performance benefit relative to the query being sampled from, as it must be answered -in full anyway before returning only some of the results. +in full anyway before returning only some of the results.\footnote{ + To clarify, this is not to say that Bernoulli sampling isn't + useful. It \emph{can} be used to improve the performance of queries + by limiting the cardinality of intermediate results, etc. But it is + not particularly useful for improving the performance of IQS queries, + where the sampling is performed on the final result set of the query. +} For performance, the statistical guarantees can be discarded and systematic or block sampling used instead. Systematic sampling considers @@ -230,6 +236,7 @@ structures attached to the nodes. More examples of alias augmentation applied to different IQS problems can be found in a recent survey by Tao~\cite{tao22}. +\Paragraph{Miscellanea.} There also exist specialized data structures with support for both efficient sampling and updates~\cite{hu14}, but these structures have poor constant factors and are very complex, rendering them of little @@ -252,7 +259,19 @@ only once per sample set (if at all), but fail to support updates. Thus, there appears to be a general dichotomy of sampling techniques: existing sampling data structures support either updates, or efficient sampling, but generally not both. It will be the purpose of this chapter to resolve -this dichotomy. +this dichotomy. In particular, we seek to develop structures with the +following desiderata, + +\begin{enumerate} + \item Support data updates (including deletes) with similar average + performance to a standard B+Tree. + \item Support IQS queries that do not pay a per-sample cost + proportional to some function of the data size. In other words, + $k$ should \emph{not} be be multiplied by any function of $n$ + in the query cost function. + %FIXME: this guy comes out of nowhere... + \item Provide the user with some basic performance tuning capability. +\end{enumerate} |