diff options
Diffstat (limited to 'chapters/sigmod23/background.tex')
| -rw-r--r-- | chapters/sigmod23/background.tex | 18 |
1 files changed, 10 insertions, 8 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index af3b80a..d600c27 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -19,16 +19,16 @@ is used to indicate the selection of either a single sample or a sample set; the specific usage should be clear from context. In each of the problems considered, sampling can be performed either -with replacement or without replacement. Sampling with replacement +with-replacement or without-replacement. Sampling with-replacement means that a record that has been included in the sample set for a given sampling query is "replaced" into the dataset and allowed to be sampled -again. Sampling without replacement does not "replace" the record, +again. Sampling without-replacement does not "replace" the record, and so each individual record can only be included within the a sample set once for a given query. The data structures that will be discussed -support sampling with replacement, and sampling without replacement can -be implemented using a constant number of with replacement sampling +support sampling with-replacement, and sampling without-replacement can +be implemented using a constant number of with-replacement sampling operations, followed by a deduplication step~\cite{hu15}, so this chapter -will focus exclusive on the with replacement case. +will focus exclusive on the with-replacement case. \subsection{Independent Sampling Problem} @@ -115,8 +115,10 @@ of problems that will be directly addressed within this chapter. Relational database systems often have native support for IQS using SQL's \texttt{TABLESAMPLE} operator~\cite{postgress-doc}. However, the -algorithms used to implement this operator have significant limitations: -users much choose between statistical independence or performance. +algorithms used to implement this operator have significant limitations +and do not allow users to maintain statistical independence of the results +without also running the query to be sampled from in full. Thus, users must +choose between independece and performance. To maintain statistical independence, Bernoulli sampling is used. This technique requires iterating over every record in the result set of the @@ -240,7 +242,7 @@ Tao~\cite{tao22}. There also exist specialized data structures with support for both efficient sampling and updates~\cite{hu14}, but these structures have poor constant factors and are very complex, rendering them of little -practical utility. Additionally, efforts have been made to extended +practical utility. Additionally, efforts have been made to extend the alias structure with support for weight updates over a fixed set of elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not allow the insertion or removal of new records, however, only in-place |