diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-13 17:29:40 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-05-13 17:29:40 -0400 |
| commit | 40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb (patch) | |
| tree | c00441b058255de08a32d227ce7af46bf11d8eb8 /chapters/sigmod23/background.tex | |
| parent | 5ffc53e69e956054fdefd1fe193e00eee705dcab (diff) | |
| download | dissertation-40bff24fc2e2da57f382e4f49a5ffb7c826bbcfb.tar.gz | |
Updates
Diffstat (limited to 'chapters/sigmod23/background.tex')
| -rw-r--r-- | chapters/sigmod23/background.tex | 12 |
1 files changed, 6 insertions, 6 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index b4ccbf1..af3b80a 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -37,12 +37,12 @@ have \emph{statistical independence} and for the distribution of records in the sample set to match the distribution of source data set. This requires that the sampling of a record does not affect the probability of any other record being sampled in the future. Such sample sets are said -to be drawn i.i.d (idendepently and identically distributed). Throughout +to be drawn i.i.d (independently and identically distributed). Throughout this chapter, the term "independent" will be used to describe both statistical independence, and identical distribution. Independence of sample sets is important because many useful statistical -results are derived from assumping that the condition holds. For example, +results are derived from assuming that the condition holds. For example, it is a requirement for the application of statistical tools such as the Central Limit Theorem~\cite{bulmer79}, which is the basis for many concentration bounds. A failure to maintain independence in sampling @@ -54,7 +54,7 @@ sampling} (IQS)~\cite{hu14}. In IQS, a sample set is constructed from a specified number of records in the result set of a database query. In this context, it isn't enough to ensure that individual records are sampled independently; the sample sets from repeated queries must also be -indepedent. This precludes, for example, caching and returning the same +independent. This precludes, for example, caching and returning the same sample set to multiple repetitions of the same query. This inter-query independence provides a variety of useful properties, such as fairness and representativeness of query results~\cite{tao22}. @@ -194,7 +194,7 @@ call static sampling indices (SSIs) in this chapter,\footnote{ is based, which was published prior to our realization that a strong distinction between an index and a data structure would be useful. I am retaining the term SSI in this chapter for consistency with the - original paper, but understand that in the termonology established in + original paper, but understand that in the terminology established in Chapter~\ref{chap:background}, SSIs are data structures, not indices. }, that are capable of answering sampling queries more efficiently than @@ -216,7 +216,7 @@ per sample. Thus, a WSS query can be answered in $\Theta(k)$ time, assuming the structure has already been built. Unfortunately, the alias structure cannot be efficiently updated, as inserting new records would change the relative weights of \emph{all} the records, and require fully -repartitioning the structure. +re-partitioning the structure. While the alias method only applies to WSS, other sampling problems can be solved by using the alias method within the context of a larger data @@ -245,7 +245,7 @@ the alias structure with support for weight updates over a fixed set of elements~\cite{hagerup93,matias03,allendorf23}. These approaches do not allow the insertion or removal of new records, however, only in-place weight updates. While in principle they could be constructed over the -entire domain of possible records, with the weights of non-existant +entire domain of possible records, with the weights of non-existent records set to $0$, this is hardly practical. Thus, these structures are not suited for the database sampling applications that are of interest to us in this chapter. |