diff options
| author | Douglas Rumbaugh <dbr4@psu.edu> | 2025-06-08 15:04:00 -0400 |
|---|---|---|
| committer | Douglas Rumbaugh <dbr4@psu.edu> | 2025-06-08 15:04:00 -0400 |
| commit | 33bc7e620276f4269ee5f1820e5477135e020b3f (patch) | |
| tree | 03a7bb2ccbf7f1d2943871a69bca18006270bd20 /chapters/sigmod23 | |
| parent | 50adf588694170699adfa75cd2d1763263085165 (diff) | |
| download | dissertation-33bc7e620276f4269ee5f1820e5477135e020b3f.tar.gz | |
Julia updates v2
Diffstat (limited to 'chapters/sigmod23')
| -rw-r--r-- | chapters/sigmod23/background.tex | 6 | ||||
| -rw-r--r-- | chapters/sigmod23/framework.tex | 8 |
2 files changed, 7 insertions, 7 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index 88f2585..42a52de 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -104,7 +104,7 @@ sampling} (WIRS), positive weights $w: D\to \mathbb{R}^+$. Given a query interval $q = [x, y]$ and an integer $k$, an independent range sampling query returns $k$ independent samples from $D \cap q$ with each - point having a probability of $\nicefrac{w(d)}{\sum_{p \in D \cap q}w(p)}$ + point having a probability of $\frac{w(d)}{\sum_{p \in D \cap q}w(p)}$ of being sampled. \end{definition} @@ -118,7 +118,7 @@ SQL's \texttt{TABLESAMPLE} operator~\cite{postgres-doc}. However, the algorithms used to implement this operator have significant limitations and do not allow users to maintain statistical independence of the results without also running the query to be sampled from in full. Thus, users must -choose between independece and performance. +choose between independence and performance. To maintain statistical independence, Bernoulli sampling is used. This technique requires iterating over every record in the result set of the @@ -198,7 +198,7 @@ call static sampling indices (SSIs) in this chapter,\footnote{ am retaining the term SSI in this chapter for consistency with the original paper, but understand that in the terminology established in Chapter~\ref{chap:background}, SSIs are data structures, not indices. -}, +} that are capable of answering sampling queries more efficiently than Olken's method relative to the overall data size. An example of such a structure is used in Walker's alias method \cite{walker74,vose91}. diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex index d51c2cb..b3a8215 100644 --- a/chapters/sigmod23/framework.tex +++ b/chapters/sigmod23/framework.tex @@ -532,7 +532,7 @@ rates, buffering, sub-partitioning of structures to allow finer-grained reconstruction~\cite{dayan22}, and approaches for allocating resources to auxiliary structures attached to the main ones for accelerating certain types of query~\cite{dayan18-1, zhu21, monkey}. This work is discussed -in greater depth in Chapter~\ref{chap:related-work} +in greater depth in Chapter~\ref{chap:related-work}. Many of the elements within the LSM Tree design space are based upon the specifics of the data structure itself, and are not applicable to our @@ -561,7 +561,7 @@ the case of sampling this isn't a serious problem. The implications of this will be discussed in Section~\ref{ssec:sampling-cost-funcs}. The size of this buffer, $N_B$ is a user-specified constant. Block capacities are defined in terms of multiples of $N_B$, such that each buffer flush -corresponds to an insert in the traditioanl Bentley-Saxe method. Thus, +corresponds to an insert in the traditional Bentley-Saxe method. Thus, rather than the $i$th block containing $2^i$ records, it contains $N_B \cdot 2^i$ records. We call this unsorted array the \emph{mutable buffer}. @@ -750,8 +750,8 @@ operations must be used, the the cost becomes $I_a(n) \in \Paragraph{Delete.} The framework supports both tombstone and tagged deletes, each with different performance. Using tombstones, the cost of a delete is identical to that of an insert. When using tagging, the -cost of a delete is the same as cost of doing a point lookup, as the -"delete" itself is simply setting a bit in the header of the record, +cost of a delete is the same as the cost of a point lookup, because the +"delete" itself only sets a bit in the header of the record, once it has been located. There will be $\Theta(\log_s n)$ total shards in the structure, each with a look-up cost of $L(n)$ using either the SSI's native point-lookup, or an auxiliary hash table, and the lookup |