Began re-writing/updating the sampling extension stuff

author: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-04 16:43:45 -0400
committer: Douglas Rumbaugh <dbr4@psu.edu> 2025-05-04 16:43:45 -0400
commit: eb519d35d7f11427dd5fc877130b02478f0da80d (patch)
tree: 2eb5bc349c82517fdc6484fce71c862b92b0213b /chapters/sigmod23/introduction.tex
parent: 873fd659e45e80fe9e229d3d85b3c4c99fb2c121 (diff)
download: dissertation-eb519d35d7f11427dd5fc877130b02478f0da80d.tar.gz
1 files changed, 36 insertions, 18 deletions
diff --git a/chapters/sigmod23/introduction.tex b/chapters/sigmod23/introduction.tex
index 0155c7d..befdbba 100644
--- a/chapters/sigmod23/introduction.tex
+++ b/chapters/sigmod23/introduction.tex
@@ -1,20 +1,38 @@
 \section{Introduction} \label{sec:intro} 
 
-As a first attempt at realizing a dynamic extension framework, one of the
-non-decomposable search problems discussed in the previous chapter was
-considered: independent range sampling, along with a number of other
-independent sampling problems. These sorts of queries are important in a
-variety of contexts, including including approximate query processing
-(AQP)~\cite{blinkdb,quickr,verdict,cohen23}, interactive data
-exploration~\cite{sps,xie21}, financial audit sampling~\cite{olken-thesis}, and
-feature selection for machine learning~\cite{ml-sampling}. However, they are
-not well served using existing techniques, which tend to sacrifice statistical
-independence for performance, or vise versa. In this chapter, a solution for
-independent sampling is presented that manages to achieve both statistical
-independence, and good performance, by designing a Bentley-Saxe inspired
-framework for introducing update support to efficient static sampling data
-structures. It seeks to demonstrate the viability of Bentley-Saxe as the basis
-for adding update support to data structures, as well as showing that the
-limitations of the decomposable search problem abstraction can be overcome
-through alternative query processing techniques to preserve good
-performance.
+Having discussed the relevant background materials, we will now turn to a
+discussion of our first attempt to address the limitations of dynamization
+in the context of one particular class of non-decomposable search problem:
+indepedent random sampling. We've already discussed one representative
+problem of this class, independent range sampling, and shown how it is
+not traditionally decomposable. This specific problem is one of several
+very similar types of problem, however, and in this chapter we will also
+attend to simple random sampling, weighted set sampling, and weighted
+independent range sampling.
+
+Independent sampling presents an interesting motivating example
+because it is nominally supported within many relational databases,
+and is useful in a variety of contexts, such as  approximate
+query processing (AQP)~\cite{blinkdb,quickr,verdict,cohen23},
+interactive data exploration~\cite{sps,xie21}, financial audit
+sampling~\cite{olken-thesis}, and feature selection for machine
+learning~\cite{ml-sampling}. However, existing support for these search
+problems is limited by the techniques used within databases to implement
+them. Existing implementations tend to sacrifice either performance,
+by requiring the entire result set of be materialized prior to applying
+Bernoulli sampling, or statistical independence. There exists techniques
+for obtaining both sampling performance and indepedence by leveraging
+existing B+Tree indices with slight modification~\cite{olken-thesis},
+but even this technique has worse sampling performance than could be
+achieved using specialized static sampling indices.
+
+Thus, we decided to attempt to apply a Bentley-Saxe based dynamization
+technique to these data structures. In this chapter, we discuss our
+approach, which addresses the decomposability problems discussed in
+Section~\cite{ssec:background-irs}, introduces two physical mechanisms
+for support deletes, and also introduces an LSM-tree inspired design
+space to allow for performance tuning. The results in this chapter are
+highly specialized to sampling problems, however they will serve as a
+launching off point for our discussion of a generalized framework in
+the subsequent chapter.
+
author	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-04 16:43:45 -0400
committer	Douglas Rumbaugh <dbr4@psu.edu>	2025-05-04 16:43:45 -0400
commit	eb519d35d7f11427dd5fc877130b02478f0da80d (patch)
tree	2eb5bc349c82517fdc6484fce71c862b92b0213b /chapters/sigmod23/introduction.tex
parent	873fd659e45e80fe9e229d3d85b3c4c99fb2c121 (diff)
download	dissertation-eb519d35d7f11427dd5fc877130b02478f0da80d.tar.gz