From eb519d35d7f11427dd5fc877130b02478f0da80d Mon Sep 17 00:00:00 2001 From: Douglas Rumbaugh Date: Sun, 4 May 2025 16:43:45 -0400 Subject: Began re-writing/updating the sampling extension stuff --- chapters/sigmod23/introduction.tex | 54 +++++++++++++++++++++++++------------- 1 file changed, 36 insertions(+), 18 deletions(-) (limited to 'chapters/sigmod23/introduction.tex') diff --git a/chapters/sigmod23/introduction.tex b/chapters/sigmod23/introduction.tex index 0155c7d..befdbba 100644 --- a/chapters/sigmod23/introduction.tex +++ b/chapters/sigmod23/introduction.tex @@ -1,20 +1,38 @@ \section{Introduction} \label{sec:intro} -As a first attempt at realizing a dynamic extension framework, one of the -non-decomposable search problems discussed in the previous chapter was -considered: independent range sampling, along with a number of other -independent sampling problems. These sorts of queries are important in a -variety of contexts, including including approximate query processing -(AQP)~\cite{blinkdb,quickr,verdict,cohen23}, interactive data -exploration~\cite{sps,xie21}, financial audit sampling~\cite{olken-thesis}, and -feature selection for machine learning~\cite{ml-sampling}. However, they are -not well served using existing techniques, which tend to sacrifice statistical -independence for performance, or vise versa. In this chapter, a solution for -independent sampling is presented that manages to achieve both statistical -independence, and good performance, by designing a Bentley-Saxe inspired -framework for introducing update support to efficient static sampling data -structures. It seeks to demonstrate the viability of Bentley-Saxe as the basis -for adding update support to data structures, as well as showing that the -limitations of the decomposable search problem abstraction can be overcome -through alternative query processing techniques to preserve good -performance. +Having discussed the relevant background materials, we will now turn to a +discussion of our first attempt to address the limitations of dynamization +in the context of one particular class of non-decomposable search problem: +indepedent random sampling. We've already discussed one representative +problem of this class, independent range sampling, and shown how it is +not traditionally decomposable. This specific problem is one of several +very similar types of problem, however, and in this chapter we will also +attend to simple random sampling, weighted set sampling, and weighted +independent range sampling. + +Independent sampling presents an interesting motivating example +because it is nominally supported within many relational databases, +and is useful in a variety of contexts, such as approximate +query processing (AQP)~\cite{blinkdb,quickr,verdict,cohen23}, +interactive data exploration~\cite{sps,xie21}, financial audit +sampling~\cite{olken-thesis}, and feature selection for machine +learning~\cite{ml-sampling}. However, existing support for these search +problems is limited by the techniques used within databases to implement +them. Existing implementations tend to sacrifice either performance, +by requiring the entire result set of be materialized prior to applying +Bernoulli sampling, or statistical independence. There exists techniques +for obtaining both sampling performance and indepedence by leveraging +existing B+Tree indices with slight modification~\cite{olken-thesis}, +but even this technique has worse sampling performance than could be +achieved using specialized static sampling indices. + +Thus, we decided to attempt to apply a Bentley-Saxe based dynamization +technique to these data structures. In this chapter, we discuss our +approach, which addresses the decomposability problems discussed in +Section~\cite{ssec:background-irs}, introduces two physical mechanisms +for support deletes, and also introduces an LSM-tree inspired design +space to allow for performance tuning. The results in this chapter are +highly specialized to sampling problems, however they will serve as a +launching off point for our discussion of a generalized framework in +the subsequent chapter. + -- cgit v1.2.3