chapters/sigmod23/abstract.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

\begin{abstract}

    The execution of analytical queries on massive datasets presents challenges
    due to long response times and high computational costs. As a result, the
    analysis of representative samples of data has emerged as an attractive
    alternative; this avoids the cost of processing queries against the entire
    dataset, while still producing statistically valid results. Unfortunately,
    the sampling techniques in common use sacrifice either sample quality or
    performance, and so are poorly suited for this task. However, it is
    possible to build high quality sample sets efficiently with the assistance
    of indexes. This introduces a new challenge: real-world data is subject to
    continuous update, and so the indexes must be kept up to date. This is
    difficult, because existing sampling indexes present a dichotomy; efficient
    sampling indexes are difficult to update, while easily updatable indexes
    have poor sampling performance. This paper seeks to address this gap by
    proposing a general and practical framework for extending most sampling
    indexes with efficient update support, based on splitting indexes into
    smaller shards, combined with a systematic approach to the periodic
    reconstruction. The framework's design space is examined, with an eye
    towards exploring trade-offs between update performance, sampling
    performance, and memory usage. Three existing static sampling indexes are
    extended using this framework to support updates, and the generalization of
    the framework to concurrent operations and larger-than-memory data is
    discussed. Through a comprehensive suite of benchmarks, the extended
    indexes are shown to match or exceed the update throughput of
    state-of-the-art dynamic baselines, while presenting significant
    improvements in sampling latency.

\end{abstract}