diff options
Diffstat (limited to 'chapters/sigmod23')
| -rw-r--r-- | chapters/sigmod23/background.tex | 8 | ||||
| -rw-r--r-- | chapters/sigmod23/framework.tex | 18 |
2 files changed, 13 insertions, 13 deletions
diff --git a/chapters/sigmod23/background.tex b/chapters/sigmod23/background.tex index 984e36c..8d3a88f 100644 --- a/chapters/sigmod23/background.tex +++ b/chapters/sigmod23/background.tex @@ -21,8 +21,8 @@ set; the specific usage should be clear from context. In each of the problems considered, sampling can be performed either with-replacement or without-replacement. Sampling with-replacement means that a record that has been included in the sample set for a given -sampling query is "replaced" into the dataset and allowed to be sampled -again. Sampling without-replacement does not "replace" the record, +sampling query is ``replaced'' into the dataset and allowed to be sampled +again. Sampling without-replacement does not ``replace'' the record, and so each individual record can only be included within the a sample set once for a given query. The data structures that will be discussed support sampling with-replacement, and sampling without-replacement can @@ -38,7 +38,7 @@ in the sample set to match the distribution of source data set. This requires that the sampling of a record does not affect the probability of any other record being sampled in the future. Such sample sets are said to be drawn i.i.d (independently and identically distributed). Throughout -this chapter, the term "independent" will be used to describe both +this chapter, the term ``independent'' will be used to describe both statistical independence, and identical distribution. Independence of sample sets is important because many useful statistical @@ -192,7 +192,7 @@ requiring greater than $k$ traversals to obtain a sample set of size $k$. \Paragraph{Static Solutions.} There are also a large number of static data structures, which we'll call static sampling indices (SSIs) in this chapter,\footnote{ - We used the term "SSI" in the original paper on which this chapter + We used the term ``SSI'' in the original paper on which this chapter is based, which was published prior to our realization that a strong distinction between an index and a data structure would be useful. I am retaining the term SSI in this chapter for consistency with the diff --git a/chapters/sigmod23/framework.tex b/chapters/sigmod23/framework.tex index 218c290..b413802 100644 --- a/chapters/sigmod23/framework.tex +++ b/chapters/sigmod23/framework.tex @@ -252,7 +252,7 @@ of the major limitations of the ghost structure approach for handling deletes is that there is not a principled method for removing deleted records from the decomposed structure. The standard approach is to set an arbitrary number of delete records, and rebuild the entire structure when -this threshold is crossed~\cite{saxe79}. Mixing the "ghost" records into +this threshold is crossed~\cite{saxe79}. Mixing the ``ghost'' records into the same structures as the original records allows for deleted records to naturally be cleaned up over time as they meet their tombstones during reconstructions using a technique called tombstone cancellation. This @@ -280,7 +280,7 @@ mechanism. The cost of using a tombstone delete in a Bentley-Saxe dynamization is the same as a simple insert, \begin{equation*} -\mathscr{D}(n)_A \in \Theta\left(\frac{B(n)}{n} \log_2 (n)\right) +D_A(n) \in \Theta\left(\frac{B(n)}{n} \log_2 (n)\right) \end{equation*} with the worst-case cost being $\Theta(B(n))$. Note that there is also a minor performance effect resulting from deleted records appearing @@ -309,7 +309,7 @@ on a Bentley-Saxe decomposition of that SSI will require, at worst, executing a point-lookup on each block, with a total cost of \begin{equation*} -\mathscr{D}(n) \in \Theta\left( L(n) \log_2 (n)\right) +D(n) \in \Theta\left( L(n) \log_2 (n)\right) \end{equation*} If the SSI being considered does \emph{not} support an efficient @@ -391,7 +391,7 @@ a natural way of controlling the number of deleted records within the structure, and thereby bounding the rejection rate. During reconstruction, we have the opportunity to remove deleted records. This will cause the record counts associated with each block of the structure to gradually -drift out of alignment with the "perfect" powers of two associated with +drift out of alignment with the ``perfect'' powers of two associated with the Bentley-Saxe method, however. In the theoretical literature on this topic, the solution to this problem is to periodically re-partition all of the records to re-align the block sizes~\cite{merge-dsp, saxe79}. This @@ -450,7 +450,7 @@ deleted records involved in the reconstruction will be dropped. Tombstones may require multiple cascading rounds of compaction to occur, because a tombstone record will only cancel when it encounters the record that it deletes. However, because tombstones always follow the record they -delete in insertion order, and will therefore always be "above" that +delete in insertion order, and will therefore always be ``above'' that record in the structure, each reconstruction will move every tombstone involved closer to the record it deletes, ensuring that eventually the bound will be satisfied. @@ -526,7 +526,7 @@ and LevelDB~\cite{leveldb}. This work has produced an incredibly large and well explored parametrization of the reconstruction procedures of LSM trees, a good summary of which can be bounded in this recent tutorial paper~\cite{sarkar23}. Examples of this design -space exploration include: different ways to organize each "level" +space exploration include: different ways to organize each ``level'' of the tree~\cite{dayan19, dostoevsky, autumn}, different growth rates, buffering, sub-partitioning of structures to allow finer-grained reconstruction~\cite{dayan22}, and approaches for allocating resources to @@ -739,19 +739,19 @@ Assuming that $N_B \ll n$, the first two terms of this expression are constant. Dropping them and amortizing the result over $n$ records give us the amortized insertion cost, \begin{equation*} -I_a(n) \in \Theta\left(\frac{B_M(n)}{n}\log_s(n)\right) +I_A(n) \in \Theta\left(\frac{B_M(n)}{n}\log_s(n)\right) \end{equation*} If the SSI being considered does not support a more efficient construction procedure from other instances of the same SSI, and the general Bentley-Saxe \texttt{unbuild} and \texttt{build} -operations must be used, the the cost becomes $I_a(n) \in +operations must be used, the the cost becomes $I_A(n) \in \Theta\left(\frac{B(n)}{n}\log_s(n)\right)$ instead. \Paragraph{Delete.} The framework supports both tombstone and tagged deletes, each with different performance. Using tombstones, the cost of a delete is identical to that of an insert. When using tagging, the cost of a delete is the same as the cost of a point lookup, because the -"delete" itself only sets a bit in the header of the record, +``delete'' itself only sets a bit in the header of the record, once it has been located. There will be $\Theta(\log_s n)$ total shards in the structure, each with a look-up cost of $L(n)$ using either the SSI's native point-lookup, or an auxiliary hash table, and the lookup |