Skip to content

Sequential Unbiased Sampling from a Set of Unknown Size

Sergey Kosov edited this page Nov 2, 2016 · 3 revisions

Implementing big data analysis techniques makes high demands on effective computer memory management. In many cases it is possible to perform effective training, using a smaller part of the training data samples. This allows significantly reducing memory- and processing time consumption.

Naive sampling

In order to select an unbiased (representative) subset A of samples from a large set S of all training data, one may use the algorithm:

while samples s
    insert s into S
S <- shuffle(S)
for i <- 0 to |A| - 1 
    A[i] <- S[i]

This algorithm guaranties that each element s from set S has probability p = |A| / |S| to be selected to the subset A.

Reservoir sampling

The algorithm above implies simultaneous storage in memory |A| + |S| elements. For sake of reducing the memory consumption we propose using the Algorithm R, which needs to keep in memory only |A| elements:

i <- 0
while samples s
    if i < |A|
        A[i] <- s
    else
        k <- random(0, i)
        if k < |A|
            A[k] <- s
   i <- i + 1

here function random(0, i) returns a random integer number from the closed interval [0; i]. Since usually |A| << |S|, the proposed sampling algorithm allows for significant memory saving.