-
Notifications
You must be signed in to change notification settings - Fork 41
Sequential Unbiased Sampling from a Set of Unknown Size
Implementing big data analysis techniques makes high demands on effective computer memory management. In many cases it is possible to perform effective training, using a smaller part of the training data samples. This allows significantly reducing memory- and processing time consumption.
In order to select an unbiased (representative) subset A of samples from a large set S of all training data, one may use the algorithm:
while samples s
insert s into S
S <- shuffle(S)
for i <- 0 to |A| - 1
A[i] <- S[i]
This algorithm guaranties that each element s from set S has probability p = |A| / |S| to be selected to the subset A.
The algorithm above implies simultaneous storage in memory |A| + |S| elements. For sake of reducing the memory consumption we propose using the Algorithm R, which needs to keep in memory only |A| elements:
i <- 0
while samples s
if i < |A|
A[i] <- s
else
k <- random(0, i)
if k < |A|
A[k] <- s
i <- i + 1
here function random(0, i) returns a random integer number from the closed interval [0; i]. Since usually |A| << |S|, the proposed sampling algorithm allows for significant memory saving.