Skip to main content

Posts

Showing posts from February, 2016

Finding random subsets

Suppose you have a large set of records and you need to process them in random batches over longer period of time. By "random batches" I mean subsets, containing random elements from the full set. The solution we've found working good for us is based on the following steps: Load unprocessed record ids into memory; Periodically extract a random batch of ids; Process extracted records and persist them as processed. The tricky part of the process is step 2). How do you efficiently find random subset of a big set? It turns out there is a ready-made algorithm for this - https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle . Following is the JavaScript implementation, which proved to be useful in the context of the aforementioned task: