srand() but cooler
Let’s say, for some odd reason, you’re hosting a Python service that:
- takes requests that run large computations based on a randomly generated number (a “seed”)
- spreads work across multiple workers to handle many requests
Then, it’s likely you’ll introduce a subtle randomness bug that leads to duplicate seeds appearing. Let me explain:
The following tests were done on a Ubuntu 20.04 LTS Hetzner instance using:
I additionally made efforts to test certain outcomes on Ubuntu 18 LTS with Python 3.6.9, albeit with much uglier logging due to the lack of f-string improvements.
Randomness: single-process single-threaded
Typically, there’s no need to fiddle with the internal configuration of Python’s
random module. Python seeds its Mersenne Twister thing with os.urandom by default, and even if you’re on some really obscure operating system, it seeds by time which is pretty hard to collide with milisecond accuracy:
That’s not to say that this is cryptographically secure or anything, but for the purposes laid out in the introduction, it’s more than sufficiently random for our purposes.
But, as with all things Python, things get a bit dicer as we scale.
As explained on the
multiprocessing docpage, the default methodology for spawning multiple Python processes on Linux is to abuse
The fork syscall, for those who don’t know, is approximately “an OS function that creates a bit-for-bit copy of an existing process”, with exceptions listed in the preceding link. The random state of a PRNG should be stored somewhere in process memory, so it stands to reason that multiple Python processes
fork()ed from the same parent will have the same seed.
We can test this theory with a simple script:
random.randint with 4 python processes in parallel; print dividing line every 4 outputs.
The results from this script are fairly surprising:
If you will observe: The
numpy.random outputs repeat 4 times (every 4 steps), while the randomness of vanilla
random is secure across processes. So we can infer that:
numpy.random’s PRNG state is captured (pickled) by the multiprocessing library and copied identically across 4 workers;
random’s PRNG state is not duplicated. It could either be shared across all workers, or reseeded within each worker on startup.
To figure out the answer to (2), we can just modify the script to print out the initial state of the PRNGs on worker start:
As expected, the
numpy.random initial state is equivalent across all workers. Unexpectedly, the
random initial states do not, and they remain different even when I extend the sleep duration to infinity, indicating a new
random._inst state is initialised per worker:
At this point I’m pretty confused. The
numpy module was getting pickled, but the
random module was not. Could I force a pickling of the
I tried extending the test script to cover more things:
In short, I’m asking (over 128 iterations): how many unique PRNG states / random numbers are there across all processes?
And the answer is weird:
- if you use
numpy.random, you get
ITERS/NUM_WORKERSunique outcomes, (or
NUM_WORKERScollisions of the same seed)
- if you use
random, the module, you get
ITERSunique outcomes (regardless of number of processes)
- if you use the parent process’s
random.Randominstance, stored at
random._inst, as an argument to a multiworker task, the
random.Randominstance will get pickled, and the number of unique outcomes will be
ITERS/NUM_WORKERS/4. This formula scales well with both varying
So at this point, I’m doubly confused. I haven’t figured out why
random succeeds in re-initing state in new processes, and I now have a new question of what mechanism keeps the pickled
random._inst PRNG state ticking forwards.
And so, I move into the fog of unverified ideas and poorly substantiated hypotheses.
forkmethod is important
I tried switching up the forkmethod from
In both cases, the random states of both
random became different in all workers:
Only the choice of
fork causes the initial seeds of numpy to differ. Why? I’m not sure; you can start with the source code here, but it’s not simple.
random state is copied, but overwritten
pickle representation of both
random.randint are both large and similar in size:
Plugging the outputs of either
randint() function into
pickletools.dis() shows a large stored array of numbers, which I assume is representative of their PRNG states. If the multiprocessing library does capture the state of
random before execution, then something about the initialisation process of each worker must recreate the
random state again.
random.randint() function is not captured by the
multiprocessing module entirely. But I find this difficult to believe, because it has to capture something to run the
randint function, and since
pickle is unable to dump
module types, I’m unable to come up with another idea here.
This also explains why explicitly capturing
r._inst succeeds in creating duplicate seeds – the
multiprocessing library reinitiallises
random._inst, but the function-provided
r_inst object remains as a separate instance.
random.*explicitly in a multiprocessed function will never go wrong. Speculation: the multiprocessing library recreates a
random.Randomobject per worker.
numpy.randomwith multiprocessing using
forkwill cause predictable random number duplication. The PRNG state is captured by
pickleand copied between processes.
- Using a copy of
random._instas an argument for a multiprocessed task will cause even more duplication than the
numpyoption. Mechanism unknown.