Multiprocessing and random()
srand()
but cooler
Let’s say, for some odd reason, you’re hosting a Python service that:
- takes requests that run large computations based on a randomly generated number (a “seed”)
- spreads work across multiple workers to handle many requests
Then, it’s likely you’ll introduce a subtle randomness bug that leads to duplicate seeds appearing. Let me explain:
The following tests were done on a Ubuntu 20.04 LTS Hetzner instance using:
numpy==1.22.4
python==3.8.10
I additionally made efforts to test certain outcomes on Ubuntu 18 LTS with Python 3.6.9, albeit with much uglier logging due to the lack of f-string improvements.
Randomness: single-process single-threaded
|
|
Typically, there’s no need to fiddle with the internal configuration of Python’s random
module. Python seeds its Mersenne Twister thing with os.urandom by default, and even if you’re on some really obscure operating system, it seeds by time which is pretty hard to collide with milisecond accuracy:
|
|
That’s not to say that this is cryptographically secure or anything, but for the purposes laid out in the introduction, it’s more than sufficiently random for our purposes.
But, as with all things Python, things get a bit dicer as we scale.
Multiprocessing and fork()
As explained on the multiprocessing
docpage, the default methodology for spawning multiple Python processes on Linux is to abuse os.fork()
.
The fork syscall, for those who don’t know, is approximately “an OS function that creates a bit-for-bit copy of an existing process”, with exceptions listed in the preceding link. The random state of a PRNG should be stored somewhere in process memory, so it stands to reason that multiple Python processes fork()
ed from the same parent will have the same seed.
We can test this theory with a simple script:
|
|
TLDR: Run np.random.randint
and random.randint
with 4 python processes in parallel; print dividing line every 4 outputs.
The results from this script are fairly surprising:
|
|
If you will observe: The numpy.random
outputs repeat 4 times (every 4 steps), while the randomness of vanilla random
is secure across processes. So we can infer that:
numpy.random
’s PRNG state is captured (pickled) by the multiprocessing library and copied identically across 4 workers;random
’s PRNG state is not duplicated. It could either be shared across all workers, or reseeded within each worker on startup.
To figure out the answer to (2), we can just modify the script to print out the initial state of the PRNGs on worker start:
|
|
As expected, the numpy.random
initial state is equivalent across all workers. Unexpectedly, the random
initial states do not, and they remain different even when I extend the sleep duration to infinity, indicating a new random._inst
state is initialised per worker:
|
|
At this point I’m pretty confused. The numpy
module was getting pickled, but the random
module was not. Could I force a pickling of the random
state?
I tried extending the test script to cover more things:
|
|
In short, I’m asking (over 128 iterations): how many unique PRNG states / random numbers are there across all processes?
And the answer is weird:
|
|
- if you use
numpy.random
, you getITERS/NUM_WORKERS
unique outcomes, (orNUM_WORKERS
collisions of the same seed) - if you use
random
, the module, you getITERS
unique outcomes (regardless of number of processes) - if you use the parent process’s
random.Random
instance, stored atrandom._inst
, as an argument to a multiworker task, therandom.Random
instance will get pickled, and the number of unique outcomes will beITERS/NUM_WORKERS/4
. This formula scales well with both varyingITERS
andNUM_WORKERS
.
So at this point, I’m doubly confused. I haven’t figured out why random
succeeds in re-initing state in new processes, and I now have a new question of what mechanism keeps the pickled random._inst
PRNG state ticking forwards.
Speculation
And so, I move into the fog of unverified ideas and poorly substantiated hypotheses.
The forkmethod
is important
I tried switching up the forkmethod from fork
to forkserver
/spawn
|
|
In both cases, the random states of both numpy
and random
became different in all workers:
|
|
Only the choice of fork
causes the initial seeds of numpy to differ. Why? I’m not sure; you can start with the source code here, but it’s not simple.
random
state is copied, but overwritten
The pickle
representation of both numpy.random.randint
and random.randint
are both large and similar in size:
|
|
Plugging the outputs of either randint()
function into pickletools.dis()
shows a large stored array of numbers, which I assume is representative of their PRNG states. If the multiprocessing library does capture the state of random
before execution, then something about the initialisation process of each worker must recreate the random
state again.
Alternatively, the random.randint()
function is not captured by the multiprocessing
module entirely. But I find this difficult to believe, because it has to capture something to run the randint
function, and since pickle
is unable to dump module
types, I’m unable to come up with another idea here.
This also explains why explicitly capturing r._inst
succeeds in creating duplicate seeds – the multiprocessing
library reinitiallises random._inst
, but the function-provided r_inst
object remains as a separate instance.
In conclusion
- Using
random.*
explicitly in a multiprocessed function will never go wrong. Speculation: the multiprocessing library recreates arandom.Random
object per worker. - Using
numpy.random
with multiprocessing usingfork
will cause predictable random number duplication. The PRNG state is captured bypickle
and copied between processes. - Using a copy of
random._inst
as an argument for a multiprocessed task will cause even more duplication than thenumpy
option. Mechanism unknown.