Multiprocessing and random()
srand() but cooler
Let’s say, for some odd reason, you’re hosting a Python service that:
- takes requests that run large computations based on a randomly generated number (a “seed”)
- spreads work across multiple workers to handle many requests
Then, it’s likely you’ll introduce a subtle randomness bug that leads to duplicate seeds appearing. Let me explain:
The following tests were done on a Ubuntu 20.04 LTS Hetzner instance using:
numpy==1.22.4python==3.8.10
I additionally made efforts to test certain outcomes on Ubuntu 18 LTS with Python 3.6.9, albeit with much uglier logging due to the lack of f-string improvements.
Randomness: single-process single-threaded
| |
Typically, there’s no need to fiddle with the internal configuration of Python’s random module. Python seeds its Mersenne Twister thing with os.urandom by default, and even if you’re on some really obscure operating system, it seeds by time which is pretty hard to collide with milisecond accuracy:
| |
That’s not to say that this is cryptographically secure or anything, but for the purposes laid out in the introduction, it’s more than sufficiently random for our purposes.
But, as with all things Python, things get a bit dicer as we scale.
Multiprocessing and fork()
As explained on the multiprocessing docpage, the default methodology for spawning multiple Python processes on Linux is to abuse os.fork().
The fork syscall, for those who don’t know, is approximately “an OS function that creates a bit-for-bit copy of an existing process”, with exceptions listed in the preceding link. The random state of a PRNG should be stored somewhere in process memory, so it stands to reason that multiple Python processes fork()ed from the same parent will have the same seed.
We can test this theory with a simple script:
| |
TLDR: Run np.random.randint and random.randint with 4 python processes in parallel; print dividing line every 4 outputs.
The results from this script are fairly surprising:
| |
If you will observe: The numpy.random outputs repeat 4 times (every 4 steps), while the randomness of vanilla random is secure across processes. So we can infer that:
numpy.random’s PRNG state is captured (pickled) by the multiprocessing library and copied identically across 4 workers;random’s PRNG state is not duplicated. It could either be shared across all workers, or reseeded within each worker on startup.
To figure out the answer to (2), we can just modify the script to print out the initial state of the PRNGs on worker start:
| |
As expected, the numpy.random initial state is equivalent across all workers. Unexpectedly, the random initial states do not, and they remain different even when I extend the sleep duration to infinity, indicating a new random._inst state is initialised per worker:
| |
At this point I’m pretty confused. The numpy module was getting pickled, but the random module was not. Could I force a pickling of the random state?
I tried extending the test script to cover more things:
| |
In short, I’m asking (over 128 iterations): how many unique PRNG states / random numbers are there across all processes?
And the answer is weird:
| |
- if you use
numpy.random, you getITERS/NUM_WORKERSunique outcomes, (orNUM_WORKERScollisions of the same seed) - if you use
random, the module, you getITERSunique outcomes (regardless of number of processes) - if you use the parent process’s
random.Randominstance, stored atrandom._inst, as an argument to a multiworker task, therandom.Randominstance will get pickled, and the number of unique outcomes will beITERS/NUM_WORKERS/4. This formula scales well with both varyingITERSandNUM_WORKERS.
So at this point, I’m doubly confused. I haven’t figured out why random succeeds in re-initing state in new processes, and I now have a new question of what mechanism keeps the pickled random._inst PRNG state ticking forwards.
Speculation
And so, I move into the fog of unverified ideas and poorly substantiated hypotheses.
The forkmethod is important
I tried switching up the forkmethod from fork to forkserver/spawn
| |
In both cases, the random states of both numpy and random became different in all workers:
| |
Only the choice of fork causes the initial seeds of numpy to differ. Why? I’m not sure; you can start with the source code here, but it’s not simple.
random state is copied, but overwritten
The pickle representation of both numpy.random.randint and random.randint are both large and similar in size:
| |
Plugging the outputs of either randint() function into pickletools.dis() shows a large stored array of numbers, which I assume is representative of their PRNG states. If the multiprocessing library does capture the state of random before execution, then something about the initialisation process of each worker must recreate the random state again.
Alternatively, the random.randint() function is not captured by the multiprocessing module entirely. But I find this difficult to believe, because it has to capture something to run the randint function, and since pickle is unable to dump module types, I’m unable to come up with another idea here.
This also explains why explicitly capturing r._inst succeeds in creating duplicate seeds – the multiprocessing library reinitiallises random._inst, but the function-provided r_inst object remains as a separate instance.
In conclusion
- Using
random.*explicitly in a multiprocessed function will never go wrong. Speculation: the multiprocessing library recreates arandom.Randomobject per worker. - Using
numpy.randomwith multiprocessing usingforkwill cause predictable random number duplication. The PRNG state is captured bypickleand copied between processes. - Using a copy of
random._instas an argument for a multiprocessed task will cause even more duplication than thenumpyoption. Mechanism unknown.