Tell us what’s happening:
Test 3 has incorrect expected value. I’ve enumerated the possibilities manually and counted 2,088 possible ways to draw at least 2 blue balls and one green ball when drawing 4 balls randomly without replacement from a hat initialized with 3 blue balls, 2 red balls, and 6 green balls. With 7920 total possible draws (11 * 10 * 9 * 8) that means the probability should be 0.26363636… The tests look for a probability of 0.272 (delta 0.01), so the actual expected result is very close to the edge of the accepted range.
The expected result is correct. The expected result corresponds to the seed value provided in the test suite. With only 1000 draws, it is extremely unlikely that you will recover the true probability. The test instead checks that your result is close to the target for starting an experiment with 1000 with the fixed seed value.
If you post your code, we can help you determine why your code is not close to the expected value.
Indeed it does work with the seed value if you implement random exactly as the test writers expected. I discovered the error because I implemented using random.shuffle() instead of random.randint(). Changing my code to call random.randomint() to choose an index of a ball to draw instead of shuffling the balls using random.shuffle then calling pop() to remove the balls did result in passing the test. This does not mean that the test is not in error.
FYI, I timed both methods running an experiment of 1000000 tries using time.process_time(). The random.shuffle() code ran slightly faster.
I agree that 52.2s vs 50.6 seconds could be noise.
The test is incorrect or the assignment documentation is incomplete. The test only works with a specific implementation detail not assigned, implementations that match the assignment can fail the test.
1000 x 1000 experiments using random.shuffle()
prob mean 0.263874
prob stdev 0.013596580121155137
1000 x 1000 experiments using random.randint()
prob mean 0.264042
prob stdev 0.01375214010099364
I recommend modifying the test to expect the actual value of 0.264 with a delta of 0.03.
That approach makes the tests easier to game, which is I suspect is why they do not use the approach you recommend. And without a seed its easier to get false failures.
Another alternative would be to design the tests with statistical analysis so that the test is smart enough to figure out if the result is converging on an incorrect value.