Drawing numbered balls from a bag with replacement

Jennifer Murphy

Registered Senior Member
Given a bag of N balls numbered 1 to N. I'd like to calculate the expected (average?) number of draws needed to have probability P of drawing each of the N balls with replacement. For example, how many balls would I need to draw (with replacement) to have a 90% chance of drawing every ball?

I believe the odds involve one or more infinite series, but my stat skills are way too rusty to be able to work it out.

Thanks for any help.
 
Had to think about this one for a while! A quick note about terminology: "expected", "average", and "mean" typically mean the same thing in statistics, but that's not what you're looking for here. In operational terms, the "expected number of draws needed" is what you would get by drawing until you got all N balls, writing down the number of draws it took you, and then repeating the experiment many times to get an average. What you want is for a number of draws D, what is the probability P to draw each ball in D draws or less. As far as I can tell, this is a significantly tougher problem. My solution isn't super-elegant, but I think it works; please point out any errors you see, or any cleaner solutions.

When faced with a problem of the form "what is the probability of x happening at least once?" it often helps to re-frame it as "what is the probability of x not happening at all?" Since x must either happen or not happen, you can subtract the answer to either problem from 1 to get the answer to the other. In this case, the re-framed question becomes "what is the probability that one or more balls will never be drawn across D attempts?" For starters, we can note that for a given draw, any particular ball will not be drawn with probability $$\frac{N-1}{N}$$. The probability of a given ball not being drawn during D consecutive attempts is then $$\left(\frac{N-1}{N}\right)^D$$. Since any of the N balls not being drawn is sufficient for failure, we might naively just multiply this probability by N:

$$1-P\approx N\left(\frac{N-1}{N}\right)^D$$

Unfortunately, this expression is too high, because it over counts the cases in which more than one ball is never drawn. For example, the above expression includes probabilities for both "ball 1 was never drawn" and "ball 2 was never drawn", both of which will include cases where neither ball 1 nor ball 2 was drawn. To correct such double-counted events, we can subtract off the cases where two given balls are never drawn. By similar logic to above, the probability of each one of these cases is $$\left(\frac{N-2}{N}\right)^D$$. How many such cases are there? We want all the ways to choose 2 elements out of a list of N options, which is the choose function $${N \choose 2}$$. This gives us the less-naive estimate

$$1-P\approx N\left(\frac{N-1}{N}\right)^D-{N \choose 2}\left(\frac{N-2}{N}\right)^D$$

While this is better, we now face the problem of under-counting cases in which three balls are never drawn. Just counting by hand, we can find that we need one more copy of each triple-failure case, so we need to add an extra $${N \choose 3}\left(\frac{N-3}{N}\right)^D$$. By now, we see that every change we make will force a higher-order correction, but fortunately, a pattern is emerging. For even-n terms, we subtract $${N \choose n}\left(\frac{N-n}{N}\right)^D$$, while for odd-n terms, we add the same. Formally, we know this pattern will hold for all n from the identity

$$\displaystyle\sum_{n=1}^N (-1)^{n-1}{N \choose n}=1$$

I leave the proof of this as an exercise to the reader. :p Carrying this pattern through all the way to the end, we finally get the exact expression we want:

$$1-P = \displaystyle\sum_{n=1}^N (-1)^{n-1}{N \choose n}\left(\frac{N-n}{N}\right)^D$$

Not too shabby! There are a number of ways one can shuffle the terms in this sum around and pull out common factors, but I couldn't find anything that really simplified it. Please let me know if you see any improvements!
 
Last edited:
Wow! That's impressive.

It will take me a bit of time to fully understand it. Your opening statement did bring back one old rusty memory from by stat classes -- that part about turning the problem into the odds of it not happening at all. But even if I had remembered that, I still doubt that I could have come up with what you did.

You have provided P=f(N,D). I want D=f(N,P), so I guess I'll have to add one more level of complexity by taking logs.

In the meantime, I gave up and wrote a little simulation. Here are the results from 1 million trials with 32 balls:
Code:
Number of trials = 1,000,000
Number of balls  = 32

     Trial Summary
Draws   Tally   Cum Prob
  43       1       0.00
  44       1       0.00
  45       3       0.00
  46       2       0.00
  47       7       0.00
  48       7       0.00
  49       6       0.00
  50      15       0.00
  51      30       0.01
     . . .
  68    1538       0.92
  69    1786       1.10
  70    2042       1.30
  71    2279       1.53
     . . .
  87    7903       9.83
  88    8385      10.67
  89    8553      11.53
     . . .
101   11651      24.14
102   11702      25.31
103   11998      26.51
     . . .
122   11671      49.66
123   11596      50.82
124   11361      51.96
     . . .
149    7269      74.93
150    6915      75.62
151    6627      76.29
     . . .
180    3053      89.93
181    2952      90.22
182    2934      90.51
     . . .
202    1663      94.89
203    1556      95.04
204    1562      95.20
     . . .
253     349      98.99
254     338      99.02
255     315      99.05
     . . .
423       1      99.99
424       1     100.00
425       2     100.00
     . . .
556       0     100.00
557       0     100.00
558       1     100.00
 
Back
Top