Probability of drawing a given word from a bag of letters in Scrabble

Suppose you had a bag with n tiles, each with a letter on it. There are n_A tiles with letter ‘A’, n_B with ‘B’, and so on, and n_* ‘wildcard’ tiles (we have n = n_A + n_B + \ldots + n_Z + n_*). Suppose you had a dictionary with a finite number of words.

You pick k tiles from the bag without replacement.

How would you compute (or estimate) the probability that you can form a given word, of length l (with 1 < l =< k) from the dictionary given the k tiles selected?

For those not familiar with Scrabble (TM), the wildcard character can be used to match any letter. Thus the word ‘BOOT’ could be ‘spelled’ with the tiles ‘B’, ‘*’, ‘O’, ‘T’. The order in which the letters are drawn does not matter.

Suggestion: in order to simplify the writing of answers, it might be better to just answer the question: what is the probability of having the word ‘BOOT’ among your possible moves after drawing 7 letters from a fresh bag.

(the problem’s introduction has been copied from this similar question)

Answer

A formula is requested. Unfortunately, the situation is so complicated it appears that any formula will merely be a roundabout way of enumerating all the possibilities. Instead, this answer offers an algorithm which is (a) tantamount to a formula involving sums of products of binomial coefficients and (b) can be ported to many platforms.


To obtain such a formula, break down the possibilities into mutually disjoint groups in two ways: according to how many letters not in the word are selected in the rack (let this be m) and according to how many wildcards (blanks) are selected (let this be w). When there are r=7 tiles in the rack, N available tiles, M available tiles with letters not in the word, and W=2 blanks available, the number of possible choices given by (m,w) is

\binom{M}{m}\binom{W}{w}\binom{N-M-W}{r-m-w}

because the choices of non-word letters, blanks, and word letters are independent conditional on (m,w,r).

This reduces the problem to finding the number of ways to spell a word when selecting only from the tiles representing the word’s letters, given that w blanks are available and r-m-w tiles will be selected. The situation is messy and no closed formula seems available. For instance, with w=0 blanks and m=3 out-of-word letters are drawn there will be precisely four letters left to spell “boot” that were drawn from the “b”, “o”, and “t” tiles. Given there are 2 “b”‘s, 8 “o”‘s, and 6 “t”‘s in the Scrabble tile set, there are positive probabilities of drawing (multisets) “bboo”, “bbot”, “bbtt”, “booo”, “boot”, “bott”, “bttt”, “oooo”, “ooot”, “oott”, “ottt”, and “tttt”, but only one of these spells “boot”. And that was the easy case! For example, supposing the rack contains five tiles chosen randomly from the “o”, “b”, and “t” tiles, together with both blanks, there are many more ways to spell “boot”–and not to spell it. For instance, “boot” can be spelled from “__boott” and “__bbttt”, but not from “__ttttt”.

This counting–the heart of the problem–can be handled recursively. I will describe it with an example. Suppose we wish to count the ways of spelling “boot” with one blank and four more tiles from the collection of “b”, “o”, and “t” tiles (whence the remaining two tiles show non-blank letters not in {“b”, “o”, “t”}). Consider the first letter, “b”:

  1. A “b” can be drawn in \binom{2}{1} ways from the two “b” tiles available. This reduces the problem to counting the number of ways of spelling the suffix “oot” using both blanks and just three more tiles from the collection of “o” and “t” tiles.

  2. One blank can be designated as a “b”. This reduces the problem to counting the number of ways of spelling “oot” using the remaining blank and just three more tiles from the collection of “o” and “t” tiles.

In general, steps (1) and (2)–which are disjoint and therefore contribute additively to the probability calculations–can be implemented as a loop over the possible number of blanks that might be used for the first letter. The reduced problem is solved recursively. The base case occurs when there’s one letter left, there is a certain number of tiles with that letter available, and there may be some blanks in the rack, too. We only have to make sure that the number of blanks in the rack plus the number of available tiles will be enough to obtain the desired quantity of that last letter.

Here is R code for the recursive step. rack usually equals 7, word is an array of counts of the letters (such as c(b=1, o=2, t=1)), alphabet is a similar structure giving the numbers of available tiles with those letters, and wild is the number of blanks assumed to occur in the rack.

f <- function(rack, word, alphabet, wild) {
  if (length(word) == 1) {
    return(ifelse(word > rack+wild, 0, choose(alphabet, rack)))
  }
  n <- word[1]
  if (n <= 0) return(0)
  m <- alphabet[1]
  x <- sapply(max(0, n-wild):min(m, rack), 
              function(i) {
                choose(m, i) * f(rack-i, word[-1], alphabet[-1], wild-max(0, n-i))
              })
  return(sum(x))
}

An interface to this function specifies the standard Scrabble tiles, converts a given word into its multiset data structure, and performs the double sum over m and w. Here is where the binomial coefficients \binom{M}{m} and \binom{W}{w} are computed and multiplied.

scrabble <- function(sword, n.wild=2, rack=7, 
              alphabet=c(a=9,b=2,c=2,d=4,e=12,f=2,g=3,h=2,i=9,j=1,k=1,l=4,m=2,
                         n=6,o=8,p=2,q=1,r=6,s=4,t=6,u=4,v=2,w=2,x=1,y=2,z=1),
              N=sum(alphabet)+n.wild) {
  word = sort(table(strsplit(sword, NULL))) # Sorting speeds things a little
  a <- sapply(names(word), function(s) alphabet[s])
  names(a) <- names(word)
  x <- sapply(0:n.wild, function(w) {
    sapply(sum(word):rack-w, 
           function(i) {
             f(i, word, a, wild=w) *
               choose(n.wild, w) * choose(N-n.wild-sum(a), rack-w-i)
           })
  })
  return(list(numerator = sum(x), denominator = choose(N, rack),
              value=sum(x) / choose(N, rack)))
}

Let’s try out this solution and time it as we go. The following test uses the same inputs employed in the simulations by @Rasmus Bååth:

system.time(x <- sapply(c("boot", "red", "axe", "zoology"), scrabble))

This machine reports 0.05 seconds total elapsed time: reasonably quick. The results?

> x
            boot        red         axe         zoology     
numerator   114327888   1249373480  823897928   11840       
denominator 16007560800 16007560800 16007560800 16007560800 
value       0.007142118 0.07804896  0.0514693   7.396505e-07

The probability for “boot” of 114327888/16007560800 exactly equals the value 2381831/333490850 obtained in my other answer (which uses a similar method but couches it in a more powerful framework requiring a symbolic algebra computing platform). The probabilities for all four words are reasonably close to Bååth’s simulations (which could not be expected to give an accurate value for “zoology” due to its low probability of 11840/16007560800, which is less than one in a million).

Attribution
Source : Link , Question Author : Sébastien , Answer Author : Community

Leave a Comment