How (and why) to program, part 1

This entry is part 1 of 2 in the series How (and why) to program

On May 15th, listeners to the NPR program Weekend Edition were given this challenge by puzzlemaster Will Shortz:

Create a 4-by-4 crossword square with four four-letter words reading across and four different four-letter words reading down. Use the word “nags” at 1 across and the word “newt” at 1 down. All eight words must be common, uncapitalized words, and all 16 letters must be different.

Here is the starting grid as described by Will Shortz:

1 N 2 A 3 G 4 S
5 E
6 W
7 T

This puzzle is a ripe one for solving (some might say cheating at…) by computer. All we need is a list of four-letter words and a way to test every combination of words in the grid for validity; so let’s get to it. For this example I will use the Python programming language for its conciseness and readability.

First, the list of words. On Linux and Mac OS X (and most other Unix-based operating systems) there is a file called /usr/share/dict/words, or sometimes /usr/lib/dict/words or /usr/dict/words, which is nothing but a more-or-less comprehensive list of English words, one per line. We’ll read that file to get our list of four-letter words, discarding words that are shorter or longer. In fact, since we know that every word in the grid begins with the letters in NEWT and NAGS — namely, N, E, W, T, A, G, and S — we can even discard four-letter words not beginning with one of those seven letters. And we can discard N words too, because NEWT and NAGS are already given; we don’t need to search for N words. The words we keep will be stored in six different “sets,” one for each of the six remaining starting letters.

Here’s the first section of our code: creating our six empty word-sets and giving each one a name.

a_words = set()
g_words = set()
s_words = set()
e_words = set()
w_words = set()
t_words = set()

Now to get the words we want out of the “words” file. First we “open” the file to get a special placeholder we’ll call wordlist.

wordlist = open('/usr/share/dict/words')

We’re going to read one line of the file at a time. The placeholder remembers our position in the file in between reads.

To read one line of the file at a time and do something with it, we’ll write a “loop” that begins like this:

for word in wordlist:
  ...stuff...

This will do stuff once for each line of the file, with word referring to the contents of that line. Unfortunately the first line of that stuff is a necessary but confusing bookkeeping detail:

for word in wordlist:
  word = word[:-1]
  ...more stuff...

This confusing bit of code is here because each line of the file — each line of every file, in fact — ends with an invisible “newline” character that means “this line is over, start a new one.” Since we want to deal only with the visible content of the line while doing more stuff, we need to discard that newline character, and

word = word[:-1]

is how you do that. (For our purposes right now it’s not terribly important how that works, but you can read it roughly as, “replace word with everything-in-word-except-the-last-character.”)

We’ll begin more stuff with a test to make sure that the word we’re looking at is four letters long:

for word in wordlist:
  word = word[:-1]
  if len(word) == 4:
    ...the rest of the stuff...

Here, len(word) means “the length of word” (i.e., the number of characters it contains), and == is for testing whether two things are equal. (A single = means “make this equal that,” as we’ve already seen a few times.) The rest of the stuff will only happen if len(word) is 4.

If it is 4, then we want to save the word in the correct set — either the set of A words, or the set of G words, etc. Here’s what the complete loop looks like:

for word in wordlist:
  word = word[:-1]
  if len(word) == 4:
    first_letter = word[0]
    if first_letter == 'a':
      a_words.add(word)
    elif first_letter == 'g':
      g_words.add(word)
    elif first_letter == 's':
      s_words.add(word)
    elif first_letter == 'e':
      e_words.add(word)
    elif first_letter == 'w':
      w_words.add(word)
    elif first_letter == 't':
      t_words.add(word)

After determining that len(word) was 4, we gave the name first_letter to the first letter of word (which is written as word[0], because most things in programming are counted beginning at 0 rather than at 1). We then tested first_letter to see if it was a, and added it to a_words if it was; if it wasn’t, we tested it to see if it was g, and added it to g_words if so; etc. The strange word elif appearing in the code above is simply Python’s abbreviation for “else if.”

If first_letter isn’t one of the six we care about — or if word isn’t 4 characters long to begin with — then nothing happens and we loop back around to the for word in wordlist line to get the next word.

After reading all the lines from wordlist, the loop exits and the program does whatever comes next, which is:

wordlist.close()

This is the way to say we’re finished using that placeholder and don’t need it anymore. Doing so releases resources, such as memory space, that the computer can now use for other purposes. (In a little program like this, that doesn’t matter much, so it’s OK to leave out wordlist.close(). But in large programs you can “leak” memory and other resources if you do things like fail to close files when you’re finished with them.)

OK. We’ve got our lists of candidate A words, G words, S words, E words, W words, and T words. Now the strategy will be to try every possible combination of A, G, and S words for 2, 3, and 4 Down; and then, for each of those possible combinations, check that the resulting words in 5, 6, and 7 Across make sense; and additionally that no letter is repeated anywhere in the grid.

To try every combination of A, G, and S words, first we start by trying every A word:

for a_word in a_words:
  ...stuff...

As we’ve already seen, this is a loop that will do stuff once for each entry in a_words (with a_word referring to that entry each time through the loop). Now if we nest another loop inside this loop, like so:

for a_word in a_words:
  for g_word in g_words:
    ...more stuff...

then more stuff will happen once for each possible combination of a_word and g_word. That is, first a_word will be aced (let’s say), and while a_word is aced, g_word will range from gabs to gyro; and when the g_words loop is finished, a_word will advance to aces, and g_word will again range from gabs to gyro, and so on though all the four-letter A words, each one running through all of the four-letter G words, like the digits of an odometer.

From that you should be able to guess that we need to nest another loop inside our nested loop, this one for the S words.

for a_word in a_words:
  for g_word in g_words:
    for s_word in s_words:
      ...test this combination...

Now to test this combination once for each possible combination of A word, G word, and S word. The first test is to see whether the words created by placing a_word in 2 Down, g_word in 3 Down, and s_word in 4 Down result in sensible words at 5 Across, 6 Across, and 7 Across.

Let’s construct the word at 5 Across — the E word — from the second letters of a_word, g_word, and s_word.

e_word = 'e' + a_word[1] + g_word[1] + s_word[1]

(Remember that counting the letters in a word begins at 0, so the second letter of each word is numbered 1.)

Let’s construct the W word and the T word the same way — w_word from the third letter of each Down word, and t_word from each Down word’s fourth letter.

w_word = 'w' + a_word[2] + g_word[2] + s_word[2]
t_word = 't' + a_word[3] + g_word[3] + s_word[3]

Now if a_word is ammo and g_word is gulp and s_word is shot, then e_word will be emuh, w_word will be wmlo, and t_word will be topt, which don’t make sense! But we can easily check whether the Across words make sense by seeing if they can be found in the e_words, w_words, and t_words sets that we created earlier. So:

for a_word in a_words:
  for g_word in g_words:
    for s_word in s_words:
      e_word = 'e' + a_word[1] + g_word[1] + s_word[1]
      w_word = 'w' + a_word[2] + g_word[2] + s_word[2]
      t_word = 't' + a_word[3] + g_word[3] + s_word[3]
      if e_word in e_words and w_word in w_words and t_word in t_words:
        ...remaining test...

If we get as far as the remaining test (which is to ensure that no letter is duplicated anywhere in the grid), we know that e_word and w_word and t_word are real words.

We can ensure no letter is duplicated by using another set called letters_used. The plan: go through the letters of each Down word one by one, checking whether the letter is in the set. If it’s not, then add it to the set and move to the next letter. If it is, then we’ve already seen that letter once and it’s duplicated.

We know the first Down word is newt, so we can create our set with those letters in it.

letters_used = set(['n', 'e', 'w', 't'])

(We could also have created this set like this:

letters_used = set()
letters_used.add('n')
letters_used.add('e')
letters_used.add('w')
letters_used.add('t')

which matches the way we created and used the other sets above, but the first way does the same thing more concisely.)

Now to use that set to find duplicates.

any_duplicates = False
for letter in a_word + g_word + s_word:
  if letter in letters_used:
    any_duplicates = True
    break
  else:
    letters_used.add(letter)

Here, break means “leave the loop immediately.” There’s no need to keep looping once we know there are duplicates.

By the time the loop finishes, either we’ve looped through all the letters and found no duplicates, or we exited the loop with break because we did find duplicates. We check any_duplicates to see which of those two things happened. If it’s True there were duplicates; but if it’s False then we’ve found a valid solution and can print it to display it to the user.

if any_duplicates == False:
  print a_word, g_word, s_word, e_word, w_word, t_word

To recap, here’s the complete program.

a_words = set()
g_words = set()
s_words = set()
e_words = set()
w_words = set()
t_words = set()

wordlist = open('/usr/share/dict/words')

for word in wordlist:
  word = word[:-1]
  if len(word) == 4:
    first_letter = word[0]
    if first_letter == 'a':
      a_words.add(word)
    elif first_letter == 'g':
      g_words.add(word)
    elif first_letter == 's':
      s_words.add(word)
    elif first_letter == 'e':
      e_words.add(word)
    elif first_letter == 'w':
      w_words.add(word)
    elif first_letter == 't':
      t_words.add(word)

wordlist.close()

for a_word in a_words:
  for g_word in g_words:
    for s_word in s_words:
      e_word = 'e' + a_word[1] + g_word[1] + s_word[1]
      w_word = 'w' + a_word[2] + g_word[2] + s_word[2]
      t_word = 't' + a_word[3] + g_word[3] + s_word[3]
      if e_word in e_words and w_word in w_words and t_word in t_words:
        letters_used = set(['n', 'e', 'w', 't'])
        any_duplicates = False
        for letter in a_word + g_word + s_word:
          if letter in letters_used:
            any_duplicates = True
            break
          else:
            letters_used.add(letter)
        if any_duplicates == False:
          print a_word, g_word, s_word, e_word, w_word, t_word

Put this program in a file named puzzle.py and run it with python puzzle.py. On my computer, /usr/share/dict/words contains 470 four-letter words starting with a, 359 starting with g, and 634 starting with s, and it took about two minutes and a quarter for this program to test all of the 470×359×634 combinations. (We could make this program much faster, but at the expense of clarity and simplicity.) In the end it produced this solution:

achy grip sumo ecru whim typo

which is the same solution that Will Shortz gave on the air a week later.

(Actually, my computer produced two solutions, because /usr/share/dict/words includes a lot of questionable junk in its quest to be comprehensive. The other solution was “achy goup sld. ecol whud typ.”)

Update: As a couple of friends have pointed out, the Mac OS X version of /usr/share/dict/words does not include all the words needed to find the solution! I ran mine on Linux, which does.

Series NavigationHow (and why) to program, part 2

3 thoughts on “How (and why) to program, part 1

  1. Vagrant

    Now, how would you solve this problem is the dictionary was spread out over 1000’s of machines and it contained 4,294,967,296 different words?

  2. Zorak

    If I’m reading this correctly (I skimmed over large parts of this 🙂 ) you’re building a_words as all the 4-letter words starting with A, g_words as all the 4-letter words starting with G, etc., then seeing which ones ‘cross’ properly.

    But as an obvious optimization, you don’t need to consider all the four-letter A— words, just the subset that contain 4 unique characters, since no letter can repeat in the entire grid anyway.

  3. bobg Post author

    Zorak, you’re right of course, and for a small up-front cost (i.e., checking individual words for duplicate letters) we can reduce by 58% the number of combinations that the main loop has to check, which is huge. But excluding words from the word sets because they contain duplicates doesn’t relieve us of the need to check for duplicates again in the main loop, so I opted for clarity — having just one check for duplicates — over speed, there and elsewhere.

    Optimization is a different lesson…

Leave a Reply