Chapter 2

Go Forth

image

2.1 Constraints

  • Existence of a data stack. All operations (conditionals, arithmetic, etc.) are done over data on the stack.
  • Existence of a heap for storing data that's needed for later operations. The heap data can be associated with names (i.e. variables). As said above, all operations are done over data on the stack, so any heap data that needs to be operated upon needs to be moved first to the stack and eventually back to the heap.
  • Abstraction in the form of user-defined "procedures" (i.e. names bound to a set of instructions), which may be called something else entirely.

2.2 A Program in this Style

  1 #!/usr/bin/env python
  2 import sys, re, operator, string
  3
  4 #
  5 # The all-important data stack
  6 #
  7 stack = []
  8
  9 #
 10 # The heap. Maps names to data (i.e. variables)
 11 #
 12 heap = {}
 13
 14 #
 15 # The new "words" (procedures) of our program
 16 #
 17 def read_file():
 18 """
 19 Takes a path to a file on the stack and places the entire
 20 contents of the file back on the stack.
 21 """
 22 f = open(stack.pop())
 23 # Push the result onto the stack
 24 stack.append([f.read()])
 25 f.close()
 26
 27 def filter_chars():
 28 """
 29 Takes data on the stack and places back a copy with all
 30 nonalphanumeric chars replaced by white space.
 31 """
 32 # This is not in style. RE is too high-level, but using it
 33 # for doing this fast and short. Push the pattern onto stack
 34 stack.append(re.compile('[W_]+'))
 35 # Push the result onto the stack
 36 stack.append([stack.pop().sub(' ', stack.pop()[0]).lower()])
 37
 38 def scan():
 39 """
 40 Takes a string on the stack and scans for words, placing
 41 the list of words back on the stack
 42 """
 43 # Again, split() is too high-level for this style, but using
 44 # it for doing this fast and short. Left as exercise.
 45 stack.extend(stack.pop()[0].split())
 46
 47 def remove_stop_words():
 48 """
 49 Takes a list of words on the stack and removes stop words.
 50 """
 51 f = open('../stop_words.txt')
 52 stack.append(f.read().split(','))
 53 f.close()
 54 # add single-letter words
 55 stack[-1].extend(list(string.ascii_lowercase))
 56 heap['stop_words'] = stack.pop()
 57 # Again, this is too high-level for this style, but using it
 58 # for doing this fast and short. Left as exercise.
 59 heap['words'] = []
 60 while len(stack)> 0:
 61    if stack[-1] in heap['stop_words']:
 62        stack.pop() # pop it and drop it
 63    else:
 64        heap['words'].append(stack.pop()) # pop it, store it
 65 stack.extend(heap['words']) # Load the words onto the stack
 66 del heap['stop_words']; del heap['words'] # Not needed
 67
 68 def frequencies():
 69 """
 70 Takes a list of words and returns a dictionary associating
 71 words with frequencies of occurrence.
 72 """
 73 heap['word_freqs'] = {}
 74 # A little flavour of the real Forth style here...
 75 while len(stack)> 0:
 76    # ... but the following line is not in style, because the
 77    # naive implementation would be too slow
 78    if stack[-1] in heap['word_freqs']:
 79        # Increment the frequency, postfix style: f 1 +
 80        stack.append(heap['word_freqs'][stack[-1]]) # push f
 81        stack.append(1) # push 1
 82        stack.append(stack.pop() + stack.pop()) # add
 83    else:
 84        stack.append(1) # Push 1 in stack[2]
 85    # Load the updated freq back onto the heap
 86    heap['word_freqs'][stack.pop()] = stack.pop()
 87
 88 # Push the result onto the stack
 89 stack.append(heap['word_freqs'])
 90 del heap['word_freqs'] # We don't need this variable anymore
 91
 92 def sort():
 93 # Not in style, left as exercise
 94 stack.extend(sorted(stack.pop().iteritems(), key=operator.
      itemgetter(1)))
 95
 96 # The main function
 97 #
 98 stack.append(sys.argv[1])
 99 read_file(); filter_chars(); scan(); remove_stop_words()
 100 frequencies(); sort()
 101
 102 stack.append(0)
 103 # Check stack length against 1, because after we process
 104 # the last word there will be one item left
 105 while stack[-1] < 25 and len(stack)> 1:
 106 heap['i'] = stack.pop()
 107 (w, f) = stack.pop(); print w, ' - ', f
 108 stack.append(heap['i']); stack.append(1)
 109 stack.append(stack.pop() + stack.pop())

Note: If not familiar with Python, please refer to the Prologue (Pythonisms) for an explanation of lists, indexes and bounds.

2.3 Commentary

THIS STYLE takes inspiration in Forth, a small programming language first developed as a personal programming system in the late 1950s by Charles Moore, a programmer working at the time for the Smithsonian Astrophysical Laboratory. This programming system – an interpreter for a simple language – supported the need to handle different equations without having to recompile the program – a time-consuming activity at the time.

This curious little language has at its heart the concept of a stack. Equations are entered in postfix notation, e.g. "3 4 +". Operands are pushed onto the stack one at a time; operators take the operands on the stack and replace them with the result. When data is not immediately needed, it can be placed on a part of the memory called the heap. Besides the stack machine, Forth supports the definition of procedures (called "words" in Forth); these procedures, like the built-in ones, operate over data on the stack.

The syntax of Forth may be hard to understand due to its postfix nature and several special symbols that aren't used in any other language. But the language model is surprisingly simple once we understand the constraints – the stack, the heap, procedures and names. Rather than using a Forth interpreter written in Python, this chapter shows how the constraints underlying Forth can be codified in Python programs, resulting, roughly, in a Forth style of programming. Let's analyze the example program.

To start with, to support the style, we first define the stack (line #7) and the heap (line #12).1 Next we define a set of procedures ("words" in Forth terminology). These procedures help us divide the problem in smaller sub-steps, such as reading a file (lines #17 through #25), filtering the characters (lines #27 through #36), scanning for words (lines #38 through #45), removing stop words (lines #47 through #66), computing the frequencies (lines #68 through #90), and sorting the result (lines #92 through #94). We'll look into some of these in more detail next. But one thing to notice is that all these procedures use (pop) data from the stack (e.g. lines #22, #36, #45) and end by pushing data onto the stack (e.g. lines #24, #36, #45).

Forth's heap supports the allocation of data blocks that can be – and usually are – bound to names. In other words, variables. The mechanism is relatively low level, as the programmer needs to define the size of the data. In our emulation of Forth's style, we simply use a dictionary (line #12). So for example, in line #56, we are popping the stop words on the stack directly into a variable on the heap named stop_words.

Many parts of the example program are not written in Forth style, but some parts are true to it, so let's focus on those. The procedure remove_stop_words (starting in line #47), as the name suggests, removes the stop words. When that procedure is called, the stack contains all the words of the input file, properly normalized. The first few words of Pride and Prejudice are:

['the', 'project', 'gutenberg', 'ebook', 'of', 'pride', 'and', 'prejudice', ...]

That is how the stack looks like at that point, for that book. Next, we open the stop words file and push the list of stop words onto the stack (lines #51 through #55). To make things simple, we keep them in a list of their own instead of merging them with the rest of the data on the stack. The stack now looks like this:

['the', 'project', 'gutenberg', 'ebook', 'of', 'pride', 'and', 'prejudice', ..., ['a', 'able', 'about', 'across', ...]]

After reading all the stop words from the file and placing them onto the stack, we then pop them out to the heap (line #56), in preparation to process the words of the book that are still on the stack. Lines #60 through #64 iterate through the words on the stack in the following way. Until the stack is empty (test in line #60), we check if the word at the top of the stack (stack[-1] in Python) is in the list of stop words (line #61). In real Forth, this test would be much more low level than what is shown here, as we would need to explicitly iterate through the list of stop words too. In any case, if the word is in the stop words list, we simply pop it out of the stack and ignore it. If the word is not in the list of stop words (line #63), we pop it out of the stack onto another variable in the heap called words – the list accumulates the non-stop words (line #64). When the iteration is over, we take the variable words and place its entire contents back on the stack (line #65). We end up with the stack containing all non-stop words, like this:

['project', 'gutenberg', 'ebook', 'pride', 'prejudice', ...]

At that point, we don't need the variables on the heap anymore, so we discard them (line #66). Forth supports deletion of variables from the heap in the spirit of what is being done here.

The frequencies procedure (starting in line #68) shows one more stylistic detail related to arithmetic operations. That procedure starts with the non-stop words on the stack (as shown above) and ends with placing a dictionary of word frequencies onto the stack (line #89). It works as follows. First, it allocates a variable on the heap called word_freqs that stores the word-frequency pairs (line #73) – it starts with the empty dictionary. It then iterates through the words on the stack. For each word at the top of the stack, it checks whether that word has been seen before (line #78). Again, this test is expressed at a much higher level than it would be in Forth, for performance reasons. If the word had been seen before, we need to increment its frequency count. That is done by pushing the current frequency count onto the stack (line #80), then pushing the value 1 onto the stack (line #81), and then adding those 2 top-most operands on the stack and placing the result on the stack (line #82). If the word had not been seen before (line #83), we simply push the value 1 onto the stack. Finally, we pop both the frequency count (right side of assignment in line #86) and the word itself (left side of assignment in line #86) out of the stack and into the variable on the heap, and move to the next word on top of the stack, until the stack is empty (back to line #75). At the end, as stated before, we push the entire content of the heap variable onto the stack, and delete that variable.

The main function starting in line #98 is the beginning of the program. We start by pushing the name of the input file onto the stack (line #98), and then invoke the procedures sequentially. Note that these procedures are not completely independent of each other: each of them relies on strong assumptions about the data that is left on the stack by the previous one.

Once the counting and sorting is done, we then print out the result (lines #105 through #109). This block of code shows a final stylistic detail related to what Forth calls "indefinite loops", or loops that run until a condition is true. In our case, we want to iterate through the dictionary of word frequencies until we count 25 iterations. So we do the following. We start by pushing the number 0 onto the stack (line #102), on top of the data that is already there (word frequencies) and proceed to an indefinite loop until the top of the stack reaches the number 25. In each iteration, we pop the count out of the stack into a variable (line #106), then pop the next word and frequency out of the stack and print them (line #107); then, we push the count in the variable back to the stack, followed by the value 1, and add them, effectively incrementing the count. The loop, and the program, terminates when the top of the stack has the value 25. The second clause for termination (len(stack)> 1) is there in the case of small test files that might not even have 25 words.

Many options could have been pursued, and the reader is encouraged to explore the solution space within this style.

2.4 Historical Notes

Early computing machines did not have stacks. The earliest reference for the idea of using a stack in computers can be found in Alan Turing's Automatic Computing Engine (ACE) report in 1945. Unfortunately, that report was classified for many years, so not many knew about it.

Stacks were invented again in the late 1950s by several people independently. It was several more years before computer architectures started to include stacks, and to use them for purposes such as subroutine calls.

Forth was a personal project from a computer maverick that never caught the attention of the dominant players of the time. Forth was entirely done in software, and it has been ported [by Moore] to several generations of computers since 1958. Considering that Moore started using it in the late 1950s, the fact that Forth was a stack machine interpreter that early on makes it historically relevant.

Another well-known stack machine based language is PostScript, a language used to describe documents for printing purposes. PostScript was developed at Xerox PARC in the late 1970s by John Warnock and others; it was based on an earlier language designed by John Warnock. The group eventually left PARC to start Adobe Systems.

2.5 Further Reading

Koopman, P. (1989). Stack Computers: The New Wave. Ellis Horwood Publisher. Available at http://www.ece.cmu.edu/˜koopman/stack computers/
Synopsis: An introduction to stack machines, a not so new wave, but interesting nevertheless.

Rather, E., Colburn, D. and Moore, C. (1993). The evolution of Forth. ACM SIGPLAN Notices 28(3) – HOPL II, pp. 177–199.
Synopsis: Charles Moore is a maverick in the computing world, and everyone should know about his work. This paper tells the story of Forth.

Warnock, J. E. (2012). Simple ideas that changed printing and publishing. Proceedings of the American Philosophical Society 156(4): 363–378.
Synopsis: A historical perspective on PostScript, a stack machine language for printing.

2.6 Glossary

Stack: A stack is "Last-In-First-Out" data structure. Its main operations are push – add an element to the top of the stack – and pop – remove the element from the top of the stack. Stacks play a critical role in programming language implementations well beyond Forth. Although usually invisible to the programmer, in virtually every modern programming language, a stack is the piece of memory that supports a thread of program execution. When procedures/functions are called, a block of data related to the parameters and return addresses is usually pushed onto the stack; subsequent procedure/function calls push other similar blocks onto the stack. Upon return, the corresponding blocks on the stack are usually popped.

Heap: The heap is another piece of memory underlying the implementation of many modern programming languages. The heap is used for dynamic memory allocation/deallocation, such as the creation of lists and objects. (Not to be confused with a data structure called Heap, a specialized tree-based data structure.)

Stack machine: A stack machine is a real or emulated computing machine that uses a stack, instead of registers, as its main support for evaluating the expressions of the program. Forth is a stack machine programming language. So are many modern virtual machines, such as the Java Virtual Machine.

2.7 Exercises

2.1 Another language. Implement the example program in another language, but preserve the style.

2.2 Search. In the example program, the comparison in line #78 is not in style, as it uses Python's high-level containment checking if x in y. Rewrite this piece of the program, i.e. searching whether a given word is in the dictionary on the heap, using go-forth style. Explain what happens to the performance of your program.

2.3 True stack. Python uses lists for implementing stacks, which makes the example program a bit confusing. Implement a true stack data structure in Python (possibly wrapping around a list) that provides the expected push, pop, peek and empty operations. Use your data structure in the example program, instead of the list introduced in line #7.

2.4 A different task. Write one of the tasks proposed in the Prologue using the go-forth style.

1In Python, stacks are simply lists; we use them as stacks by invoking the stack operations pop and append (acting as push). Occasionally, extend is also used, as a shortcut to append/push the elements of an entire list onto the stack.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset