There's more...

Of course, there are ways to improve code, such as actually using a loop rather than an iterator, but this demonstrates a couple of points:

Just because PyPy is being used doesn't mean that it will improve program performance. Not only do you have to ensure that the PyPy subset of Python commands is utilized, it also means that the code has to be written in a manner that utilizes the improvement capabilities of PyPy.

While maximum performance can be achieved using a compiled language, using PyPy means that you don't have to bother rewriting your code very often. Of course, if your code is taking a long time to process, but can't be optimized for PyPy, then compiling may be your best bet.
For example, writing a C version of the Million Bottles code resulted in a compilation time of < 1 second. This is 99 percent faster than PyPy's time.
This also points out that it is better to write your code first, then conduct performance modeling and identify bottlenecks. Those areas will be the key places to focus on, whether it's rewriting in a compiled language or looking into PyPy.

The PyPy documentation (http://pypy.org/performance.html) provides some hints on how to optimize your code prior to refactoring or rewriting it:

Use regression testing. Like any testing code, it requires significant time upfront to determine what tests are needed, as well as the actual code writing. But the payout comes when refactoring as it allows you to try different optimizations without worrying about adding a lot of hidden bugs.
Use profilers to actually measure the time of your code overall, as well as individual portions. This way, you know exactly where the time sinks are so you can focus on those areas, rather than guessing where the bottlenecks are.
Harking back to parallel processing, be aware of code that is I/O-bound versus CPU-bound. I/O-bound code is reliant upon data transfers and benefits most from multithreading, rather than significant code optimization; there is only so much you can do with your code before the data processing becomes reliant on the speed of the I/O connections.
CPU-bound code is where you get the most value in terms of refactoring and optimization. That's because the CPU has to process a lot of data, so any sort of optimization in the code, such as compiling it or parallelizing it, will have an impact on the performance speed.
While you can always rewrite your code in a compiled language, it defeats the purpose of using Python. A better technique is to tune your algorithms to maximize performance in terms how the data is processed. You will probably go through several iterations of tuning and algorithm optimizing as you discover new bottlenecks.

Smaller programs are intrinsically faster than larger ones. This is because the different levels of cache on CPUs are progressively smaller the closer to the core they are, but they are also faster as well. If you can create a program, or at least subroutines, that can fit inside a cache space, it will be as fast as the cache itself is.
Smaller programs imply simpler code, as simple code creates shorter machine language opcodes. The problem comes from algorithm tuneup; improving algorithm performance generally implies using time-saving but space-filling techniques such as pre-computations or reverse maps.

Table of Contents for There's more...