There are plenty of substantive open source software projects out there for data scientists, so why Python?1 After all, there is the R language. R is a robust and well-supported language written initially by statistician for statisticians. Our view is not to promote one language over the other. The goal is to illustrate how the addition of Python to the SAS user’s toolkit is a means for valuable skills augmentation. Besides, Bob Muenchen has already written R for SAS and SPSS Users.2
Python is used in a wide range of computing applications from web and internet development to scientific and numerical analysis. Its pedigree from the realm of scientific and technical computing domains gives the language a natural affinity for data analysis. This is one of the reasons why Google uses it so extensively and has developed an outstanding tutorial for programmers.3
Perhaps the best answer as to why Python is best expressed in the Zen of Python, written by Tim Peters.4 While these are design principles used to influence the development of a language like Python, they apply (mostly) to our own efforts. These aphorisms are worth bookmarking and re-reading periodically.
Setting Up a Python Environment
One of the first questions a new Python user is confronted with is which version to use, Python 2 or Python 3. For this writing we used Python 3.6.4 (Version 3.6, Maintenance 4). The current release of Python is 3.7.2, released on December 24, 2018. Python release 3.8.0 is expected in November 2019. As with any language, minor changes in syntax occur as the developers make feature improvements and Python is no exception. We have chosen Python 3.6 since this was the latest release as time of writing and the release of 3.7 has not impacted any of the chapters. You can read more about the differences between Python 2 and Python 3.5
An attractive feature for Python is the availability of community-contributed modules. Python comes with a base library or core set of modules, referred to as the Standard Library. Due to Python’s design, individuals and organizations contribute to the creation of thousands of additional modules which are mostly written in Python. Interested in astronomical calculations used to predict any planet’s location in space? Then the kplr package is what you need.6 Closer to home, we will utilize the Python-dateutil 2.7.3 package to extend Python’s base capabilities for handling datetime arithmetic.7
Just as you can configure your SAS development environment in numerous ways, the same is true for Python. And while there are various implementations of Python, such as Jython, IronPython, and PyPy to make life simpler, organizations package distributions for you so you can avoid having to understand dependencies or using build scripts to assemble a custom environment. At the time of this writing, we are using the Anaconda distribution 5.2.0 for Windows 10 located at Anaconda’s distribution page at www.anaconda.com/download/ .
The Anaconda distribution of Python also supports OSX and Linux. They conveniently take care of all the details for you by providing familiar tools for installing, uninstalling, upgrading, determining package dependencies, and so on. But they do much more than just make a convenient distribution. They provide detailed documentation, support a community of enthusiastic users, and offer a supported enterprise product around the free distribution.
Anaconda3 Install Process for Windows
- 1.
From www.anaconda.com/download/ download the Anaconda3-5.2.0 for Windows Installer for Python 3.6. Select the 32-bit or 64-bit installer (depending on your Windows machine architecture).
- 2.
From this download location on your machine, you should see the file Anaconda3-5.2.0-Windows-x86-64.exe (assuming your Windows machine is 64-bit) for launching the Windows Installer.
- 3.Launch the Windows Installer (see Figure 1-1).
- 4.Click Next to review the license agreement and click the “I Agree” button (see Figure 1-2).
- 5.Select the installation type, stand-alone or multi-user (see Figure 1-3).
- 6.Select the installation location (see Figure 1-4).
- 7.Register Anaconda as the default Python 3.6 installation by ensuring the “Register Anaconda as my default Python 3.6” box is checked. Press the “Install” button (see Figure 1-5).
- 8.Start the installation process (see Figure 1-6). You may be asked if you would like to install Microsoft’s Visual Studio Code. Visual Studio provides a visual interface for constructing and debugging Python scripts. It is an optional component and is not used in this book.
- 9.
Validate the install by opening a Windows Command Prompt window and enter (after the > symbol prompt):
Python Command for Windows
This is an indication that the installation is complete including modifications made to Windows environment variable PATH.
Troubleshooting Python Installation for Windows
- 1.On Windows 10, open File Explorer and select “Properties” for “This PC” (see Figure 1-7).
- 2.Right-click the “Properties” dialog to open the Control Panel for the System (see Figure 1-8).
- 3.Select “Advanced” tab for System Properties and press the Environment Variables… button (see Figure 1-9).
- 4.Highlight the “Path” Environment Variables (see Figure 1-10).
- 5.Edit the Path Environment Variables by clicking the “New” button (see Figure 1-11).
- 6.Add the Anaconda Python installation path specified in step 6 from the Anaconda3 Install Process for Windows as seen earlier (see Figure 1-12).
- 7.
Ensure the path you entered is correct and click “OK”.
- 8.
To validate start a new Windows Command Prompt and enter the command “Python”. The output should look similar to the one in Listing 1-2.
Validate Python for Windows
The three angle brackets (>>>) is the default prompt for Python 3.
Anaconda3 Install Process for Linux
- 1.From www.anaconda.com/download/ download the Anaconda3-5.2.0 for Linux Installer for Python 3.6. This is actually a script file. Select the 32-bit or 64-bit installer (depending on your machine architecture). Select “Save File” and click “OK” (see Figure 1-13).
- 2.
Open a Linux terminal window and navigate to the location for the default directory /<userhome>/Downloads.
- 3.
Change the permission to allow the script to execute with chmod command.
- 4.
If you are using a Bash shell, you can execute the shell script with ./ preceding the script filename (see Figure 1-14).
- 5.Press <enter> to continue and display the License Agreement (Figure 1-15).
- 6.Accept the license term by entering “yes” and pressing <enter> (Figure 1-16).
- 7.Confirm the Anaconda3 installation directory and press <enter> (Figure 1-17).
- 8.Append the Anaconda3 installation directory to the $PATH environment variable by entering “yes” and pressing <enter> (Figure 1-18).
- 9.Confirm the installation by closing the terminal window used to execute the installation script and opening a new terminal window. This action will execute the .bashrc file in your home directory and “pick up” the updated $PATH environment variable that includes the Anaconda3 installation directory (Figure 1-19).
Executing a Python Script on Windows
loop.py Program
Notice there appears to be no symbols used to end a program statement. The end-of-line character is used to end a Python statement. This also helps to enforce legibility by keeping each statement on a separate physical line.
Coincidently, like SAS, Python honors a semi-colon as an end-of-statement terminator. However, you rarely see this. That’s because multiple statements on the same physical line are considered an affront to program legibility.
Equivalent of loop.py Written in SAS
- 1.
- 2.
Open a Windows Command window and navigate to the directory where you saved the loop.py Python script.
- 3.
Execute the loop.py script from the Windows Command window by entering
Output from loop.py
No Such File or Directory
Modified loop.py with No Indentation
Expected an Indented Block Error
Once you get over the shock of how Python imposes the indentation requirements, you will come to see this as an important feature for creating and maintaining legible, easy-to-understand code. The standard coding practice is to have four whitespaces rather than using <TAB>’s. In the section “Integrated Development Environment (IDE) for Python,” you will see how this and other formatting details are handled for you automatically.
Case Sensitivity
Case Sensitivity
Python scripts can be executed interactively. In this example, we invoke the Python command. This causes the command line prompts to change to the default Python prompt, >>>. To end an interactive Python session, submit the statement exit().
The variable Y (uppercase) is assigned the integer value of 201. The Python print() function is called for the variable y (lowercase). Since the variable y is not presently defined in the Python namespace, a NameError is raised.
Line Continuation Symbol
Line Continuation
A language that makes it hard to write elegant code makes it hard to write good code.
—Eric Raymond, Why Python?8
Executing a Python Script on Linux
- 1.
- 2.
Open a terminal window and navigate to the directory where you saved the loop.py Python script.
- 3.
Execute the loop.py Python script with command
Now that you understand how to execute Python scripts in “non-interactive mode,” you are probably wondering about Python’s equivalent for SAS Display Manager or the SAS Studio client. This leads us to the next topic, “Integrated Development Environment (IDE) for Python.”
Integrated Development Environment (IDE) for Python
In order to improve our Python coding productivity, we need a tool for interactive script development, as opposed to the non-interactive methods we have discussed thus far. We need the equivalent of the SAS Display Manager or SAS Studio.
SAS Display Manager, SAS Enterprise Guide, and SAS Studio are examples of an integrated development environment or IDE for short. Beyond just editing your SAS programs, these IDEs provide a set of services, such as submitting programs for execution, logging execution, rendering output, and managing resources. For example, in the SAS Display Manager, opening the LIBREF window to view assigned SAS Data Libraries is an example of the IDE’s ability to provide a non-programming method to visually inspect the properties and members for a SAS LIBNAME statements assigned to the current session.
As you might expect, not all IDEs are created equal. The more sophisticated IDEs permit setting checkpoints to enable a “walk through” of code execution displaying variable values and resource states on a line-at-a-time basis. They also provision methods to store a collection of programs into a coherent set of packages. These packages can then be re-distributed to others for execution. Perhaps the most compelling feature is how an IDE encourages team collaboration by allowing multiple users to work together creating, testing, and documenting a project composed of a collection of these artifacts.
One of the more interesting IDEs developed specifically for the data scientist community is the Jupyter notebook. It uses a web-based interface to write, execute, test, and document your code. Jupyter notebooks support over 40 languages, including Python, R, Scala, and Julia. It also has an open architecture, so vendors and users can write plug-ins for their own execution engines, or what Jupyter refers to as kernels. SAS Institute supports a bare-bones SAS kernel executing on Linux for Jupyter notebooks.9
A compelling feature for Jupyter notebooks is the ability to develop and share them across the Web. All of the Python examples used in this book were developed using the Jupyter notebook. Best of all, the Anaconda distribution of Python comes bundled with the Jupyter notebook IDE.
Jupyter Notebook
- 1.
On the dashboard (labeled Home page), click the New button on the upper right and then select Python 3 from the drop-down. This launches a new untitled notebook page.
- 2.
Enter the loop.py script created earlier into a cell. You can also copy the script you created earlier and paste directly into the notebook cell.
- 3.
Click the “Play” button to execute the code you copied into the cell (Figure 1-25).
You may have multiple notebooks, each represented by a browser tab, opened at the same time. You may also have multiple instances of Jupyter notebooks opened at the same time (with multiple notebooks open, pay attention to names to avoid accidental overwriting).
Jupyter Notebook for Linux
On Linux the terminal window remains open while the Jupyter notebook is active.
- 1.
Start a new browser instance.
- 2.
The notebook should launch a browser session. If it does not start the browser, then look for the message in the Linux terminal window:
Summary
In this chapter we illustrated how to install and configure the Python environment for Windows and Linux. We also introduced basic formatting and syntax rules needed to execute simple Python scripts. And we introduced different methods for executing Python scripts including the use of Jupyter notebooks. With a working Python environment established, we can begin exploring Python as a language to augment SAS for data exploration and analysis.